# Deployment Checklist Use this checklist when deploying a new service to ensure you don't miss critical steps. ## Pre-Deployment ### Application Requirements - [ ] **Health endpoint implemented** - `/health` returns 200 OK - Returns JSON with status - Responds quickly (<500ms) - Doesn't block on external services - [ ] **Port configuration** - App reads `PORT` from environment ```python PORT = int(os.getenv('PORT', 5000)) app.run(host='0.0.0.0', port=PORT) ``` - [ ] **Graceful shutdown** - App handles SIGTERM signal - Closes connections cleanly - Finishes current requests - Exits within 30 seconds - [ ] **Logging configured** - Uses stdout/stderr - Structured logging (JSON preferred) - Includes timestamps - No log files (Nomad captures stdout) ### Docker Image - [ ] **Dockerfile complete** - Based on `Dockerfile.complete` - Multi-stage build (smaller image) - Non-root user (uid 1000) - Health check defined - Minimal base image - [ ] **Image tested locally** ```bash docker build -t myapp:test . docker run -p 5000:5000 -e PORT=5000 myapp:test curl http://localhost:5000/health ``` - [ ] **Image pushed to registry** ```bash docker tag myapp:test registry.i80.dk/gitea/myapp:latest docker push registry.i80.dk/gitea/myapp:latest ``` ### Nomad Job Configuration - [ ] **Job file created** - Copy from `nomad-job-complete.hcl.tmpl` - Replace `[[PROJECT_NAME]]` with actual name - Replace `[[PORT]]` with app port (usually 5000) - Update resource limits (CPU/memory) - [ ] **Health check configured** - Uses named port, not hardcoded ```hcl check { port = "http" # NOT "5000"! } ``` - [ ] **Traefik tags correct** - Domain matches expected URL ```hcl "traefik.http.routers.myapp.rule=Host(`myapp.i80.dk`)" ``` - [ ] **Volumes declared** (if needed) - Volume source matches Autobox config - Mount path correct - Permissions considered - [ ] **Secrets configured** - Using chosen workaround method - Environment variables OR - File-based secrets OR - Consul KV - [ ] **Job validates** - No syntax errors ```bash nomad job validate nomad-job.hcl ``` ### Autobox Configuration - [ ] **Volumes created** (if needed) ```bash # Run on Autobox sudo ./setup-nomad-volumes.sh myapp ``` - [ ] **Volumes show in agent-info** ```bash nomad agent-info | grep myapp-data ``` - [ ] **Secrets file created** (if using file-based secrets) ```bash sudo vim /opt/nomad-secrets/myapp/secrets.env ``` - [ ] **Permissions correct** ```bash ls -la /opt/nomad-volumes/myapp-data # Should be 1000:1000 ``` ### Gitea CI/CD (if using) - [ ] **Workflow file created** - Copy from `main.yml.tmpl` - Replace `[[PROJECT_NAME]]` everywhere - Registry credentials configured - [ ] **Secrets configured** - In Gitea repository settings - `secrets.username` - Registry username - `secrets.password` - Registry password - [ ] **Self-hosted runner** - Has necessary access - Docker installed - Nomad CLI installed - SSH access to Nomad server ## Deployment ### Initial Deployment - [ ] **Job submitted** ```bash nomad job run nomad-job.hcl ``` - [ ] **Allocation running** ```bash nomad job status myapp # Should show: Running = 1 ``` - [ ] **No errors in logs** ```bash nomad alloc logs -f myapp-task ``` ### Consul Registration - [ ] **Service registered** ```bash consul catalog service myapp ``` - [ ] **Service healthy** ```bash consul catalog service myapp # Look for: Checks: http_health: passing ``` - [ ] **Tags correct** ```bash consul catalog service myapp # Verify traefik tags present ``` ### DNS & Access - [ ] **DNS record created** - Check consul-template output ```bash cat /certs/consul/trinity_powerdns_records.txt | grep myapp ``` - [ ] **Nginx config generated** ```bash grep myapp /certs/consul-nginx/conf.d/services.conf ``` - [ ] **Nginx reloaded** - Check watcher logs ```bash tail -f /var/log/nginx_restater.log ``` - [ ] **Service accessible** - Test public URL ```bash curl https://myapp.i80.dk curl https://myapp.i80.dk/health ``` ## Post-Deployment ### Verification - [ ] **Health check passing** - For at least 5 minutes ```bash watch -n 5 'consul catalog service myapp' ``` - [ ] **No restarts** - Allocation stable ```bash nomad alloc status # Check "Recent Events" - no restarts ``` - [ ] **Logs clean** - No errors or warnings ```bash nomad alloc logs -f myapp-task ``` - [ ] **Performance acceptable** - Response time < 1s - Memory usage stable - CPU usage reasonable ### Monitoring - [ ] **Metrics accessible** - If implemented ```bash curl https://myapp.i80.dk/metrics ``` - [ ] **Logs searchable** - Can find application logs ```bash nomad alloc logs -f myapp-task | grep ERROR ``` - [ ] **Alerts configured** - If using monitoring system - Health check failures - High error rate - High memory usage ### Documentation - [ ] **Service documented** - In team wiki/docs - What it does - Where it's deployed - How to access it - Known issues - [ ] **Runbook created** - For operational issues - How to restart - How to check logs - Common troubleshooting steps - Escalation path - [ ] **Secrets documented** - Where they're stored - Which Consul KV keys - Which files on Autobox - Who has access ## Rollback Plan - [ ] **Previous version tagged** - In case of issues ```bash docker tag myapp:latest myapp:stable ``` - [ ] **Rollback tested** - Know how to revert ```bash # Update job file to use :stable tag # nomad job run nomad-job.hcl ``` - [ ] **Data backup** - Before first deployment ```bash # If using volumes sudo tar -czf /backup/myapp-data.tar.gz /opt/nomad-volumes/myapp-data ``` ## Common Issues Checklist If deployment fails, check: - [ ] Is `/health` endpoint implemented and returning 200? - [ ] Is app binding to `0.0.0.0` (not `127.0.0.1`)? - [ ] Is app reading `PORT` from environment variable? - [ ] Are health check port references correct (no hardcoded ports)? - [ ] Do volume paths match between Autobox and Nomad job? - [ ] Are volume permissions correct (uid 1000)? - [ ] Are secrets accessible (environment or files)? - [ ] Is Docker image pulling successfully? - [ ] Is allocation getting scheduled (not pending)? - [ ] Are there port conflicts? ## Quick Debugging Commands ```bash # Service status consul catalog service myapp nomad job status myapp # Allocation details ALLOC_ID=$(nomad job status myapp | grep running | head -1 | awk '{print $1}') nomad alloc status $ALLOC_ID # Logs nomad alloc logs -f $ALLOC_ID myapp-task nomad alloc logs -stderr -f $ALLOC_ID myapp-task # Exec into container nomad alloc exec -i -t $ALLOC_ID /bin/sh # Health check test PORT=$(nomad alloc status $ALLOC_ID | grep "Port.*http" | awk '{print $3}' | cut -d':' -f2) curl http://192.168.15.124:$PORT/health # Restart nomad job restart myapp # Force reschedule nomad job stop -purge myapp nomad job run nomad-job.hcl ``` --- **Print this checklist and use it for every deployment until the process becomes second nature!**