7.1 KiB
Deployment Checklist
Use this checklist when deploying a new service to ensure you don't miss critical steps.
Pre-Deployment
Application Requirements
-
Health endpoint implemented -
/healthreturns 200 OK- Returns JSON with status
- Responds quickly (<500ms)
- Doesn't block on external services
-
Port configuration - App reads
PORTfrom environmentPORT = int(os.getenv('PORT', 5000)) app.run(host='0.0.0.0', port=PORT) -
Graceful shutdown - App handles SIGTERM signal
- Closes connections cleanly
- Finishes current requests
- Exits within 30 seconds
-
Logging configured - Uses stdout/stderr
- Structured logging (JSON preferred)
- Includes timestamps
- No log files (Nomad captures stdout)
Docker Image
-
Dockerfile complete - Based on
Dockerfile.complete- Multi-stage build (smaller image)
- Non-root user (uid 1000)
- Health check defined
- Minimal base image
-
Image tested locally
docker build -t myapp:test . docker run -p 5000:5000 -e PORT=5000 myapp:test curl http://localhost:5000/health -
Image pushed to registry
docker tag myapp:test registry.i80.dk/gitea/myapp:latest docker push registry.i80.dk/gitea/myapp:latest
Nomad Job Configuration
-
Job file created - Copy from
nomad-job-complete.hcl.tmpl- Replace
[[PROJECT_NAME]]with actual name - Replace
[[PORT]]with app port (usually 5000) - Update resource limits (CPU/memory)
- Replace
-
Health check configured - Uses named port, not hardcoded
check { port = "http" # NOT "5000"! } -
Traefik tags correct - Domain matches expected URL
"traefik.http.routers.myapp.rule=Host(`myapp.i80.dk`)" -
Volumes declared (if needed)
- Volume source matches Autobox config
- Mount path correct
- Permissions considered
-
Secrets configured - Using chosen workaround method
- Environment variables OR
- File-based secrets OR
- Consul KV
-
Job validates - No syntax errors
nomad job validate nomad-job.hcl
Autobox Configuration
-
Volumes created (if needed)
# Run on Autobox sudo ./setup-nomad-volumes.sh myapp -
Volumes show in agent-info
nomad agent-info | grep myapp-data -
Secrets file created (if using file-based secrets)
sudo vim /opt/nomad-secrets/myapp/secrets.env -
Permissions correct
ls -la /opt/nomad-volumes/myapp-data # Should be 1000:1000
Gitea CI/CD (if using)
-
Workflow file created - Copy from
main.yml.tmpl- Replace
[[PROJECT_NAME]]everywhere - Registry credentials configured
- Replace
-
Secrets configured - In Gitea repository settings
secrets.username- Registry usernamesecrets.password- Registry password
-
Self-hosted runner - Has necessary access
- Docker installed
- Nomad CLI installed
- SSH access to Nomad server
Deployment
Initial Deployment
-
Job submitted
nomad job run nomad-job.hcl -
Allocation running
nomad job status myapp # Should show: Running = 1 -
No errors in logs
nomad alloc logs -f <alloc-id> myapp-task
Consul Registration
-
Service registered
consul catalog service myapp -
Service healthy
consul catalog service myapp # Look for: Checks: http_health: passing -
Tags correct
consul catalog service myapp # Verify traefik tags present
DNS & Access
-
DNS record created - Check consul-template output
cat /certs/consul/trinity_powerdns_records.txt | grep myapp -
Nginx config generated
grep myapp /certs/consul-nginx/conf.d/services.conf -
Nginx reloaded - Check watcher logs
tail -f /var/log/nginx_restater.log -
Service accessible - Test public URL
curl https://myapp.i80.dk curl https://myapp.i80.dk/health
Post-Deployment
Verification
-
Health check passing - For at least 5 minutes
watch -n 5 'consul catalog service myapp' -
No restarts - Allocation stable
nomad alloc status <alloc-id> # Check "Recent Events" - no restarts -
Logs clean - No errors or warnings
nomad alloc logs -f <alloc-id> myapp-task -
Performance acceptable
- Response time < 1s
- Memory usage stable
- CPU usage reasonable
Monitoring
-
Metrics accessible - If implemented
curl https://myapp.i80.dk/metrics -
Logs searchable - Can find application logs
nomad alloc logs -f <alloc-id> myapp-task | grep ERROR -
Alerts configured - If using monitoring system
- Health check failures
- High error rate
- High memory usage
Documentation
-
Service documented - In team wiki/docs
- What it does
- Where it's deployed
- How to access it
- Known issues
-
Runbook created - For operational issues
- How to restart
- How to check logs
- Common troubleshooting steps
- Escalation path
-
Secrets documented - Where they're stored
- Which Consul KV keys
- Which files on Autobox
- Who has access
Rollback Plan
-
Previous version tagged - In case of issues
docker tag myapp:latest myapp:stable -
Rollback tested - Know how to revert
# Update job file to use :stable tag # nomad job run nomad-job.hcl -
Data backup - Before first deployment
# If using volumes sudo tar -czf /backup/myapp-data.tar.gz /opt/nomad-volumes/myapp-data
Common Issues Checklist
If deployment fails, check:
- Is
/healthendpoint implemented and returning 200? - Is app binding to
0.0.0.0(not127.0.0.1)? - Is app reading
PORTfrom environment variable? - Are health check port references correct (no hardcoded ports)?
- Do volume paths match between Autobox and Nomad job?
- Are volume permissions correct (uid 1000)?
- Are secrets accessible (environment or files)?
- Is Docker image pulling successfully?
- Is allocation getting scheduled (not pending)?
- Are there port conflicts?
Quick Debugging Commands
# Service status
consul catalog service myapp
nomad job status myapp
# Allocation details
ALLOC_ID=$(nomad job status myapp | grep running | head -1 | awk '{print $1}')
nomad alloc status $ALLOC_ID
# Logs
nomad alloc logs -f $ALLOC_ID myapp-task
nomad alloc logs -stderr -f $ALLOC_ID myapp-task
# Exec into container
nomad alloc exec -i -t $ALLOC_ID /bin/sh
# Health check test
PORT=$(nomad alloc status $ALLOC_ID | grep "Port.*http" | awk '{print $3}' | cut -d':' -f2)
curl http://192.168.15.124:$PORT/health
# Restart
nomad job restart myapp
# Force reschedule
nomad job stop -purge myapp
nomad job run nomad-job.hcl
Print this checklist and use it for every deployment until the process becomes second nature!