Files
PythonTemplateProject/DEPLOYMENT_CHECKLIST.md

313 lines
7.1 KiB
Markdown
Raw Normal View History

2025-11-28 23:21:07 +01:00
# Deployment Checklist
Use this checklist when deploying a new service to ensure you don't miss critical steps.
## Pre-Deployment
### Application Requirements
- [ ] **Health endpoint implemented** - `/health` returns 200 OK
- Returns JSON with status
- Responds quickly (<500ms)
- Doesn't block on external services
- [ ] **Port configuration** - App reads `PORT` from environment
```python
PORT = int(os.getenv('PORT', 5000))
app.run(host='0.0.0.0', port=PORT)
```
- [ ] **Graceful shutdown** - App handles SIGTERM signal
- Closes connections cleanly
- Finishes current requests
- Exits within 30 seconds
- [ ] **Logging configured** - Uses stdout/stderr
- Structured logging (JSON preferred)
- Includes timestamps
- No log files (Nomad captures stdout)
### Docker Image
- [ ] **Dockerfile complete** - Based on `Dockerfile.complete`
- Multi-stage build (smaller image)
- Non-root user (uid 1000)
- Health check defined
- Minimal base image
- [ ] **Image tested locally**
```bash
docker build -t myapp:test .
docker run -p 5000:5000 -e PORT=5000 myapp:test
curl http://localhost:5000/health
```
- [ ] **Image pushed to registry**
```bash
docker tag myapp:test registry.i80.dk/gitea/myapp:latest
docker push registry.i80.dk/gitea/myapp:latest
```
### Nomad Job Configuration
- [ ] **Job file created** - Copy from `nomad-job-complete.hcl.tmpl`
- Replace `[[PROJECT_NAME]]` with actual name
- Replace `[[PORT]]` with app port (usually 5000)
- Update resource limits (CPU/memory)
- [ ] **Health check configured** - Uses named port, not hardcoded
```hcl
check {
port = "http" # NOT "5000"!
}
```
- [ ] **Traefik tags correct** - Domain matches expected URL
```hcl
"traefik.http.routers.myapp.rule=Host(`myapp.i80.dk`)"
```
- [ ] **Volumes declared** (if needed)
- Volume source matches Autobox config
- Mount path correct
- Permissions considered
- [ ] **Secrets configured** - Using chosen workaround method
- Environment variables OR
- File-based secrets OR
- Consul KV
- [ ] **Job validates** - No syntax errors
```bash
nomad job validate nomad-job.hcl
```
### Autobox Configuration
- [ ] **Volumes created** (if needed)
```bash
# Run on Autobox
sudo ./setup-nomad-volumes.sh myapp
```
- [ ] **Volumes show in agent-info**
```bash
nomad agent-info | grep myapp-data
```
- [ ] **Secrets file created** (if using file-based secrets)
```bash
sudo vim /opt/nomad-secrets/myapp/secrets.env
```
- [ ] **Permissions correct**
```bash
ls -la /opt/nomad-volumes/myapp-data # Should be 1000:1000
```
### Gitea CI/CD (if using)
- [ ] **Workflow file created** - Copy from `main.yml.tmpl`
- Replace `[[PROJECT_NAME]]` everywhere
- Registry credentials configured
- [ ] **Secrets configured** - In Gitea repository settings
- `secrets.username` - Registry username
- `secrets.password` - Registry password
- [ ] **Self-hosted runner** - Has necessary access
- Docker installed
- Nomad CLI installed
- SSH access to Nomad server
## Deployment
### Initial Deployment
- [ ] **Job submitted**
```bash
nomad job run nomad-job.hcl
```
- [ ] **Allocation running**
```bash
nomad job status myapp
# Should show: Running = 1
```
- [ ] **No errors in logs**
```bash
nomad alloc logs -f <alloc-id> myapp-task
```
### Consul Registration
- [ ] **Service registered**
```bash
consul catalog service myapp
```
- [ ] **Service healthy**
```bash
consul catalog service myapp
# Look for: Checks: http_health: passing
```
- [ ] **Tags correct**
```bash
consul catalog service myapp
# Verify traefik tags present
```
### DNS & Access
- [ ] **DNS record created** - Check consul-template output
```bash
cat /certs/consul/trinity_powerdns_records.txt | grep myapp
```
- [ ] **Nginx config generated**
```bash
grep myapp /certs/consul-nginx/conf.d/services.conf
```
- [ ] **Nginx reloaded** - Check watcher logs
```bash
tail -f /var/log/nginx_restater.log
```
- [ ] **Service accessible** - Test public URL
```bash
curl https://myapp.i80.dk
curl https://myapp.i80.dk/health
```
## Post-Deployment
### Verification
- [ ] **Health check passing** - For at least 5 minutes
```bash
watch -n 5 'consul catalog service myapp'
```
- [ ] **No restarts** - Allocation stable
```bash
nomad alloc status <alloc-id>
# Check "Recent Events" - no restarts
```
- [ ] **Logs clean** - No errors or warnings
```bash
nomad alloc logs -f <alloc-id> myapp-task
```
- [ ] **Performance acceptable**
- Response time < 1s
- Memory usage stable
- CPU usage reasonable
### Monitoring
- [ ] **Metrics accessible** - If implemented
```bash
curl https://myapp.i80.dk/metrics
```
- [ ] **Logs searchable** - Can find application logs
```bash
nomad alloc logs -f <alloc-id> myapp-task | grep ERROR
```
- [ ] **Alerts configured** - If using monitoring system
- Health check failures
- High error rate
- High memory usage
### Documentation
- [ ] **Service documented** - In team wiki/docs
- What it does
- Where it's deployed
- How to access it
- Known issues
- [ ] **Runbook created** - For operational issues
- How to restart
- How to check logs
- Common troubleshooting steps
- Escalation path
- [ ] **Secrets documented** - Where they're stored
- Which Consul KV keys
- Which files on Autobox
- Who has access
## Rollback Plan
- [ ] **Previous version tagged** - In case of issues
```bash
docker tag myapp:latest myapp:stable
```
- [ ] **Rollback tested** - Know how to revert
```bash
# Update job file to use :stable tag
# nomad job run nomad-job.hcl
```
- [ ] **Data backup** - Before first deployment
```bash
# If using volumes
sudo tar -czf /backup/myapp-data.tar.gz /opt/nomad-volumes/myapp-data
```
## Common Issues Checklist
If deployment fails, check:
- [ ] Is `/health` endpoint implemented and returning 200?
- [ ] Is app binding to `0.0.0.0` (not `127.0.0.1`)?
- [ ] Is app reading `PORT` from environment variable?
- [ ] Are health check port references correct (no hardcoded ports)?
- [ ] Do volume paths match between Autobox and Nomad job?
- [ ] Are volume permissions correct (uid 1000)?
- [ ] Are secrets accessible (environment or files)?
- [ ] Is Docker image pulling successfully?
- [ ] Is allocation getting scheduled (not pending)?
- [ ] Are there port conflicts?
## Quick Debugging Commands
```bash
# Service status
consul catalog service myapp
nomad job status myapp
# Allocation details
ALLOC_ID=$(nomad job status myapp | grep running | head -1 | awk '{print $1}')
nomad alloc status $ALLOC_ID
# Logs
nomad alloc logs -f $ALLOC_ID myapp-task
nomad alloc logs -stderr -f $ALLOC_ID myapp-task
# Exec into container
nomad alloc exec -i -t $ALLOC_ID /bin/sh
# Health check test
PORT=$(nomad alloc status $ALLOC_ID | grep "Port.*http" | awk '{print $3}' | cut -d':' -f2)
curl http://192.168.15.124:$PORT/health
# Restart
nomad job restart myapp
# Force reschedule
nomad job stop -purge myapp
nomad job run nomad-job.hcl
```
---
**Print this checklist and use it for every deployment until the process becomes second nature!**