313 lines
7.1 KiB
Markdown
313 lines
7.1 KiB
Markdown
# Deployment Checklist
|
|
|
|
Use this checklist when deploying a new service to ensure you don't miss critical steps.
|
|
|
|
## Pre-Deployment
|
|
|
|
### Application Requirements
|
|
|
|
- [ ] **Health endpoint implemented** - `/health` returns 200 OK
|
|
- Returns JSON with status
|
|
- Responds quickly (<500ms)
|
|
- Doesn't block on external services
|
|
|
|
- [ ] **Port configuration** - App reads `PORT` from environment
|
|
```python
|
|
PORT = int(os.getenv('PORT', 5000))
|
|
app.run(host='0.0.0.0', port=PORT)
|
|
```
|
|
|
|
- [ ] **Graceful shutdown** - App handles SIGTERM signal
|
|
- Closes connections cleanly
|
|
- Finishes current requests
|
|
- Exits within 30 seconds
|
|
|
|
- [ ] **Logging configured** - Uses stdout/stderr
|
|
- Structured logging (JSON preferred)
|
|
- Includes timestamps
|
|
- No log files (Nomad captures stdout)
|
|
|
|
### Docker Image
|
|
|
|
- [ ] **Dockerfile complete** - Based on `Dockerfile.complete`
|
|
- Multi-stage build (smaller image)
|
|
- Non-root user (uid 1000)
|
|
- Health check defined
|
|
- Minimal base image
|
|
|
|
- [ ] **Image tested locally**
|
|
```bash
|
|
docker build -t myapp:test .
|
|
docker run -p 5000:5000 -e PORT=5000 myapp:test
|
|
curl http://localhost:5000/health
|
|
```
|
|
|
|
- [ ] **Image pushed to registry**
|
|
```bash
|
|
docker tag myapp:test registry.i80.dk/gitea/myapp:latest
|
|
docker push registry.i80.dk/gitea/myapp:latest
|
|
```
|
|
|
|
### Nomad Job Configuration
|
|
|
|
- [ ] **Job file created** - Copy from `nomad-job-complete.hcl.tmpl`
|
|
- Replace `[[PROJECT_NAME]]` with actual name
|
|
- Replace `[[PORT]]` with app port (usually 5000)
|
|
- Update resource limits (CPU/memory)
|
|
|
|
- [ ] **Health check configured** - Uses named port, not hardcoded
|
|
```hcl
|
|
check {
|
|
port = "http" # NOT "5000"!
|
|
}
|
|
```
|
|
|
|
- [ ] **Traefik tags correct** - Domain matches expected URL
|
|
```hcl
|
|
"traefik.http.routers.myapp.rule=Host(`myapp.i80.dk`)"
|
|
```
|
|
|
|
- [ ] **Volumes declared** (if needed)
|
|
- Volume source matches Autobox config
|
|
- Mount path correct
|
|
- Permissions considered
|
|
|
|
- [ ] **Secrets configured** - Using chosen workaround method
|
|
- Environment variables OR
|
|
- File-based secrets OR
|
|
- Consul KV
|
|
|
|
- [ ] **Job validates** - No syntax errors
|
|
```bash
|
|
nomad job validate nomad-job.hcl
|
|
```
|
|
|
|
### Autobox Configuration
|
|
|
|
- [ ] **Volumes created** (if needed)
|
|
```bash
|
|
# Run on Autobox
|
|
sudo ./setup-nomad-volumes.sh myapp
|
|
```
|
|
|
|
- [ ] **Volumes show in agent-info**
|
|
```bash
|
|
nomad agent-info | grep myapp-data
|
|
```
|
|
|
|
- [ ] **Secrets file created** (if using file-based secrets)
|
|
```bash
|
|
sudo vim /opt/nomad-secrets/myapp/secrets.env
|
|
```
|
|
|
|
- [ ] **Permissions correct**
|
|
```bash
|
|
ls -la /opt/nomad-volumes/myapp-data # Should be 1000:1000
|
|
```
|
|
|
|
### Gitea CI/CD (if using)
|
|
|
|
- [ ] **Workflow file created** - Copy from `main.yml.tmpl`
|
|
- Replace `[[PROJECT_NAME]]` everywhere
|
|
- Registry credentials configured
|
|
|
|
- [ ] **Secrets configured** - In Gitea repository settings
|
|
- `secrets.username` - Registry username
|
|
- `secrets.password` - Registry password
|
|
|
|
- [ ] **Self-hosted runner** - Has necessary access
|
|
- Docker installed
|
|
- Nomad CLI installed
|
|
- SSH access to Nomad server
|
|
|
|
## Deployment
|
|
|
|
### Initial Deployment
|
|
|
|
- [ ] **Job submitted**
|
|
```bash
|
|
nomad job run nomad-job.hcl
|
|
```
|
|
|
|
- [ ] **Allocation running**
|
|
```bash
|
|
nomad job status myapp
|
|
# Should show: Running = 1
|
|
```
|
|
|
|
- [ ] **No errors in logs**
|
|
```bash
|
|
nomad alloc logs -f <alloc-id> myapp-task
|
|
```
|
|
|
|
### Consul Registration
|
|
|
|
- [ ] **Service registered**
|
|
```bash
|
|
consul catalog service myapp
|
|
```
|
|
|
|
- [ ] **Service healthy**
|
|
```bash
|
|
consul catalog service myapp
|
|
# Look for: Checks: http_health: passing
|
|
```
|
|
|
|
- [ ] **Tags correct**
|
|
```bash
|
|
consul catalog service myapp
|
|
# Verify traefik tags present
|
|
```
|
|
|
|
### DNS & Access
|
|
|
|
- [ ] **DNS record created** - Check consul-template output
|
|
```bash
|
|
cat /certs/consul/trinity_powerdns_records.txt | grep myapp
|
|
```
|
|
|
|
- [ ] **Nginx config generated**
|
|
```bash
|
|
grep myapp /certs/consul-nginx/conf.d/services.conf
|
|
```
|
|
|
|
- [ ] **Nginx reloaded** - Check watcher logs
|
|
```bash
|
|
tail -f /var/log/nginx_restater.log
|
|
```
|
|
|
|
- [ ] **Service accessible** - Test public URL
|
|
```bash
|
|
curl https://myapp.i80.dk
|
|
curl https://myapp.i80.dk/health
|
|
```
|
|
|
|
## Post-Deployment
|
|
|
|
### Verification
|
|
|
|
- [ ] **Health check passing** - For at least 5 minutes
|
|
```bash
|
|
watch -n 5 'consul catalog service myapp'
|
|
```
|
|
|
|
- [ ] **No restarts** - Allocation stable
|
|
```bash
|
|
nomad alloc status <alloc-id>
|
|
# Check "Recent Events" - no restarts
|
|
```
|
|
|
|
- [ ] **Logs clean** - No errors or warnings
|
|
```bash
|
|
nomad alloc logs -f <alloc-id> myapp-task
|
|
```
|
|
|
|
- [ ] **Performance acceptable**
|
|
- Response time < 1s
|
|
- Memory usage stable
|
|
- CPU usage reasonable
|
|
|
|
### Monitoring
|
|
|
|
- [ ] **Metrics accessible** - If implemented
|
|
```bash
|
|
curl https://myapp.i80.dk/metrics
|
|
```
|
|
|
|
- [ ] **Logs searchable** - Can find application logs
|
|
```bash
|
|
nomad alloc logs -f <alloc-id> myapp-task | grep ERROR
|
|
```
|
|
|
|
- [ ] **Alerts configured** - If using monitoring system
|
|
- Health check failures
|
|
- High error rate
|
|
- High memory usage
|
|
|
|
### Documentation
|
|
|
|
- [ ] **Service documented** - In team wiki/docs
|
|
- What it does
|
|
- Where it's deployed
|
|
- How to access it
|
|
- Known issues
|
|
|
|
- [ ] **Runbook created** - For operational issues
|
|
- How to restart
|
|
- How to check logs
|
|
- Common troubleshooting steps
|
|
- Escalation path
|
|
|
|
- [ ] **Secrets documented** - Where they're stored
|
|
- Which Consul KV keys
|
|
- Which files on Autobox
|
|
- Who has access
|
|
|
|
## Rollback Plan
|
|
|
|
- [ ] **Previous version tagged** - In case of issues
|
|
```bash
|
|
docker tag myapp:latest myapp:stable
|
|
```
|
|
|
|
- [ ] **Rollback tested** - Know how to revert
|
|
```bash
|
|
# Update job file to use :stable tag
|
|
# nomad job run nomad-job.hcl
|
|
```
|
|
|
|
- [ ] **Data backup** - Before first deployment
|
|
```bash
|
|
# If using volumes
|
|
sudo tar -czf /backup/myapp-data.tar.gz /opt/nomad-volumes/myapp-data
|
|
```
|
|
|
|
## Common Issues Checklist
|
|
|
|
If deployment fails, check:
|
|
|
|
- [ ] Is `/health` endpoint implemented and returning 200?
|
|
- [ ] Is app binding to `0.0.0.0` (not `127.0.0.1`)?
|
|
- [ ] Is app reading `PORT` from environment variable?
|
|
- [ ] Are health check port references correct (no hardcoded ports)?
|
|
- [ ] Do volume paths match between Autobox and Nomad job?
|
|
- [ ] Are volume permissions correct (uid 1000)?
|
|
- [ ] Are secrets accessible (environment or files)?
|
|
- [ ] Is Docker image pulling successfully?
|
|
- [ ] Is allocation getting scheduled (not pending)?
|
|
- [ ] Are there port conflicts?
|
|
|
|
## Quick Debugging Commands
|
|
|
|
```bash
|
|
# Service status
|
|
consul catalog service myapp
|
|
nomad job status myapp
|
|
|
|
# Allocation details
|
|
ALLOC_ID=$(nomad job status myapp | grep running | head -1 | awk '{print $1}')
|
|
nomad alloc status $ALLOC_ID
|
|
|
|
# Logs
|
|
nomad alloc logs -f $ALLOC_ID myapp-task
|
|
nomad alloc logs -stderr -f $ALLOC_ID myapp-task
|
|
|
|
# Exec into container
|
|
nomad alloc exec -i -t $ALLOC_ID /bin/sh
|
|
|
|
# Health check test
|
|
PORT=$(nomad alloc status $ALLOC_ID | grep "Port.*http" | awk '{print $3}' | cut -d':' -f2)
|
|
curl http://192.168.15.124:$PORT/health
|
|
|
|
# Restart
|
|
nomad job restart myapp
|
|
|
|
# Force reschedule
|
|
nomad job stop -purge myapp
|
|
nomad job run nomad-job.hcl
|
|
```
|
|
|
|
---
|
|
|
|
**Print this checklist and use it for every deployment until the process becomes second nature!**
|