Files
PythonTemplateProject/DEPLOYMENT_CHECKLIST.md
Henrik Jess Nielsen e73ac7ca3b Updated network
2025-11-28 23:21:07 +01:00

7.1 KiB

Deployment Checklist

Use this checklist when deploying a new service to ensure you don't miss critical steps.

Pre-Deployment

Application Requirements

  • Health endpoint implemented - /health returns 200 OK

    • Returns JSON with status
    • Responds quickly (<500ms)
    • Doesn't block on external services
  • Port configuration - App reads PORT from environment

    PORT = int(os.getenv('PORT', 5000))
    app.run(host='0.0.0.0', port=PORT)
    
  • Graceful shutdown - App handles SIGTERM signal

    • Closes connections cleanly
    • Finishes current requests
    • Exits within 30 seconds
  • Logging configured - Uses stdout/stderr

    • Structured logging (JSON preferred)
    • Includes timestamps
    • No log files (Nomad captures stdout)

Docker Image

  • Dockerfile complete - Based on Dockerfile.complete

    • Multi-stage build (smaller image)
    • Non-root user (uid 1000)
    • Health check defined
    • Minimal base image
  • Image tested locally

    docker build -t myapp:test .
    docker run -p 5000:5000 -e PORT=5000 myapp:test
    curl http://localhost:5000/health
    
  • Image pushed to registry

    docker tag myapp:test registry.i80.dk/gitea/myapp:latest
    docker push registry.i80.dk/gitea/myapp:latest
    

Nomad Job Configuration

  • Job file created - Copy from nomad-job-complete.hcl.tmpl

    • Replace [[PROJECT_NAME]] with actual name
    • Replace [[PORT]] with app port (usually 5000)
    • Update resource limits (CPU/memory)
  • Health check configured - Uses named port, not hardcoded

    check {
      port = "http"  # NOT "5000"!
    }
    
  • Traefik tags correct - Domain matches expected URL

    "traefik.http.routers.myapp.rule=Host(`myapp.i80.dk`)"
    
  • Volumes declared (if needed)

    • Volume source matches Autobox config
    • Mount path correct
    • Permissions considered
  • Secrets configured - Using chosen workaround method

    • Environment variables OR
    • File-based secrets OR
    • Consul KV
  • Job validates - No syntax errors

    nomad job validate nomad-job.hcl
    

Autobox Configuration

  • Volumes created (if needed)

    # Run on Autobox
    sudo ./setup-nomad-volumes.sh myapp
    
  • Volumes show in agent-info

    nomad agent-info | grep myapp-data
    
  • Secrets file created (if using file-based secrets)

    sudo vim /opt/nomad-secrets/myapp/secrets.env
    
  • Permissions correct

    ls -la /opt/nomad-volumes/myapp-data  # Should be 1000:1000
    

Gitea CI/CD (if using)

  • Workflow file created - Copy from main.yml.tmpl

    • Replace [[PROJECT_NAME]] everywhere
    • Registry credentials configured
  • Secrets configured - In Gitea repository settings

    • secrets.username - Registry username
    • secrets.password - Registry password
  • Self-hosted runner - Has necessary access

    • Docker installed
    • Nomad CLI installed
    • SSH access to Nomad server

Deployment

Initial Deployment

  • Job submitted

    nomad job run nomad-job.hcl
    
  • Allocation running

    nomad job status myapp
    # Should show: Running = 1
    
  • No errors in logs

    nomad alloc logs -f <alloc-id> myapp-task
    

Consul Registration

  • Service registered

    consul catalog service myapp
    
  • Service healthy

    consul catalog service myapp
    # Look for: Checks: http_health: passing
    
  • Tags correct

    consul catalog service myapp
    # Verify traefik tags present
    

DNS & Access

  • DNS record created - Check consul-template output

    cat /certs/consul/trinity_powerdns_records.txt | grep myapp
    
  • Nginx config generated

    grep myapp /certs/consul-nginx/conf.d/services.conf
    
  • Nginx reloaded - Check watcher logs

    tail -f /var/log/nginx_restater.log
    
  • Service accessible - Test public URL

    curl https://myapp.i80.dk
    curl https://myapp.i80.dk/health
    

Post-Deployment

Verification

  • Health check passing - For at least 5 minutes

    watch -n 5 'consul catalog service myapp'
    
  • No restarts - Allocation stable

    nomad alloc status <alloc-id>
    # Check "Recent Events" - no restarts
    
  • Logs clean - No errors or warnings

    nomad alloc logs -f <alloc-id> myapp-task
    
  • Performance acceptable

    • Response time < 1s
    • Memory usage stable
    • CPU usage reasonable

Monitoring

  • Metrics accessible - If implemented

    curl https://myapp.i80.dk/metrics
    
  • Logs searchable - Can find application logs

    nomad alloc logs -f <alloc-id> myapp-task | grep ERROR
    
  • Alerts configured - If using monitoring system

    • Health check failures
    • High error rate
    • High memory usage

Documentation

  • Service documented - In team wiki/docs

    • What it does
    • Where it's deployed
    • How to access it
    • Known issues
  • Runbook created - For operational issues

    • How to restart
    • How to check logs
    • Common troubleshooting steps
    • Escalation path
  • Secrets documented - Where they're stored

    • Which Consul KV keys
    • Which files on Autobox
    • Who has access

Rollback Plan

  • Previous version tagged - In case of issues

    docker tag myapp:latest myapp:stable
    
  • Rollback tested - Know how to revert

    # Update job file to use :stable tag
    # nomad job run nomad-job.hcl
    
  • Data backup - Before first deployment

    # If using volumes
    sudo tar -czf /backup/myapp-data.tar.gz /opt/nomad-volumes/myapp-data
    

Common Issues Checklist

If deployment fails, check:

  • Is /health endpoint implemented and returning 200?
  • Is app binding to 0.0.0.0 (not 127.0.0.1)?
  • Is app reading PORT from environment variable?
  • Are health check port references correct (no hardcoded ports)?
  • Do volume paths match between Autobox and Nomad job?
  • Are volume permissions correct (uid 1000)?
  • Are secrets accessible (environment or files)?
  • Is Docker image pulling successfully?
  • Is allocation getting scheduled (not pending)?
  • Are there port conflicts?

Quick Debugging Commands

# Service status
consul catalog service myapp
nomad job status myapp

# Allocation details
ALLOC_ID=$(nomad job status myapp | grep running | head -1 | awk '{print $1}')
nomad alloc status $ALLOC_ID

# Logs
nomad alloc logs -f $ALLOC_ID myapp-task
nomad alloc logs -stderr -f $ALLOC_ID myapp-task

# Exec into container
nomad alloc exec -i -t $ALLOC_ID /bin/sh

# Health check test
PORT=$(nomad alloc status $ALLOC_ID | grep "Port.*http" | awk '{print $3}' | cut -d':' -f2)
curl http://192.168.15.124:$PORT/health

# Restart
nomad job restart myapp

# Force reschedule
nomad job stop -purge myapp
nomad job run nomad-job.hcl

Print this checklist and use it for every deployment until the process becomes second nature!