# Nomad Deployment Guide for i80.dk Infrastructure **Last Updated:** 2025-11-28 This guide covers deploying Python applications to your Nomad cluster with proper health checks, volumes, and Vault workarounds. ## 📋 Table of Contents - [Quick Start](#quick-start) - [Health Checks - The #1 Pain Point](#health-checks---the-1-pain-point) - [Host Volumes - The #2 Pain Point](#host-volumes---the-2-pain-point) - [Vault Workarounds](#vault-workarounds) - [Complete Nomad Job Example](#complete-nomad-job-example) - [Dockerfile Best Practices](#dockerfile-best-practices) - [Gitea CI/CD Workflow](#gitea-cicd-workflow) - [Troubleshooting](#troubleshooting) --- ## Quick Start ### 1. Add Health Endpoint to Your App **CRITICAL:** Your app MUST respond to `/health` with HTTP 200 OK. ```python @app.route('/health') def health(): return jsonify({'status': 'healthy'}), 200 ``` ### 2. Use Complete Nomad Job Template Copy `.gitea/workflows/nomad-job-complete.hcl.tmpl` to your project and customize: ```bash cp .gitea/workflows/nomad-job-complete.hcl.tmpl .gitea/workflows/nomad-job.hcl ``` Replace `[[PROJECT_NAME]]` and `[[PORT]]` with your values. ### 3. Build and Deploy ```bash # Build Docker image docker build -t registry.i80.dk/gitea/myapp:latest . # Push to registry docker push registry.i80.dk/gitea/myapp:latest # Deploy to Nomad nomad job run .gitea/workflows/nomad-job.hcl ``` --- ## Health Checks - The #1 Pain Point ### Why Health Checks Fail **Common mistakes:** 1. ❌ **No /health endpoint** - App doesn't implement health endpoint 2. ❌ **Wrong port** - Health check uses wrong port variable 3. ❌ **App not ready** - Health check runs before app starts 4. ❌ **Blocking endpoint** - /health takes too long to respond 5. ❌ **Wrong HTTP method** - App expects POST, Consul sends GET ### Proper Health Check Implementation **In your Flask app:** ```python import time app_start_time = time.time() @app.route('/health') def health(): """ Health check endpoint for Consul/Nomad. Returns: 200 OK: Service is healthy 503: Service is not ready or shutting down """ # Give app time to initialize (optional) if time.time() - app_start_time < 5: return jsonify({'status': 'starting'}), 503 # Add your health checks try: # Check database connection # db.execute("SELECT 1") # Check external dependencies # api_client.ping() return jsonify({ 'status': 'healthy', 'uptime': time.time() - app_start_time }), 200 except Exception as e: return jsonify({ 'status': 'unhealthy', 'error': str(e) }), 503 ``` **In your Nomad job:** ```hcl service { name = "myapp" port = "http" check { name = "http_health" type = "http" path = "/health" interval = "10s" timeout = "2s" port = "http" # Use named port, NOT hardcoded! # Give app time to start before first check check_restart { limit = 3 grace = "10s" ignore_warnings = false } } } ``` ### Testing Health Checks Locally ```bash # Start your app python app.py # Test health endpoint curl http://localhost:5000/health # Should return: # {"status": "healthy", "uptime": 123.45} ``` ### Common Health Check Issues **Issue: Service marked unhealthy immediately** **Solution:** Add `check_restart` grace period: ```hcl check_restart { limit = 3 grace = "10s" # Wait 10s before first check } ``` **Issue: Health check timeout** **Symptoms:** ``` Health check timed out (timeout: 2s) ``` **Solutions:** - Make /health endpoint faster - Increase timeout: `timeout = "5s"` - Remove slow operations from health check **Issue: Wrong port** **Symptoms:** ``` Connection refused on port 5000 ``` **Solution:** Use dynamic port in Nomad job: ```hcl # ❌ WRONG - hardcoded port check { port = "5000" } # ✅ CORRECT - use named port check { port = "http" } # And in your app environment: env { PORT = "${NOMAD_PORT_http}" } ``` --- ## Host Volumes - The #2 Pain Point ### Why Host Volumes Fail **Common mistakes:** 1. ❌ **Volume not declared on Nomad client** - Must configure on Autobox first! 2. ❌ **Wrong source name** - Source must match client config 3. ❌ **Permission issues** - Volume owned by root, app runs as user 4. ❌ **Mount path conflicts** - Path already exists in container ### Setting Up Host Volumes **Step 1: Configure on Nomad Client (Autobox)** **File:** `/etc/nomad.d/client.hcl` on Autobox ```hcl client { enabled = true host_volume "myapp-data" { path = "/opt/nomad-volumes/myapp-data" read_only = false } } ``` **Create directory:** ```bash # On Autobox sudo mkdir -p /opt/nomad-volumes/myapp-data sudo chown 1000:1000 /opt/nomad-volumes/myapp-data # Match container user sudo chmod 755 /opt/nomad-volumes/myapp-data ``` **Restart Nomad client:** ```bash sudo systemctl restart nomad ``` **Step 2: Use Volume in Nomad Job** ```hcl group "myapp-group" { volume "data" { type = "host" source = "myapp-data" # Must match name in client.hcl read_only = false } task "myapp-task" { volume_mount { volume = "data" destination = "/app/data" read_only = false } config { image = "registry.i80.dk/gitea/myapp:latest" } } } ``` **Step 3: Use in Your App** ```python import os # Data directory from mounted volume DATA_DIR = os.getenv('DATA_DIR', '/app/data') # SQLite database in persistent volume db_path = os.path.join(DATA_DIR, 'app.db') ``` ### Volume Permissions **Best Practice: Run container as non-root user** **In Dockerfile:** ```dockerfile # Create non-root user RUN useradd -m -u 1000 appuser # Switch to user USER appuser ``` **On Autobox:** ```bash # Set ownership to match container user (uid 1000) sudo chown -R 1000:1000 /opt/nomad-volumes/myapp-data ``` ### Checking Volume Mounts ```bash # On Nomad - check allocation nomad alloc status # Look for volume mounts section: # Mounted Volumes: # data -> /opt/nomad-volumes/myapp-data # SSH to Autobox and verify ls -la /opt/nomad-volumes/myapp-data ``` ### Volume Backup **Simple backup script:** ```bash #!/bin/bash # backup-volumes.sh VOLUME_PATH="/opt/nomad-volumes/myapp-data" BACKUP_PATH="/backup/$(date +%Y%m%d)" mkdir -p "$BACKUP_PATH" tar -czf "$BACKUP_PATH/myapp-data.tar.gz" "$VOLUME_PATH" ``` --- ## Vault Workarounds ### Problem Your Vault is currently not working. Can't use proper secret management. ### Temporary Solutions **Option 1: Environment Variables in Nomad Job (NOT RECOMMENDED)** ```hcl env { APP_ENV = "production" PORT = "${NOMAD_PORT_http}" DATABASE_URL = "sqlite:///app/data/app.db" API_KEY = "your-secret-key-here" # BAD: Secret in plain text! } ``` **Pros:** - Simple - Works immediately **Cons:** - ❌ Secrets visible in Nomad UI - ❌ Secrets in version control (if committed) - ❌ Hard to rotate secrets **Option 2: File-Based Secrets (BETTER)** **Store secrets in file on Autobox:** ```bash # On Autobox sudo mkdir -p /opt/nomad-secrets/myapp sudo vim /opt/nomad-secrets/myapp/secrets.env # Content: # API_KEY=your-secret-key # DB_PASSWORD=your-db-password sudo chown 1000:1000 /opt/nomad-secrets/myapp/secrets.env sudo chmod 600 /opt/nomad-secrets/myapp/secrets.env ``` **Mount as host volume:** ```hcl group "myapp-group" { volume "secrets" { type = "host" source = "myapp-secrets" read_only = true # Read-only for security } task "myapp-task" { volume_mount { volume = "secrets" destination = "/app/secrets" read_only = true } # Read secrets file at startup config { command = "sh" args = ["-c", "source /app/secrets/secrets.env && flask run --port $PORT"] } } } ``` **Pros:** - ✅ Secrets not in Nomad job file - ✅ Can be backed up separately - ✅ Easier to rotate **Cons:** - ⚠️ Still manual management - ⚠️ Need to manage file permissions **Option 3: Consul KV Store (RECOMMENDED TEMPORARY)** ```bash # Store secret in Consul consul kv put secret/myapp/api_key "your-secret-key" ``` **In Nomad job template:** ```hcl task "myapp-task" { template { data = < myapp-task ``` **Common causes:** - /health endpoint not implemented - App crashed - Wrong port - Slow startup ### Container Keeps Restarting **Check allocation status:** ```bash nomad alloc status # Look at Recent Events: # Started -> Restart Signaled -> Started ... ``` **Common causes:** - Failed health checks - App crash on startup - Missing dependencies - Port already in use ### Volume Mount Issues **Check Nomad client config:** ```bash # On Autobox sudo nomad agent-info | grep -A 10 "host_volumes" ``` **Check permissions:** ```bash # On Autobox ls -la /opt/nomad-volumes/myapp-data # Should be owned by uid 1000 (or your container user) ``` **Check allocation:** ```bash nomad alloc status # Look for Mounted Volumes section ``` ### Port Conflicts **Symptoms:** ``` Failed to start task: bind: address already in use ``` **Solution:** Nomad assigns dynamic ports automatically: ```hcl network { port "http" { to = 5000 # Container internal port # Nomad picks external port (30000-32000) } } env { PORT = "${NOMAD_PORT_http}" # Use Nomad's assigned port } ``` ### Secrets Not Loading **Check Consul KV:** ```bash consul kv get secret/myapp/api_key ``` **Check template rendering:** ```bash nomad alloc fs secrets/ # Should see config.env or your secret files ``` **View rendered template:** ```bash nomad alloc fs secrets/config.env ``` --- ## Quick Reference ### Essential Commands ```bash # Check service health consul catalog service myapp # View allocation nomad alloc status # View logs nomad alloc logs -f myapp-task # Exec into container nomad alloc exec -i -t /bin/sh # Restart job nomad job restart myapp # Stop job nomad job stop myapp # Force reschedule nomad job dispatch -meta restart=true myapp ``` ### Health Check URL ```bash # Find allocated port nomad alloc status | grep "Port.*http" # Test health endpoint curl http://192.168.15.124:30123/health ``` ### Volume Locations - **Client config:** `/etc/nomad.d/client.hcl` (on Autobox) - **Volume data:** `/opt/nomad-volumes/` (on Autobox) - **Secrets:** `/opt/nomad-secrets/` (on Autobox) --- **For more information, see:** - Main infrastructure docs: `~/Projects/i80_network.md` - Nomad docs: https://nomad.i80.dk:4646 - Consul UI: https://consul.i80.dk:8500