Files
PythonTemplateProject/Docs/NOMAD_DEPLOYMENT_GUIDE.md
Henrik Jess Nielsen d3177b82d8 Updated network
2025-11-28 23:21:33 +01:00

13 KiB

Nomad Deployment Guide for i80.dk Infrastructure

Last Updated: 2025-11-28

This guide covers deploying Python applications to your Nomad cluster with proper health checks, volumes, and Vault workarounds.

📋 Table of Contents


Quick Start

1. Add Health Endpoint to Your App

CRITICAL: Your app MUST respond to /health with HTTP 200 OK.

@app.route('/health')
def health():
    return jsonify({'status': 'healthy'}), 200

2. Use Complete Nomad Job Template

Copy .gitea/workflows/nomad-job-complete.hcl.tmpl to your project and customize:

cp .gitea/workflows/nomad-job-complete.hcl.tmpl .gitea/workflows/nomad-job.hcl

Replace [[PROJECT_NAME]] and [[PORT]] with your values.

3. Build and Deploy

# Build Docker image
docker build -t registry.i80.dk/gitea/myapp:latest .

# Push to registry
docker push registry.i80.dk/gitea/myapp:latest

# Deploy to Nomad
nomad job run .gitea/workflows/nomad-job.hcl

Health Checks - The #1 Pain Point

Why Health Checks Fail

Common mistakes:

  1. No /health endpoint - App doesn't implement health endpoint
  2. Wrong port - Health check uses wrong port variable
  3. App not ready - Health check runs before app starts
  4. Blocking endpoint - /health takes too long to respond
  5. Wrong HTTP method - App expects POST, Consul sends GET

Proper Health Check Implementation

In your Flask app:

import time

app_start_time = time.time()

@app.route('/health')
def health():
    """
    Health check endpoint for Consul/Nomad.
    
    Returns:
        200 OK: Service is healthy
        503: Service is not ready or shutting down
    """
    # Give app time to initialize (optional)
    if time.time() - app_start_time < 5:
        return jsonify({'status': 'starting'}), 503
    
    # Add your health checks
    try:
        # Check database connection
        # db.execute("SELECT 1")
        
        # Check external dependencies
        # api_client.ping()
        
        return jsonify({
            'status': 'healthy',
            'uptime': time.time() - app_start_time
        }), 200
        
    except Exception as e:
        return jsonify({
            'status': 'unhealthy',
            'error': str(e)
        }), 503

In your Nomad job:

service {
  name = "myapp"
  port = "http"
  
  check {
    name     = "http_health"
    type     = "http"
    path     = "/health"
    interval = "10s"
    timeout  = "2s"
    port     = "http"  # Use named port, NOT hardcoded!
    
    # Give app time to start before first check
    check_restart {
      limit           = 3
      grace           = "10s"
      ignore_warnings = false
    }
  }
}

Testing Health Checks Locally

# Start your app
python app.py

# Test health endpoint
curl http://localhost:5000/health

# Should return:
# {"status": "healthy", "uptime": 123.45}

Common Health Check Issues

Issue: Service marked unhealthy immediately

Solution: Add check_restart grace period:

check_restart {
  limit = 3
  grace = "10s"  # Wait 10s before first check
}

Issue: Health check timeout

Symptoms:

Health check timed out (timeout: 2s)

Solutions:

  • Make /health endpoint faster
  • Increase timeout: timeout = "5s"
  • Remove slow operations from health check

Issue: Wrong port

Symptoms:

Connection refused on port 5000

Solution: Use dynamic port in Nomad job:

# ❌ WRONG - hardcoded port
check {
  port = "5000"
}

# ✅ CORRECT - use named port
check {
  port = "http"
}

# And in your app environment:
env {
  PORT = "${NOMAD_PORT_http}"
}

Host Volumes - The #2 Pain Point

Why Host Volumes Fail

Common mistakes:

  1. Volume not declared on Nomad client - Must configure on Autobox first!
  2. Wrong source name - Source must match client config
  3. Permission issues - Volume owned by root, app runs as user
  4. Mount path conflicts - Path already exists in container

Setting Up Host Volumes

Step 1: Configure on Nomad Client (Autobox)

File: /etc/nomad.d/client.hcl on Autobox

client {
  enabled = true
  
  host_volume "myapp-data" {
    path      = "/opt/nomad-volumes/myapp-data"
    read_only = false
  }
}

Create directory:

# On Autobox
sudo mkdir -p /opt/nomad-volumes/myapp-data
sudo chown 1000:1000 /opt/nomad-volumes/myapp-data  # Match container user
sudo chmod 755 /opt/nomad-volumes/myapp-data

Restart Nomad client:

sudo systemctl restart nomad

Step 2: Use Volume in Nomad Job

group "myapp-group" {
  volume "data" {
    type      = "host"
    source    = "myapp-data"  # Must match name in client.hcl
    read_only = false
  }
  
  task "myapp-task" {
    volume_mount {
      volume      = "data"
      destination = "/app/data"
      read_only   = false
    }
    
    config {
      image = "registry.i80.dk/gitea/myapp:latest"
    }
  }
}

Step 3: Use in Your App

import os

# Data directory from mounted volume
DATA_DIR = os.getenv('DATA_DIR', '/app/data')

# SQLite database in persistent volume
db_path = os.path.join(DATA_DIR, 'app.db')

Volume Permissions

Best Practice: Run container as non-root user

In Dockerfile:

# Create non-root user
RUN useradd -m -u 1000 appuser

# Switch to user
USER appuser

On Autobox:

# Set ownership to match container user (uid 1000)
sudo chown -R 1000:1000 /opt/nomad-volumes/myapp-data

Checking Volume Mounts

# On Nomad - check allocation
nomad alloc status <alloc-id>

# Look for volume mounts section:
# Mounted Volumes:
#   data -> /opt/nomad-volumes/myapp-data

# SSH to Autobox and verify
ls -la /opt/nomad-volumes/myapp-data

Volume Backup

Simple backup script:

#!/bin/bash
# backup-volumes.sh

VOLUME_PATH="/opt/nomad-volumes/myapp-data"
BACKUP_PATH="/backup/$(date +%Y%m%d)"

mkdir -p "$BACKUP_PATH"
tar -czf "$BACKUP_PATH/myapp-data.tar.gz" "$VOLUME_PATH"

Vault Workarounds

Problem

Your Vault is currently not working. Can't use proper secret management.

Temporary Solutions

Option 1: Environment Variables in Nomad Job (NOT RECOMMENDED)

env {
  APP_ENV      = "production"
  PORT         = "${NOMAD_PORT_http}"
  DATABASE_URL = "sqlite:///app/data/app.db"
  API_KEY      = "your-secret-key-here"  # BAD: Secret in plain text!
}

Pros:

  • Simple
  • Works immediately

Cons:

  • Secrets visible in Nomad UI
  • Secrets in version control (if committed)
  • Hard to rotate secrets

Option 2: File-Based Secrets (BETTER)

Store secrets in file on Autobox:

# On Autobox
sudo mkdir -p /opt/nomad-secrets/myapp
sudo vim /opt/nomad-secrets/myapp/secrets.env

# Content:
# API_KEY=your-secret-key
# DB_PASSWORD=your-db-password

sudo chown 1000:1000 /opt/nomad-secrets/myapp/secrets.env
sudo chmod 600 /opt/nomad-secrets/myapp/secrets.env

Mount as host volume:

group "myapp-group" {
  volume "secrets" {
    type      = "host"
    source    = "myapp-secrets"
    read_only = true  # Read-only for security
  }
  
  task "myapp-task" {
    volume_mount {
      volume      = "secrets"
      destination = "/app/secrets"
      read_only   = true
    }
    
    # Read secrets file at startup
    config {
      command = "sh"
      args = ["-c", "source /app/secrets/secrets.env && flask run --port $PORT"]
    }
  }
}

Pros:

  • Secrets not in Nomad job file
  • Can be backed up separately
  • Easier to rotate

Cons:

  • ⚠️ Still manual management
  • ⚠️ Need to manage file permissions

Option 3: Consul KV Store (RECOMMENDED TEMPORARY)

# Store secret in Consul
consul kv put secret/myapp/api_key "your-secret-key"

In Nomad job template:

task "myapp-task" {
  template {
    data = <<EOH
{{ with key "secret/myapp/api_key" }}
API_KEY="{{ . }}"
{{ end }}
EOH
    destination = "secrets/config.env"
    env         = true
  }
}

Pros:

  • Uses existing infrastructure (Consul)
  • Can be managed via API
  • Not visible in Nomad UI

Cons:

  • ⚠️ Not as secure as Vault
  • ⚠️ Manual secret rotation

When Vault is Fixed

Proper Vault integration:

task "myapp-task" {
  vault {
    policies = ["myapp-policy"]
  }
  
  template {
    data = <<EOH
{{ with secret "secret/data/myapp" }}
API_KEY="{{ .Data.data.api_key }}"
DATABASE_URL="{{ .Data.data.database_url }}"
{{ end }}
EOH
    destination = "secrets/config.env"
    env         = true
  }
}

Complete Nomad Job Example

See .gitea/workflows/nomad-job-complete.hcl.tmpl for a fully documented example with:

  • Proper health checks with grace period
  • Host volume configuration
  • Vault workarounds
  • Auto-revert on failed deployments
  • Graceful shutdown handling
  • Resource limits
  • Log rotation

Dockerfile Best Practices

Multi-Stage Build

# Builder stage
FROM python:3.11-slim as builder
WORKDIR /app
RUN pip install --user -r requirements.txt

# Runtime stage (smaller)
FROM python:3.11-slim
COPY --from=builder /root/.local /home/appuser/.local
USER appuser
CMD ["flask", "run"]

Benefits:

  • Smaller final image
  • Faster deployment
  • Less attack surface

Non-Root User

# Create user
RUN useradd -m -u 1000 appuser

# Switch to user
USER appuser

Why:

  • Security best practice
  • Required for some volume mounts
  • Prevents privilege escalation

Health Check

HEALTHCHECK --interval=30s --timeout=3s --start-period=10s \
  CMD curl -f http://localhost:${PORT}/health || exit 1

Benefits:

  • Docker can detect unhealthy containers
  • Nomad respects Docker health checks
  • Extra layer of monitoring

Gitea CI/CD Workflow

Complete Workflow Example

See .gitea/workflows/main.yml.tmpl for a complete Gitea Actions workflow that:

  1. Builds Docker image
  2. Tags with commit hash + latest
  3. Pushes to private registry
  4. Validates Nomad job
  5. Stops old deployment
  6. Deploys new version
  7. Updates nginx configuration
  8. Updates forwarder configuration

Secrets in Gitea

Configure in Gitea repository settings:

  • secrets.username - Registry username
  • secrets.password - Registry password

Self-Hosted Runner

Your runner must have:

  • Docker installed
  • Nomad CLI installed
  • SSH access to Nomad server
  • Access to private registry

Troubleshooting

Service Marked Unhealthy

Check Consul:

# On Nomad
consul catalog service myapp

# Look for:
# Checks:
#   - http_health: critical

Check allocation logs:

nomad alloc logs -f <alloc-id> myapp-task

Common causes:

  • /health endpoint not implemented
  • App crashed
  • Wrong port
  • Slow startup

Container Keeps Restarting

Check allocation status:

nomad alloc status <alloc-id>

# Look at Recent Events:
# Started -> Restart Signaled -> Started ...

Common causes:

  • Failed health checks
  • App crash on startup
  • Missing dependencies
  • Port already in use

Volume Mount Issues

Check Nomad client config:

# On Autobox
sudo nomad agent-info | grep -A 10 "host_volumes"

Check permissions:

# On Autobox
ls -la /opt/nomad-volumes/myapp-data

# Should be owned by uid 1000 (or your container user)

Check allocation:

nomad alloc status <alloc-id>

# Look for Mounted Volumes section

Port Conflicts

Symptoms:

Failed to start task: bind: address already in use

Solution: Nomad assigns dynamic ports automatically:

network {
  port "http" {
    to = 5000  # Container internal port
    # Nomad picks external port (30000-32000)
  }
}

env {
  PORT = "${NOMAD_PORT_http}"  # Use Nomad's assigned port
}

Secrets Not Loading

Check Consul KV:

consul kv get secret/myapp/api_key

Check template rendering:

nomad alloc fs <alloc-id> secrets/

# Should see config.env or your secret files

View rendered template:

nomad alloc fs <alloc-id> secrets/config.env

Quick Reference

Essential Commands

# Check service health
consul catalog service myapp

# View allocation
nomad alloc status <alloc-id>

# View logs
nomad alloc logs -f <alloc-id> myapp-task

# Exec into container
nomad alloc exec -i -t <alloc-id> /bin/sh

# Restart job
nomad job restart myapp

# Stop job
nomad job stop myapp

# Force reschedule
nomad job dispatch -meta restart=true myapp

Health Check URL

# Find allocated port
nomad alloc status <alloc-id> | grep "Port.*http"

# Test health endpoint
curl http://192.168.15.124:30123/health

Volume Locations

  • Client config: /etc/nomad.d/client.hcl (on Autobox)
  • Volume data: /opt/nomad-volumes/<volume-name> (on Autobox)
  • Secrets: /opt/nomad-secrets/<app-name> (on Autobox)

For more information, see: