Updated network
This commit is contained in:
732
NOMAD_DEPLOYMENT_GUIDE.md
Normal file
732
NOMAD_DEPLOYMENT_GUIDE.md
Normal file
@@ -0,0 +1,732 @@
|
||||
# Nomad Deployment Guide for i80.dk Infrastructure
|
||||
|
||||
**Last Updated:** 2025-11-28
|
||||
|
||||
This guide covers deploying Python applications to your Nomad cluster with proper health checks, volumes, and Vault workarounds.
|
||||
|
||||
## 📋 Table of Contents
|
||||
|
||||
- [Quick Start](#quick-start)
|
||||
- [Health Checks - The #1 Pain Point](#health-checks---the-1-pain-point)
|
||||
- [Host Volumes - The #2 Pain Point](#host-volumes---the-2-pain-point)
|
||||
- [Vault Workarounds](#vault-workarounds)
|
||||
- [Complete Nomad Job Example](#complete-nomad-job-example)
|
||||
- [Dockerfile Best Practices](#dockerfile-best-practices)
|
||||
- [Gitea CI/CD Workflow](#gitea-cicd-workflow)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Add Health Endpoint to Your App
|
||||
|
||||
**CRITICAL:** Your app MUST respond to `/health` with HTTP 200 OK.
|
||||
|
||||
```python
|
||||
@app.route('/health')
|
||||
def health():
|
||||
return jsonify({'status': 'healthy'}), 200
|
||||
```
|
||||
|
||||
### 2. Use Complete Nomad Job Template
|
||||
|
||||
Copy `.gitea/workflows/nomad-job-complete.hcl.tmpl` to your project and customize:
|
||||
|
||||
```bash
|
||||
cp .gitea/workflows/nomad-job-complete.hcl.tmpl .gitea/workflows/nomad-job.hcl
|
||||
```
|
||||
|
||||
Replace `[[PROJECT_NAME]]` and `[[PORT]]` with your values.
|
||||
|
||||
### 3. Build and Deploy
|
||||
|
||||
```bash
|
||||
# Build Docker image
|
||||
docker build -t registry.i80.dk/gitea/myapp:latest .
|
||||
|
||||
# Push to registry
|
||||
docker push registry.i80.dk/gitea/myapp:latest
|
||||
|
||||
# Deploy to Nomad
|
||||
nomad job run .gitea/workflows/nomad-job.hcl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Health Checks - The #1 Pain Point
|
||||
|
||||
### Why Health Checks Fail
|
||||
|
||||
**Common mistakes:**
|
||||
|
||||
1. ❌ **No /health endpoint** - App doesn't implement health endpoint
|
||||
2. ❌ **Wrong port** - Health check uses wrong port variable
|
||||
3. ❌ **App not ready** - Health check runs before app starts
|
||||
4. ❌ **Blocking endpoint** - /health takes too long to respond
|
||||
5. ❌ **Wrong HTTP method** - App expects POST, Consul sends GET
|
||||
|
||||
### Proper Health Check Implementation
|
||||
|
||||
**In your Flask app:**
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
app_start_time = time.time()
|
||||
|
||||
@app.route('/health')
|
||||
def health():
|
||||
"""
|
||||
Health check endpoint for Consul/Nomad.
|
||||
|
||||
Returns:
|
||||
200 OK: Service is healthy
|
||||
503: Service is not ready or shutting down
|
||||
"""
|
||||
# Give app time to initialize (optional)
|
||||
if time.time() - app_start_time < 5:
|
||||
return jsonify({'status': 'starting'}), 503
|
||||
|
||||
# Add your health checks
|
||||
try:
|
||||
# Check database connection
|
||||
# db.execute("SELECT 1")
|
||||
|
||||
# Check external dependencies
|
||||
# api_client.ping()
|
||||
|
||||
return jsonify({
|
||||
'status': 'healthy',
|
||||
'uptime': time.time() - app_start_time
|
||||
}), 200
|
||||
|
||||
except Exception as e:
|
||||
return jsonify({
|
||||
'status': 'unhealthy',
|
||||
'error': str(e)
|
||||
}), 503
|
||||
```
|
||||
|
||||
**In your Nomad job:**
|
||||
|
||||
```hcl
|
||||
service {
|
||||
name = "myapp"
|
||||
port = "http"
|
||||
|
||||
check {
|
||||
name = "http_health"
|
||||
type = "http"
|
||||
path = "/health"
|
||||
interval = "10s"
|
||||
timeout = "2s"
|
||||
port = "http" # Use named port, NOT hardcoded!
|
||||
|
||||
# Give app time to start before first check
|
||||
check_restart {
|
||||
limit = 3
|
||||
grace = "10s"
|
||||
ignore_warnings = false
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Testing Health Checks Locally
|
||||
|
||||
```bash
|
||||
# Start your app
|
||||
python app.py
|
||||
|
||||
# Test health endpoint
|
||||
curl http://localhost:5000/health
|
||||
|
||||
# Should return:
|
||||
# {"status": "healthy", "uptime": 123.45}
|
||||
```
|
||||
|
||||
### Common Health Check Issues
|
||||
|
||||
**Issue: Service marked unhealthy immediately**
|
||||
|
||||
**Solution:** Add `check_restart` grace period:
|
||||
|
||||
```hcl
|
||||
check_restart {
|
||||
limit = 3
|
||||
grace = "10s" # Wait 10s before first check
|
||||
}
|
||||
```
|
||||
|
||||
**Issue: Health check timeout**
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Health check timed out (timeout: 2s)
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
- Make /health endpoint faster
|
||||
- Increase timeout: `timeout = "5s"`
|
||||
- Remove slow operations from health check
|
||||
|
||||
**Issue: Wrong port**
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Connection refused on port 5000
|
||||
```
|
||||
|
||||
**Solution:** Use dynamic port in Nomad job:
|
||||
|
||||
```hcl
|
||||
# ❌ WRONG - hardcoded port
|
||||
check {
|
||||
port = "5000"
|
||||
}
|
||||
|
||||
# ✅ CORRECT - use named port
|
||||
check {
|
||||
port = "http"
|
||||
}
|
||||
|
||||
# And in your app environment:
|
||||
env {
|
||||
PORT = "${NOMAD_PORT_http}"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Host Volumes - The #2 Pain Point
|
||||
|
||||
### Why Host Volumes Fail
|
||||
|
||||
**Common mistakes:**
|
||||
|
||||
1. ❌ **Volume not declared on Nomad client** - Must configure on Autobox first!
|
||||
2. ❌ **Wrong source name** - Source must match client config
|
||||
3. ❌ **Permission issues** - Volume owned by root, app runs as user
|
||||
4. ❌ **Mount path conflicts** - Path already exists in container
|
||||
|
||||
### Setting Up Host Volumes
|
||||
|
||||
**Step 1: Configure on Nomad Client (Autobox)**
|
||||
|
||||
**File:** `/etc/nomad.d/client.hcl` on Autobox
|
||||
|
||||
```hcl
|
||||
client {
|
||||
enabled = true
|
||||
|
||||
host_volume "myapp-data" {
|
||||
path = "/opt/nomad-volumes/myapp-data"
|
||||
read_only = false
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Create directory:**
|
||||
|
||||
```bash
|
||||
# On Autobox
|
||||
sudo mkdir -p /opt/nomad-volumes/myapp-data
|
||||
sudo chown 1000:1000 /opt/nomad-volumes/myapp-data # Match container user
|
||||
sudo chmod 755 /opt/nomad-volumes/myapp-data
|
||||
```
|
||||
|
||||
**Restart Nomad client:**
|
||||
|
||||
```bash
|
||||
sudo systemctl restart nomad
|
||||
```
|
||||
|
||||
**Step 2: Use Volume in Nomad Job**
|
||||
|
||||
```hcl
|
||||
group "myapp-group" {
|
||||
volume "data" {
|
||||
type = "host"
|
||||
source = "myapp-data" # Must match name in client.hcl
|
||||
read_only = false
|
||||
}
|
||||
|
||||
task "myapp-task" {
|
||||
volume_mount {
|
||||
volume = "data"
|
||||
destination = "/app/data"
|
||||
read_only = false
|
||||
}
|
||||
|
||||
config {
|
||||
image = "registry.i80.dk/gitea/myapp:latest"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Step 3: Use in Your App**
|
||||
|
||||
```python
|
||||
import os
|
||||
|
||||
# Data directory from mounted volume
|
||||
DATA_DIR = os.getenv('DATA_DIR', '/app/data')
|
||||
|
||||
# SQLite database in persistent volume
|
||||
db_path = os.path.join(DATA_DIR, 'app.db')
|
||||
```
|
||||
|
||||
### Volume Permissions
|
||||
|
||||
**Best Practice: Run container as non-root user**
|
||||
|
||||
**In Dockerfile:**
|
||||
|
||||
```dockerfile
|
||||
# Create non-root user
|
||||
RUN useradd -m -u 1000 appuser
|
||||
|
||||
# Switch to user
|
||||
USER appuser
|
||||
```
|
||||
|
||||
**On Autobox:**
|
||||
|
||||
```bash
|
||||
# Set ownership to match container user (uid 1000)
|
||||
sudo chown -R 1000:1000 /opt/nomad-volumes/myapp-data
|
||||
```
|
||||
|
||||
### Checking Volume Mounts
|
||||
|
||||
```bash
|
||||
# On Nomad - check allocation
|
||||
nomad alloc status <alloc-id>
|
||||
|
||||
# Look for volume mounts section:
|
||||
# Mounted Volumes:
|
||||
# data -> /opt/nomad-volumes/myapp-data
|
||||
|
||||
# SSH to Autobox and verify
|
||||
ls -la /opt/nomad-volumes/myapp-data
|
||||
```
|
||||
|
||||
### Volume Backup
|
||||
|
||||
**Simple backup script:**
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# backup-volumes.sh
|
||||
|
||||
VOLUME_PATH="/opt/nomad-volumes/myapp-data"
|
||||
BACKUP_PATH="/backup/$(date +%Y%m%d)"
|
||||
|
||||
mkdir -p "$BACKUP_PATH"
|
||||
tar -czf "$BACKUP_PATH/myapp-data.tar.gz" "$VOLUME_PATH"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Vault Workarounds
|
||||
|
||||
### Problem
|
||||
|
||||
Your Vault is currently not working. Can't use proper secret management.
|
||||
|
||||
### Temporary Solutions
|
||||
|
||||
**Option 1: Environment Variables in Nomad Job (NOT RECOMMENDED)**
|
||||
|
||||
```hcl
|
||||
env {
|
||||
APP_ENV = "production"
|
||||
PORT = "${NOMAD_PORT_http}"
|
||||
DATABASE_URL = "sqlite:///app/data/app.db"
|
||||
API_KEY = "your-secret-key-here" # BAD: Secret in plain text!
|
||||
}
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- Simple
|
||||
- Works immediately
|
||||
|
||||
**Cons:**
|
||||
- ❌ Secrets visible in Nomad UI
|
||||
- ❌ Secrets in version control (if committed)
|
||||
- ❌ Hard to rotate secrets
|
||||
|
||||
**Option 2: File-Based Secrets (BETTER)**
|
||||
|
||||
**Store secrets in file on Autobox:**
|
||||
|
||||
```bash
|
||||
# On Autobox
|
||||
sudo mkdir -p /opt/nomad-secrets/myapp
|
||||
sudo vim /opt/nomad-secrets/myapp/secrets.env
|
||||
|
||||
# Content:
|
||||
# API_KEY=your-secret-key
|
||||
# DB_PASSWORD=your-db-password
|
||||
|
||||
sudo chown 1000:1000 /opt/nomad-secrets/myapp/secrets.env
|
||||
sudo chmod 600 /opt/nomad-secrets/myapp/secrets.env
|
||||
```
|
||||
|
||||
**Mount as host volume:**
|
||||
|
||||
```hcl
|
||||
group "myapp-group" {
|
||||
volume "secrets" {
|
||||
type = "host"
|
||||
source = "myapp-secrets"
|
||||
read_only = true # Read-only for security
|
||||
}
|
||||
|
||||
task "myapp-task" {
|
||||
volume_mount {
|
||||
volume = "secrets"
|
||||
destination = "/app/secrets"
|
||||
read_only = true
|
||||
}
|
||||
|
||||
# Read secrets file at startup
|
||||
config {
|
||||
command = "sh"
|
||||
args = ["-c", "source /app/secrets/secrets.env && flask run --port $PORT"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- ✅ Secrets not in Nomad job file
|
||||
- ✅ Can be backed up separately
|
||||
- ✅ Easier to rotate
|
||||
|
||||
**Cons:**
|
||||
- ⚠️ Still manual management
|
||||
- ⚠️ Need to manage file permissions
|
||||
|
||||
**Option 3: Consul KV Store (RECOMMENDED TEMPORARY)**
|
||||
|
||||
```bash
|
||||
# Store secret in Consul
|
||||
consul kv put secret/myapp/api_key "your-secret-key"
|
||||
```
|
||||
|
||||
**In Nomad job template:**
|
||||
|
||||
```hcl
|
||||
task "myapp-task" {
|
||||
template {
|
||||
data = <<EOH
|
||||
{{ with key "secret/myapp/api_key" }}
|
||||
API_KEY="{{ . }}"
|
||||
{{ end }}
|
||||
EOH
|
||||
destination = "secrets/config.env"
|
||||
env = true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Pros:**
|
||||
- ✅ Uses existing infrastructure (Consul)
|
||||
- ✅ Can be managed via API
|
||||
- ✅ Not visible in Nomad UI
|
||||
|
||||
**Cons:**
|
||||
- ⚠️ Not as secure as Vault
|
||||
- ⚠️ Manual secret rotation
|
||||
|
||||
### When Vault is Fixed
|
||||
|
||||
**Proper Vault integration:**
|
||||
|
||||
```hcl
|
||||
task "myapp-task" {
|
||||
vault {
|
||||
policies = ["myapp-policy"]
|
||||
}
|
||||
|
||||
template {
|
||||
data = <<EOH
|
||||
{{ with secret "secret/data/myapp" }}
|
||||
API_KEY="{{ .Data.data.api_key }}"
|
||||
DATABASE_URL="{{ .Data.data.database_url }}"
|
||||
{{ end }}
|
||||
EOH
|
||||
destination = "secrets/config.env"
|
||||
env = true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Complete Nomad Job Example
|
||||
|
||||
See `.gitea/workflows/nomad-job-complete.hcl.tmpl` for a fully documented example with:
|
||||
|
||||
- ✅ Proper health checks with grace period
|
||||
- ✅ Host volume configuration
|
||||
- ✅ Vault workarounds
|
||||
- ✅ Auto-revert on failed deployments
|
||||
- ✅ Graceful shutdown handling
|
||||
- ✅ Resource limits
|
||||
- ✅ Log rotation
|
||||
|
||||
---
|
||||
|
||||
## Dockerfile Best Practices
|
||||
|
||||
### Multi-Stage Build
|
||||
|
||||
```dockerfile
|
||||
# Builder stage
|
||||
FROM python:3.11-slim as builder
|
||||
WORKDIR /app
|
||||
RUN pip install --user -r requirements.txt
|
||||
|
||||
# Runtime stage (smaller)
|
||||
FROM python:3.11-slim
|
||||
COPY --from=builder /root/.local /home/appuser/.local
|
||||
USER appuser
|
||||
CMD ["flask", "run"]
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- Smaller final image
|
||||
- Faster deployment
|
||||
- Less attack surface
|
||||
|
||||
### Non-Root User
|
||||
|
||||
```dockerfile
|
||||
# Create user
|
||||
RUN useradd -m -u 1000 appuser
|
||||
|
||||
# Switch to user
|
||||
USER appuser
|
||||
```
|
||||
|
||||
**Why:**
|
||||
- Security best practice
|
||||
- Required for some volume mounts
|
||||
- Prevents privilege escalation
|
||||
|
||||
### Health Check
|
||||
|
||||
```dockerfile
|
||||
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s \
|
||||
CMD curl -f http://localhost:${PORT}/health || exit 1
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- Docker can detect unhealthy containers
|
||||
- Nomad respects Docker health checks
|
||||
- Extra layer of monitoring
|
||||
|
||||
---
|
||||
|
||||
## Gitea CI/CD Workflow
|
||||
|
||||
### Complete Workflow Example
|
||||
|
||||
See `.gitea/workflows/main.yml.tmpl` for a complete Gitea Actions workflow that:
|
||||
|
||||
1. ✅ Builds Docker image
|
||||
2. ✅ Tags with commit hash + latest
|
||||
3. ✅ Pushes to private registry
|
||||
4. ✅ Validates Nomad job
|
||||
5. ✅ Stops old deployment
|
||||
6. ✅ Deploys new version
|
||||
7. ✅ Updates nginx configuration
|
||||
8. ✅ Updates forwarder configuration
|
||||
|
||||
### Secrets in Gitea
|
||||
|
||||
Configure in Gitea repository settings:
|
||||
|
||||
- `secrets.username` - Registry username
|
||||
- `secrets.password` - Registry password
|
||||
|
||||
### Self-Hosted Runner
|
||||
|
||||
Your runner must have:
|
||||
|
||||
- Docker installed
|
||||
- Nomad CLI installed
|
||||
- SSH access to Nomad server
|
||||
- Access to private registry
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Service Marked Unhealthy
|
||||
|
||||
**Check Consul:**
|
||||
|
||||
```bash
|
||||
# On Nomad
|
||||
consul catalog service myapp
|
||||
|
||||
# Look for:
|
||||
# Checks:
|
||||
# - http_health: critical
|
||||
```
|
||||
|
||||
**Check allocation logs:**
|
||||
|
||||
```bash
|
||||
nomad alloc logs -f <alloc-id> myapp-task
|
||||
```
|
||||
|
||||
**Common causes:**
|
||||
- /health endpoint not implemented
|
||||
- App crashed
|
||||
- Wrong port
|
||||
- Slow startup
|
||||
|
||||
### Container Keeps Restarting
|
||||
|
||||
**Check allocation status:**
|
||||
|
||||
```bash
|
||||
nomad alloc status <alloc-id>
|
||||
|
||||
# Look at Recent Events:
|
||||
# Started -> Restart Signaled -> Started ...
|
||||
```
|
||||
|
||||
**Common causes:**
|
||||
- Failed health checks
|
||||
- App crash on startup
|
||||
- Missing dependencies
|
||||
- Port already in use
|
||||
|
||||
### Volume Mount Issues
|
||||
|
||||
**Check Nomad client config:**
|
||||
|
||||
```bash
|
||||
# On Autobox
|
||||
sudo nomad agent-info | grep -A 10 "host_volumes"
|
||||
```
|
||||
|
||||
**Check permissions:**
|
||||
|
||||
```bash
|
||||
# On Autobox
|
||||
ls -la /opt/nomad-volumes/myapp-data
|
||||
|
||||
# Should be owned by uid 1000 (or your container user)
|
||||
```
|
||||
|
||||
**Check allocation:**
|
||||
|
||||
```bash
|
||||
nomad alloc status <alloc-id>
|
||||
|
||||
# Look for Mounted Volumes section
|
||||
```
|
||||
|
||||
### Port Conflicts
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Failed to start task: bind: address already in use
|
||||
```
|
||||
|
||||
**Solution:** Nomad assigns dynamic ports automatically:
|
||||
|
||||
```hcl
|
||||
network {
|
||||
port "http" {
|
||||
to = 5000 # Container internal port
|
||||
# Nomad picks external port (30000-32000)
|
||||
}
|
||||
}
|
||||
|
||||
env {
|
||||
PORT = "${NOMAD_PORT_http}" # Use Nomad's assigned port
|
||||
}
|
||||
```
|
||||
|
||||
### Secrets Not Loading
|
||||
|
||||
**Check Consul KV:**
|
||||
|
||||
```bash
|
||||
consul kv get secret/myapp/api_key
|
||||
```
|
||||
|
||||
**Check template rendering:**
|
||||
|
||||
```bash
|
||||
nomad alloc fs <alloc-id> secrets/
|
||||
|
||||
# Should see config.env or your secret files
|
||||
```
|
||||
|
||||
**View rendered template:**
|
||||
|
||||
```bash
|
||||
nomad alloc fs <alloc-id> secrets/config.env
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### Essential Commands
|
||||
|
||||
```bash
|
||||
# Check service health
|
||||
consul catalog service myapp
|
||||
|
||||
# View allocation
|
||||
nomad alloc status <alloc-id>
|
||||
|
||||
# View logs
|
||||
nomad alloc logs -f <alloc-id> myapp-task
|
||||
|
||||
# Exec into container
|
||||
nomad alloc exec -i -t <alloc-id> /bin/sh
|
||||
|
||||
# Restart job
|
||||
nomad job restart myapp
|
||||
|
||||
# Stop job
|
||||
nomad job stop myapp
|
||||
|
||||
# Force reschedule
|
||||
nomad job dispatch -meta restart=true myapp
|
||||
```
|
||||
|
||||
### Health Check URL
|
||||
|
||||
```bash
|
||||
# Find allocated port
|
||||
nomad alloc status <alloc-id> | grep "Port.*http"
|
||||
|
||||
# Test health endpoint
|
||||
curl http://192.168.15.124:30123/health
|
||||
```
|
||||
|
||||
### Volume Locations
|
||||
|
||||
- **Client config:** `/etc/nomad.d/client.hcl` (on Autobox)
|
||||
- **Volume data:** `/opt/nomad-volumes/<volume-name>` (on Autobox)
|
||||
- **Secrets:** `/opt/nomad-secrets/<app-name>` (on Autobox)
|
||||
|
||||
---
|
||||
|
||||
**For more information, see:**
|
||||
- Main infrastructure docs: `~/Projects/i80_network.md`
|
||||
- Nomad docs: https://nomad.i80.dk:4646
|
||||
- Consul UI: https://consul.i80.dk:8500
|
||||
Reference in New Issue
Block a user