Files
fil/docs/guides/docker.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

271 lines
9.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Docker Deployment <span class="version-badge">v4.0.0</span>
Official Docker images built on the Rust core with Debian 13 (Trixie). Each image supports three execution modes: API server (default), command-line tool, and MCP server.
## Quick Start
### Pull and Run
=== "API Server"
--8<-- "snippets/docker/api_server_basic.md"
=== "CLI Mode"
--8<-- "snippets/docker/cli_mode_basic.md"
=== "MCP Server"
--8<-- "snippets/docker/mcp_basic.md"
### Pull Image
=== "Core"
--8<-- "snippets/docker/core_pull.md"
=== "Full"
--8<-- "snippets/docker/full_pull.md"
## Image Variants
| | **Core** | **Full** |
| ----------------- | -------------------------------------- | ---------------------------------------- |
| **Image** | `ghcr.io/kreuzberg-dev/kreuzberg:core` | `ghcr.io/kreuzberg-dev/kreuzberg:latest` |
| **Size** | ~1.01.3 GB | ~1.52.1 GB |
| **Tesseract OCR** | 12 languages | 12 languages |
| **Modern Office** | DOCX, PPTX, XLSX | DOCX, PPTX, XLSX |
| **Legacy Office** | DOC, PPT, XLS (native OLE/CFB) | DOC, PPT, XLS (native OLE/CFB) |
| **Startup** | ~1s | ~1s |
**Core** is optimized for production deployments where image size matters. Both images support all major formats — choose based on deployment constraints.
All images include: Tesseract OCR (eng, spa, fra, deu, ita, por, chi-sim, chi-tra, jpn, ara, rus, hin), PDF (pdf_oxide), images, HTML, email, and archives.
## Execution Modes
### API Server (Default)
```bash title="Terminal"
docker run -p 8000:8000 ghcr.io/kreuzberg-dev/kreuzberg:latest
# Custom port and CORS
docker run -p 9000:9000 \
-e KREUZBERG_CORS_ORIGINS="https://myapp.com" \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
serve --host 0.0.0.0 --port 9000
# With config file
docker run -p 8000:8000 \
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
serve --config /config/kreuzberg.toml
```
See [API Server Guide](api-server.md) for endpoint documentation.
### CLI Mode
```bash title="Terminal"
# Extract a file
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/document.pdf
# Extract with OCR
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/scanned.pdf --ocr true
# Batch processing
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
batch /data/*.pdf --format json
# MIME detection
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
detect /data/unknown-file.bin
```
### MCP Server
```bash title="Terminal"
docker run ghcr.io/kreuzberg-dev/kreuzberg:latest mcp
# With config
docker run \
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
mcp --config /config/kreuzberg.toml
```
See [API Server Guide - MCP Section](api-server.md#mcp-server) for integration details.
## Environment Variables
| Variable | Default | Description |
| ------------------------------ | ----------------------------- | ------------------------------------------------------------------------------------------------ |
| `KREUZBERG_MAX_UPLOAD_SIZE_MB` | `100` | Max upload size in MB |
| `KREUZBERG_CORS_ORIGINS` | `*` | Comma-separated allowed origins |
| `RUST_LOG` | `info` | Log level: `error`, `warn`, `info`, `debug`, `trace` |
| `KREUZBERG_CACHE_DIR` | `/app/.kreuzberg` | Cache directory (set explicitly in Docker; outside containers defaults to platform global cache) |
| `HF_HOME` | `/app/.kreuzberg/huggingface` | HuggingFace model cache |
Host and port are set via CLI args: `serve --host 0.0.0.0 --port 8000`.
## Volume Mounts
```bash title="Terminal"
# Cache persistence (embedding models, OCR cache)
docker run -p 8000:8000 \
-v kreuzberg-cache:/app/.kreuzberg \
ghcr.io/kreuzberg-dev/kreuzberg:latest
# Config file
docker run -p 8000:8000 \
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
serve --config /config/kreuzberg.toml
# Documents (read-only)
docker run -v $(pwd)/documents:/data:ro \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/document.pdf
```
!!! Note "Model Downloads" Embedding models download on first use (~90 MB 1.2 GB depending on preset). Use a persistent volume for `/app/.kreuzberg` in production to avoid re-downloading on container restart. Outside Docker, models are cached in the platform-specific global cache directory (for example, `~/.cache/kreuzberg/` on Linux, `~/Library/Caches/kreuzberg/` on macOS).
## Docker Compose
```yaml title="docker-compose.yaml"
services:
kreuzberg-api:
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
ports:
- "8000:8000"
environment:
- KREUZBERG_CORS_ORIGINS=https://myapp.com
- KREUZBERG_MAX_UPLOAD_SIZE_MB=500
- RUST_LOG=info
volumes:
- ./config:/config
- cache-data:/app/.kreuzberg
command: serve --host 0.0.0.0 --port 8000 --config /config/kreuzberg.toml
restart: unless-stopped
healthcheck:
test: ["CMD", "kreuzberg", "--version"]
interval: 30s
timeout: 10s
retries: 3
start_period: 5s
volumes:
cache-data:
```
## Security
Images run as non-root user `kreuzberg` (UID 1000). For hardened deployments:
```bash title="Terminal"
docker run --security-opt no-new-privileges \
--read-only \
--tmpfs /tmp \
-p 8000:8000 \
ghcr.io/kreuzberg-dev/kreuzberg:latest
```
Ensure mounted volumes have correct permissions:
```bash title="Terminal"
chown -R 1000:1000 /path/to/mounted/directory
```
## Resource Allocation
| Workload | Memory | CPU | Notes |
| -------- | ------ | --------- | --------------------------------------- |
| Light | 512 MB | 0.5 cores | Small documents, low concurrency |
| Medium | 1 GB | 1 core | Typical documents, moderate concurrency |
| Heavy | 2 GB+ | 2+ cores | Large documents, OCR, high concurrency |
```bash title="Terminal"
docker run -p 8000:8000 --memory=1g --cpus=1 \
ghcr.io/kreuzberg-dev/kreuzberg:latest
```
## Building Custom Images
=== "Core Image"
--8<-- "snippets/docker/build_core.md"
=== "Full Image"
--8<-- "snippets/docker/build_full.md"
```dockerfile title="Custom Dockerfile"
FROM ghcr.io/kreuzberg-dev/kreuzberg:latest
USER root
RUN apt-get update && \
apt-get install -y --no-install-recommends your-package-here && \
apt-get clean && rm -rf /var/lib/apt/lists/*
USER kreuzberg
COPY kreuzberg.toml /app/kreuzberg.toml
CMD ["serve", "--config", "/app/kreuzberg.toml"]
```
## Other Image Variants
The published Core and Full images cover most use cases. For specialized needs, the `docker/` directory has additional Dockerfiles:
| Dockerfile | What it builds |
| ------------------------- | ------------------------------------------------------------------------------------- |
| `Dockerfile.cli` | Minimal image with just the `kreuzberg` binary — good for CI pipelines and batch jobs |
| `Dockerfile.musl-build` | Fully static Linux binaries via MUSL — runs on any distro, no dynamic libs |
| `Dockerfile.musl-ffi` | Static C FFI library for language bindings (Go, Ruby, R, PHP, Elixir) |
| `Dockerfile.musl-rustler` | MUSL-based Rustler NIF for Elixir |
### CLI Image
A stripped-down image with only the CLI binary. No server, no API — just extraction:
```bash title="Terminal"
docker build -f docker/Dockerfile.cli -t kreuzberg-cli .
docker run -v $(pwd):/data kreuzberg-cli extract /data/document.pdf
docker run -v $(pwd):/data kreuzberg-cli batch /data/*.pdf --format json
docker run -v $(pwd):/data kreuzberg-cli detect /data/unknown-file.bin
```
### MUSL Static Builds
These produce binaries with zero dynamic library dependencies. A single file that runs on any Linux — Alpine, scratch containers, bare EC2 instances, whatever.
```bash title="Terminal"
docker build -f docker/Dockerfile.musl-build -t kreuzberg-musl-build .
docker build -f docker/Dockerfile.musl-ffi -t kreuzberg-musl-ffi .
```
The FFI variant builds a shared library used by the Go, Ruby, R, PHP, and Elixir bindings for portable cross-platform distribution.
## Troubleshooting
??? Question "Container won't start"
Check logs with `docker logs <container-id>`. Common causes: port conflict (change `-p` mapping), insufficient memory (increase `--memory`), volume permission errors.
??? Question "Permission errors on mounted volumes"
Images run as UID 1000. Fix with: `chown -R 1000:1000 /path/to/mounted/directory`
??? Question "Large file processing fails"
Increase memory limit (`--memory=4g`) and upload size (`-e KREUZBERG_MAX_UPLOAD_SIZE_MB=1000`).
## Next Steps
- [Kubernetes Deployment](kubernetes.md) — production K8s with OCR config and health checks
- [API Server Guide](api-server.md) — endpoint documentation
- [Configuration](configuration.md) — all configuration options