271 lines
9.7 KiB
Markdown
271 lines
9.7 KiB
Markdown
|
|
# Docker Deployment <span class="version-badge">v4.0.0</span>
|
|||
|
|
|
|||
|
|
Official Docker images built on the Rust core with Debian 13 (Trixie). Each image supports three execution modes: API server (default), command-line tool, and MCP server.
|
|||
|
|
|
|||
|
|
## Quick Start
|
|||
|
|
|
|||
|
|
### Pull and Run
|
|||
|
|
|
|||
|
|
=== "API Server"
|
|||
|
|
|
|||
|
|
--8<-- "snippets/docker/api_server_basic.md"
|
|||
|
|
|
|||
|
|
=== "CLI Mode"
|
|||
|
|
|
|||
|
|
--8<-- "snippets/docker/cli_mode_basic.md"
|
|||
|
|
|
|||
|
|
=== "MCP Server"
|
|||
|
|
|
|||
|
|
--8<-- "snippets/docker/mcp_basic.md"
|
|||
|
|
|
|||
|
|
### Pull Image
|
|||
|
|
|
|||
|
|
=== "Core"
|
|||
|
|
|
|||
|
|
--8<-- "snippets/docker/core_pull.md"
|
|||
|
|
|
|||
|
|
=== "Full"
|
|||
|
|
|
|||
|
|
--8<-- "snippets/docker/full_pull.md"
|
|||
|
|
|
|||
|
|
## Image Variants
|
|||
|
|
|
|||
|
|
| | **Core** | **Full** |
|
|||
|
|
| ----------------- | -------------------------------------- | ---------------------------------------- |
|
|||
|
|
| **Image** | `ghcr.io/kreuzberg-dev/kreuzberg:core` | `ghcr.io/kreuzberg-dev/kreuzberg:latest` |
|
|||
|
|
| **Size** | ~1.0–1.3 GB | ~1.5–2.1 GB |
|
|||
|
|
| **Tesseract OCR** | 12 languages | 12 languages |
|
|||
|
|
| **Modern Office** | DOCX, PPTX, XLSX | DOCX, PPTX, XLSX |
|
|||
|
|
| **Legacy Office** | DOC, PPT, XLS (native OLE/CFB) | DOC, PPT, XLS (native OLE/CFB) |
|
|||
|
|
| **Startup** | ~1s | ~1s |
|
|||
|
|
|
|||
|
|
**Core** is optimized for production deployments where image size matters. Both images support all major formats — choose based on deployment constraints.
|
|||
|
|
|
|||
|
|
All images include: Tesseract OCR (eng, spa, fra, deu, ita, por, chi-sim, chi-tra, jpn, ara, rus, hin), PDF (pdf_oxide), images, HTML, email, and archives.
|
|||
|
|
|
|||
|
|
## Execution Modes
|
|||
|
|
|
|||
|
|
### API Server (Default)
|
|||
|
|
|
|||
|
|
```bash title="Terminal"
|
|||
|
|
docker run -p 8000:8000 ghcr.io/kreuzberg-dev/kreuzberg:latest
|
|||
|
|
|
|||
|
|
# Custom port and CORS
|
|||
|
|
docker run -p 9000:9000 \
|
|||
|
|
-e KREUZBERG_CORS_ORIGINS="https://myapp.com" \
|
|||
|
|
ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
|||
|
|
serve --host 0.0.0.0 --port 9000
|
|||
|
|
|
|||
|
|
# With config file
|
|||
|
|
docker run -p 8000:8000 \
|
|||
|
|
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
|
|||
|
|
ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
|||
|
|
serve --config /config/kreuzberg.toml
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
See [API Server Guide](api-server.md) for endpoint documentation.
|
|||
|
|
|
|||
|
|
### CLI Mode
|
|||
|
|
|
|||
|
|
```bash title="Terminal"
|
|||
|
|
# Extract a file
|
|||
|
|
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
|||
|
|
extract /data/document.pdf
|
|||
|
|
|
|||
|
|
# Extract with OCR
|
|||
|
|
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
|||
|
|
extract /data/scanned.pdf --ocr true
|
|||
|
|
|
|||
|
|
# Batch processing
|
|||
|
|
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
|||
|
|
batch /data/*.pdf --format json
|
|||
|
|
|
|||
|
|
# MIME detection
|
|||
|
|
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
|||
|
|
detect /data/unknown-file.bin
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### MCP Server
|
|||
|
|
|
|||
|
|
```bash title="Terminal"
|
|||
|
|
docker run ghcr.io/kreuzberg-dev/kreuzberg:latest mcp
|
|||
|
|
|
|||
|
|
# With config
|
|||
|
|
docker run \
|
|||
|
|
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
|
|||
|
|
ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
|||
|
|
mcp --config /config/kreuzberg.toml
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
See [API Server Guide - MCP Section](api-server.md#mcp-server) for integration details.
|
|||
|
|
|
|||
|
|
## Environment Variables
|
|||
|
|
|
|||
|
|
| Variable | Default | Description |
|
|||
|
|
| ------------------------------ | ----------------------------- | ------------------------------------------------------------------------------------------------ |
|
|||
|
|
| `KREUZBERG_MAX_UPLOAD_SIZE_MB` | `100` | Max upload size in MB |
|
|||
|
|
| `KREUZBERG_CORS_ORIGINS` | `*` | Comma-separated allowed origins |
|
|||
|
|
| `RUST_LOG` | `info` | Log level: `error`, `warn`, `info`, `debug`, `trace` |
|
|||
|
|
| `KREUZBERG_CACHE_DIR` | `/app/.kreuzberg` | Cache directory (set explicitly in Docker; outside containers defaults to platform global cache) |
|
|||
|
|
| `HF_HOME` | `/app/.kreuzberg/huggingface` | HuggingFace model cache |
|
|||
|
|
|
|||
|
|
Host and port are set via CLI args: `serve --host 0.0.0.0 --port 8000`.
|
|||
|
|
|
|||
|
|
## Volume Mounts
|
|||
|
|
|
|||
|
|
```bash title="Terminal"
|
|||
|
|
# Cache persistence (embedding models, OCR cache)
|
|||
|
|
docker run -p 8000:8000 \
|
|||
|
|
-v kreuzberg-cache:/app/.kreuzberg \
|
|||
|
|
ghcr.io/kreuzberg-dev/kreuzberg:latest
|
|||
|
|
|
|||
|
|
# Config file
|
|||
|
|
docker run -p 8000:8000 \
|
|||
|
|
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
|
|||
|
|
ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
|||
|
|
serve --config /config/kreuzberg.toml
|
|||
|
|
|
|||
|
|
# Documents (read-only)
|
|||
|
|
docker run -v $(pwd)/documents:/data:ro \
|
|||
|
|
ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
|||
|
|
extract /data/document.pdf
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
!!! Note "Model Downloads" Embedding models download on first use (~90 MB – 1.2 GB depending on preset). Use a persistent volume for `/app/.kreuzberg` in production to avoid re-downloading on container restart. Outside Docker, models are cached in the platform-specific global cache directory (for example, `~/.cache/kreuzberg/` on Linux, `~/Library/Caches/kreuzberg/` on macOS).
|
|||
|
|
|
|||
|
|
## Docker Compose
|
|||
|
|
|
|||
|
|
```yaml title="docker-compose.yaml"
|
|||
|
|
services:
|
|||
|
|
kreuzberg-api:
|
|||
|
|
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
|
|||
|
|
ports:
|
|||
|
|
- "8000:8000"
|
|||
|
|
environment:
|
|||
|
|
- KREUZBERG_CORS_ORIGINS=https://myapp.com
|
|||
|
|
- KREUZBERG_MAX_UPLOAD_SIZE_MB=500
|
|||
|
|
- RUST_LOG=info
|
|||
|
|
volumes:
|
|||
|
|
- ./config:/config
|
|||
|
|
- cache-data:/app/.kreuzberg
|
|||
|
|
command: serve --host 0.0.0.0 --port 8000 --config /config/kreuzberg.toml
|
|||
|
|
restart: unless-stopped
|
|||
|
|
healthcheck:
|
|||
|
|
test: ["CMD", "kreuzberg", "--version"]
|
|||
|
|
interval: 30s
|
|||
|
|
timeout: 10s
|
|||
|
|
retries: 3
|
|||
|
|
start_period: 5s
|
|||
|
|
|
|||
|
|
volumes:
|
|||
|
|
cache-data:
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Security
|
|||
|
|
|
|||
|
|
Images run as non-root user `kreuzberg` (UID 1000). For hardened deployments:
|
|||
|
|
|
|||
|
|
```bash title="Terminal"
|
|||
|
|
docker run --security-opt no-new-privileges \
|
|||
|
|
--read-only \
|
|||
|
|
--tmpfs /tmp \
|
|||
|
|
-p 8000:8000 \
|
|||
|
|
ghcr.io/kreuzberg-dev/kreuzberg:latest
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Ensure mounted volumes have correct permissions:
|
|||
|
|
|
|||
|
|
```bash title="Terminal"
|
|||
|
|
chown -R 1000:1000 /path/to/mounted/directory
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Resource Allocation
|
|||
|
|
|
|||
|
|
| Workload | Memory | CPU | Notes |
|
|||
|
|
| -------- | ------ | --------- | --------------------------------------- |
|
|||
|
|
| Light | 512 MB | 0.5 cores | Small documents, low concurrency |
|
|||
|
|
| Medium | 1 GB | 1 core | Typical documents, moderate concurrency |
|
|||
|
|
| Heavy | 2 GB+ | 2+ cores | Large documents, OCR, high concurrency |
|
|||
|
|
|
|||
|
|
```bash title="Terminal"
|
|||
|
|
docker run -p 8000:8000 --memory=1g --cpus=1 \
|
|||
|
|
ghcr.io/kreuzberg-dev/kreuzberg:latest
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Building Custom Images
|
|||
|
|
|
|||
|
|
=== "Core Image"
|
|||
|
|
|
|||
|
|
--8<-- "snippets/docker/build_core.md"
|
|||
|
|
|
|||
|
|
=== "Full Image"
|
|||
|
|
|
|||
|
|
--8<-- "snippets/docker/build_full.md"
|
|||
|
|
|
|||
|
|
```dockerfile title="Custom Dockerfile"
|
|||
|
|
FROM ghcr.io/kreuzberg-dev/kreuzberg:latest
|
|||
|
|
|
|||
|
|
USER root
|
|||
|
|
RUN apt-get update && \
|
|||
|
|
apt-get install -y --no-install-recommends your-package-here && \
|
|||
|
|
apt-get clean && rm -rf /var/lib/apt/lists/*
|
|||
|
|
|
|||
|
|
USER kreuzberg
|
|||
|
|
COPY kreuzberg.toml /app/kreuzberg.toml
|
|||
|
|
CMD ["serve", "--config", "/app/kreuzberg.toml"]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Other Image Variants
|
|||
|
|
|
|||
|
|
The published Core and Full images cover most use cases. For specialized needs, the `docker/` directory has additional Dockerfiles:
|
|||
|
|
|
|||
|
|
| Dockerfile | What it builds |
|
|||
|
|
| ------------------------- | ------------------------------------------------------------------------------------- |
|
|||
|
|
| `Dockerfile.cli` | Minimal image with just the `kreuzberg` binary — good for CI pipelines and batch jobs |
|
|||
|
|
| `Dockerfile.musl-build` | Fully static Linux binaries via MUSL — runs on any distro, no dynamic libs |
|
|||
|
|
| `Dockerfile.musl-ffi` | Static C FFI library for language bindings (Go, Ruby, R, PHP, Elixir) |
|
|||
|
|
| `Dockerfile.musl-rustler` | MUSL-based Rustler NIF for Elixir |
|
|||
|
|
|
|||
|
|
### CLI Image
|
|||
|
|
|
|||
|
|
A stripped-down image with only the CLI binary. No server, no API — just extraction:
|
|||
|
|
|
|||
|
|
```bash title="Terminal"
|
|||
|
|
docker build -f docker/Dockerfile.cli -t kreuzberg-cli .
|
|||
|
|
|
|||
|
|
docker run -v $(pwd):/data kreuzberg-cli extract /data/document.pdf
|
|||
|
|
docker run -v $(pwd):/data kreuzberg-cli batch /data/*.pdf --format json
|
|||
|
|
docker run -v $(pwd):/data kreuzberg-cli detect /data/unknown-file.bin
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### MUSL Static Builds
|
|||
|
|
|
|||
|
|
These produce binaries with zero dynamic library dependencies. A single file that runs on any Linux — Alpine, scratch containers, bare EC2 instances, whatever.
|
|||
|
|
|
|||
|
|
```bash title="Terminal"
|
|||
|
|
docker build -f docker/Dockerfile.musl-build -t kreuzberg-musl-build .
|
|||
|
|
docker build -f docker/Dockerfile.musl-ffi -t kreuzberg-musl-ffi .
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The FFI variant builds a shared library used by the Go, Ruby, R, PHP, and Elixir bindings for portable cross-platform distribution.
|
|||
|
|
|
|||
|
|
## Troubleshooting
|
|||
|
|
|
|||
|
|
??? Question "Container won't start"
|
|||
|
|
|
|||
|
|
Check logs with `docker logs <container-id>`. Common causes: port conflict (change `-p` mapping), insufficient memory (increase `--memory`), volume permission errors.
|
|||
|
|
|
|||
|
|
??? Question "Permission errors on mounted volumes"
|
|||
|
|
|
|||
|
|
Images run as UID 1000. Fix with: `chown -R 1000:1000 /path/to/mounted/directory`
|
|||
|
|
|
|||
|
|
??? Question "Large file processing fails"
|
|||
|
|
|
|||
|
|
Increase memory limit (`--memory=4g`) and upload size (`-e KREUZBERG_MAX_UPLOAD_SIZE_MB=1000`).
|
|||
|
|
|
|||
|
|
## Next Steps
|
|||
|
|
|
|||
|
|
- [Kubernetes Deployment](kubernetes.md) — production K8s with OCR config and health checks
|
|||
|
|
- [API Server Guide](api-server.md) — endpoint documentation
|
|||
|
|
- [Configuration](configuration.md) — all configuration options
|