This commit is contained in:
798
docs/reference/environment-variables.md
Normal file
798
docs/reference/environment-variables.md
Normal file
@@ -0,0 +1,798 @@
|
||||
# Environment Variables Reference
|
||||
|
||||
Configuration precedence in Kreuzberg follows this order (highest to lowest):
|
||||
|
||||
1. **Environment Variables** - Highest priority, overrides all other sources
|
||||
2. **Configuration Files** - TOML, YAML, or JSON config files
|
||||
3. **Defaults** - Built-in sensible defaults
|
||||
|
||||
This document covers all KREUZBERG\_\* environment variables for version 4.3.8.
|
||||
|
||||
## When to Use Environment Variables
|
||||
|
||||
Environment variables are ideal for:
|
||||
|
||||
- **Container/Cloud Deployments**: Docker, Kubernetes, serverless environments where config files are impractical
|
||||
- **CI/CD Pipelines**: Override settings per environment (dev, staging, production)
|
||||
- **Simple Overrides**: Changing one or two settings without managing a config file
|
||||
- **Secrets Management**: Using secret management systems that inject values as env vars
|
||||
|
||||
For complex configurations with many settings, configuration files are recommended:
|
||||
|
||||
```toml title="Example Configuration File"
|
||||
# kreuzberg.toml is cleaner for multiple settings
|
||||
[ocr]
|
||||
language = "eng"
|
||||
backend = "tesseract"
|
||||
|
||||
[chunking]
|
||||
max_chars = 2000
|
||||
max_overlap = 300
|
||||
```
|
||||
|
||||
## API Server Configuration
|
||||
|
||||
These variables control the Kreuzberg server's network behavior and request handling.
|
||||
|
||||
### KREUZBERG_HOST
|
||||
|
||||
**Type**: `String`
|
||||
**Default**: `127.0.0.1`
|
||||
**Valid Values**: Any IPv4 or IPv6 address, or hostname
|
||||
|
||||
The server bind address. Use `0.0.0.0` to listen on all interfaces.
|
||||
|
||||
```bash title="Server Bind Address Examples"
|
||||
# Listen only on localhost (default)
|
||||
export KREUZBERG_HOST=127.0.0.1
|
||||
|
||||
# Listen on all interfaces (Docker, cloud deployments)
|
||||
export KREUZBERG_HOST=0.0.0.0
|
||||
|
||||
# Listen on specific interface
|
||||
export KREUZBERG_HOST=192.168.1.100
|
||||
```
|
||||
|
||||
### KREUZBERG_PORT
|
||||
|
||||
**Type**: `u16` (1-65535)
|
||||
**Default**: `8000`
|
||||
|
||||
The server port number.
|
||||
|
||||
```bash title="Server Port Examples"
|
||||
export KREUZBERG_PORT=3000
|
||||
export KREUZBERG_PORT=8080
|
||||
```
|
||||
|
||||
**Error**: Port must be a valid u16 number:
|
||||
|
||||
```text
|
||||
KREUZBERG_PORT must be a valid u16 number, got 'invalid': invalid digit found in string
|
||||
```
|
||||
|
||||
### KREUZBERG_CORS_ORIGINS
|
||||
|
||||
**Type**: `String` (comma-separated list)
|
||||
**Default**: Empty (allows all origins)
|
||||
|
||||
Whitelist of allowed CORS origins. When empty, the server accepts requests from any origin.
|
||||
|
||||
```bash title="CORS Origins Configuration"
|
||||
# Allow all origins (default)
|
||||
# unset KREUZBERG_CORS_ORIGINS
|
||||
|
||||
# Allow specific origins
|
||||
export KREUZBERG_CORS_ORIGINS="https://api.example.com, https://app.example.com"
|
||||
|
||||
# Single origin
|
||||
export KREUZBERG_CORS_ORIGINS="https://trusted.com"
|
||||
```
|
||||
|
||||
**Security Warning**: Be explicit with CORS origins in production. Allowing all origins (`*`) means any website can call your API on behalf of users. In Kreuzberg, an empty list allows all origins - be intentional about this choice.
|
||||
|
||||
```bash title="CORS Security Best Practices"
|
||||
# Production: Restrict to known origins
|
||||
export KREUZBERG_CORS_ORIGINS="https://app.mycompany.com, https://admin.mycompany.com"
|
||||
|
||||
# Development: Can use wildcard, but understand the security implications
|
||||
# Don't use wildcard in production unless absolutely necessary
|
||||
```
|
||||
|
||||
### KREUZBERG_MAX_REQUEST_BODY_BYTES
|
||||
|
||||
**Type**: `usize` (bytes)
|
||||
**Default**: `104857600` (100 MB)
|
||||
|
||||
Maximum size of HTTP request bodies. Prevents oversized requests from consuming server resources.
|
||||
|
||||
```bash title="Max Request Body Size Examples"
|
||||
# 50 MB
|
||||
export KREUZBERG_MAX_REQUEST_BODY_BYTES=52428800
|
||||
|
||||
# 200 MB
|
||||
export KREUZBERG_MAX_REQUEST_BODY_BYTES=209715200
|
||||
|
||||
# 500 MB
|
||||
export KREUZBERG_MAX_REQUEST_BODY_BYTES=524288000
|
||||
```
|
||||
|
||||
**Note**: Both `KREUZBERG_MAX_REQUEST_BODY_BYTES` and `KREUZBERG_MAX_MULTIPART_FIELD_BYTES` control upload limits. Adjust both for consistent behavior.
|
||||
|
||||
### KREUZBERG_MAX_MULTIPART_FIELD_BYTES
|
||||
|
||||
**Type**: `usize` (bytes)
|
||||
**Default**: `104857600` (100 MB)
|
||||
|
||||
Maximum size of individual multipart form fields. Controls the size of file uploads in multipart requests.
|
||||
|
||||
```bash title="Max Multipart Field Size Examples"
|
||||
# 100 MB (default)
|
||||
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=104857600
|
||||
|
||||
# 500 MB for large document processing
|
||||
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=524288000
|
||||
|
||||
# 1 GB for extreme cases
|
||||
export KREUZBERG_MAX_MULTIPART_FIELD_BYTES=1073741824
|
||||
```
|
||||
|
||||
## Extraction Configuration
|
||||
|
||||
These variables control document extraction behavior, including OCR, text chunking, and caching.
|
||||
|
||||
### KREUZBERG_OCR_LANGUAGE
|
||||
|
||||
**Type**: `String` (ISO 639-1 or 639-3 language code)
|
||||
**Default**: `eng` (English)
|
||||
|
||||
OCR language for scanned documents. Must be a valid language code recognized by the OCR backend.
|
||||
|
||||
```bash title="OCR Language Configuration"
|
||||
# English (default)
|
||||
export KREUZBERG_OCR_LANGUAGE=eng
|
||||
|
||||
# German
|
||||
export KREUZBERG_OCR_LANGUAGE=deu
|
||||
|
||||
# French
|
||||
export KREUZBERG_OCR_LANGUAGE=fra
|
||||
|
||||
# Spanish
|
||||
export KREUZBERG_OCR_LANGUAGE=spa
|
||||
|
||||
# Chinese (Simplified)
|
||||
export KREUZBERG_OCR_LANGUAGE=chi_sim
|
||||
|
||||
# Japanese
|
||||
export KREUZBERG_OCR_LANGUAGE=jpn
|
||||
```
|
||||
|
||||
**Supported Codes**: Language codes are backend-agnostic and automatically mapped to the appropriate format for each backend:
|
||||
|
||||
- **Tesseract codes** (ISO 639-3): `eng`, `deu`, `fra`, `spa`, `ita`, `por`, `rus`, `chi_sim`, `chi_tra`, `jpn`, `kor`
|
||||
- **PaddleOCR codes**: `en`, `ch`, `french`, `german`, `korean`, `thai`, `greek`, `cyrillic`, `latin`, `arabic`, `devanagari`, `tamil`, `telugu`
|
||||
- **ISO 639-1 codes**: `en`, `de`, `fr`, `es`, `ja`, `ko`, `zh`, `ru`, `ar`, `th`, `el`
|
||||
|
||||
All code formats are accepted regardless of backend — Kreuzberg automatically maps between them.
|
||||
|
||||
### KREUZBERG_OCR_BACKEND
|
||||
|
||||
**Type**: `String`
|
||||
**Default**: `tesseract`
|
||||
**Valid Values**: `tesseract`, `easyocr`, `paddleocr`
|
||||
|
||||
OCR engine to use for text extraction from images and scanned documents.
|
||||
|
||||
```bash title="OCR Backend Selection"
|
||||
# Tesseract (open source, good for English)
|
||||
export KREUZBERG_OCR_BACKEND=tesseract
|
||||
|
||||
# EasyOCR (better multilingual support, slower)
|
||||
export KREUZBERG_OCR_BACKEND=easyocr
|
||||
|
||||
# PaddleOCR (fast, good accuracy across languages)
|
||||
export KREUZBERG_OCR_BACKEND=paddleocr
|
||||
```
|
||||
|
||||
**Performance Notes**:
|
||||
|
||||
- **tesseract**: Fastest, best for English and Latin scripts
|
||||
- **easyocr**: Slower, excellent multilingual support
|
||||
- **paddleocr**: Fast with good accuracy for many languages
|
||||
|
||||
### KREUZBERG_CHUNKING_MAX_CHARS
|
||||
|
||||
**Type**: `usize` (positive integer)
|
||||
**Default**: `1000` (characters)
|
||||
|
||||
Maximum number of characters per text chunk. Smaller chunks are useful for LLM context windows.
|
||||
|
||||
```bash title="Chunk Size Configuration"
|
||||
# Small chunks for token-constrained LLMs
|
||||
export KREUZBERG_CHUNKING_MAX_CHARS=512
|
||||
|
||||
# Default: balanced for most use cases
|
||||
export KREUZBERG_CHUNKING_MAX_CHARS=1000
|
||||
|
||||
# Larger chunks for fewer splits
|
||||
export KREUZBERG_CHUNKING_MAX_CHARS=2000
|
||||
|
||||
# Very large chunks for comprehensive context
|
||||
export KREUZBERG_CHUNKING_MAX_CHARS=4000
|
||||
```
|
||||
|
||||
**Validation**: Must be greater than 0. Must be greater than `KREUZBERG_CHUNKING_MAX_OVERLAP`.
|
||||
|
||||
### KREUZBERG_CHUNKING_MAX_OVERLAP
|
||||
|
||||
**Type**: `usize` (non-negative integer)
|
||||
**Default**: `200` (characters)
|
||||
|
||||
Character overlap between consecutive chunks. Maintains context across chunk boundaries.
|
||||
|
||||
```bash title="Chunk Overlap Configuration"
|
||||
# No overlap (creates discontinuities)
|
||||
export KREUZBERG_CHUNKING_MAX_OVERLAP=0
|
||||
|
||||
# Default: 20% overlap with 1000-char chunks
|
||||
export KREUZBERG_CHUNKING_MAX_OVERLAP=200
|
||||
|
||||
# More overlap: 30% for better context continuity
|
||||
export KREUZBERG_CHUNKING_MAX_OVERLAP=300
|
||||
|
||||
# High overlap for sensitive documents
|
||||
export KREUZBERG_CHUNKING_MAX_OVERLAP=500
|
||||
```
|
||||
|
||||
**Validation**: Must be less than `KREUZBERG_CHUNKING_MAX_CHARS`.
|
||||
|
||||
**Example Error**:
|
||||
|
||||
```text
|
||||
Chunking overlap (500) cannot be greater than or equal to max_chars (1000)
|
||||
```
|
||||
|
||||
### KREUZBERG_CACHE_ENABLED
|
||||
|
||||
**Type**: `Boolean` (`true` or `false`, case-insensitive)
|
||||
**Default**: `true`
|
||||
|
||||
Enable or disable extraction result caching. Cache stores results to avoid reprocessing identical documents.
|
||||
|
||||
```bash title="Cache Enable/Disable"
|
||||
# Enable cache (default, recommended for production)
|
||||
export KREUZBERG_CACHE_ENABLED=true
|
||||
|
||||
# Disable cache (development, testing, or when cache is problematic)
|
||||
export KREUZBERG_CACHE_ENABLED=false
|
||||
|
||||
# Case insensitive
|
||||
export KREUZBERG_CACHE_ENABLED=TRUE
|
||||
export KREUZBERG_CACHE_ENABLED=False
|
||||
```
|
||||
|
||||
### KREUZBERG_OUTPUT_FORMAT
|
||||
|
||||
**Type**: `String`
|
||||
**Default**: `plain`
|
||||
**Valid Values**: `plain`, `markdown`, `djot`, `html`
|
||||
|
||||
Controls the text content format of extraction results. Determines how extracted text is formatted in the result output.
|
||||
|
||||
```bash title="Output Format Options"
|
||||
# Plain text content only (default)
|
||||
export KREUZBERG_OUTPUT_FORMAT=plain
|
||||
|
||||
# Markdown formatted output
|
||||
export KREUZBERG_OUTPUT_FORMAT=markdown
|
||||
|
||||
# Djot markup format
|
||||
export KREUZBERG_OUTPUT_FORMAT=djot
|
||||
|
||||
# HTML formatted output
|
||||
export KREUZBERG_OUTPUT_FORMAT=html
|
||||
```
|
||||
|
||||
**Use Cases**:
|
||||
|
||||
| Format | Use Case |
|
||||
| ---------- | --------------------------------------------------------------- |
|
||||
| `plain` | Raw extracted text without formatting |
|
||||
| `markdown` | Structured text with headings, lists, emphasis (RAG, LLM input) |
|
||||
| `djot` | Lightweight markup, alternative to Markdown |
|
||||
| `html` | Rich formatted output for web display |
|
||||
|
||||
**Example:**
|
||||
|
||||
```bash title="Extract with markdown formatting"
|
||||
export KREUZBERG_OUTPUT_FORMAT=markdown
|
||||
kreuzberg
|
||||
```
|
||||
|
||||
### KREUZBERG_TOKEN_REDUCTION_MODE
|
||||
|
||||
**Type**: `String`
|
||||
**Default**: `off`
|
||||
**Valid Values**: `off`, `light`, `moderate`, `aggressive`, `maximum`
|
||||
|
||||
Token reduction aggressiveness for compressing extracted text while preserving meaning. Useful when working with token-limited LLMs.
|
||||
|
||||
```bash title="Token Reduction Mode Options"
|
||||
# No reduction (keep all text as-is)
|
||||
export KREUZBERG_TOKEN_REDUCTION_MODE=off
|
||||
|
||||
# Light reduction: Remove common stopwords, minimal impact
|
||||
export KREUZBERG_TOKEN_REDUCTION_MODE=light
|
||||
|
||||
# Moderate reduction: Balance between compression and meaning preservation
|
||||
export KREUZBERG_TOKEN_REDUCTION_MODE=moderate
|
||||
|
||||
# Aggressive reduction: Significant compression, some detail loss
|
||||
export KREUZBERG_TOKEN_REDUCTION_MODE=aggressive
|
||||
|
||||
# Maximum reduction: Extreme compression for token-constrained scenarios
|
||||
export KREUZBERG_TOKEN_REDUCTION_MODE=maximum
|
||||
```
|
||||
|
||||
**Impact on Tokens**:
|
||||
|
||||
| Mode | Typical Reduction | Use Case |
|
||||
| ------------ | ----------------- | ------------------------------------------- |
|
||||
| `off` | 0% | Full preservation, no compression |
|
||||
| `light` | 10-15% | Minimal impact, clean up obvious redundancy |
|
||||
| `moderate` | 25-35% | Balanced approach for most scenarios |
|
||||
| `aggressive` | 40-50% | Significant compression, still readable |
|
||||
| `maximum` | 50-70% | Extreme compression, lose some detail |
|
||||
|
||||
## Runtime Configuration
|
||||
|
||||
Control cache location, debug output, and runtime behavior.
|
||||
|
||||
### KREUZBERG_CACHE_DIR
|
||||
|
||||
**Type**: `String` (file system path)
|
||||
**Default**: Platform-specific global cache directory
|
||||
|
||||
Override the default cache directory for storing extraction cache, models, and intermediate files. When unset, Kreuzberg uses a platform-appropriate global cache:
|
||||
|
||||
- **Linux**: `~/.cache/kreuzberg/` (or `$XDG_CACHE_HOME/kreuzberg/`)
|
||||
- **macOS**: `~/Library/Caches/kreuzberg/`
|
||||
- **Windows**: `%LOCALAPPDATA%/kreuzberg/`
|
||||
|
||||
If the platform cache directory cannot be determined, Kreuzberg falls back to `~/.cache/kreuzberg/`, then `.kreuzberg/` in the current working directory as a last resort.
|
||||
|
||||
```bash title="Cache Directory Configuration"
|
||||
# Default: uses platform-specific global cache (recommended)
|
||||
# unset KREUZBERG_CACHE_DIR
|
||||
|
||||
# Store cache in specific location
|
||||
export KREUZBERG_CACHE_DIR=/var/cache/kreuzberg
|
||||
|
||||
# Docker: Use volume mount
|
||||
export KREUZBERG_CACHE_DIR=/data/kreuzberg-cache
|
||||
|
||||
# Development: Quick local cleanup
|
||||
export KREUZBERG_CACHE_DIR=/tmp/kreuzberg-cache
|
||||
```
|
||||
|
||||
**Directory Structure**: Kreuzberg creates subdirectories for different cache types:
|
||||
|
||||
```text
|
||||
$KREUZBERG_CACHE_DIR/
|
||||
ocr/ # OCR result cache
|
||||
embeddings/ # Chunk embedding cache
|
||||
extractions/ # Full extraction cache
|
||||
```
|
||||
|
||||
### KREUZBERG_CI_DEBUG
|
||||
|
||||
**Type**: `Boolean` (presence check: set to any value to enable)
|
||||
**Default**: Disabled (unset)
|
||||
|
||||
Enable detailed debug logging for CI environments. Outputs step-by-step timing and parameter information for OCR operations.
|
||||
|
||||
```bash title="Enable CI Debug Logging"
|
||||
# Enable CI debug output
|
||||
export KREUZBERG_CI_DEBUG=1
|
||||
export KREUZBERG_CI_DEBUG=true
|
||||
export KREUZBERG_CI_DEBUG=yes
|
||||
|
||||
# Output example:
|
||||
# [kreuzberg::ocr] perform_ocr:start bytes=1024000 language=eng output=text use_cache=true
|
||||
# [kreuzberg::ocr] perform_ocr:end duration_ms=2534
|
||||
```
|
||||
|
||||
**Use Cases**:
|
||||
|
||||
- Debugging slow OCR operations
|
||||
- Tracing cache hits/misses
|
||||
- Performance profiling in CI pipelines
|
||||
- Understanding extraction pipeline behavior
|
||||
|
||||
### KREUZBERG_DEBUG_OCR
|
||||
|
||||
**Type**: `Boolean` (presence check: set to any value to enable)
|
||||
**Default**: Disabled (unset)
|
||||
|
||||
Enable OCR-specific debug output. Outputs diagnostic information about OCR decisions, fallbacks, and text coverage metrics.
|
||||
|
||||
```bash title="Enable OCR Debug Logging"
|
||||
# Enable OCR debug logging
|
||||
export KREUZBERG_DEBUG_OCR=1
|
||||
|
||||
# Output example:
|
||||
# [kreuzberg::pdf::ocr] fallback=true non_whitespace=8543 alnum=7234 meaningful_words=312
|
||||
# [kreuzberg::pdf::ocr] avg_non_whitespace=45.2 avg_alnum=38.1 alnum_ratio=0.847
|
||||
```
|
||||
|
||||
**Diagnostic Information**:
|
||||
|
||||
- Whether OCR fallback was triggered
|
||||
- Character counts (whitespace, alphanumeric)
|
||||
- Word counts and coverage ratios
|
||||
- Coverage thresholds and decisions
|
||||
|
||||
## Memory & Performance
|
||||
|
||||
Configure caching for string encoding operations to optimize performance.
|
||||
|
||||
### KREUZBERG_ENCODING_CACHE_MAX_ENTRIES
|
||||
|
||||
**Type**: `usize` (positive integer)
|
||||
**Default**: `10000`
|
||||
|
||||
Maximum number of strings cached in the encoding cache. Each entry consumes memory proportional to string length.
|
||||
|
||||
```bash title="Encoding Cache Entry Limit"
|
||||
# Default: reasonable for most applications
|
||||
export KREUZBERG_ENCODING_CACHE_MAX_ENTRIES=10000
|
||||
|
||||
# Higher for very large batches
|
||||
export KREUZBERG_ENCODING_CACHE_MAX_ENTRIES=50000
|
||||
|
||||
# Lower to reduce memory usage
|
||||
export KREUZBERG_ENCODING_CACHE_MAX_ENTRIES=1000
|
||||
```
|
||||
|
||||
### KREUZBERG_ENCODING_CACHE_MAX_BYTES
|
||||
|
||||
**Type**: `usize` (bytes)
|
||||
**Default**: `104857600` (100 MB)
|
||||
|
||||
Maximum total size of cached strings in bytes. Once exceeded, least-used entries are evicted.
|
||||
|
||||
```bash title="Encoding Cache Size Limit"
|
||||
# Default: 100 MB
|
||||
export KREUZBERG_ENCODING_CACHE_MAX_BYTES=104857600
|
||||
|
||||
# Larger cache for high-throughput scenarios
|
||||
export KREUZBERG_ENCODING_CACHE_MAX_BYTES=524288000 # 500 MB
|
||||
|
||||
# Smaller cache for memory-constrained environments
|
||||
export KREUZBERG_ENCODING_CACHE_MAX_BYTES=10485760 # 10 MB
|
||||
```
|
||||
|
||||
## LLM Integration
|
||||
|
||||
Configure LLM-powered features such as structured extraction, vision-based OCR, and provider-hosted embeddings.
|
||||
|
||||
### KREUZBERG_LLM_MODEL
|
||||
|
||||
**Type**: `String`
|
||||
**Default**: None (must be set explicitly or via config)
|
||||
|
||||
Default LLM model for structured extraction. Uses [liter-llm](https://github.com/kreuzberg-dev/liter-llm) model format (`provider/model-name`).
|
||||
|
||||
```bash title="LLM Model Configuration"
|
||||
# OpenAI
|
||||
export KREUZBERG_LLM_MODEL=openai/gpt-4o-mini
|
||||
|
||||
# Anthropic
|
||||
export KREUZBERG_LLM_MODEL=anthropic/claude-sonnet-4-20250514
|
||||
|
||||
# Local provider
|
||||
export KREUZBERG_LLM_MODEL=ollama/llama3
|
||||
```
|
||||
|
||||
### KREUZBERG_LLM_API_KEY
|
||||
|
||||
**Type**: `String`
|
||||
**Default**: None
|
||||
|
||||
API key for the structured extraction LLM provider. When not set, liter-llm falls back to provider-standard environment variables (for example, `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`).
|
||||
|
||||
```bash title="LLM API Key Configuration"
|
||||
export KREUZBERG_LLM_API_KEY=sk-...
|
||||
```
|
||||
|
||||
**Security Warning**: Prefer using provider-standard environment variables or a secrets manager over setting this directly. This variable is provided for cases where multiple providers are used and explicit key routing is needed.
|
||||
|
||||
### KREUZBERG_LLM_BASE_URL
|
||||
|
||||
**Type**: `String`
|
||||
**Default**: None (uses provider default)
|
||||
|
||||
Custom base URL for the structured extraction LLM provider. Useful for self-hosted models, proxies, or alternative API-compatible endpoints.
|
||||
|
||||
```bash title="LLM Base URL Configuration"
|
||||
# Custom OpenAI-compatible endpoint
|
||||
export KREUZBERG_LLM_BASE_URL=https://api.example.com
|
||||
|
||||
# Local Ollama instance
|
||||
export KREUZBERG_LLM_BASE_URL=http://localhost:11434
|
||||
```
|
||||
|
||||
### KREUZBERG_VLM_OCR_MODEL
|
||||
|
||||
**Type**: `String`
|
||||
**Default**: None (must be set explicitly or via config)
|
||||
|
||||
VLM (Vision Language Model) model for vision-based OCR. When configured, Kreuzberg can use a vision model as an OCR backend, sending document images directly to the VLM for text extraction.
|
||||
|
||||
```bash title="VLM OCR Model Configuration"
|
||||
# OpenAI GPT-4o for vision OCR
|
||||
export KREUZBERG_VLM_OCR_MODEL=openai/gpt-4o
|
||||
|
||||
# Anthropic Claude for vision OCR
|
||||
export KREUZBERG_VLM_OCR_MODEL=anthropic/claude-sonnet-4-20250514
|
||||
```
|
||||
|
||||
### KREUZBERG_VLM_EMBEDDING_MODEL
|
||||
|
||||
**Type**: `String`
|
||||
**Default**: None (must be set explicitly or via config)
|
||||
|
||||
LLM model for provider-hosted embeddings. Instead of running local ONNX embedding models, Kreuzberg can delegate embedding generation to a cloud provider's embedding API.
|
||||
|
||||
```bash title="VLM Embedding Model Configuration"
|
||||
# OpenAI embeddings
|
||||
export KREUZBERG_VLM_EMBEDDING_MODEL=openai/text-embedding-3-small
|
||||
|
||||
# Cohere embeddings
|
||||
export KREUZBERG_VLM_EMBEDDING_MODEL=cohere/embed-english-v3.0
|
||||
```
|
||||
|
||||
**Note**: When `api_key` is not set in config, liter-llm falls back to provider-standard environment variables (for example, `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`).
|
||||
|
||||
| Variable | Description | Example |
|
||||
| ------------------------------- | -------------------------------------------------- | ------------------------------- |
|
||||
| `KREUZBERG_LLM_MODEL` | Default LLM model for structured extraction | `openai/gpt-4o-mini` |
|
||||
| `KREUZBERG_LLM_API_KEY` | API key for structured extraction LLM provider | `sk-...` |
|
||||
| `KREUZBERG_LLM_BASE_URL` | Custom base URL for structured extraction provider | `https://api.example.com` |
|
||||
| `KREUZBERG_VLM_OCR_MODEL` | VLM model for vision-based OCR | `openai/gpt-4o` |
|
||||
| `KREUZBERG_VLM_EMBEDDING_MODEL` | LLM model for provider-hosted embeddings | `openai/text-embedding-3-small` |
|
||||
|
||||
---
|
||||
|
||||
## Testing Variables
|
||||
|
||||
Variables for development, testing, and quality assurance.
|
||||
|
||||
### KREUZBERG_RUN_FULL_OCR
|
||||
|
||||
**Type**: `Boolean` (presence check: set to any value to enable)
|
||||
**Default**: Disabled (skips expensive tests)
|
||||
**Status**: Testing only
|
||||
|
||||
Enable expensive OCR quality tests. These tests perform full OCR on large documents and are slow (can take minutes).
|
||||
|
||||
```bash title="Enable Full OCR Tests"
|
||||
# Skip expensive OCR tests (default, fast test runs)
|
||||
# unset KREUZBERG_RUN_FULL_OCR
|
||||
|
||||
# Run full OCR quality tests
|
||||
export KREUZBERG_RUN_FULL_OCR=1
|
||||
|
||||
# In test output:
|
||||
# test test_ocr_quality_multi_page_consistency ... SKIPPED
|
||||
# Skipping test_ocr_quality_multi_page_consistency: set KREUZBERG_RUN_FULL_OCR=1 to enable
|
||||
```
|
||||
|
||||
**Warning**:
|
||||
|
||||
- These tests can take 10+ minutes
|
||||
- Require OCR backends to be installed and working
|
||||
- Produce large temporary files
|
||||
- Use only in CI/CD for comprehensive validation
|
||||
|
||||
## Docker Compose Examples
|
||||
|
||||
### Basic Configuration
|
||||
|
||||
```yaml title="Docker Compose - Basic Setup"
|
||||
version: "3.8"
|
||||
services:
|
||||
kreuzberg:
|
||||
image: kreuzberg:latest
|
||||
ports:
|
||||
- "3000:3000"
|
||||
environment:
|
||||
KREUZBERG_HOST: "0.0.0.0"
|
||||
KREUZBERG_PORT: "3000"
|
||||
KREUZBERG_OCR_LANGUAGE: "eng"
|
||||
KREUZBERG_CACHE_ENABLED: "true"
|
||||
```
|
||||
|
||||
### Production Configuration
|
||||
|
||||
```yaml title="Docker Compose - Production Setup"
|
||||
version: "3.8"
|
||||
services:
|
||||
kreuzberg:
|
||||
image: kreuzberg:latest
|
||||
ports:
|
||||
- "8000:8000"
|
||||
volumes:
|
||||
- kreuzberg_cache:/data/cache
|
||||
environment:
|
||||
KREUZBERG_HOST: "0.0.0.0"
|
||||
KREUZBERG_PORT: "8000"
|
||||
KREUZBERG_CORS_ORIGINS: "https://app.example.com, https://admin.example.com"
|
||||
KREUZBERG_MAX_REQUEST_BODY_BYTES: "209715200" # 200 MB
|
||||
KREUZBERG_MAX_MULTIPART_FIELD_BYTES: "209715200"
|
||||
KREUZBERG_CACHE_DIR: "/data/cache"
|
||||
KREUZBERG_OCR_LANGUAGE: "eng"
|
||||
KREUZBERG_OCR_BACKEND: "tesseract"
|
||||
KREUZBERG_CHUNKING_MAX_CHARS: "2000"
|
||||
KREUZBERG_CHUNKING_MAX_OVERLAP: "300"
|
||||
KREUZBERG_TOKEN_REDUCTION_MODE: "moderate"
|
||||
|
||||
volumes:
|
||||
kreuzberg_cache:
|
||||
driver: local
|
||||
```
|
||||
|
||||
### Multilingual Configuration
|
||||
|
||||
```yaml title="Docker Compose - Multilingual Setup"
|
||||
version: "3.8"
|
||||
services:
|
||||
kreuzberg:
|
||||
image: kreuzberg:latest
|
||||
ports:
|
||||
- "8000:8000"
|
||||
environment:
|
||||
KREUZBERG_HOST: "0.0.0.0"
|
||||
KREUZBERG_PORT: "8000"
|
||||
KREUZBERG_OCR_BACKEND: "easyocr" # Better multilingual support
|
||||
KREUZBERG_OCR_LANGUAGE: "fra" # French
|
||||
KREUZBERG_CACHE_ENABLED: "true"
|
||||
```
|
||||
|
||||
### Development Configuration
|
||||
|
||||
```yaml title="Docker Compose - Development Setup"
|
||||
version: "3.8"
|
||||
services:
|
||||
kreuzberg:
|
||||
image: kreuzberg:latest
|
||||
ports:
|
||||
- "8000:8000"
|
||||
environment:
|
||||
KREUZBERG_HOST: "127.0.0.1"
|
||||
KREUZBERG_PORT: "8000"
|
||||
KREUZBERG_CACHE_ENABLED: "false" # Disable for fresh testing
|
||||
KREUZBERG_CI_DEBUG: "1" # Enable debug output
|
||||
KREUZBERG_DEBUG_OCR: "1"
|
||||
KREUZBERG_CACHE_DIR: "/tmp/kreuzberg"
|
||||
```
|
||||
|
||||
## Environment Variable Loading Order
|
||||
|
||||
Kreuzberg applies environment variables in this order:
|
||||
|
||||
1. Load configuration file (TOML/YAML/JSON) if specified
|
||||
2. Parse environment variables using `apply_env_overrides()`
|
||||
3. Validate all settings
|
||||
|
||||
This ensures environment variables always win over file configuration:
|
||||
|
||||
```rust title="Rust - Applying Environment Overrides"
|
||||
let mut config = ExtractionConfig::from_file("kreuzberg.toml")?;
|
||||
config.apply_env_overrides()?; // Overrides file values
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Using with Config Files
|
||||
|
||||
Combine files with environment overrides for flexibility:
|
||||
|
||||
```bash title="Combining Config Files with Env Overrides"
|
||||
# Load base config from file
|
||||
# Override specific values for this deployment
|
||||
export KREUZBERG_OCR_LANGUAGE=deu
|
||||
export KREUZBERG_CACHE_DIR=/mnt/cache
|
||||
kreuzberg --config kreuzberg.toml
|
||||
```
|
||||
|
||||
### Shell Script Initialization
|
||||
|
||||
```bash title="Environment-Based Shell Script"
|
||||
#!/bin/bash
|
||||
# Load deployment-specific settings
|
||||
|
||||
if [ "$ENVIRONMENT" = "production" ]; then
|
||||
export KREUZBERG_HOST="0.0.0.0"
|
||||
export KREUZBERG_CORS_ORIGINS="https://app.example.com"
|
||||
export KREUZBERG_CACHE_ENABLED="true"
|
||||
export KREUZBERG_MAX_REQUEST_BODY_BYTES=$((200 * 1048576))
|
||||
elif [ "$ENVIRONMENT" = "development" ]; then
|
||||
export KREUZBERG_HOST="127.0.0.1"
|
||||
export KREUZBERG_CACHE_ENABLED="false"
|
||||
export KREUZBERG_CI_DEBUG="1"
|
||||
fi
|
||||
|
||||
kreuzberg
|
||||
```
|
||||
|
||||
### Kubernetes ConfigMap
|
||||
|
||||
```yaml title="Kubernetes ConfigMap and Pod Configuration"
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: kreuzberg-config
|
||||
data:
|
||||
KREUZBERG_HOST: "0.0.0.0"
|
||||
KREUZBERG_PORT: "8000"
|
||||
KREUZBERG_CORS_ORIGINS: "https://api.example.com"
|
||||
KREUZBERG_CACHE_DIR: "/data/cache"
|
||||
KREUZBERG_OCR_BACKEND: "tesseract"
|
||||
KREUZBERG_TOKEN_REDUCTION_MODE: "moderate"
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: kreuzberg-server
|
||||
spec:
|
||||
containers:
|
||||
- name: kreuzberg
|
||||
image: kreuzberg:latest
|
||||
ports:
|
||||
- containerPort: 8000
|
||||
envFrom:
|
||||
- configMapRef:
|
||||
name: kreuzberg-config
|
||||
volumeMounts:
|
||||
- name: cache
|
||||
mountPath: /data/cache
|
||||
volumes:
|
||||
- name: cache
|
||||
persistentVolumeClaim:
|
||||
claimName: kreuzberg-cache-pvc
|
||||
```
|
||||
|
||||
## ONNX Runtime Configuration
|
||||
|
||||
### ORT_DYLIB_PATH
|
||||
|
||||
**Type**: `String`
|
||||
**Default**: Not set (bundled CPU ONNX Runtime is used)
|
||||
|
||||
Path to a custom ONNX Runtime shared library. Set this to use a GPU-enabled ONNX Runtime instead of the bundled CPU-only version.
|
||||
|
||||
Required for GPU acceleration (`cuda`, `tensorrt`) with PaddleOCR, layout detection, embeddings, and document orientation detection.
|
||||
|
||||
```bash title="GPU Acceleration Setup"
|
||||
# Linux — using ONNX Runtime GPU release
|
||||
export ORT_DYLIB_PATH=/usr/local/lib/libonnxruntime.so
|
||||
|
||||
# Linux — using pip-installed onnxruntime-gpu
|
||||
export ORT_DYLIB_PATH=$(python -c "import onnxruntime; print(onnxruntime.__path__[0])")/capi/libonnxruntime.so
|
||||
|
||||
# macOS — using Homebrew
|
||||
export ORT_DYLIB_PATH=/opt/homebrew/lib/libonnxruntime.dylib
|
||||
|
||||
# Windows
|
||||
set ORT_DYLIB_PATH=C:\path\to\onnxruntime.dll
|
||||
```
|
||||
|
||||
When not set, Kreuzberg auto-discovers system-installed ONNX Runtime on common paths. If no system library is found, the bundled CPU-only version is used.
|
||||
|
||||
## See Also
|
||||
|
||||
- [Configuration Guide](./configuration.md) - Detailed configuration file format and options
|
||||
- [File Size Limits](./file-size-limits.md) - Upload and processing limits
|
||||
- [Types Reference](./types.md) - API type definitions and structures
|
||||
Reference in New Issue
Block a user