16 KiB
Configuration Reference
Kreuzberg uses a hierarchical configuration system supporting multiple formats and auto-discovery mechanisms. This reference covers all available configuration options, field names across programming languages, and loading strategies.
Supported Formats
Kreuzberg configurations can be defined in three formats:
- TOML (recommended):
kreuzberg.toml - YAML:
kreuzberg.yaml - JSON:
kreuzberg.json
All formats support the same schema and configuration options.
Auto-Discovery
When no configuration file is explicitly specified, Kreuzberg searches for configuration files in the following order:
- Current working directory:
kreuzberg.toml,kreuzberg.yaml,kreuzberg.json - Parent directories (recursively up the tree, same file name pattern)
The first matching configuration file is loaded.
Programmatic Loading
Python
from kreuzberg import ExtractionConfig
# Load from explicit path
config = ExtractionConfig.from_file("kreuzberg.toml")
# Auto-discover configuration
config = ExtractionConfig.discover()
Node.js / TypeScript
import { ExtractionConfig } from "@kreuzberg/node";
// Load from explicit path
const config = ExtractionConfig.fromFile("kreuzberg.toml");
// Auto-discover configuration
const config = ExtractionConfig.discover();
CLI
# Explicit configuration file
kreuzberg extract --config kreuzberg.toml document.pdf
# Auto-discovery (searches default locations)
kreuzberg extract document.pdf
Configuration Schema
The complete TOML schema with all available sections and options:
Top-Level Options
use_cache = true
enable_quality_processing = true
force_ocr = false
output_format = "markdown"
result_format = "text"
max_concurrent_extractions = 4
| Option | Type | Default | Description |
|---|---|---|---|
use_cache |
boolean | true |
Enable caching of extraction results |
enable_quality_processing |
boolean | true |
Enable post-processing for output quality |
force_ocr |
boolean | false |
Force OCR processing even for searchable PDFs |
disable_ocr |
boolean | false |
Disable OCR entirely — image files return empty content instead of errors (v4.7.0+) |
output_format |
string | "markdown" |
Output format (markdown, html, text) |
result_format |
string | "text" |
Result format for structured output |
max_concurrent_extractions |
integer | 4 |
Maximum concurrent document extractions |
OCR Configuration
[ocr]
backend = "tesseract"
language = "eng"
| Option | Type | Default | Description |
|---|---|---|---|
backend |
string | "tesseract" |
OCR backend (currently tesseract) |
language |
string | "eng" |
ISO 639-3 language code (eng, deu, fra, etc.) |
Tesseract Configuration
[ocr.tesseract_config]
psm = 3
oem = 3
min_confidence = 0.0
output_format = "text"
enable_table_detection = false
table_min_confidence = 0.5
table_column_threshold = 50
table_row_threshold_ratio = 0.5
use_cache = true
| Option | Type | Default | Description |
|---|---|---|---|
psm |
integer | 3 |
Page Segmentation Mode (0-13) |
oem |
integer | 3 |
OCR Engine Mode (0-3) |
min_confidence |
float | 0.0 |
Minimum OCR confidence threshold (0.0-1.0) |
output_format |
string | "text" |
Output format from OCR |
enable_table_detection |
boolean | false |
Enable table detection during OCR |
table_min_confidence |
float | 0.5 |
Minimum confidence for table cells |
table_column_threshold |
integer | 50 |
Pixel threshold for column detection |
table_row_threshold_ratio |
float | 0.5 |
Row height ratio threshold |
use_cache |
boolean | true |
Cache OCR results |
Tesseract Preprocessing
[ocr.tesseract_config.preprocessing]
target_dpi = 300
auto_rotate = true
deskew = true
denoise = true
contrast_enhance = true
binarization_method = "otsu"
invert_colors = false
| Option | Type | Default | Description |
|---|---|---|---|
target_dpi |
integer | 300 |
Target DPI for preprocessing |
auto_rotate |
boolean | true |
Automatically detect and correct page rotation |
deskew |
boolean | true |
Correct skewed pages |
denoise |
boolean | true |
Remove noise from images |
contrast_enhance |
boolean | true |
Enhance image contrast |
binarization_method |
string | "otsu" |
Method for image binarization |
invert_colors |
boolean | false |
Invert image colors if needed |
PDF Options
[pdf_options]
extract_images = true
extract_metadata = true
[pdf_options.hierarchy]
enabled = true
k_clusters = 6
include_bbox = true
ocr_coverage_threshold = 0.5
| Option | Type | Default | Description |
|---|---|---|---|
extract_images |
boolean | true |
Extract images from PDF documents |
extract_metadata |
boolean | true |
Extract PDF metadata |
hierarchy.enabled |
boolean | true |
Enable PDF hierarchy extraction (v4.0.0+) |
hierarchy.k_clusters |
integer | 6 |
Number of clusters for hierarchy detection |
hierarchy.include_bbox |
boolean | true |
Include bounding boxes in hierarchy |
hierarchy.ocr_coverage_threshold |
float | 0.5 |
OCR coverage threshold for hierarchy (0.0-1.0) |
Image Processing
[images]
extract_images = true
target_dpi = 300
max_image_dimension = 4096
auto_adjust_dpi = true
min_dpi = 72
max_dpi = 600
| Option | Type | Default | Description |
|---|---|---|---|
extract_images |
boolean | true |
Extract images from documents |
target_dpi |
integer | 300 |
Target DPI for image processing |
max_image_dimension |
integer | 4096 |
Maximum image dimension in pixels |
auto_adjust_dpi |
boolean | true |
Automatically adjust DPI based on image size |
min_dpi |
integer | 72 |
Minimum DPI threshold |
max_dpi |
integer | 600 |
Maximum DPI threshold |
Chunking Configuration
[chunking]
max_chars = 1000
max_overlap = 200
[chunking.embedding]
batch_size = 32
normalize = true
show_download_progress = true
cache_dir = "~/.cache/kreuzberg/embeddings"
[chunking.embedding.model]
type = "preset"
name = "balanced"
| Option | Type | Default | Description |
|---|---|---|---|
max_chars |
integer | 1000 |
Maximum characters per chunk |
max_overlap |
integer | 200 |
Overlap between consecutive chunks |
embedding.batch_size |
integer | 32 |
Batch size for embedding generation |
embedding.normalize |
boolean | true |
Normalize embeddings to unit length |
embedding.show_download_progress |
boolean | true |
Show progress when downloading models |
embedding.cache_dir |
string | "~/.cache/kreuzberg/embeddings" |
Directory for caching embeddings |
embedding.model.type |
string | "preset" |
Model type: preset, fastembed, or custom |
embedding.model.name |
string | "balanced" |
Preset model name (balanced, fast, accurate, multilingual) |
embedding.model.model |
string | FastEmbed model identifier | |
embedding.model.model_id |
string | Custom HuggingFace model ID | |
embedding.model.dimensions |
integer | Embedding dimensions |
Keywords Configuration
[keywords]
algorithm = "yake"
max_keywords = 10
min_score = 0.0
ngram_range = [1, 3]
language = "en"
| Option | Type | Default | Description |
|---|---|---|---|
algorithm |
string | "yake" |
Keyword extraction algorithm (yake or rake) |
max_keywords |
integer | 10 |
Maximum keywords to extract |
min_score |
float | 0.0 |
Minimum relevance score for keywords |
ngram_range |
array | [1, 3] |
N-gram size range [min, max] |
language |
string | "en" |
Language code for keyword extraction |
Token Reduction
[token_reduction]
mode = "off"
preserve_important_words = true
| Option | Type | Default | Description |
|---|---|---|---|
mode |
string | "off" |
Mode: off, aggressive, moderate, minimal |
preserve_important_words |
boolean | true |
Preserve important words during reduction |
Language Detection
[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false
| Option | Type | Default | Description |
|---|---|---|---|
enabled |
boolean | true |
Enable automatic language detection |
min_confidence |
float | 0.8 |
Minimum confidence threshold for detection |
detect_multiple |
boolean | false |
Detect multiple languages in document |
Post-Processor
[postprocessor]
enabled = true
| Option | Type | Default | Description |
|---|---|---|---|
enabled |
boolean | true |
Enable post-processing of extracted content |
FileExtractionConfig (Per-File Overrides)
Passed as an optional parameter to batch_extract_file / batch_extract_bytes (and their sync variants) to override settings per file in a batch. All fields optional — None = use batch default. The separate _with_configs functions were removed in v4.5.0.
Overridable fields: enable_quality_processing, ocr, force_ocr, chunking, images, pdf_options, token_reduction, language_detection, pages, keywords, postprocessor, html_options, result_format, output_format, include_document_structure, layout.
Batch-level only (not overridable): max_concurrent_extractions, use_cache, acceleration, security_limits.
Merge semantics: For each file, FileExtractionConfig fields are overlaid on the batch ExtractionConfig. None falls through to batch default; Some(value) replaces the batch default for that file.
# FileExtractionConfig cannot be specified in config files —
# it is a programmatic API for per-file overrides at runtime.
Naming Conventions
Kreuzberg uses consistent naming conventions across different contexts:
| Context | Convention | Example |
|---|---|---|
| Python | snake_case | max_chars, pdf_options, use_cache |
| Node.js / TypeScript | camelCase | maxChars, pdfOptions, useCache |
| Rust | snake_case | max_chars, pdf_options, use_cache |
| TOML / YAML / JSON | snake_case | max_chars, pdf_options, use_cache |
| CLI flags | kebab-case | --max-chars, --pdf-options, --use-cache |
When switching between languages, apply the appropriate conversion:
- Python → Node.js:
snake_casetocamelCase - CLI → Python:
kebab-casetosnake_case - TOML → Python: No conversion needed (both use
snake_case)
Environment Variables
The following environment variables can override configuration:
| Variable | Purpose | Example |
|---|---|---|
KREUZBERG_HOST |
Server bind address (serve command) | 127.0.0.1 |
KREUZBERG_PORT |
Server port (serve command) | 8080 |
Configuration Merging
Configuration sources are merged in priority order (highest to lowest):
- CLI flags (highest priority)
- Inline JSON configuration (programmatic)
- Configuration file (lowest priority)
Later sources override earlier ones. For example, a CLI flag --max-chars 2000 overrides max_chars = 1000 in the configuration file.
Example Configurations
Minimal Configuration
use_cache = true
enable_quality_processing = true
[ocr]
backend = "tesseract"
language = "eng"
High-Quality PDF Extraction
use_cache = true
enable_quality_processing = true
force_ocr = false
[ocr]
backend = "tesseract"
language = "eng"
[ocr.tesseract_config]
psm = 3
oem = 3
enable_table_detection = true
table_min_confidence = 0.7
[pdf_options]
extract_images = true
extract_metadata = true
[pdf_options.hierarchy]
enabled = true
k_clusters = 6
[images]
extract_images = true
target_dpi = 300
Semantic Search Configuration
[chunking]
max_chars = 800
max_overlap = 150
[chunking.embedding]
batch_size = 32
normalize = true
cache_dir = "~/.cache/kreuzberg/embeddings"
[chunking.embedding.model]
type = "preset"
name = "accurate"
[keywords]
algorithm = "yake"
max_keywords = 15
Field Name Reference
Critical field names to use in configuration files:
max_chars(NOTmax_characters)max_overlap(NOToverlap)table_min_confidencetable_column_thresholdtable_row_threshold_ratioocr_coverage_thresholdk_clustersinclude_bboxenable_table_detectionauto_rotateauto_adjust_dpishow_download_progressmin_confidencedetect_multiple
Always verify field names against the source configuration file when adding new options.