Files
fil/skills/kreuzberg/references/configuration.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

16 KiB

Configuration Reference

Kreuzberg uses a hierarchical configuration system supporting multiple formats and auto-discovery mechanisms. This reference covers all available configuration options, field names across programming languages, and loading strategies.

Supported Formats

Kreuzberg configurations can be defined in three formats:

  • TOML (recommended): kreuzberg.toml
  • YAML: kreuzberg.yaml
  • JSON: kreuzberg.json

All formats support the same schema and configuration options.

Auto-Discovery

When no configuration file is explicitly specified, Kreuzberg searches for configuration files in the following order:

  1. Current working directory: kreuzberg.toml, kreuzberg.yaml, kreuzberg.json
  2. Parent directories (recursively up the tree, same file name pattern)

The first matching configuration file is loaded.

Programmatic Loading

Python

from kreuzberg import ExtractionConfig

# Load from explicit path
config = ExtractionConfig.from_file("kreuzberg.toml")

# Auto-discover configuration
config = ExtractionConfig.discover()

Node.js / TypeScript

import { ExtractionConfig } from "@kreuzberg/node";

// Load from explicit path
const config = ExtractionConfig.fromFile("kreuzberg.toml");

// Auto-discover configuration
const config = ExtractionConfig.discover();

CLI

# Explicit configuration file
kreuzberg extract --config kreuzberg.toml document.pdf

# Auto-discovery (searches default locations)
kreuzberg extract document.pdf

Configuration Schema

The complete TOML schema with all available sections and options:

Top-Level Options

use_cache = true
enable_quality_processing = true
force_ocr = false
output_format = "markdown"
result_format = "text"
max_concurrent_extractions = 4
Option Type Default Description
use_cache boolean true Enable caching of extraction results
enable_quality_processing boolean true Enable post-processing for output quality
force_ocr boolean false Force OCR processing even for searchable PDFs
disable_ocr boolean false Disable OCR entirely — image files return empty content instead of errors (v4.7.0+)
output_format string "markdown" Output format (markdown, html, text)
result_format string "text" Result format for structured output
max_concurrent_extractions integer 4 Maximum concurrent document extractions

OCR Configuration

[ocr]
backend = "tesseract"
language = "eng"
Option Type Default Description
backend string "tesseract" OCR backend (currently tesseract)
language string "eng" ISO 639-3 language code (eng, deu, fra, etc.)

Tesseract Configuration

[ocr.tesseract_config]
psm = 3
oem = 3
min_confidence = 0.0
output_format = "text"
enable_table_detection = false
table_min_confidence = 0.5
table_column_threshold = 50
table_row_threshold_ratio = 0.5
use_cache = true
Option Type Default Description
psm integer 3 Page Segmentation Mode (0-13)
oem integer 3 OCR Engine Mode (0-3)
min_confidence float 0.0 Minimum OCR confidence threshold (0.0-1.0)
output_format string "text" Output format from OCR
enable_table_detection boolean false Enable table detection during OCR
table_min_confidence float 0.5 Minimum confidence for table cells
table_column_threshold integer 50 Pixel threshold for column detection
table_row_threshold_ratio float 0.5 Row height ratio threshold
use_cache boolean true Cache OCR results

Tesseract Preprocessing

[ocr.tesseract_config.preprocessing]
target_dpi = 300
auto_rotate = true
deskew = true
denoise = true
contrast_enhance = true
binarization_method = "otsu"
invert_colors = false
Option Type Default Description
target_dpi integer 300 Target DPI for preprocessing
auto_rotate boolean true Automatically detect and correct page rotation
deskew boolean true Correct skewed pages
denoise boolean true Remove noise from images
contrast_enhance boolean true Enhance image contrast
binarization_method string "otsu" Method for image binarization
invert_colors boolean false Invert image colors if needed

PDF Options

[pdf_options]
extract_images = true
extract_metadata = true

[pdf_options.hierarchy]
enabled = true
k_clusters = 6
include_bbox = true
ocr_coverage_threshold = 0.5
Option Type Default Description
extract_images boolean true Extract images from PDF documents
extract_metadata boolean true Extract PDF metadata
hierarchy.enabled boolean true Enable PDF hierarchy extraction (v4.0.0+)
hierarchy.k_clusters integer 6 Number of clusters for hierarchy detection
hierarchy.include_bbox boolean true Include bounding boxes in hierarchy
hierarchy.ocr_coverage_threshold float 0.5 OCR coverage threshold for hierarchy (0.0-1.0)

Image Processing

[images]
extract_images = true
target_dpi = 300
max_image_dimension = 4096
auto_adjust_dpi = true
min_dpi = 72
max_dpi = 600
Option Type Default Description
extract_images boolean true Extract images from documents
target_dpi integer 300 Target DPI for image processing
max_image_dimension integer 4096 Maximum image dimension in pixels
auto_adjust_dpi boolean true Automatically adjust DPI based on image size
min_dpi integer 72 Minimum DPI threshold
max_dpi integer 600 Maximum DPI threshold

Chunking Configuration

[chunking]
max_chars = 1000
max_overlap = 200

[chunking.embedding]
batch_size = 32
normalize = true
show_download_progress = true
cache_dir = "~/.cache/kreuzberg/embeddings"

[chunking.embedding.model]
type = "preset"
name = "balanced"
Option Type Default Description
max_chars integer 1000 Maximum characters per chunk
max_overlap integer 200 Overlap between consecutive chunks
embedding.batch_size integer 32 Batch size for embedding generation
embedding.normalize boolean true Normalize embeddings to unit length
embedding.show_download_progress boolean true Show progress when downloading models
embedding.cache_dir string "~/.cache/kreuzberg/embeddings" Directory for caching embeddings
embedding.model.type string "preset" Model type: preset, fastembed, or custom
embedding.model.name string "balanced" Preset model name (balanced, fast, accurate, multilingual)
embedding.model.model string FastEmbed model identifier
embedding.model.model_id string Custom HuggingFace model ID
embedding.model.dimensions integer Embedding dimensions

Keywords Configuration

[keywords]
algorithm = "yake"
max_keywords = 10
min_score = 0.0
ngram_range = [1, 3]
language = "en"
Option Type Default Description
algorithm string "yake" Keyword extraction algorithm (yake or rake)
max_keywords integer 10 Maximum keywords to extract
min_score float 0.0 Minimum relevance score for keywords
ngram_range array [1, 3] N-gram size range [min, max]
language string "en" Language code for keyword extraction

Token Reduction

[token_reduction]
mode = "off"
preserve_important_words = true
Option Type Default Description
mode string "off" Mode: off, aggressive, moderate, minimal
preserve_important_words boolean true Preserve important words during reduction

Language Detection

[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false
Option Type Default Description
enabled boolean true Enable automatic language detection
min_confidence float 0.8 Minimum confidence threshold for detection
detect_multiple boolean false Detect multiple languages in document

Post-Processor

[postprocessor]
enabled = true
Option Type Default Description
enabled boolean true Enable post-processing of extracted content

FileExtractionConfig (Per-File Overrides)

Passed as an optional parameter to batch_extract_file / batch_extract_bytes (and their sync variants) to override settings per file in a batch. All fields optional — None = use batch default. The separate _with_configs functions were removed in v4.5.0.

Overridable fields: enable_quality_processing, ocr, force_ocr, chunking, images, pdf_options, token_reduction, language_detection, pages, keywords, postprocessor, html_options, result_format, output_format, include_document_structure, layout.

Batch-level only (not overridable): max_concurrent_extractions, use_cache, acceleration, security_limits.

Merge semantics: For each file, FileExtractionConfig fields are overlaid on the batch ExtractionConfig. None falls through to batch default; Some(value) replaces the batch default for that file.

# FileExtractionConfig cannot be specified in config files —
# it is a programmatic API for per-file overrides at runtime.

Naming Conventions

Kreuzberg uses consistent naming conventions across different contexts:

Context Convention Example
Python snake_case max_chars, pdf_options, use_cache
Node.js / TypeScript camelCase maxChars, pdfOptions, useCache
Rust snake_case max_chars, pdf_options, use_cache
TOML / YAML / JSON snake_case max_chars, pdf_options, use_cache
CLI flags kebab-case --max-chars, --pdf-options, --use-cache

When switching between languages, apply the appropriate conversion:

  • Python → Node.js: snake_case to camelCase
  • CLI → Python: kebab-case to snake_case
  • TOML → Python: No conversion needed (both use snake_case)

Environment Variables

The following environment variables can override configuration:

Variable Purpose Example
KREUZBERG_HOST Server bind address (serve command) 127.0.0.1
KREUZBERG_PORT Server port (serve command) 8080

Configuration Merging

Configuration sources are merged in priority order (highest to lowest):

  1. CLI flags (highest priority)
  2. Inline JSON configuration (programmatic)
  3. Configuration file (lowest priority)

Later sources override earlier ones. For example, a CLI flag --max-chars 2000 overrides max_chars = 1000 in the configuration file.

Example Configurations

Minimal Configuration

use_cache = true
enable_quality_processing = true

[ocr]
backend = "tesseract"
language = "eng"

High-Quality PDF Extraction

use_cache = true
enable_quality_processing = true
force_ocr = false

[ocr]
backend = "tesseract"
language = "eng"

[ocr.tesseract_config]
psm = 3
oem = 3
enable_table_detection = true
table_min_confidence = 0.7

[pdf_options]
extract_images = true
extract_metadata = true

[pdf_options.hierarchy]
enabled = true
k_clusters = 6

[images]
extract_images = true
target_dpi = 300

Semantic Search Configuration

[chunking]
max_chars = 800
max_overlap = 150

[chunking.embedding]
batch_size = 32
normalize = true
cache_dir = "~/.cache/kreuzberg/embeddings"

[chunking.embedding.model]
type = "preset"
name = "accurate"

[keywords]
algorithm = "yake"
max_keywords = 15

Field Name Reference

Critical field names to use in configuration files:

  • max_chars (NOT max_characters)
  • max_overlap (NOT overlap)
  • table_min_confidence
  • table_column_threshold
  • table_row_threshold_ratio
  • ocr_coverage_threshold
  • k_clusters
  • include_bbox
  • enable_table_detection
  • auto_rotate
  • auto_adjust_dpi
  • show_download_progress
  • min_confidence
  • detect_multiple

Always verify field names against the source configuration file when adding new options.