hjess/fil

Fork 0

Files

Henrik Jess Nielsen b4c07d3693

Deploy fil (kreuzberg) / deploy (push) Successful in 49s

Details

Nomad changes

2026-06-01 23:40:55 +02:00

16 KiB

Raw Blame History

Configuration Reference

Kreuzberg uses a hierarchical configuration system supporting multiple formats and auto-discovery mechanisms. This reference covers all available configuration options, field names across programming languages, and loading strategies.

Supported Formats

Kreuzberg configurations can be defined in three formats:

TOML (recommended): kreuzberg.toml
YAML: kreuzberg.yaml
JSON: kreuzberg.json

All formats support the same schema and configuration options.

Auto-Discovery

When no configuration file is explicitly specified, Kreuzberg searches for configuration files in the following order:

Current working directory: kreuzberg.toml, kreuzberg.yaml, kreuzberg.json
Parent directories (recursively up the tree, same file name pattern)

The first matching configuration file is loaded.

Programmatic Loading

Python

from kreuzberg import ExtractionConfig

# Load from explicit path
config = ExtractionConfig.from_file("kreuzberg.toml")

# Auto-discover configuration
config = ExtractionConfig.discover()

Node.js / TypeScript

import { ExtractionConfig } from "@kreuzberg/node";

// Load from explicit path
const config = ExtractionConfig.fromFile("kreuzberg.toml");

// Auto-discover configuration
const config = ExtractionConfig.discover();

CLI

# Explicit configuration file
kreuzberg extract --config kreuzberg.toml document.pdf

# Auto-discovery (searches default locations)
kreuzberg extract document.pdf

Configuration Schema

The complete TOML schema with all available sections and options:

Top-Level Options

use_cache = true
enable_quality_processing = true
force_ocr = false
output_format = "markdown"
result_format = "text"
max_concurrent_extractions = 4

Option	Type	Default	Description
`use_cache`	boolean	`true`	Enable caching of extraction results
`enable_quality_processing`	boolean	`true`	Enable post-processing for output quality
`force_ocr`	boolean	`false`	Force OCR processing even for searchable PDFs
`disable_ocr`	boolean	`false`	Disable OCR entirely — image files return empty content instead of errors (v4.7.0+)
`output_format`	string	`"markdown"`	Output format (markdown, html, text)
`result_format`	string	`"text"`	Result format for structured output
`max_concurrent_extractions`	integer	`4`	Maximum concurrent document extractions

OCR Configuration

[ocr]
backend = "tesseract"
language = "eng"

Option	Type	Default	Description
`backend`	string	`"tesseract"`	OCR backend (currently tesseract)
`language`	string	`"eng"`	ISO 639-3 language code (eng, deu, fra, etc.)

Tesseract Configuration

[ocr.tesseract_config]
psm = 3
oem = 3
min_confidence = 0.0
output_format = "text"
enable_table_detection = false
table_min_confidence = 0.5
table_column_threshold = 50
table_row_threshold_ratio = 0.5
use_cache = true

Option	Type	Default	Description
`psm`	integer	`3`	Page Segmentation Mode (0-13)
`oem`	integer	`3`	OCR Engine Mode (0-3)
`min_confidence`	float	`0.0`	Minimum OCR confidence threshold (0.0-1.0)
`output_format`	string	`"text"`	Output format from OCR
`enable_table_detection`	boolean	`false`	Enable table detection during OCR
`table_min_confidence`	float	`0.5`	Minimum confidence for table cells
`table_column_threshold`	integer	`50`	Pixel threshold for column detection
`table_row_threshold_ratio`	float	`0.5`	Row height ratio threshold
`use_cache`	boolean	`true`	Cache OCR results

Tesseract Preprocessing

[ocr.tesseract_config.preprocessing]
target_dpi = 300
auto_rotate = true
deskew = true
denoise = true
contrast_enhance = true
binarization_method = "otsu"
invert_colors = false

Option	Type	Default	Description
`target_dpi`	integer	`300`	Target DPI for preprocessing
`auto_rotate`	boolean	`true`	Automatically detect and correct page rotation
`deskew`	boolean	`true`	Correct skewed pages
`denoise`	boolean	`true`	Remove noise from images
`contrast_enhance`	boolean	`true`	Enhance image contrast
`binarization_method`	string	`"otsu"`	Method for image binarization
`invert_colors`	boolean	`false`	Invert image colors if needed

PDF Options

[pdf_options]
extract_images = true
extract_metadata = true

[pdf_options.hierarchy]
enabled = true
k_clusters = 6
include_bbox = true
ocr_coverage_threshold = 0.5

Option	Type	Default	Description
`extract_images`	boolean	`true`	Extract images from PDF documents
`extract_metadata`	boolean	`true`	Extract PDF metadata
`hierarchy.enabled`	boolean	`true`	Enable PDF hierarchy extraction (v4.0.0+)
`hierarchy.k_clusters`	integer	`6`	Number of clusters for hierarchy detection
`hierarchy.include_bbox`	boolean	`true`	Include bounding boxes in hierarchy
`hierarchy.ocr_coverage_threshold`	float	`0.5`	OCR coverage threshold for hierarchy (0.0-1.0)

Image Processing

[images]
extract_images = true
target_dpi = 300
max_image_dimension = 4096
auto_adjust_dpi = true
min_dpi = 72
max_dpi = 600

Option	Type	Default	Description
`extract_images`	boolean	`true`	Extract images from documents
`target_dpi`	integer	`300`	Target DPI for image processing
`max_image_dimension`	integer	`4096`	Maximum image dimension in pixels
`auto_adjust_dpi`	boolean	`true`	Automatically adjust DPI based on image size
`min_dpi`	integer	`72`	Minimum DPI threshold
`max_dpi`	integer	`600`	Maximum DPI threshold

Chunking Configuration

[chunking]
max_chars = 1000
max_overlap = 200

[chunking.embedding]
batch_size = 32
normalize = true
show_download_progress = true
cache_dir = "~/.cache/kreuzberg/embeddings"

[chunking.embedding.model]
type = "preset"
name = "balanced"

Option	Type	Default	Description
`max_chars`	integer	`1000`	Maximum characters per chunk
`max_overlap`	integer	`200`	Overlap between consecutive chunks
`embedding.batch_size`	integer	`32`	Batch size for embedding generation
`embedding.normalize`	boolean	`true`	Normalize embeddings to unit length
`embedding.show_download_progress`	boolean	`true`	Show progress when downloading models
`embedding.cache_dir`	string	`"~/.cache/kreuzberg/embeddings"`	Directory for caching embeddings
`embedding.model.type`	string	`"preset"`	Model type: preset, fastembed, or custom
`embedding.model.name`	string	`"balanced"`	Preset model name (balanced, fast, accurate, multilingual)
`embedding.model.model`	string		FastEmbed model identifier
`embedding.model.model_id`	string		Custom HuggingFace model ID
`embedding.model.dimensions`	integer		Embedding dimensions

Keywords Configuration

[keywords]
algorithm = "yake"
max_keywords = 10
min_score = 0.0
ngram_range = [1, 3]
language = "en"

Option	Type	Default	Description
`algorithm`	string	`"yake"`	Keyword extraction algorithm (yake or rake)
`max_keywords`	integer	`10`	Maximum keywords to extract
`min_score`	float	`0.0`	Minimum relevance score for keywords
`ngram_range`	array	`[1, 3]`	N-gram size range [min, max]
`language`	string	`"en"`	Language code for keyword extraction

Token Reduction

[token_reduction]
mode = "off"
preserve_important_words = true

Option	Type	Default	Description
`mode`	string	`"off"`	Mode: off, aggressive, moderate, minimal
`preserve_important_words`	boolean	`true`	Preserve important words during reduction

Language Detection

[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false

Option	Type	Default	Description
`enabled`	boolean	`true`	Enable automatic language detection
`min_confidence`	float	`0.8`	Minimum confidence threshold for detection
`detect_multiple`	boolean	`false`	Detect multiple languages in document

Post-Processor

[postprocessor]
enabled = true

Option	Type	Default	Description
`enabled`	boolean	`true`	Enable post-processing of extracted content

FileExtractionConfig (Per-File Overrides)

Passed as an optional parameter to batch_extract_file / batch_extract_bytes (and their sync variants) to override settings per file in a batch. All fields optional — None = use batch default. The separate _with_configs functions were removed in v4.5.0.

Overridable fields: enable_quality_processing, ocr, force_ocr, chunking, images, pdf_options, token_reduction, language_detection, pages, keywords, postprocessor, html_options, result_format, output_format, include_document_structure, layout.

Batch-level only (not overridable): max_concurrent_extractions, use_cache, acceleration, security_limits.

Merge semantics: For each file, FileExtractionConfig fields are overlaid on the batch ExtractionConfig. None falls through to batch default; Some(value) replaces the batch default for that file.

# FileExtractionConfig cannot be specified in config files —
# it is a programmatic API for per-file overrides at runtime.

Naming Conventions

Kreuzberg uses consistent naming conventions across different contexts:

Context	Convention	Example
Python	snake_case	`max_chars`, `pdf_options`, `use_cache`
Node.js / TypeScript	camelCase	`maxChars`, `pdfOptions`, `useCache`
Rust	snake_case	`max_chars`, `pdf_options`, `use_cache`
TOML / YAML / JSON	snake_case	`max_chars`, `pdf_options`, `use_cache`
CLI flags	kebab-case	`--max-chars`, `--pdf-options`, `--use-cache`

When switching between languages, apply the appropriate conversion:

Python → Node.js: snake_case to camelCase
CLI → Python: kebab-case to snake_case
TOML → Python: No conversion needed (both use snake_case)

Environment Variables

The following environment variables can override configuration:

Variable	Purpose	Example
`KREUZBERG_HOST`	Server bind address (serve command)	`127.0.0.1`
`KREUZBERG_PORT`	Server port (serve command)	`8080`

Configuration Merging

Configuration sources are merged in priority order (highest to lowest):

CLI flags (highest priority)
Inline JSON configuration (programmatic)
Configuration file (lowest priority)

Later sources override earlier ones. For example, a CLI flag --max-chars 2000 overrides max_chars = 1000 in the configuration file.

Example Configurations

Minimal Configuration

use_cache = true
enable_quality_processing = true

[ocr]
backend = "tesseract"
language = "eng"

High-Quality PDF Extraction

use_cache = true
enable_quality_processing = true
force_ocr = false

[ocr]
backend = "tesseract"
language = "eng"

[ocr.tesseract_config]
psm = 3
oem = 3
enable_table_detection = true
table_min_confidence = 0.7

[pdf_options]
extract_images = true
extract_metadata = true

[pdf_options.hierarchy]
enabled = true
k_clusters = 6

[images]
extract_images = true
target_dpi = 300

Semantic Search Configuration

[chunking]
max_chars = 800
max_overlap = 150

[chunking.embedding]
batch_size = 32
normalize = true
cache_dir = "~/.cache/kreuzberg/embeddings"

[chunking.embedding.model]
type = "preset"
name = "accurate"

[keywords]
algorithm = "yake"
max_keywords = 15

Field Name Reference

Critical field names to use in configuration files:

max_chars (NOT max_characters)
max_overlap (NOT overlap)
table_min_confidence
table_column_threshold
table_row_threshold_ratio
ocr_coverage_threshold
k_clusters
include_bbox
enable_table_detection
auto_rotate
auto_adjust_dpi
show_download_progress
min_confidence
detect_multiple

Always verify field names against the source configuration file when adding new options.

16 KiB Raw Blame History