Files

418 lines
16 KiB
Markdown
Raw Permalink Normal View History

2026-06-01 23:40:55 +02:00
# Configuration Reference
Kreuzberg uses a hierarchical configuration system supporting multiple formats and auto-discovery mechanisms. This reference covers all available configuration options, field names across programming languages, and loading strategies.
## Supported Formats
Kreuzberg configurations can be defined in three formats:
- **TOML** (recommended): `kreuzberg.toml`
- **YAML**: `kreuzberg.yaml`
- **JSON**: `kreuzberg.json`
All formats support the same schema and configuration options.
## Auto-Discovery
When no configuration file is explicitly specified, Kreuzberg searches for configuration files in the following order:
1. Current working directory: `kreuzberg.toml`, `kreuzberg.yaml`, `kreuzberg.json`
2. Parent directories (recursively up the tree, same file name pattern)
The first matching configuration file is loaded.
## Programmatic Loading
### Python
```python
from kreuzberg import ExtractionConfig
# Load from explicit path
config = ExtractionConfig.from_file("kreuzberg.toml")
# Auto-discover configuration
config = ExtractionConfig.discover()
```
### Node.js / TypeScript
```typescript
import { ExtractionConfig } from "@kreuzberg/node";
// Load from explicit path
const config = ExtractionConfig.fromFile("kreuzberg.toml");
// Auto-discover configuration
const config = ExtractionConfig.discover();
```
### CLI
```bash
# Explicit configuration file
kreuzberg extract --config kreuzberg.toml document.pdf
# Auto-discovery (searches default locations)
kreuzberg extract document.pdf
```
## Configuration Schema
The complete TOML schema with all available sections and options:
### Top-Level Options
```toml
use_cache = true
enable_quality_processing = true
force_ocr = false
output_format = "markdown"
result_format = "text"
max_concurrent_extractions = 4
```
| Option | Type | Default | Description |
| ---------------------------- | ------- | ------------ | ----------------------------------------------------------------------------------- |
| `use_cache` | boolean | `true` | Enable caching of extraction results |
| `enable_quality_processing` | boolean | `true` | Enable post-processing for output quality |
| `force_ocr` | boolean | `false` | Force OCR processing even for searchable PDFs |
| `disable_ocr` | boolean | `false` | Disable OCR entirely — image files return empty content instead of errors (v4.7.0+) |
| `output_format` | string | `"markdown"` | Output format (markdown, html, text) |
| `result_format` | string | `"text"` | Result format for structured output |
| `max_concurrent_extractions` | integer | `4` | Maximum concurrent document extractions |
### OCR Configuration
```toml
[ocr]
backend = "tesseract"
language = "eng"
```
| Option | Type | Default | Description |
| ---------- | ------ | ------------- | --------------------------------------------- |
| `backend` | string | `"tesseract"` | OCR backend (currently tesseract) |
| `language` | string | `"eng"` | ISO 639-3 language code (eng, deu, fra, etc.) |
#### Tesseract Configuration
```toml
[ocr.tesseract_config]
psm = 3
oem = 3
min_confidence = 0.0
output_format = "text"
enable_table_detection = false
table_min_confidence = 0.5
table_column_threshold = 50
table_row_threshold_ratio = 0.5
use_cache = true
```
| Option | Type | Default | Description |
| --------------------------- | ------- | -------- | ------------------------------------------ |
| `psm` | integer | `3` | Page Segmentation Mode (0-13) |
| `oem` | integer | `3` | OCR Engine Mode (0-3) |
| `min_confidence` | float | `0.0` | Minimum OCR confidence threshold (0.0-1.0) |
| `output_format` | string | `"text"` | Output format from OCR |
| `enable_table_detection` | boolean | `false` | Enable table detection during OCR |
| `table_min_confidence` | float | `0.5` | Minimum confidence for table cells |
| `table_column_threshold` | integer | `50` | Pixel threshold for column detection |
| `table_row_threshold_ratio` | float | `0.5` | Row height ratio threshold |
| `use_cache` | boolean | `true` | Cache OCR results |
#### Tesseract Preprocessing
```toml
[ocr.tesseract_config.preprocessing]
target_dpi = 300
auto_rotate = true
deskew = true
denoise = true
contrast_enhance = true
binarization_method = "otsu"
invert_colors = false
```
| Option | Type | Default | Description |
| --------------------- | ------- | -------- | ---------------------------------------------- |
| `target_dpi` | integer | `300` | Target DPI for preprocessing |
| `auto_rotate` | boolean | `true` | Automatically detect and correct page rotation |
| `deskew` | boolean | `true` | Correct skewed pages |
| `denoise` | boolean | `true` | Remove noise from images |
| `contrast_enhance` | boolean | `true` | Enhance image contrast |
| `binarization_method` | string | `"otsu"` | Method for image binarization |
| `invert_colors` | boolean | `false` | Invert image colors if needed |
### PDF Options
```toml
[pdf_options]
extract_images = true
extract_metadata = true
[pdf_options.hierarchy]
enabled = true
k_clusters = 6
include_bbox = true
ocr_coverage_threshold = 0.5
```
| Option | Type | Default | Description |
| ---------------------------------- | ------- | ------- | ---------------------------------------------- |
| `extract_images` | boolean | `true` | Extract images from PDF documents |
| `extract_metadata` | boolean | `true` | Extract PDF metadata |
| `hierarchy.enabled` | boolean | `true` | Enable PDF hierarchy extraction (v4.0.0+) |
| `hierarchy.k_clusters` | integer | `6` | Number of clusters for hierarchy detection |
| `hierarchy.include_bbox` | boolean | `true` | Include bounding boxes in hierarchy |
| `hierarchy.ocr_coverage_threshold` | float | `0.5` | OCR coverage threshold for hierarchy (0.0-1.0) |
### Image Processing
```toml
[images]
extract_images = true
target_dpi = 300
max_image_dimension = 4096
auto_adjust_dpi = true
min_dpi = 72
max_dpi = 600
```
| Option | Type | Default | Description |
| --------------------- | ------- | ------- | -------------------------------------------- |
| `extract_images` | boolean | `true` | Extract images from documents |
| `target_dpi` | integer | `300` | Target DPI for image processing |
| `max_image_dimension` | integer | `4096` | Maximum image dimension in pixels |
| `auto_adjust_dpi` | boolean | `true` | Automatically adjust DPI based on image size |
| `min_dpi` | integer | `72` | Minimum DPI threshold |
| `max_dpi` | integer | `600` | Maximum DPI threshold |
### Chunking Configuration
```toml
[chunking]
max_chars = 1000
max_overlap = 200
[chunking.embedding]
batch_size = 32
normalize = true
show_download_progress = true
cache_dir = "~/.cache/kreuzberg/embeddings"
[chunking.embedding.model]
type = "preset"
name = "balanced"
```
| Option | Type | Default | Description |
| ---------------------------------- | ------- | --------------------------------- | ---------------------------------------------------------- |
| `max_chars` | integer | `1000` | Maximum characters per chunk |
| `max_overlap` | integer | `200` | Overlap between consecutive chunks |
| `embedding.batch_size` | integer | `32` | Batch size for embedding generation |
| `embedding.normalize` | boolean | `true` | Normalize embeddings to unit length |
| `embedding.show_download_progress` | boolean | `true` | Show progress when downloading models |
| `embedding.cache_dir` | string | `"~/.cache/kreuzberg/embeddings"` | Directory for caching embeddings |
| `embedding.model.type` | string | `"preset"` | Model type: preset, fastembed, or custom |
| `embedding.model.name` | string | `"balanced"` | Preset model name (balanced, fast, accurate, multilingual) |
| `embedding.model.model` | string | | FastEmbed model identifier |
| `embedding.model.model_id` | string | | Custom HuggingFace model ID |
| `embedding.model.dimensions` | integer | | Embedding dimensions |
### Keywords Configuration
```toml
[keywords]
algorithm = "yake"
max_keywords = 10
min_score = 0.0
ngram_range = [1, 3]
language = "en"
```
| Option | Type | Default | Description |
| -------------- | ------- | -------- | ------------------------------------------- |
| `algorithm` | string | `"yake"` | Keyword extraction algorithm (yake or rake) |
| `max_keywords` | integer | `10` | Maximum keywords to extract |
| `min_score` | float | `0.0` | Minimum relevance score for keywords |
| `ngram_range` | array | `[1, 3]` | N-gram size range [min, max] |
| `language` | string | `"en"` | Language code for keyword extraction |
### Token Reduction
```toml
[token_reduction]
mode = "off"
preserve_important_words = true
```
| Option | Type | Default | Description |
| -------------------------- | ------- | ------- | ----------------------------------------- |
| `mode` | string | `"off"` | Mode: off, aggressive, moderate, minimal |
| `preserve_important_words` | boolean | `true` | Preserve important words during reduction |
### Language Detection
```toml
[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false
```
| Option | Type | Default | Description |
| ----------------- | ------- | ------- | ------------------------------------------ |
| `enabled` | boolean | `true` | Enable automatic language detection |
| `min_confidence` | float | `0.8` | Minimum confidence threshold for detection |
| `detect_multiple` | boolean | `false` | Detect multiple languages in document |
### Post-Processor
```toml
[postprocessor]
enabled = true
```
| Option | Type | Default | Description |
| --------- | ------- | ------- | ------------------------------------------- |
| `enabled` | boolean | `true` | Enable post-processing of extracted content |
## FileExtractionConfig (Per-File Overrides)
Passed as an optional parameter to `batch_extract_file` / `batch_extract_bytes` (and their sync variants) to override settings per file in a batch. All fields optional — `None` = use batch default. The separate `_with_configs` functions were removed in v4.5.0.
**Overridable fields:** `enable_quality_processing`, `ocr`, `force_ocr`, `chunking`, `images`, `pdf_options`, `token_reduction`, `language_detection`, `pages`, `keywords`, `postprocessor`, `html_options`, `result_format`, `output_format`, `include_document_structure`, `layout`.
**Batch-level only (not overridable):** `max_concurrent_extractions`, `use_cache`, `acceleration`, `security_limits`.
**Merge semantics:** For each file, `FileExtractionConfig` fields are overlaid on the batch `ExtractionConfig`. `None` falls through to batch default; `Some(value)` replaces the batch default for that file.
```toml
# FileExtractionConfig cannot be specified in config files —
# it is a programmatic API for per-file overrides at runtime.
```
## Naming Conventions
Kreuzberg uses consistent naming conventions across different contexts:
| Context | Convention | Example |
| -------------------- | ---------- | --------------------------------------------- |
| Python | snake_case | `max_chars`, `pdf_options`, `use_cache` |
| Node.js / TypeScript | camelCase | `maxChars`, `pdfOptions`, `useCache` |
| Rust | snake_case | `max_chars`, `pdf_options`, `use_cache` |
| TOML / YAML / JSON | snake_case | `max_chars`, `pdf_options`, `use_cache` |
| CLI flags | kebab-case | `--max-chars`, `--pdf-options`, `--use-cache` |
When switching between languages, apply the appropriate conversion:
- Python → Node.js: `snake_case` to `camelCase`
- CLI → Python: `kebab-case` to `snake_case`
- TOML → Python: No conversion needed (both use `snake_case`)
## Environment Variables
The following environment variables can override configuration:
| Variable | Purpose | Example |
| ---------------- | ----------------------------------- | ----------- |
| `KREUZBERG_HOST` | Server bind address (serve command) | `127.0.0.1` |
| `KREUZBERG_PORT` | Server port (serve command) | `8080` |
## Configuration Merging
Configuration sources are merged in priority order (highest to lowest):
1. **CLI flags** (highest priority)
2. **Inline JSON configuration** (programmatic)
3. **Configuration file** (lowest priority)
Later sources override earlier ones. For example, a CLI flag `--max-chars 2000` overrides `max_chars = 1000` in the configuration file.
## Example Configurations
### Minimal Configuration
```toml
use_cache = true
enable_quality_processing = true
[ocr]
backend = "tesseract"
language = "eng"
```
### High-Quality PDF Extraction
```toml
use_cache = true
enable_quality_processing = true
force_ocr = false
[ocr]
backend = "tesseract"
language = "eng"
[ocr.tesseract_config]
psm = 3
oem = 3
enable_table_detection = true
table_min_confidence = 0.7
[pdf_options]
extract_images = true
extract_metadata = true
[pdf_options.hierarchy]
enabled = true
k_clusters = 6
[images]
extract_images = true
target_dpi = 300
```
### Semantic Search Configuration
```toml
[chunking]
max_chars = 800
max_overlap = 150
[chunking.embedding]
batch_size = 32
normalize = true
cache_dir = "~/.cache/kreuzberg/embeddings"
[chunking.embedding.model]
type = "preset"
name = "accurate"
[keywords]
algorithm = "yake"
max_keywords = 15
```
## Field Name Reference
Critical field names to use in configuration files:
- `max_chars` (NOT `max_characters`)
- `max_overlap` (NOT `overlap`)
- `table_min_confidence`
- `table_column_threshold`
- `table_row_threshold_ratio`
- `ocr_coverage_threshold`
- `k_clusters`
- `include_bbox`
- `enable_table_detection`
- `auto_rotate`
- `auto_adjust_dpi`
- `show_download_progress`
- `min_confidence`
- `detect_multiple`
Always verify field names against the source configuration file when adding new options.