4.8 KiB
Configuration Guide v4.0.0
All extraction behavior is controlled through ExtractionConfig. Pass it directly in code or load it from a TOML/YAML/JSON file. Every field is optional. For per-field documentation, see the Configuration Reference.
Quick Start
=== "Python"
--8<-- "snippets/python/config/config_basic.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/config_basic.md"
=== "Rust"
--8<-- "snippets/rust/config/config_basic.md"
=== "Go"
--8<-- "snippets/go/config/config_basic.md"
=== "Java"
--8<-- "snippets/java/config/config_basic.md"
=== "C#"
--8<-- "snippets/csharp/config_basic.md"
=== "Ruby"
--8<-- "snippets/ruby/config/config_basic.md"
=== "R"
--8<-- "snippets/r/config/config_basic.md"
Configuration Files
Three formats are supported. TOML is recommended.
=== "TOML (Recommended)"
```toml title="kreuzberg.toml"
use_cache = true
enable_quality_processing = true
[ocr]
backend = "tesseract"
language = "eng"
[ocr.tesseract_config]
psm = 3
```
=== "YAML"
```yaml title="kreuzberg.yaml"
use_cache: true
enable_quality_processing: true
ocr:
backend: tesseract
language: eng
tesseract_config:
psm: 3
```
=== "JSON"
```json title="kreuzberg.json"
{
"use_cache": true,
"enable_quality_processing": true,
"ocr": {
"backend": "tesseract",
"language": "eng",
"tesseract_config": {
"psm": 3
}
}
}
```
Automatic Discovery
When no --config path is supplied, Kreuzberg walks up from the current working directory looking for kreuzberg.toml and uses the first match. YAML and JSON files are supported only when passed explicitly via --config. If nothing is found, defaults are used.
=== "Python"
--8<-- "snippets/python/config/config_discover.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/config_discover.md"
=== "Rust"
--8<-- "snippets/rust/config/config_discover.md"
=== "Go"
--8<-- "snippets/go/config/config_discover.md"
=== "Java"
--8<-- "snippets/java/config/config_discover.md"
=== "C#"
--8<-- "snippets/csharp/config_discover.md"
=== "Ruby"
--8<-- "snippets/ruby/config/config_discover.md"
=== "R"
--8<-- "snippets/r/config/config_discover.md"
=== "Wasm"
--8<-- "snippets/wasm/config/config_discover.md"
Common Use Cases
Setting Up OCR
=== "Python"
--8<-- "snippets/python/config/config_ocr.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/config_ocr.md"
=== "Rust"
--8<-- "snippets/rust/ocr/config_ocr.md"
=== "Go"
--8<-- "snippets/go/config/config_ocr.md"
=== "Java"
--8<-- "snippets/java/config/config_ocr.md"
=== "C#"
--8<-- "snippets/csharp/config_ocr.md"
=== "Ruby"
--8<-- "snippets/ruby/config/config_ocr.md"
=== "R"
--8<-- "snippets/r/config/config_ocr.md"
For backend selection and language packs, see OCR Guide. For fine-grained Tesseract tuning, see TesseractConfig Reference.
Chunking for RAG
=== "Python"
--8<-- "snippets/python/utils/chunking.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/chunking.md"
=== "Rust"
--8<-- "snippets/rust/advanced/chunking.md"
=== "Go"
--8<-- "snippets/go/utils/chunking.md"
=== "Java"
--8<-- "snippets/java/utils/chunking.md"
=== "C#"
--8<-- "snippets/csharp/advanced/embedding_with_chunking.md"
=== "Ruby"
--8<-- "snippets/ruby/utils/chunking.md"
=== "R"
--8<-- "snippets/r/utils/chunking.md"
All Configuration Categories
- ExtractionConfig — top-level options
- OcrConfig — OCR backend, language, acceleration
- TesseractConfig — Tesseract PSM, confidence, table detection
- ChunkingConfig — chunk size, overlap
- TokenReductionConfig — LLM prompt token reduction
- ContentFilterConfig — header/footer/watermark filtering
- PageConfig — page tracking and markers
- AccelerationConfig — ONNX Runtime execution provider
Next Steps
- Extraction Basics — core extraction API and supported formats
- OCR Guide — backend installation and language setup
- Advanced Features — embeddings, language detection, page tracking
- Plugins Guide — custom post-processors and validators