Nomad changes
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s

This commit is contained in:
Henrik Jess Nielsen
2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions

View File

@@ -0,0 +1,214 @@
# Configuration Guide <span class="version-badge">v4.0.0</span>
All extraction behavior is controlled through `ExtractionConfig`. Pass it directly in code or load it from a TOML/YAML/JSON file. Every field is optional. For per-field documentation, see the [Configuration Reference](../reference/configuration.md).
## Quick Start
=== "Python"
--8<-- "snippets/python/config/config_basic.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/config_basic.md"
=== "Rust"
--8<-- "snippets/rust/config/config_basic.md"
=== "Go"
--8<-- "snippets/go/config/config_basic.md"
=== "Java"
--8<-- "snippets/java/config/config_basic.md"
=== "C#"
--8<-- "snippets/csharp/config_basic.md"
=== "Ruby"
--8<-- "snippets/ruby/config/config_basic.md"
=== "R"
--8<-- "snippets/r/config/config_basic.md"
## Configuration Files
Three formats are supported. TOML is recommended.
=== "TOML (Recommended)"
```toml title="kreuzberg.toml"
use_cache = true
enable_quality_processing = true
[ocr]
backend = "tesseract"
language = "eng"
[ocr.tesseract_config]
psm = 3
```
=== "YAML"
```yaml title="kreuzberg.yaml"
use_cache: true
enable_quality_processing: true
ocr:
backend: tesseract
language: eng
tesseract_config:
psm: 3
```
=== "JSON"
```json title="kreuzberg.json"
{
"use_cache": true,
"enable_quality_processing": true,
"ocr": {
"backend": "tesseract",
"language": "eng",
"tesseract_config": {
"psm": 3
}
}
}
```
### Automatic Discovery
When no `--config` path is supplied, Kreuzberg walks up from the current working directory looking for `kreuzberg.toml` and uses the first match. YAML and JSON files are supported only when passed explicitly via `--config`. If nothing is found, defaults are used.
=== "Python"
--8<-- "snippets/python/config/config_discover.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/config_discover.md"
=== "Rust"
--8<-- "snippets/rust/config/config_discover.md"
=== "Go"
--8<-- "snippets/go/config/config_discover.md"
=== "Java"
--8<-- "snippets/java/config/config_discover.md"
=== "C#"
--8<-- "snippets/csharp/config_discover.md"
=== "Ruby"
--8<-- "snippets/ruby/config/config_discover.md"
=== "R"
--8<-- "snippets/r/config/config_discover.md"
=== "Wasm"
--8<-- "snippets/wasm/config/config_discover.md"
## Common Use Cases
### Setting Up OCR
=== "Python"
--8<-- "snippets/python/config/config_ocr.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/config_ocr.md"
=== "Rust"
--8<-- "snippets/rust/ocr/config_ocr.md"
=== "Go"
--8<-- "snippets/go/config/config_ocr.md"
=== "Java"
--8<-- "snippets/java/config/config_ocr.md"
=== "C#"
--8<-- "snippets/csharp/config_ocr.md"
=== "Ruby"
--8<-- "snippets/ruby/config/config_ocr.md"
=== "R"
--8<-- "snippets/r/config/config_ocr.md"
For backend selection and language packs, see [OCR Guide](ocr.md). For fine-grained Tesseract tuning, see [TesseractConfig Reference](../reference/configuration.md#tesseractconfig).
### Chunking for RAG
=== "Python"
--8<-- "snippets/python/utils/chunking.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/chunking.md"
=== "Rust"
--8<-- "snippets/rust/advanced/chunking.md"
=== "Go"
--8<-- "snippets/go/utils/chunking.md"
=== "Java"
--8<-- "snippets/java/utils/chunking.md"
=== "C#"
--8<-- "snippets/csharp/advanced/embedding_with_chunking.md"
=== "Ruby"
--8<-- "snippets/ruby/utils/chunking.md"
=== "R"
--8<-- "snippets/r/utils/chunking.md"
## All Configuration Categories
- [ExtractionConfig](../reference/configuration.md#extractionconfig) — top-level options
- [OcrConfig](../reference/configuration.md#ocrconfig) — OCR backend, language, acceleration
- [TesseractConfig](../reference/configuration.md#tesseractconfig) — Tesseract PSM, confidence, table detection
- [ChunkingConfig](../reference/configuration.md#chunkingconfig) — chunk size, overlap
- [TokenReductionConfig](../reference/configuration.md#tokenreductionconfig) — LLM prompt token reduction
- [ContentFilterConfig](../reference/configuration.md#contentfilterconfig) — header/footer/watermark filtering
- [PageConfig](../reference/configuration.md#pageconfig) — page tracking and markers
- [AccelerationConfig](../reference/configuration.md#accelerationconfig) — ONNX Runtime execution provider
## Next Steps
- [Extraction Basics](extraction.md) — core extraction API and supported formats
- [OCR Guide](ocr.md) — backend installation and language setup
- [Advanced Features](advanced.md) — embeddings, language detection, page tracking
- [Plugins Guide](plugins.md) — custom post-processors and validators