215 lines
4.8 KiB
Markdown
215 lines
4.8 KiB
Markdown
# Configuration Guide <span class="version-badge">v4.0.0</span>
|
|
|
|
All extraction behavior is controlled through `ExtractionConfig`. Pass it directly in code or load it from a TOML/YAML/JSON file. Every field is optional. For per-field documentation, see the [Configuration Reference](../reference/configuration.md).
|
|
|
|
## Quick Start
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/config/config_basic.md"
|
|
|
|
=== "TypeScript"
|
|
|
|
--8<-- "snippets/typescript/config/config_basic.md"
|
|
|
|
=== "Rust"
|
|
|
|
--8<-- "snippets/rust/config/config_basic.md"
|
|
|
|
=== "Go"
|
|
|
|
--8<-- "snippets/go/config/config_basic.md"
|
|
|
|
=== "Java"
|
|
|
|
--8<-- "snippets/java/config/config_basic.md"
|
|
|
|
=== "C#"
|
|
|
|
--8<-- "snippets/csharp/config_basic.md"
|
|
|
|
=== "Ruby"
|
|
|
|
--8<-- "snippets/ruby/config/config_basic.md"
|
|
|
|
=== "R"
|
|
|
|
--8<-- "snippets/r/config/config_basic.md"
|
|
|
|
## Configuration Files
|
|
|
|
Three formats are supported. TOML is recommended.
|
|
|
|
=== "TOML (Recommended)"
|
|
|
|
```toml title="kreuzberg.toml"
|
|
use_cache = true
|
|
enable_quality_processing = true
|
|
|
|
[ocr]
|
|
backend = "tesseract"
|
|
language = "eng"
|
|
|
|
[ocr.tesseract_config]
|
|
psm = 3
|
|
```
|
|
|
|
=== "YAML"
|
|
|
|
```yaml title="kreuzberg.yaml"
|
|
use_cache: true
|
|
enable_quality_processing: true
|
|
|
|
ocr:
|
|
backend: tesseract
|
|
language: eng
|
|
tesseract_config:
|
|
psm: 3
|
|
```
|
|
|
|
=== "JSON"
|
|
|
|
```json title="kreuzberg.json"
|
|
{
|
|
"use_cache": true,
|
|
"enable_quality_processing": true,
|
|
"ocr": {
|
|
"backend": "tesseract",
|
|
"language": "eng",
|
|
"tesseract_config": {
|
|
"psm": 3
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Automatic Discovery
|
|
|
|
When no `--config` path is supplied, Kreuzberg walks up from the current working directory looking for `kreuzberg.toml` and uses the first match. YAML and JSON files are supported only when passed explicitly via `--config`. If nothing is found, defaults are used.
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/config/config_discover.md"
|
|
|
|
=== "TypeScript"
|
|
|
|
--8<-- "snippets/typescript/config/config_discover.md"
|
|
|
|
=== "Rust"
|
|
|
|
--8<-- "snippets/rust/config/config_discover.md"
|
|
|
|
=== "Go"
|
|
|
|
--8<-- "snippets/go/config/config_discover.md"
|
|
|
|
=== "Java"
|
|
|
|
--8<-- "snippets/java/config/config_discover.md"
|
|
|
|
=== "C#"
|
|
|
|
--8<-- "snippets/csharp/config_discover.md"
|
|
|
|
=== "Ruby"
|
|
|
|
--8<-- "snippets/ruby/config/config_discover.md"
|
|
|
|
=== "R"
|
|
|
|
--8<-- "snippets/r/config/config_discover.md"
|
|
|
|
=== "Wasm"
|
|
|
|
--8<-- "snippets/wasm/config/config_discover.md"
|
|
|
|
## Common Use Cases
|
|
|
|
### Setting Up OCR
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/config/config_ocr.md"
|
|
|
|
=== "TypeScript"
|
|
|
|
--8<-- "snippets/typescript/config/config_ocr.md"
|
|
|
|
=== "Rust"
|
|
|
|
--8<-- "snippets/rust/ocr/config_ocr.md"
|
|
|
|
=== "Go"
|
|
|
|
--8<-- "snippets/go/config/config_ocr.md"
|
|
|
|
=== "Java"
|
|
|
|
--8<-- "snippets/java/config/config_ocr.md"
|
|
|
|
=== "C#"
|
|
|
|
--8<-- "snippets/csharp/config_ocr.md"
|
|
|
|
=== "Ruby"
|
|
|
|
--8<-- "snippets/ruby/config/config_ocr.md"
|
|
|
|
=== "R"
|
|
|
|
--8<-- "snippets/r/config/config_ocr.md"
|
|
|
|
For backend selection and language packs, see [OCR Guide](ocr.md). For fine-grained Tesseract tuning, see [TesseractConfig Reference](../reference/configuration.md#tesseractconfig).
|
|
|
|
### Chunking for RAG
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/utils/chunking.md"
|
|
|
|
=== "TypeScript"
|
|
|
|
--8<-- "snippets/typescript/utils/chunking.md"
|
|
|
|
=== "Rust"
|
|
|
|
--8<-- "snippets/rust/advanced/chunking.md"
|
|
|
|
=== "Go"
|
|
|
|
--8<-- "snippets/go/utils/chunking.md"
|
|
|
|
=== "Java"
|
|
|
|
--8<-- "snippets/java/utils/chunking.md"
|
|
|
|
=== "C#"
|
|
|
|
--8<-- "snippets/csharp/advanced/embedding_with_chunking.md"
|
|
|
|
=== "Ruby"
|
|
|
|
--8<-- "snippets/ruby/utils/chunking.md"
|
|
|
|
=== "R"
|
|
|
|
--8<-- "snippets/r/utils/chunking.md"
|
|
|
|
## All Configuration Categories
|
|
|
|
- [ExtractionConfig](../reference/configuration.md#extractionconfig) — top-level options
|
|
- [OcrConfig](../reference/configuration.md#ocrconfig) — OCR backend, language, acceleration
|
|
- [TesseractConfig](../reference/configuration.md#tesseractconfig) — Tesseract PSM, confidence, table detection
|
|
- [ChunkingConfig](../reference/configuration.md#chunkingconfig) — chunk size, overlap
|
|
- [TokenReductionConfig](../reference/configuration.md#tokenreductionconfig) — LLM prompt token reduction
|
|
- [ContentFilterConfig](../reference/configuration.md#contentfilterconfig) — header/footer/watermark filtering
|
|
- [PageConfig](../reference/configuration.md#pageconfig) — page tracking and markers
|
|
- [AccelerationConfig](../reference/configuration.md#accelerationconfig) — ONNX Runtime execution provider
|
|
|
|
## Next Steps
|
|
|
|
- [Extraction Basics](extraction.md) — core extraction API and supported formats
|
|
- [OCR Guide](ocr.md) — backend installation and language setup
|
|
- [Advanced Features](advanced.md) — embeddings, language detection, page tracking
|
|
- [Plugins Guide](plugins.md) — custom post-processors and validators
|