418 lines
16 KiB
Markdown
418 lines
16 KiB
Markdown
|
|
# Configuration Reference
|
||
|
|
|
||
|
|
Kreuzberg uses a hierarchical configuration system supporting multiple formats and auto-discovery mechanisms. This reference covers all available configuration options, field names across programming languages, and loading strategies.
|
||
|
|
|
||
|
|
## Supported Formats
|
||
|
|
|
||
|
|
Kreuzberg configurations can be defined in three formats:
|
||
|
|
|
||
|
|
- **TOML** (recommended): `kreuzberg.toml`
|
||
|
|
- **YAML**: `kreuzberg.yaml`
|
||
|
|
- **JSON**: `kreuzberg.json`
|
||
|
|
|
||
|
|
All formats support the same schema and configuration options.
|
||
|
|
|
||
|
|
## Auto-Discovery
|
||
|
|
|
||
|
|
When no configuration file is explicitly specified, Kreuzberg searches for configuration files in the following order:
|
||
|
|
|
||
|
|
1. Current working directory: `kreuzberg.toml`, `kreuzberg.yaml`, `kreuzberg.json`
|
||
|
|
2. Parent directories (recursively up the tree, same file name pattern)
|
||
|
|
|
||
|
|
The first matching configuration file is loaded.
|
||
|
|
|
||
|
|
## Programmatic Loading
|
||
|
|
|
||
|
|
### Python
|
||
|
|
|
||
|
|
```python
|
||
|
|
from kreuzberg import ExtractionConfig
|
||
|
|
|
||
|
|
# Load from explicit path
|
||
|
|
config = ExtractionConfig.from_file("kreuzberg.toml")
|
||
|
|
|
||
|
|
# Auto-discover configuration
|
||
|
|
config = ExtractionConfig.discover()
|
||
|
|
```
|
||
|
|
|
||
|
|
### Node.js / TypeScript
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
import { ExtractionConfig } from "@kreuzberg/node";
|
||
|
|
|
||
|
|
// Load from explicit path
|
||
|
|
const config = ExtractionConfig.fromFile("kreuzberg.toml");
|
||
|
|
|
||
|
|
// Auto-discover configuration
|
||
|
|
const config = ExtractionConfig.discover();
|
||
|
|
```
|
||
|
|
|
||
|
|
### CLI
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Explicit configuration file
|
||
|
|
kreuzberg extract --config kreuzberg.toml document.pdf
|
||
|
|
|
||
|
|
# Auto-discovery (searches default locations)
|
||
|
|
kreuzberg extract document.pdf
|
||
|
|
```
|
||
|
|
|
||
|
|
## Configuration Schema
|
||
|
|
|
||
|
|
The complete TOML schema with all available sections and options:
|
||
|
|
|
||
|
|
### Top-Level Options
|
||
|
|
|
||
|
|
```toml
|
||
|
|
use_cache = true
|
||
|
|
enable_quality_processing = true
|
||
|
|
force_ocr = false
|
||
|
|
output_format = "markdown"
|
||
|
|
result_format = "text"
|
||
|
|
max_concurrent_extractions = 4
|
||
|
|
```
|
||
|
|
|
||
|
|
| Option | Type | Default | Description |
|
||
|
|
| ---------------------------- | ------- | ------------ | ----------------------------------------------------------------------------------- |
|
||
|
|
| `use_cache` | boolean | `true` | Enable caching of extraction results |
|
||
|
|
| `enable_quality_processing` | boolean | `true` | Enable post-processing for output quality |
|
||
|
|
| `force_ocr` | boolean | `false` | Force OCR processing even for searchable PDFs |
|
||
|
|
| `disable_ocr` | boolean | `false` | Disable OCR entirely — image files return empty content instead of errors (v4.7.0+) |
|
||
|
|
| `output_format` | string | `"markdown"` | Output format (markdown, html, text) |
|
||
|
|
| `result_format` | string | `"text"` | Result format for structured output |
|
||
|
|
| `max_concurrent_extractions` | integer | `4` | Maximum concurrent document extractions |
|
||
|
|
|
||
|
|
### OCR Configuration
|
||
|
|
|
||
|
|
```toml
|
||
|
|
[ocr]
|
||
|
|
backend = "tesseract"
|
||
|
|
language = "eng"
|
||
|
|
```
|
||
|
|
|
||
|
|
| Option | Type | Default | Description |
|
||
|
|
| ---------- | ------ | ------------- | --------------------------------------------- |
|
||
|
|
| `backend` | string | `"tesseract"` | OCR backend (currently tesseract) |
|
||
|
|
| `language` | string | `"eng"` | ISO 639-3 language code (eng, deu, fra, etc.) |
|
||
|
|
|
||
|
|
#### Tesseract Configuration
|
||
|
|
|
||
|
|
```toml
|
||
|
|
[ocr.tesseract_config]
|
||
|
|
psm = 3
|
||
|
|
oem = 3
|
||
|
|
min_confidence = 0.0
|
||
|
|
output_format = "text"
|
||
|
|
enable_table_detection = false
|
||
|
|
table_min_confidence = 0.5
|
||
|
|
table_column_threshold = 50
|
||
|
|
table_row_threshold_ratio = 0.5
|
||
|
|
use_cache = true
|
||
|
|
```
|
||
|
|
|
||
|
|
| Option | Type | Default | Description |
|
||
|
|
| --------------------------- | ------- | -------- | ------------------------------------------ |
|
||
|
|
| `psm` | integer | `3` | Page Segmentation Mode (0-13) |
|
||
|
|
| `oem` | integer | `3` | OCR Engine Mode (0-3) |
|
||
|
|
| `min_confidence` | float | `0.0` | Minimum OCR confidence threshold (0.0-1.0) |
|
||
|
|
| `output_format` | string | `"text"` | Output format from OCR |
|
||
|
|
| `enable_table_detection` | boolean | `false` | Enable table detection during OCR |
|
||
|
|
| `table_min_confidence` | float | `0.5` | Minimum confidence for table cells |
|
||
|
|
| `table_column_threshold` | integer | `50` | Pixel threshold for column detection |
|
||
|
|
| `table_row_threshold_ratio` | float | `0.5` | Row height ratio threshold |
|
||
|
|
| `use_cache` | boolean | `true` | Cache OCR results |
|
||
|
|
|
||
|
|
#### Tesseract Preprocessing
|
||
|
|
|
||
|
|
```toml
|
||
|
|
[ocr.tesseract_config.preprocessing]
|
||
|
|
target_dpi = 300
|
||
|
|
auto_rotate = true
|
||
|
|
deskew = true
|
||
|
|
denoise = true
|
||
|
|
contrast_enhance = true
|
||
|
|
binarization_method = "otsu"
|
||
|
|
invert_colors = false
|
||
|
|
```
|
||
|
|
|
||
|
|
| Option | Type | Default | Description |
|
||
|
|
| --------------------- | ------- | -------- | ---------------------------------------------- |
|
||
|
|
| `target_dpi` | integer | `300` | Target DPI for preprocessing |
|
||
|
|
| `auto_rotate` | boolean | `true` | Automatically detect and correct page rotation |
|
||
|
|
| `deskew` | boolean | `true` | Correct skewed pages |
|
||
|
|
| `denoise` | boolean | `true` | Remove noise from images |
|
||
|
|
| `contrast_enhance` | boolean | `true` | Enhance image contrast |
|
||
|
|
| `binarization_method` | string | `"otsu"` | Method for image binarization |
|
||
|
|
| `invert_colors` | boolean | `false` | Invert image colors if needed |
|
||
|
|
|
||
|
|
### PDF Options
|
||
|
|
|
||
|
|
```toml
|
||
|
|
[pdf_options]
|
||
|
|
extract_images = true
|
||
|
|
extract_metadata = true
|
||
|
|
|
||
|
|
[pdf_options.hierarchy]
|
||
|
|
enabled = true
|
||
|
|
k_clusters = 6
|
||
|
|
include_bbox = true
|
||
|
|
ocr_coverage_threshold = 0.5
|
||
|
|
```
|
||
|
|
|
||
|
|
| Option | Type | Default | Description |
|
||
|
|
| ---------------------------------- | ------- | ------- | ---------------------------------------------- |
|
||
|
|
| `extract_images` | boolean | `true` | Extract images from PDF documents |
|
||
|
|
| `extract_metadata` | boolean | `true` | Extract PDF metadata |
|
||
|
|
| `hierarchy.enabled` | boolean | `true` | Enable PDF hierarchy extraction (v4.0.0+) |
|
||
|
|
| `hierarchy.k_clusters` | integer | `6` | Number of clusters for hierarchy detection |
|
||
|
|
| `hierarchy.include_bbox` | boolean | `true` | Include bounding boxes in hierarchy |
|
||
|
|
| `hierarchy.ocr_coverage_threshold` | float | `0.5` | OCR coverage threshold for hierarchy (0.0-1.0) |
|
||
|
|
|
||
|
|
### Image Processing
|
||
|
|
|
||
|
|
```toml
|
||
|
|
[images]
|
||
|
|
extract_images = true
|
||
|
|
target_dpi = 300
|
||
|
|
max_image_dimension = 4096
|
||
|
|
auto_adjust_dpi = true
|
||
|
|
min_dpi = 72
|
||
|
|
max_dpi = 600
|
||
|
|
```
|
||
|
|
|
||
|
|
| Option | Type | Default | Description |
|
||
|
|
| --------------------- | ------- | ------- | -------------------------------------------- |
|
||
|
|
| `extract_images` | boolean | `true` | Extract images from documents |
|
||
|
|
| `target_dpi` | integer | `300` | Target DPI for image processing |
|
||
|
|
| `max_image_dimension` | integer | `4096` | Maximum image dimension in pixels |
|
||
|
|
| `auto_adjust_dpi` | boolean | `true` | Automatically adjust DPI based on image size |
|
||
|
|
| `min_dpi` | integer | `72` | Minimum DPI threshold |
|
||
|
|
| `max_dpi` | integer | `600` | Maximum DPI threshold |
|
||
|
|
|
||
|
|
### Chunking Configuration
|
||
|
|
|
||
|
|
```toml
|
||
|
|
[chunking]
|
||
|
|
max_chars = 1000
|
||
|
|
max_overlap = 200
|
||
|
|
|
||
|
|
[chunking.embedding]
|
||
|
|
batch_size = 32
|
||
|
|
normalize = true
|
||
|
|
show_download_progress = true
|
||
|
|
cache_dir = "~/.cache/kreuzberg/embeddings"
|
||
|
|
|
||
|
|
[chunking.embedding.model]
|
||
|
|
type = "preset"
|
||
|
|
name = "balanced"
|
||
|
|
```
|
||
|
|
|
||
|
|
| Option | Type | Default | Description |
|
||
|
|
| ---------------------------------- | ------- | --------------------------------- | ---------------------------------------------------------- |
|
||
|
|
| `max_chars` | integer | `1000` | Maximum characters per chunk |
|
||
|
|
| `max_overlap` | integer | `200` | Overlap between consecutive chunks |
|
||
|
|
| `embedding.batch_size` | integer | `32` | Batch size for embedding generation |
|
||
|
|
| `embedding.normalize` | boolean | `true` | Normalize embeddings to unit length |
|
||
|
|
| `embedding.show_download_progress` | boolean | `true` | Show progress when downloading models |
|
||
|
|
| `embedding.cache_dir` | string | `"~/.cache/kreuzberg/embeddings"` | Directory for caching embeddings |
|
||
|
|
| `embedding.model.type` | string | `"preset"` | Model type: preset, fastembed, or custom |
|
||
|
|
| `embedding.model.name` | string | `"balanced"` | Preset model name (balanced, fast, accurate, multilingual) |
|
||
|
|
| `embedding.model.model` | string | | FastEmbed model identifier |
|
||
|
|
| `embedding.model.model_id` | string | | Custom HuggingFace model ID |
|
||
|
|
| `embedding.model.dimensions` | integer | | Embedding dimensions |
|
||
|
|
|
||
|
|
### Keywords Configuration
|
||
|
|
|
||
|
|
```toml
|
||
|
|
[keywords]
|
||
|
|
algorithm = "yake"
|
||
|
|
max_keywords = 10
|
||
|
|
min_score = 0.0
|
||
|
|
ngram_range = [1, 3]
|
||
|
|
language = "en"
|
||
|
|
```
|
||
|
|
|
||
|
|
| Option | Type | Default | Description |
|
||
|
|
| -------------- | ------- | -------- | ------------------------------------------- |
|
||
|
|
| `algorithm` | string | `"yake"` | Keyword extraction algorithm (yake or rake) |
|
||
|
|
| `max_keywords` | integer | `10` | Maximum keywords to extract |
|
||
|
|
| `min_score` | float | `0.0` | Minimum relevance score for keywords |
|
||
|
|
| `ngram_range` | array | `[1, 3]` | N-gram size range [min, max] |
|
||
|
|
| `language` | string | `"en"` | Language code for keyword extraction |
|
||
|
|
|
||
|
|
### Token Reduction
|
||
|
|
|
||
|
|
```toml
|
||
|
|
[token_reduction]
|
||
|
|
mode = "off"
|
||
|
|
preserve_important_words = true
|
||
|
|
```
|
||
|
|
|
||
|
|
| Option | Type | Default | Description |
|
||
|
|
| -------------------------- | ------- | ------- | ----------------------------------------- |
|
||
|
|
| `mode` | string | `"off"` | Mode: off, aggressive, moderate, minimal |
|
||
|
|
| `preserve_important_words` | boolean | `true` | Preserve important words during reduction |
|
||
|
|
|
||
|
|
### Language Detection
|
||
|
|
|
||
|
|
```toml
|
||
|
|
[language_detection]
|
||
|
|
enabled = true
|
||
|
|
min_confidence = 0.8
|
||
|
|
detect_multiple = false
|
||
|
|
```
|
||
|
|
|
||
|
|
| Option | Type | Default | Description |
|
||
|
|
| ----------------- | ------- | ------- | ------------------------------------------ |
|
||
|
|
| `enabled` | boolean | `true` | Enable automatic language detection |
|
||
|
|
| `min_confidence` | float | `0.8` | Minimum confidence threshold for detection |
|
||
|
|
| `detect_multiple` | boolean | `false` | Detect multiple languages in document |
|
||
|
|
|
||
|
|
### Post-Processor
|
||
|
|
|
||
|
|
```toml
|
||
|
|
[postprocessor]
|
||
|
|
enabled = true
|
||
|
|
```
|
||
|
|
|
||
|
|
| Option | Type | Default | Description |
|
||
|
|
| --------- | ------- | ------- | ------------------------------------------- |
|
||
|
|
| `enabled` | boolean | `true` | Enable post-processing of extracted content |
|
||
|
|
|
||
|
|
## FileExtractionConfig (Per-File Overrides)
|
||
|
|
|
||
|
|
Passed as an optional parameter to `batch_extract_file` / `batch_extract_bytes` (and their sync variants) to override settings per file in a batch. All fields optional — `None` = use batch default. The separate `_with_configs` functions were removed in v4.5.0.
|
||
|
|
|
||
|
|
**Overridable fields:** `enable_quality_processing`, `ocr`, `force_ocr`, `chunking`, `images`, `pdf_options`, `token_reduction`, `language_detection`, `pages`, `keywords`, `postprocessor`, `html_options`, `result_format`, `output_format`, `include_document_structure`, `layout`.
|
||
|
|
|
||
|
|
**Batch-level only (not overridable):** `max_concurrent_extractions`, `use_cache`, `acceleration`, `security_limits`.
|
||
|
|
|
||
|
|
**Merge semantics:** For each file, `FileExtractionConfig` fields are overlaid on the batch `ExtractionConfig`. `None` falls through to batch default; `Some(value)` replaces the batch default for that file.
|
||
|
|
|
||
|
|
```toml
|
||
|
|
# FileExtractionConfig cannot be specified in config files —
|
||
|
|
# it is a programmatic API for per-file overrides at runtime.
|
||
|
|
```
|
||
|
|
|
||
|
|
## Naming Conventions
|
||
|
|
|
||
|
|
Kreuzberg uses consistent naming conventions across different contexts:
|
||
|
|
|
||
|
|
| Context | Convention | Example |
|
||
|
|
| -------------------- | ---------- | --------------------------------------------- |
|
||
|
|
| Python | snake_case | `max_chars`, `pdf_options`, `use_cache` |
|
||
|
|
| Node.js / TypeScript | camelCase | `maxChars`, `pdfOptions`, `useCache` |
|
||
|
|
| Rust | snake_case | `max_chars`, `pdf_options`, `use_cache` |
|
||
|
|
| TOML / YAML / JSON | snake_case | `max_chars`, `pdf_options`, `use_cache` |
|
||
|
|
| CLI flags | kebab-case | `--max-chars`, `--pdf-options`, `--use-cache` |
|
||
|
|
|
||
|
|
When switching between languages, apply the appropriate conversion:
|
||
|
|
|
||
|
|
- Python → Node.js: `snake_case` to `camelCase`
|
||
|
|
- CLI → Python: `kebab-case` to `snake_case`
|
||
|
|
- TOML → Python: No conversion needed (both use `snake_case`)
|
||
|
|
|
||
|
|
## Environment Variables
|
||
|
|
|
||
|
|
The following environment variables can override configuration:
|
||
|
|
|
||
|
|
| Variable | Purpose | Example |
|
||
|
|
| ---------------- | ----------------------------------- | ----------- |
|
||
|
|
| `KREUZBERG_HOST` | Server bind address (serve command) | `127.0.0.1` |
|
||
|
|
| `KREUZBERG_PORT` | Server port (serve command) | `8080` |
|
||
|
|
|
||
|
|
## Configuration Merging
|
||
|
|
|
||
|
|
Configuration sources are merged in priority order (highest to lowest):
|
||
|
|
|
||
|
|
1. **CLI flags** (highest priority)
|
||
|
|
2. **Inline JSON configuration** (programmatic)
|
||
|
|
3. **Configuration file** (lowest priority)
|
||
|
|
|
||
|
|
Later sources override earlier ones. For example, a CLI flag `--max-chars 2000` overrides `max_chars = 1000` in the configuration file.
|
||
|
|
|
||
|
|
## Example Configurations
|
||
|
|
|
||
|
|
### Minimal Configuration
|
||
|
|
|
||
|
|
```toml
|
||
|
|
use_cache = true
|
||
|
|
enable_quality_processing = true
|
||
|
|
|
||
|
|
[ocr]
|
||
|
|
backend = "tesseract"
|
||
|
|
language = "eng"
|
||
|
|
```
|
||
|
|
|
||
|
|
### High-Quality PDF Extraction
|
||
|
|
|
||
|
|
```toml
|
||
|
|
use_cache = true
|
||
|
|
enable_quality_processing = true
|
||
|
|
force_ocr = false
|
||
|
|
|
||
|
|
[ocr]
|
||
|
|
backend = "tesseract"
|
||
|
|
language = "eng"
|
||
|
|
|
||
|
|
[ocr.tesseract_config]
|
||
|
|
psm = 3
|
||
|
|
oem = 3
|
||
|
|
enable_table_detection = true
|
||
|
|
table_min_confidence = 0.7
|
||
|
|
|
||
|
|
[pdf_options]
|
||
|
|
extract_images = true
|
||
|
|
extract_metadata = true
|
||
|
|
|
||
|
|
[pdf_options.hierarchy]
|
||
|
|
enabled = true
|
||
|
|
k_clusters = 6
|
||
|
|
|
||
|
|
[images]
|
||
|
|
extract_images = true
|
||
|
|
target_dpi = 300
|
||
|
|
```
|
||
|
|
|
||
|
|
### Semantic Search Configuration
|
||
|
|
|
||
|
|
```toml
|
||
|
|
[chunking]
|
||
|
|
max_chars = 800
|
||
|
|
max_overlap = 150
|
||
|
|
|
||
|
|
[chunking.embedding]
|
||
|
|
batch_size = 32
|
||
|
|
normalize = true
|
||
|
|
cache_dir = "~/.cache/kreuzberg/embeddings"
|
||
|
|
|
||
|
|
[chunking.embedding.model]
|
||
|
|
type = "preset"
|
||
|
|
name = "accurate"
|
||
|
|
|
||
|
|
[keywords]
|
||
|
|
algorithm = "yake"
|
||
|
|
max_keywords = 15
|
||
|
|
```
|
||
|
|
|
||
|
|
## Field Name Reference
|
||
|
|
|
||
|
|
Critical field names to use in configuration files:
|
||
|
|
|
||
|
|
- `max_chars` (NOT `max_characters`)
|
||
|
|
- `max_overlap` (NOT `overlap`)
|
||
|
|
- `table_min_confidence`
|
||
|
|
- `table_column_threshold`
|
||
|
|
- `table_row_threshold_ratio`
|
||
|
|
- `ocr_coverage_threshold`
|
||
|
|
- `k_clusters`
|
||
|
|
- `include_bbox`
|
||
|
|
- `enable_table_detection`
|
||
|
|
- `auto_rotate`
|
||
|
|
- `auto_adjust_dpi`
|
||
|
|
- `show_download_progress`
|
||
|
|
- `min_confidence`
|
||
|
|
- `detect_multiple`
|
||
|
|
|
||
|
|
Always verify field names against the source configuration file when adding new options.
|