skills/kreuzberg/references/configuration.md

# Configuration Reference

Kreuzberg uses a hierarchical configuration system supporting multiple formats and auto-discovery mechanisms. This reference covers all available configuration options, field names across programming languages, and loading strategies.

## Supported Formats

Kreuzberg configurations can be defined in three formats:

- **TOML** (recommended): `kreuzberg.toml`
- **YAML**: `kreuzberg.yaml`
- **JSON**: `kreuzberg.json`

All formats support the same schema and configuration options.

## Auto-Discovery

When no configuration file is explicitly specified, Kreuzberg searches for configuration files in the following order:

1. Current working directory: `kreuzberg.toml`, `kreuzberg.yaml`, `kreuzberg.json`
2. Parent directories (recursively up the tree, same file name pattern)

The first matching configuration file is loaded.

## Programmatic Loading

### Python

```python
from kreuzberg import ExtractionConfig

# Load from explicit path
config = ExtractionConfig.from_file("kreuzberg.toml")

# Auto-discover configuration
config = ExtractionConfig.discover()
```

### Node.js / TypeScript

```typescript
import { ExtractionConfig } from "@kreuzberg/node";

// Load from explicit path
const config = ExtractionConfig.fromFile("kreuzberg.toml");

// Auto-discover configuration
const config = ExtractionConfig.discover();
```

### CLI

```bash
# Explicit configuration file
kreuzberg extract --config kreuzberg.toml document.pdf

# Auto-discovery (searches default locations)
kreuzberg extract document.pdf
```

## Configuration Schema

The complete TOML schema with all available sections and options:

### Top-Level Options

```toml
use_cache = true
enable_quality_processing = true
force_ocr = false
output_format = "markdown"
result_format = "text"
max_concurrent_extractions = 4
```

| Option                       | Type    | Default      | Description                                                                         |
| ---------------------------- | ------- | ------------ | ----------------------------------------------------------------------------------- |
| `use_cache`                  | boolean | `true`       | Enable caching of extraction results                                                |
| `enable_quality_processing`  | boolean | `true`       | Enable post-processing for output quality                                           |
| `force_ocr`                  | boolean | `false`      | Force OCR processing even for searchable PDFs                                       |
| `disable_ocr`                | boolean | `false`      | Disable OCR entirely — image files return empty content instead of errors (v4.7.0+) |
| `output_format`              | string  | `"markdown"` | Output format (markdown, html, text)                                                |
| `result_format`              | string  | `"text"`     | Result format for structured output                                                 |
| `max_concurrent_extractions` | integer | `4`          | Maximum concurrent document extractions                                             |

### OCR Configuration

```toml
[ocr]
backend = "tesseract"
language = "eng"
```

| Option     | Type   | Default       | Description                                   |
| ---------- | ------ | ------------- | --------------------------------------------- |
| `backend`  | string | `"tesseract"` | OCR backend (currently tesseract)             |
| `language` | string | `"eng"`       | ISO 639-3 language code (eng, deu, fra, etc.) |

#### Tesseract Configuration

```toml
[ocr.tesseract_config]
psm = 3
oem = 3
min_confidence = 0.0
output_format = "text"
enable_table_detection = false
table_min_confidence = 0.5
table_column_threshold = 50
table_row_threshold_ratio = 0.5
use_cache = true
```

| Option                      | Type    | Default  | Description                                |
| --------------------------- | ------- | -------- | ------------------------------------------ |
| `psm`                       | integer | `3`      | Page Segmentation Mode (0-13)              |
| `oem`                       | integer | `3`      | OCR Engine Mode (0-3)                      |
| `min_confidence`            | float   | `0.0`    | Minimum OCR confidence threshold (0.0-1.0) |
| `output_format`             | string  | `"text"` | Output format from OCR                     |
| `enable_table_detection`    | boolean | `false`  | Enable table detection during OCR          |
| `table_min_confidence`      | float   | `0.5`    | Minimum confidence for table cells         |
| `table_column_threshold`    | integer | `50`     | Pixel threshold for column detection       |
| `table_row_threshold_ratio` | float   | `0.5`    | Row height ratio threshold                 |
| `use_cache`                 | boolean | `true`   | Cache OCR results                          |

#### Tesseract Preprocessing

```toml
[ocr.tesseract_config.preprocessing]
target_dpi = 300
auto_rotate = true
deskew = true
denoise = true
contrast_enhance = true
binarization_method = "otsu"
invert_colors = false
```

| Option                | Type    | Default  | Description                                    |
| --------------------- | ------- | -------- | ---------------------------------------------- |
| `target_dpi`          | integer | `300`    | Target DPI for preprocessing                   |
| `auto_rotate`         | boolean | `true`   | Automatically detect and correct page rotation |
| `deskew`              | boolean | `true`   | Correct skewed pages                           |
| `denoise`             | boolean | `true`   | Remove noise from images                       |
| `contrast_enhance`    | boolean | `true`   | Enhance image contrast                         |
| `binarization_method` | string  | `"otsu"` | Method for image binarization                  |
| `invert_colors`       | boolean | `false`  | Invert image colors if needed                  |

### PDF Options

```toml
[pdf_options]
extract_images = true
extract_metadata = true

[pdf_options.hierarchy]
enabled = true
k_clusters = 6
include_bbox = true
ocr_coverage_threshold = 0.5
```

| Option                             | Type    | Default | Description                                    |
| ---------------------------------- | ------- | ------- | ---------------------------------------------- |
| `extract_images`                   | boolean | `true`  | Extract images from PDF documents              |
| `extract_metadata`                 | boolean | `true`  | Extract PDF metadata                           |
| `hierarchy.enabled`                | boolean | `true`  | Enable PDF hierarchy extraction (v4.0.0+)      |
| `hierarchy.k_clusters`             | integer | `6`     | Number of clusters for hierarchy detection     |
| `hierarchy.include_bbox`           | boolean | `true`  | Include bounding boxes in hierarchy            |
| `hierarchy.ocr_coverage_threshold` | float   | `0.5`   | OCR coverage threshold for hierarchy (0.0-1.0) |

### Image Processing

```toml
[images]
extract_images = true
target_dpi = 300
max_image_dimension = 4096
auto_adjust_dpi = true
min_dpi = 72
max_dpi = 600
```

| Option                | Type    | Default | Description                                  |
| --------------------- | ------- | ------- | -------------------------------------------- |
| `extract_images`      | boolean | `true`  | Extract images from documents                |
| `target_dpi`          | integer | `300`   | Target DPI for image processing              |
| `max_image_dimension` | integer | `4096`  | Maximum image dimension in pixels            |
| `auto_adjust_dpi`     | boolean | `true`  | Automatically adjust DPI based on image size |
| `min_dpi`             | integer | `72`    | Minimum DPI threshold                        |
| `max_dpi`             | integer | `600`   | Maximum DPI threshold                        |

### Chunking Configuration

```toml
[chunking]
max_chars = 1000
max_overlap = 200

[chunking.embedding]
batch_size = 32
normalize = true
show_download_progress = true
cache_dir = "~/.cache/kreuzberg/embeddings"

[chunking.embedding.model]
type = "preset"
name = "balanced"
```

| Option                             | Type    | Default                           | Description                                                |
| ---------------------------------- | ------- | --------------------------------- | ---------------------------------------------------------- |
| `max_chars`                        | integer | `1000`                            | Maximum characters per chunk                               |
| `max_overlap`                      | integer | `200`                             | Overlap between consecutive chunks                         |
| `embedding.batch_size`             | integer | `32`                              | Batch size for embedding generation                        |
| `embedding.normalize`              | boolean | `true`                            | Normalize embeddings to unit length                        |
| `embedding.show_download_progress` | boolean | `true`                            | Show progress when downloading models                      |
| `embedding.cache_dir`              | string  | `"~/.cache/kreuzberg/embeddings"` | Directory for caching embeddings                           |
| `embedding.model.type`             | string  | `"preset"`                        | Model type: preset, fastembed, or custom                   |
| `embedding.model.name`             | string  | `"balanced"`                      | Preset model name (balanced, fast, accurate, multilingual) |
| `embedding.model.model`            | string  |                                   | FastEmbed model identifier                                 |
| `embedding.model.model_id`         | string  |                                   | Custom HuggingFace model ID                                |
| `embedding.model.dimensions`       | integer |                                   | Embedding dimensions                                       |

### Keywords Configuration

```toml
[keywords]
algorithm = "yake"
max_keywords = 10
min_score = 0.0
ngram_range = [1, 3]
language = "en"
```

| Option         | Type    | Default  | Description                                 |
| -------------- | ------- | -------- | ------------------------------------------- |
| `algorithm`    | string  | `"yake"` | Keyword extraction algorithm (yake or rake) |
| `max_keywords` | integer | `10`     | Maximum keywords to extract                 |
| `min_score`    | float   | `0.0`    | Minimum relevance score for keywords        |
| `ngram_range`  | array   | `[1, 3]` | N-gram size range [min, max]                |
| `language`     | string  | `"en"`   | Language code for keyword extraction        |

### Token Reduction

```toml
[token_reduction]
mode = "off"
preserve_important_words = true
```

| Option                     | Type    | Default | Description                               |
| -------------------------- | ------- | ------- | ----------------------------------------- |
| `mode`                     | string  | `"off"` | Mode: off, aggressive, moderate, minimal  |
| `preserve_important_words` | boolean | `true`  | Preserve important words during reduction |

### Language Detection

```toml
[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false
```

| Option            | Type    | Default | Description                                |
| ----------------- | ------- | ------- | ------------------------------------------ |
| `enabled`         | boolean | `true`  | Enable automatic language detection        |
| `min_confidence`  | float   | `0.8`   | Minimum confidence threshold for detection |
| `detect_multiple` | boolean | `false` | Detect multiple languages in document      |

### Post-Processor

```toml
[postprocessor]
enabled = true
```

| Option    | Type    | Default | Description                                 |
| --------- | ------- | ------- | ------------------------------------------- |
| `enabled` | boolean | `true`  | Enable post-processing of extracted content |

## FileExtractionConfig (Per-File Overrides)

Passed as an optional parameter to `batch_extract_file` / `batch_extract_bytes` (and their sync variants) to override settings per file in a batch. All fields optional — `None` = use batch default. The separate `_with_configs` functions were removed in v4.5.0.

**Overridable fields:** `enable_quality_processing`, `ocr`, `force_ocr`, `chunking`, `images`, `pdf_options`, `token_reduction`, `language_detection`, `pages`, `keywords`, `postprocessor`, `html_options`, `result_format`, `output_format`, `include_document_structure`, `layout`.

**Batch-level only (not overridable):** `max_concurrent_extractions`, `use_cache`, `acceleration`, `security_limits`.

**Merge semantics:** For each file, `FileExtractionConfig` fields are overlaid on the batch `ExtractionConfig`. `None` falls through to batch default; `Some(value)` replaces the batch default for that file.

```toml
# FileExtractionConfig cannot be specified in config files —
# it is a programmatic API for per-file overrides at runtime.
```

## Naming Conventions

Kreuzberg uses consistent naming conventions across different contexts:

| Context              | Convention | Example                                       |
| -------------------- | ---------- | --------------------------------------------- |
| Python               | snake_case | `max_chars`, `pdf_options`, `use_cache`       |
| Node.js / TypeScript | camelCase  | `maxChars`, `pdfOptions`, `useCache`          |
| Rust                 | snake_case | `max_chars`, `pdf_options`, `use_cache`       |
| TOML / YAML / JSON   | snake_case | `max_chars`, `pdf_options`, `use_cache`       |
| CLI flags            | kebab-case | `--max-chars`, `--pdf-options`, `--use-cache` |

When switching between languages, apply the appropriate conversion:

- Python → Node.js: `snake_case` to `camelCase`
- CLI → Python: `kebab-case` to `snake_case`
- TOML → Python: No conversion needed (both use `snake_case`)

## Environment Variables

The following environment variables can override configuration:

| Variable         | Purpose                             | Example     |
| ---------------- | ----------------------------------- | ----------- |
| `KREUZBERG_HOST` | Server bind address (serve command) | `127.0.0.1` |
| `KREUZBERG_PORT` | Server port (serve command)         | `8080`      |

## Configuration Merging

Configuration sources are merged in priority order (highest to lowest):

1. **CLI flags** (highest priority)
2. **Inline JSON configuration** (programmatic)
3. **Configuration file** (lowest priority)

Later sources override earlier ones. For example, a CLI flag `--max-chars 2000` overrides `max_chars = 1000` in the configuration file.

## Example Configurations

### Minimal Configuration

```toml
use_cache = true
enable_quality_processing = true

[ocr]
backend = "tesseract"
language = "eng"
```

### High-Quality PDF Extraction

```toml
use_cache = true
enable_quality_processing = true
force_ocr = false

[ocr]
backend = "tesseract"
language = "eng"

[ocr.tesseract_config]
psm = 3
oem = 3
enable_table_detection = true
table_min_confidence = 0.7

[pdf_options]
extract_images = true
extract_metadata = true

[pdf_options.hierarchy]
enabled = true
k_clusters = 6

[images]
extract_images = true
target_dpi = 300
```

### Semantic Search Configuration

```toml
[chunking]
max_chars = 800
max_overlap = 150

[chunking.embedding]
batch_size = 32
normalize = true
cache_dir = "~/.cache/kreuzberg/embeddings"

[chunking.embedding.model]
type = "preset"
name = "accurate"

[keywords]
algorithm = "yake"
max_keywords = 15
```

## Field Name Reference

Critical field names to use in configuration files:

- `max_chars` (NOT `max_characters`)
- `max_overlap` (NOT `overlap`)
- `table_min_confidence`
- `table_column_threshold`
- `table_row_threshold_ratio`
- `ocr_coverage_threshold`
- `k_clusters`
- `include_bbox`
- `enable_table_detection`
- `auto_rotate`
- `auto_adjust_dpi`
- `show_download_progress`
- `min_confidence`
- `detect_multiple`

Always verify field names against the source configuration file when adding new options.
Nomad changes 2026-06-01 23:40:55 +02:00			`# Configuration Reference`

			`Kreuzberg uses a hierarchical configuration system supporting multiple formats and auto-discovery mechanisms. This reference covers all available configuration options, field names across programming languages, and loading strategies.`

			`## Supported Formats`

			`Kreuzberg configurations can be defined in three formats:`

			- TOML (recommended): `kreuzberg.toml`
			- YAML: `kreuzberg.yaml`
			- JSON: `kreuzberg.json`

			`All formats support the same schema and configuration options.`

			`## Auto-Discovery`

			`When no configuration file is explicitly specified, Kreuzberg searches for configuration files in the following order:`

			1. Current working directory: `kreuzberg.toml`, `kreuzberg.yaml`, `kreuzberg.json`
			`2. Parent directories (recursively up the tree, same file name pattern)`

			`The first matching configuration file is loaded.`

			`## Programmatic Loading`

			`### Python`

			```python
			`from kreuzberg import ExtractionConfig`

			`# Load from explicit path`
			`config = ExtractionConfig.from_file("kreuzberg.toml")`

			`# Auto-discover configuration`
			`config = ExtractionConfig.discover()`
			```

			`### Node.js / TypeScript`

			```typescript
			`import { ExtractionConfig } from "@kreuzberg/node";`

			`// Load from explicit path`
			`const config = ExtractionConfig.fromFile("kreuzberg.toml");`

			`// Auto-discover configuration`
			`const config = ExtractionConfig.discover();`
			```

			`### CLI`

			```bash
			`# Explicit configuration file`
			`kreuzberg extract --config kreuzberg.toml document.pdf`

			`# Auto-discovery (searches default locations)`
			`kreuzberg extract document.pdf`
			```

			`## Configuration Schema`

			`The complete TOML schema with all available sections and options:`

			`### Top-Level Options`

			```toml
			`use_cache = true`
			`enable_quality_processing = true`
			`force_ocr = false`
			`output_format = "markdown"`
			`result_format = "text"`
			`max_concurrent_extractions = 4`
			```

			`\| Option \| Type \| Default \| Description \|`
			`\| ---------------------------- \| ------- \| ------------ \| ----------------------------------------------------------------------------------- \|`
			\| `use_cache` \| boolean \| `true` \| Enable caching of extraction results \|
			\| `enable_quality_processing` \| boolean \| `true` \| Enable post-processing for output quality \|
			\| `force_ocr` \| boolean \| `false` \| Force OCR processing even for searchable PDFs \|
			\| `disable_ocr` \| boolean \| `false` \| Disable OCR entirely — image files return empty content instead of errors (v4.7.0+) \|
			\| `output_format` \| string \| `"markdown"` \| Output format (markdown, html, text) \|
			\| `result_format` \| string \| `"text"` \| Result format for structured output \|
			\| `max_concurrent_extractions` \| integer \| `4` \| Maximum concurrent document extractions \|

			`### OCR Configuration`

			```toml
			`[ocr]`
			`backend = "tesseract"`
			`language = "eng"`
			```

			`\| Option \| Type \| Default \| Description \|`
			`\| ---------- \| ------ \| ------------- \| --------------------------------------------- \|`
			\| `backend` \| string \| `"tesseract"` \| OCR backend (currently tesseract) \|
			\| `language` \| string \| `"eng"` \| ISO 639-3 language code (eng, deu, fra, etc.) \|

			`#### Tesseract Configuration`

			```toml
			`[ocr.tesseract_config]`
			`psm = 3`
			`oem = 3`
			`min_confidence = 0.0`
			`output_format = "text"`
			`enable_table_detection = false`
			`table_min_confidence = 0.5`
			`table_column_threshold = 50`
			`table_row_threshold_ratio = 0.5`
			`use_cache = true`
			```

			`\| Option \| Type \| Default \| Description \|`
			`\| --------------------------- \| ------- \| -------- \| ------------------------------------------ \|`
			\| `psm` \| integer \| `3` \| Page Segmentation Mode (0-13) \|
			\| `oem` \| integer \| `3` \| OCR Engine Mode (0-3) \|
			\| `min_confidence` \| float \| `0.0` \| Minimum OCR confidence threshold (0.0-1.0) \|
			\| `output_format` \| string \| `"text"` \| Output format from OCR \|
			\| `enable_table_detection` \| boolean \| `false` \| Enable table detection during OCR \|
			\| `table_min_confidence` \| float \| `0.5` \| Minimum confidence for table cells \|
			\| `table_column_threshold` \| integer \| `50` \| Pixel threshold for column detection \|
			\| `table_row_threshold_ratio` \| float \| `0.5` \| Row height ratio threshold \|
			\| `use_cache` \| boolean \| `true` \| Cache OCR results \|

			`#### Tesseract Preprocessing`

			```toml
			`[ocr.tesseract_config.preprocessing]`
			`target_dpi = 300`
			`auto_rotate = true`
			`deskew = true`
			`denoise = true`
			`contrast_enhance = true`
			`binarization_method = "otsu"`
			`invert_colors = false`
			```

			`\| Option \| Type \| Default \| Description \|`
			`\| --------------------- \| ------- \| -------- \| ---------------------------------------------- \|`
			\| `target_dpi` \| integer \| `300` \| Target DPI for preprocessing \|
			\| `auto_rotate` \| boolean \| `true` \| Automatically detect and correct page rotation \|
			\| `deskew` \| boolean \| `true` \| Correct skewed pages \|
			\| `denoise` \| boolean \| `true` \| Remove noise from images \|
			\| `contrast_enhance` \| boolean \| `true` \| Enhance image contrast \|
			\| `binarization_method` \| string \| `"otsu"` \| Method for image binarization \|
			\| `invert_colors` \| boolean \| `false` \| Invert image colors if needed \|

			`### PDF Options`

			```toml
			`[pdf_options]`
			`extract_images = true`
			`extract_metadata = true`

			`[pdf_options.hierarchy]`
			`enabled = true`
			`k_clusters = 6`
			`include_bbox = true`
			`ocr_coverage_threshold = 0.5`
			```

			`\| Option \| Type \| Default \| Description \|`
			`\| ---------------------------------- \| ------- \| ------- \| ---------------------------------------------- \|`
			\| `extract_images` \| boolean \| `true` \| Extract images from PDF documents \|
			\| `extract_metadata` \| boolean \| `true` \| Extract PDF metadata \|
			\| `hierarchy.enabled` \| boolean \| `true` \| Enable PDF hierarchy extraction (v4.0.0+) \|
			\| `hierarchy.k_clusters` \| integer \| `6` \| Number of clusters for hierarchy detection \|
			\| `hierarchy.include_bbox` \| boolean \| `true` \| Include bounding boxes in hierarchy \|
			\| `hierarchy.ocr_coverage_threshold` \| float \| `0.5` \| OCR coverage threshold for hierarchy (0.0-1.0) \|

			`### Image Processing`

			```toml
			`[images]`
			`extract_images = true`
			`target_dpi = 300`
			`max_image_dimension = 4096`
			`auto_adjust_dpi = true`
			`min_dpi = 72`
			`max_dpi = 600`
			```

			`\| Option \| Type \| Default \| Description \|`
			`\| --------------------- \| ------- \| ------- \| -------------------------------------------- \|`
			\| `extract_images` \| boolean \| `true` \| Extract images from documents \|
			\| `target_dpi` \| integer \| `300` \| Target DPI for image processing \|
			\| `max_image_dimension` \| integer \| `4096` \| Maximum image dimension in pixels \|
			\| `auto_adjust_dpi` \| boolean \| `true` \| Automatically adjust DPI based on image size \|
			\| `min_dpi` \| integer \| `72` \| Minimum DPI threshold \|
			\| `max_dpi` \| integer \| `600` \| Maximum DPI threshold \|

			`### Chunking Configuration`

			```toml
			`[chunking]`
			`max_chars = 1000`
			`max_overlap = 200`

			`[chunking.embedding]`
			`batch_size = 32`
			`normalize = true`
			`show_download_progress = true`
			`cache_dir = "~/.cache/kreuzberg/embeddings"`

			`[chunking.embedding.model]`
			`type = "preset"`
			`name = "balanced"`
			```

			`\| Option \| Type \| Default \| Description \|`
			`\| ---------------------------------- \| ------- \| --------------------------------- \| ---------------------------------------------------------- \|`
			\| `max_chars` \| integer \| `1000` \| Maximum characters per chunk \|
			\| `max_overlap` \| integer \| `200` \| Overlap between consecutive chunks \|
			\| `embedding.batch_size` \| integer \| `32` \| Batch size for embedding generation \|
			\| `embedding.normalize` \| boolean \| `true` \| Normalize embeddings to unit length \|
			\| `embedding.show_download_progress` \| boolean \| `true` \| Show progress when downloading models \|
			\| `embedding.cache_dir` \| string \| `"~/.cache/kreuzberg/embeddings"` \| Directory for caching embeddings \|
			\| `embedding.model.type` \| string \| `"preset"` \| Model type: preset, fastembed, or custom \|
			\| `embedding.model.name` \| string \| `"balanced"` \| Preset model name (balanced, fast, accurate, multilingual) \|
			\| `embedding.model.model` \| string \| \| FastEmbed model identifier \|
			\| `embedding.model.model_id` \| string \| \| Custom HuggingFace model ID \|
			\| `embedding.model.dimensions` \| integer \| \| Embedding dimensions \|

			`### Keywords Configuration`

			```toml
			`[keywords]`
			`algorithm = "yake"`
			`max_keywords = 10`
			`min_score = 0.0`
			`ngram_range = [1, 3]`
			`language = "en"`
			```

			`\| Option \| Type \| Default \| Description \|`
			`\| -------------- \| ------- \| -------- \| ------------------------------------------- \|`
			\| `algorithm` \| string \| `"yake"` \| Keyword extraction algorithm (yake or rake) \|
			\| `max_keywords` \| integer \| `10` \| Maximum keywords to extract \|
			\| `min_score` \| float \| `0.0` \| Minimum relevance score for keywords \|
			\| `ngram_range` \| array \| `[1, 3]` \| N-gram size range [min, max] \|
			\| `language` \| string \| `"en"` \| Language code for keyword extraction \|

			`### Token Reduction`

			```toml
			`[token_reduction]`
			`mode = "off"`
			`preserve_important_words = true`
			```

			`\| Option \| Type \| Default \| Description \|`
			`\| -------------------------- \| ------- \| ------- \| ----------------------------------------- \|`
			\| `mode` \| string \| `"off"` \| Mode: off, aggressive, moderate, minimal \|
			\| `preserve_important_words` \| boolean \| `true` \| Preserve important words during reduction \|

			`### Language Detection`

			```toml
			`[language_detection]`
			`enabled = true`
			`min_confidence = 0.8`
			`detect_multiple = false`
			```

			`\| Option \| Type \| Default \| Description \|`
			`\| ----------------- \| ------- \| ------- \| ------------------------------------------ \|`
			\| `enabled` \| boolean \| `true` \| Enable automatic language detection \|
			\| `min_confidence` \| float \| `0.8` \| Minimum confidence threshold for detection \|
			\| `detect_multiple` \| boolean \| `false` \| Detect multiple languages in document \|

			`### Post-Processor`

			```toml
			`[postprocessor]`
			`enabled = true`
			```

			`\| Option \| Type \| Default \| Description \|`
			`\| --------- \| ------- \| ------- \| ------------------------------------------- \|`
			\| `enabled` \| boolean \| `true` \| Enable post-processing of extracted content \|

			`## FileExtractionConfig (Per-File Overrides)`

			Passed as an optional parameter to `batch_extract_file` / `batch_extract_bytes` (and their sync variants) to override settings per file in a batch. All fields optional — `None` = use batch default. The separate `_with_configs` functions were removed in v4.5.0.

			Overridable fields: `enable_quality_processing`, `ocr`, `force_ocr`, `chunking`, `images`, `pdf_options`, `token_reduction`, `language_detection`, `pages`, `keywords`, `postprocessor`, `html_options`, `result_format`, `output_format`, `include_document_structure`, `layout`.

			Batch-level only (not overridable): `max_concurrent_extractions`, `use_cache`, `acceleration`, `security_limits`.

			Merge semantics: For each file, `FileExtractionConfig` fields are overlaid on the batch `ExtractionConfig`. `None` falls through to batch default; `Some(value)` replaces the batch default for that file.

			```toml
			`# FileExtractionConfig cannot be specified in config files —`
			`# it is a programmatic API for per-file overrides at runtime.`
			```

			`## Naming Conventions`

			`Kreuzberg uses consistent naming conventions across different contexts:`

			`\| Context \| Convention \| Example \|`
			`\| -------------------- \| ---------- \| --------------------------------------------- \|`
			\| Python \| snake_case \| `max_chars`, `pdf_options`, `use_cache` \|
			\| Node.js / TypeScript \| camelCase \| `maxChars`, `pdfOptions`, `useCache` \|
			\| Rust \| snake_case \| `max_chars`, `pdf_options`, `use_cache` \|
			\| TOML / YAML / JSON \| snake_case \| `max_chars`, `pdf_options`, `use_cache` \|
			\| CLI flags \| kebab-case \| `--max-chars`, `--pdf-options`, `--use-cache` \|

			`When switching between languages, apply the appropriate conversion:`

			- Python → Node.js: `snake_case` to `camelCase`
			- CLI → Python: `kebab-case` to `snake_case`
			- TOML → Python: No conversion needed (both use `snake_case`)

			`## Environment Variables`

			`The following environment variables can override configuration:`

			`\| Variable \| Purpose \| Example \|`
			`\| ---------------- \| ----------------------------------- \| ----------- \|`
			\| `KREUZBERG_HOST` \| Server bind address (serve command) \| `127.0.0.1` \|
			\| `KREUZBERG_PORT` \| Server port (serve command) \| `8080` \|

			`## Configuration Merging`

			`Configuration sources are merged in priority order (highest to lowest):`

			`1. CLI flags (highest priority)`
			`2. Inline JSON configuration (programmatic)`
			`3. Configuration file (lowest priority)`

			Later sources override earlier ones. For example, a CLI flag `--max-chars 2000` overrides `max_chars = 1000` in the configuration file.

			`## Example Configurations`

			`### Minimal Configuration`

			```toml
			`use_cache = true`
			`enable_quality_processing = true`

			`[ocr]`
			`backend = "tesseract"`
			`language = "eng"`
			```

			`### High-Quality PDF Extraction`

			```toml
			`use_cache = true`
			`enable_quality_processing = true`
			`force_ocr = false`

			`[ocr]`
			`backend = "tesseract"`
			`language = "eng"`

			`[ocr.tesseract_config]`
			`psm = 3`
			`oem = 3`
			`enable_table_detection = true`
			`table_min_confidence = 0.7`

			`[pdf_options]`
			`extract_images = true`
			`extract_metadata = true`

			`[pdf_options.hierarchy]`
			`enabled = true`
			`k_clusters = 6`

			`[images]`
			`extract_images = true`
			`target_dpi = 300`
			```

			`### Semantic Search Configuration`

			```toml
			`[chunking]`
			`max_chars = 800`
			`max_overlap = 150`

			`[chunking.embedding]`
			`batch_size = 32`
			`normalize = true`
			`cache_dir = "~/.cache/kreuzberg/embeddings"`

			`[chunking.embedding.model]`
			`type = "preset"`
			`name = "accurate"`

			`[keywords]`
			`algorithm = "yake"`
			`max_keywords = 15`
			```

			`## Field Name Reference`

			`Critical field names to use in configuration files:`

			- `max_chars` (NOT `max_characters`)
			- `max_overlap` (NOT `overlap`)
			- `table_min_confidence`
			- `table_column_threshold`
			- `table_row_threshold_ratio`
			- `ocr_coverage_threshold`
			- `k_clusters`
			- `include_bbox`
			- `enable_table_detection`
			- `auto_rotate`
			- `auto_adjust_dpi`
			- `show_download_progress`
			- `min_confidence`
			- `detect_multiple`

			`Always verify field names against the source configuration file when adding new options.`