# Configuration Reference

Kreuzberg uses a hierarchical configuration system supporting multiple formats and auto-discovery mechanisms. This reference covers all available configuration options, field names across programming languages, and loading strategies.

## Supported Formats

Kreuzberg configurations can be defined in three formats:

- **TOML** (recommended): `kreuzberg.toml`
- **YAML**: `kreuzberg.yaml`
- **JSON**: `kreuzberg.json`

All formats support the same schema and configuration options.

## Auto-Discovery

When no configuration file is explicitly specified, Kreuzberg searches for configuration files in the following order:

1. Current working directory: `kreuzberg.toml`, `kreuzberg.yaml`, `kreuzberg.json`
2. Parent directories (recursively up the tree, same file name pattern)

The first matching configuration file is loaded.

## Programmatic Loading

### Python

```python
from kreuzberg import ExtractionConfig

# Load from explicit path
config = ExtractionConfig.from_file("kreuzberg.toml")

# Auto-discover configuration
config = ExtractionConfig.discover()
```

### Node.js / TypeScript

```typescript
import { ExtractionConfig } from "@kreuzberg/node";

// Load from explicit path
const config = ExtractionConfig.fromFile("kreuzberg.toml");

// Auto-discover configuration
const config = ExtractionConfig.discover();
```

### CLI

```bash
# Explicit configuration file
kreuzberg extract --config kreuzberg.toml document.pdf

# Auto-discovery (searches default locations)
kreuzberg extract document.pdf
```

## Configuration Schema

The complete TOML schema with all available sections and options:

### Top-Level Options

```toml
use_cache = true
enable_quality_processing = true
force_ocr = false
output_format = "markdown"
result_format = "text"
max_concurrent_extractions = 4
```

| Option                       | Type    | Default      | Description                                                                         |
| ---------------------------- | ------- | ------------ | ----------------------------------------------------------------------------------- |
| `use_cache`                  | boolean | `true`       | Enable caching of extraction results                                                |
| `enable_quality_processing`  | boolean | `true`       | Enable post-processing for output quality                                           |
| `force_ocr`                  | boolean | `false`      | Force OCR processing even for searchable PDFs                                       |
| `disable_ocr`                | boolean | `false`      | Disable OCR entirely — image files return empty content instead of errors (v4.7.0+) |
| `output_format`              | string  | `"markdown"` | Output format (markdown, html, text)                                                |
| `result_format`              | string  | `"text"`     | Result format for structured output                                                 |
| `max_concurrent_extractions` | integer | `4`          | Maximum concurrent document extractions                                             |

### OCR Configuration

```toml
[ocr]
backend = "tesseract"
language = "eng"
```

| Option     | Type   | Default       | Description                                   |
| ---------- | ------ | ------------- | --------------------------------------------- |
| `backend`  | string | `"tesseract"` | OCR backend (currently tesseract)             |
| `language` | string | `"eng"`       | ISO 639-3 language code (eng, deu, fra, etc.) |

#### Tesseract Configuration

```toml
[ocr.tesseract_config]
psm = 3
oem = 3
min_confidence = 0.0
output_format = "text"
enable_table_detection = false
table_min_confidence = 0.5
table_column_threshold = 50
table_row_threshold_ratio = 0.5
use_cache = true
```

| Option                      | Type    | Default  | Description                                |
| --------------------------- | ------- | -------- | ------------------------------------------ |
| `psm`                       | integer | `3`      | Page Segmentation Mode (0-13)              |
| `oem`                       | integer | `3`      | OCR Engine Mode (0-3)                      |
| `min_confidence`            | float   | `0.0`    | Minimum OCR confidence threshold (0.0-1.0) |
| `output_format`             | string  | `"text"` | Output format from OCR                     |
| `enable_table_detection`    | boolean | `false`  | Enable table detection during OCR          |
| `table_min_confidence`      | float   | `0.5`    | Minimum confidence for table cells         |
| `table_column_threshold`    | integer | `50`     | Pixel threshold for column detection       |
| `table_row_threshold_ratio` | float   | `0.5`    | Row height ratio threshold                 |
| `use_cache`                 | boolean | `true`   | Cache OCR results                          |

#### Tesseract Preprocessing

```toml
[ocr.tesseract_config.preprocessing]
target_dpi = 300
auto_rotate = true
deskew = true
denoise = true
contrast_enhance = true
binarization_method = "otsu"
invert_colors = false
```

| Option                | Type    | Default  | Description                                    |
| --------------------- | ------- | -------- | ---------------------------------------------- |
| `target_dpi`          | integer | `300`    | Target DPI for preprocessing                   |
| `auto_rotate`         | boolean | `true`   | Automatically detect and correct page rotation |
| `deskew`              | boolean | `true`   | Correct skewed pages                           |
| `denoise`             | boolean | `true`   | Remove noise from images                       |
| `contrast_enhance`    | boolean | `true`   | Enhance image contrast                         |
| `binarization_method` | string  | `"otsu"` | Method for image binarization                  |
| `invert_colors`       | boolean | `false`  | Invert image colors if needed                  |

### PDF Options

```toml
[pdf_options]
extract_images = true
extract_metadata = true

[pdf_options.hierarchy]
enabled = true
k_clusters = 6
include_bbox = true
ocr_coverage_threshold = 0.5
```

| Option                             | Type    | Default | Description                                    |
| ---------------------------------- | ------- | ------- | ---------------------------------------------- |
| `extract_images`                   | boolean | `true`  | Extract images from PDF documents              |
| `extract_metadata`                 | boolean | `true`  | Extract PDF metadata                           |
| `hierarchy.enabled`                | boolean | `true`  | Enable PDF hierarchy extraction (v4.0.0+)      |
| `hierarchy.k_clusters`             | integer | `6`     | Number of clusters for hierarchy detection     |
| `hierarchy.include_bbox`           | boolean | `true`  | Include bounding boxes in hierarchy            |
| `hierarchy.ocr_coverage_threshold` | float   | `0.5`   | OCR coverage threshold for hierarchy (0.0-1.0) |

### Image Processing

```toml
[images]
extract_images = true
target_dpi = 300
max_image_dimension = 4096
auto_adjust_dpi = true
min_dpi = 72
max_dpi = 600
```

| Option                | Type    | Default | Description                                  |
| --------------------- | ------- | ------- | -------------------------------------------- |
| `extract_images`      | boolean | `true`  | Extract images from documents                |
| `target_dpi`          | integer | `300`   | Target DPI for image processing              |
| `max_image_dimension` | integer | `4096`  | Maximum image dimension in pixels            |
| `auto_adjust_dpi`     | boolean | `true`  | Automatically adjust DPI based on image size |
| `min_dpi`             | integer | `72`    | Minimum DPI threshold                        |
| `max_dpi`             | integer | `600`   | Maximum DPI threshold                        |

### Chunking Configuration

```toml
[chunking]
max_chars = 1000
max_overlap = 200

[chunking.embedding]
batch_size = 32
normalize = true
show_download_progress = true
cache_dir = "~/.cache/kreuzberg/embeddings"

[chunking.embedding.model]
type = "preset"
name = "balanced"
```

| Option                             | Type    | Default                           | Description                                                |
| ---------------------------------- | ------- | --------------------------------- | ---------------------------------------------------------- |
| `max_chars`                        | integer | `1000`                            | Maximum characters per chunk                               |
| `max_overlap`                      | integer | `200`                             | Overlap between consecutive chunks                         |
| `embedding.batch_size`             | integer | `32`                              | Batch size for embedding generation                        |
| `embedding.normalize`              | boolean | `true`                            | Normalize embeddings to unit length                        |
| `embedding.show_download_progress` | boolean | `true`                            | Show progress when downloading models                      |
| `embedding.cache_dir`              | string  | `"~/.cache/kreuzberg/embeddings"` | Directory for caching embeddings                           |
| `embedding.model.type`             | string  | `"preset"`                        | Model type: preset, fastembed, or custom                   |
| `embedding.model.name`             | string  | `"balanced"`                      | Preset model name (balanced, fast, accurate, multilingual) |
| `embedding.model.model`            | string  |                                   | FastEmbed model identifier                                 |
| `embedding.model.model_id`         | string  |                                   | Custom HuggingFace model ID                                |
| `embedding.model.dimensions`       | integer |                                   | Embedding dimensions                                       |

### Keywords Configuration

```toml
[keywords]
algorithm = "yake"
max_keywords = 10
min_score = 0.0
ngram_range = [1, 3]
language = "en"
```

| Option         | Type    | Default  | Description                                 |
| -------------- | ------- | -------- | ------------------------------------------- |
| `algorithm`    | string  | `"yake"` | Keyword extraction algorithm (yake or rake) |
| `max_keywords` | integer | `10`     | Maximum keywords to extract                 |
| `min_score`    | float   | `0.0`    | Minimum relevance score for keywords        |
| `ngram_range`  | array   | `[1, 3]` | N-gram size range [min, max]                |
| `language`     | string  | `"en"`   | Language code for keyword extraction        |

### Token Reduction

```toml
[token_reduction]
mode = "off"
preserve_important_words = true
```

| Option                     | Type    | Default | Description                               |
| -------------------------- | ------- | ------- | ----------------------------------------- |
| `mode`                     | string  | `"off"` | Mode: off, aggressive, moderate, minimal  |
| `preserve_important_words` | boolean | `true`  | Preserve important words during reduction |

### Language Detection

```toml
[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false
```

| Option            | Type    | Default | Description                                |
| ----------------- | ------- | ------- | ------------------------------------------ |
| `enabled`         | boolean | `true`  | Enable automatic language detection        |
| `min_confidence`  | float   | `0.8`   | Minimum confidence threshold for detection |
| `detect_multiple` | boolean | `false` | Detect multiple languages in document      |

### Post-Processor

```toml
[postprocessor]
enabled = true
```

| Option    | Type    | Default | Description                                 |
| --------- | ------- | ------- | ------------------------------------------- |
| `enabled` | boolean | `true`  | Enable post-processing of extracted content |

## FileExtractionConfig (Per-File Overrides)

Passed as an optional parameter to `batch_extract_file` / `batch_extract_bytes` (and their sync variants) to override settings per file in a batch. All fields optional — `None` = use batch default. The separate `_with_configs` functions were removed in v4.5.0.

**Overridable fields:** `enable_quality_processing`, `ocr`, `force_ocr`, `chunking`, `images`, `pdf_options`, `token_reduction`, `language_detection`, `pages`, `keywords`, `postprocessor`, `html_options`, `result_format`, `output_format`, `include_document_structure`, `layout`.

**Batch-level only (not overridable):** `max_concurrent_extractions`, `use_cache`, `acceleration`, `security_limits`.

**Merge semantics:** For each file, `FileExtractionConfig` fields are overlaid on the batch `ExtractionConfig`. `None` falls through to batch default; `Some(value)` replaces the batch default for that file.

```toml
# FileExtractionConfig cannot be specified in config files —
# it is a programmatic API for per-file overrides at runtime.
```

## Naming Conventions

Kreuzberg uses consistent naming conventions across different contexts:

| Context              | Convention | Example                                       |
| -------------------- | ---------- | --------------------------------------------- |
| Python               | snake_case | `max_chars`, `pdf_options`, `use_cache`       |
| Node.js / TypeScript | camelCase  | `maxChars`, `pdfOptions`, `useCache`          |
| Rust                 | snake_case | `max_chars`, `pdf_options`, `use_cache`       |
| TOML / YAML / JSON   | snake_case | `max_chars`, `pdf_options`, `use_cache`       |
| CLI flags            | kebab-case | `--max-chars`, `--pdf-options`, `--use-cache` |

When switching between languages, apply the appropriate conversion:

- Python → Node.js: `snake_case` to `camelCase`
- CLI → Python: `kebab-case` to `snake_case`
- TOML → Python: No conversion needed (both use `snake_case`)

## Environment Variables

The following environment variables can override configuration:

| Variable         | Purpose                             | Example     |
| ---------------- | ----------------------------------- | ----------- |
| `KREUZBERG_HOST` | Server bind address (serve command) | `127.0.0.1` |
| `KREUZBERG_PORT` | Server port (serve command)         | `8080`      |

## Configuration Merging

Configuration sources are merged in priority order (highest to lowest):

1. **CLI flags** (highest priority)
2. **Inline JSON configuration** (programmatic)
3. **Configuration file** (lowest priority)

Later sources override earlier ones. For example, a CLI flag `--max-chars 2000` overrides `max_chars = 1000` in the configuration file.

## Example Configurations

### Minimal Configuration

```toml
use_cache = true
enable_quality_processing = true

[ocr]
backend = "tesseract"
language = "eng"
```

### High-Quality PDF Extraction

```toml
use_cache = true
enable_quality_processing = true
force_ocr = false

[ocr]
backend = "tesseract"
language = "eng"

[ocr.tesseract_config]
psm = 3
oem = 3
enable_table_detection = true
table_min_confidence = 0.7

[pdf_options]
extract_images = true
extract_metadata = true

[pdf_options.hierarchy]
enabled = true
k_clusters = 6

[images]
extract_images = true
target_dpi = 300
```

### Semantic Search Configuration

```toml
[chunking]
max_chars = 800
max_overlap = 150

[chunking.embedding]
batch_size = 32
normalize = true
cache_dir = "~/.cache/kreuzberg/embeddings"

[chunking.embedding.model]
type = "preset"
name = "accurate"

[keywords]
algorithm = "yake"
max_keywords = 15
```

## Field Name Reference

Critical field names to use in configuration files:

- `max_chars` (NOT `max_characters`)
- `max_overlap` (NOT `overlap`)
- `table_min_confidence`
- `table_column_threshold`
- `table_row_threshold_ratio`
- `ocr_coverage_threshold`
- `k_clusters`
- `include_bbox`
- `enable_table_detection`
- `auto_rotate`
- `auto_adjust_dpi`
- `show_download_progress`
- `min_confidence`
- `detect_multiple`

Always verify field names against the source configuration file when adding new options.