skills/kreuzberg/references/python-api.md

# Kreuzberg Python API Reference

Comprehensive documentation for the Kreuzberg Python API. All extraction logic and heavy lifting is implemented in high-performance Rust, with Python adding OCR backends (EasyOCR, PaddleOCR) and custom post-processor support.

## Extraction Functions

### Synchronous File Extraction

```python
def extract_file_sync(
    file_path: str | Path,
    mime_type: str | None = None,
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult
```

Extract content from a file (synchronous).

**Parameters:**

- `file_path` (str | Path): Path to the file
- `mime_type` (str | None): Optional MIME type hint (auto-detected if None)
- `config` (ExtractionConfig | None): Extraction configuration (uses defaults if None)
- `easyocr_kwargs` (dict | None): EasyOCR initialization options (languages, use_gpu, beam_width, etc.)
- `paddleocr_kwargs` (dict | None): PaddleOCR initialization options (lang, use_angle_cls, show_log, etc.)

**Returns:** ExtractionResult with content, metadata, and tables

**Example:**

```python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig

# Basic usage
result = extract_file_sync("document.pdf")

# With Tesseract configuration
config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        language="eng",
        tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
    )
)
result = extract_file_sync("invoice.pdf", config=config)

# With EasyOCR custom options
config = ExtractionConfig(ocr=OcrConfig(backend="easyocr", language="eng"))
result = extract_file_sync("scanned.pdf", config=config, easyocr_kwargs={"use_gpu": True})
```

### Asynchronous File Extraction

```python
async def extract_file(
    file_path: str | Path,
    mime_type: str | None = None,
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult
```

Extract content from a file (asynchronous). Same parameters and behavior as `extract_file_sync`.

### Synchronous Bytes Extraction

```python
def extract_bytes_sync(
    data: bytes | bytearray,
    mime_type: str,
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult
```

Extract content from bytes (synchronous).

**Parameters:**

- `data` (bytes | bytearray): File content as bytes or bytearray
- `mime_type` (str): MIME type of the data (required for format detection)
- `config` (ExtractionConfig | None): Extraction configuration
- `easyocr_kwargs` (dict | None): EasyOCR initialization options
- `paddleocr_kwargs` (dict | None): PaddleOCR initialization options

**Returns:** ExtractionResult with content, metadata, and tables

### Asynchronous Bytes Extraction

```python
async def extract_bytes(
    data: bytes | bytearray,
    mime_type: str,
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult
```

Extract content from bytes (asynchronous). Same parameters and behavior as `extract_bytes_sync`.

### Batch File Extraction

```python
async def batch_extract_files(
    paths: list[str | Path],
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
```

Extract content from multiple files in parallel (asynchronous).

**Parameters:**

- `paths` (list[str | Path]): List of file paths
- `config` (ExtractionConfig | None): Extraction configuration
- `easyocr_kwargs` (dict | None): EasyOCR initialization options
- `paddleocr_kwargs` (dict | None): PaddleOCR initialization options

**Returns:** List of ExtractionResults (one per file)

### Batch File Extraction (Synchronous)

```python
def batch_extract_files_sync(
    paths: list[str | Path],
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
```

Extract content from multiple files in parallel (synchronous).

### Batch Bytes Extraction

```python
async def batch_extract_bytes(
    data_list: list[bytes | bytearray],
    mime_types: list[str],
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
```

Extract content from multiple byte arrays in parallel (asynchronous).

**Parameters:**

- `data_list` (list[bytes | bytearray]): List of file contents as bytes/bytearray
- `mime_types` (list[str]): List of MIME types (one per data item)
- `config` (ExtractionConfig | None): Extraction configuration
- `easyocr_kwargs` (dict | None): EasyOCR initialization options
- `paddleocr_kwargs` (dict | None): PaddleOCR initialization options

**Returns:** List of ExtractionResults (one per data item)

### Batch Bytes Extraction (Synchronous)

```python
def batch_extract_bytes_sync(
    data_list: list[bytes | bytearray],
    mime_types: list[str],
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
```

Extract content from multiple byte arrays in parallel (synchronous).

### Per-File Config in Batch Functions

As of v4.5.0, per-file configuration overrides are passed as an optional `file_configs` parameter on the unified batch functions:

```python
def batch_extract_files_sync(
    paths: list[str | Path],
    config: ExtractionConfig | None = None,
    *,
    file_configs: list[FileExtractionConfig | None] | None = None,
    easyocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
```

The `file_configs` list must have the same length as `paths`. Each element is either a `FileExtractionConfig` override or `None` to use batch defaults. The same parameter is available on `batch_extract_files`, `batch_extract_bytes_sync`, and `batch_extract_bytes`.

> **Note:** The separate `batch_extract_files_with_configs_sync` / `batch_extract_files_with_configs` / `batch_extract_bytes_with_configs_sync` / `batch_extract_bytes_with_configs` functions have been removed in v4.5.0.

## Configuration Classes

### ExtractionConfig

Main extraction configuration for document processing. All attributes are optional and use sensible defaults when not specified.

**Attributes:**

| Field                        | Type                            | Default       | Description                                                                               |
| ---------------------------- | ------------------------------- | ------------- | ----------------------------------------------------------------------------------------- |
| `use_cache`                  | bool                            | True          | Enable caching of extraction results to improve performance on repeated extractions       |
| `enable_quality_processing`  | bool                            | True          | Enable quality post-processing to clean and normalize extracted text                      |
| `ocr`                        | OcrConfig \| None               | None          | OCR configuration for extracting text from images. None = OCR disabled                    |
| `force_ocr`                  | bool                            | False         | Force OCR processing even for searchable PDFs that contain extractable text               |
| `chunking`                   | ChunkingConfig \| None          | None          | Text chunking configuration for dividing content into manageable chunks. None = disabled  |
| `images`                     | ImageExtractionConfig \| None   | None          | Image extraction configuration for extracting images FROM documents. None = no extraction |
| `pdf_options`                | PdfConfig \| None               | None          | PDF-specific options like password handling and metadata extraction                       |
| `token_reduction`            | TokenReductionConfig \| None    | None          | Token reduction configuration for reducing token count in extracted content               |
| `language_detection`         | LanguageDetectionConfig \| None | None          | Language detection configuration for identifying document language(s)                     |
| `keywords`                   | KeywordConfig \| None           | None          | Keyword extraction configuration for identifying important terms and phrases              |
| `postprocessor`              | PostProcessorConfig \| None     | None          | Post-processor configuration for custom text processing                                   |
| `max_concurrent_extractions` | int \| None                     | num_cpus \* 2 | Maximum concurrent extractions in batch operations                                        |
| `html_options`               | HtmlConversionOptions \| None   | None          | HTML conversion options for converting documents to markdown                              |
| `pages`                      | PageConfig \| None              | None          | Page extraction configuration for tracking page boundaries                                |
| `security_limits`            | dict[str, int] \| None          | None          | Security limits configuration                                                             |
| `result_format`              | str                             | "unified"     | Result format: "unified" or "element_based"                                               |
| `output_format`              | str                             | "plain"       | Output content format: "plain", "markdown", "djot", or "html"                             |

**Example:**

```python
from kreuzberg import ExtractionConfig, ChunkingConfig, OcrConfig

# Basic extraction with defaults
config = ExtractionConfig()

# Enable chunking with 512-char chunks and 100-char overlap
config = ExtractionConfig(chunking=ChunkingConfig(max_chars=512, max_overlap=100))

# Enable OCR with Tesseract
config = ExtractionConfig(ocr=OcrConfig(backend="tesseract", language="eng"))

# Multiple options
config = ExtractionConfig(
    use_cache=True,
    enable_quality_processing=True,
    output_format="markdown",
    result_format="unified"
)
```

### FileExtractionConfig

Per-file extraction overrides for batch operations. All fields optional (`None` = use batch default).

**Key fields:** `enable_quality_processing`, `ocr`, `force_ocr`, `chunking`, `images`, `pdf_options`, `token_reduction`, `language_detection`, `pages`, `keywords`, `postprocessor`, `html_options`, `result_format`, `output_format`, `include_document_structure`, `layout`.

Excluded (batch-level only): `max_concurrent_extractions`, `use_cache`, `acceleration`, `security_limits`.

```python
per_file = FileExtractionConfig(
    force_ocr=True,
    ocr=OcrConfig(backend="tesseract", language="deu"),
)
```

### OcrConfig

OCR configuration for extracting text from images.

**Attributes:**

| Field              | Type                    | Default     | Description                                                                                           |
| ------------------ | ----------------------- | ----------- | ----------------------------------------------------------------------------------------------------- |
| `backend`          | str                     | "tesseract" | OCR backend: "tesseract", "easyocr", or "paddleocr"                                                   |
| `language`         | str                     | "eng"       | Language code (ISO 639-3 three-letter: "eng", "deu", "fra" or ISO 639-1 two-letter: "en", "de", "fr") |
| `tesseract_config` | TesseractConfig \| None | None        | Tesseract-specific configuration (only used when backend="tesseract")                                 |

**Example:**

```python
from kreuzberg import OcrConfig

# Tesseract with German language
config = OcrConfig(backend="tesseract", language="deu")

# EasyOCR for faster recognition
config = OcrConfig(backend="easyocr", language="eng")

# PaddleOCR for production deployments
config = OcrConfig(backend="paddleocr", language="chi_sim")
```

### TesseractConfig

Detailed Tesseract OCR configuration for advanced tuning. Fine-tune Tesseract OCR behavior for specific document types and quality levels.

**Attributes:**

| Field                                | Type                             | Default    | Description                                                                               |
| ------------------------------------ | -------------------------------- | ---------- | ----------------------------------------------------------------------------------------- |
| `language`                           | str                              | "eng"      | OCR language (ISO 639-3 three-letter code)                                                |
| `psm`                                | int                              | 3          | Page Segmentation Mode: 0 (detection only), 3 (auto), 6 (uniform block), 11 (sparse text) |
| `output_format`                      | str                              | "markdown" | Output format for OCR results                                                             |
| `oem`                                | int                              | 3          | OCR Engine Mode: 0 (legacy), 1 (LSTM), 2 (both), 3 (auto)                                 |
| `min_confidence`                     | float                            | 0.0        | Minimum confidence threshold (0.0-1.0) for accepting OCR results                          |
| `preprocessing`                      | ImagePreprocessingConfig \| None | None       | Image preprocessing configuration before OCR                                              |
| `enable_table_detection`             | bool                             | True       | Enable automatic table detection and extraction                                           |
| `table_min_confidence`               | float                            | 0.0        | Minimum confidence for table detection (0.0-1.0)                                          |
| `table_column_threshold`             | int                              | 50         | Minimum pixel width between columns                                                       |
| `table_row_threshold_ratio`          | float                            | 0.5        | Minimum row height ratio                                                                  |
| `use_cache`                          | bool                             | True       | Cache OCR results for improved performance                                                |
| `classify_use_pre_adapted_templates` | bool                             | True       | Use pre-adapted character templates                                                       |
| `language_model_ngram_on`            | bool                             | False      | Enable language model n-gram processing                                                   |
| `tessedit_dont_blkrej_good_wds`      | bool                             | True       | Don't block-reject good words                                                             |
| `tessedit_dont_rowrej_good_wds`      | bool                             | True       | Don't row-reject good words                                                               |
| `tessedit_enable_dict_correction`    | bool                             | True       | Enable dictionary-based spelling correction                                               |
| `tessedit_char_whitelist`            | str                              | ""         | Whitelist of characters to recognize (empty = all)                                        |
| `tessedit_char_blacklist`            | str                              | ""         | Blacklist of characters to ignore                                                         |
| `tessedit_use_primary_params_model`  | bool                             | True       | Use primary parameters model                                                              |
| `textord_space_size_is_variable`     | bool                             | True       | Allow variable space sizes                                                                |
| `thresholding_method`                | bool                             | False      | Thresholding method for binarization                                                      |

**Example:**

```python
from kreuzberg import TesseractConfig, ImagePreprocessingConfig

# General document OCR
config = TesseractConfig(psm=3, oem=3)

# Invoice/form OCR with table detection
config = TesseractConfig(psm=6, oem=2, enable_table_detection=True, min_confidence=0.6)

# High-precision technical document OCR
config = TesseractConfig(
    psm=3,
    oem=2,
    preprocessing=ImagePreprocessingConfig(denoise=True, contrast_enhance=True, auto_rotate=True),
    min_confidence=0.7,
    tessedit_enable_dict_correction=True,
)

# Numeric-only OCR (for receipts, barcodes)
config = TesseractConfig(psm=6, tessedit_char_whitelist="0123456789.-,", min_confidence=0.8)

# Multiple language document
config = TesseractConfig(language="eng+fra+deu", psm=3, oem=2)
```

### ChunkingConfig

Text chunking configuration for dividing content into chunks. Chunking is useful for preparing content for embedding, indexing, or processing with length-limited systems (like LLM context windows).

**Attributes:**

| Field         | Type                    | Default | Description                                                                                                                            |
| ------------- | ----------------------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| `max_chars`   | int                     | 1000    | Maximum number of characters per chunk. Chunks larger than this will be split intelligently at sentence/paragraph boundaries           |
| `max_overlap` | int                     | 200     | Overlap between consecutive chunks in characters. Creates context bridges between chunks for smoother processing                       |
| `embedding`   | EmbeddingConfig \| None | None    | Configuration for generating embeddings for each chunk using ONNX models. None = no embeddings                                         |
| `preset`      | str \| None             | None    | Use a preset chunking configuration (overrides individual settings if provided). Use list_embedding_presets() to see available presets |

**IMPORTANT:** The fields are `max_chars` and `max_overlap` (NOT `max_characters` or `overlap`).

**Example:**

```python
from kreuzberg import ExtractionConfig, ChunkingConfig, EmbeddingConfig, EmbeddingModelType

# Basic chunking with defaults
config = ExtractionConfig(chunking=ChunkingConfig())

# Custom chunk size with overlap
config = ExtractionConfig(chunking=ChunkingConfig(max_chars=512, max_overlap=100))

# Chunking with embeddings
config = ExtractionConfig(
    chunking=ChunkingConfig(
        max_chars=512,
        embedding=EmbeddingConfig(model=EmbeddingModelType.preset("balanced"))
    )
)

# Using preset configuration
config = ExtractionConfig(chunking=ChunkingConfig(preset="semantic"))
```

### PdfConfig

PDF-specific extraction configuration.

**Attributes:**

| Field              | Type                    | Default | Description                                                                                         |
| ------------------ | ----------------------- | ------- | --------------------------------------------------------------------------------------------------- |
| `extract_images`   | bool                    | False   | Extract images from PDF documents                                                                   |
| `passwords`        | list[str] \| None       | None    | List of passwords to try when opening encrypted PDFs. Try each password in order until one succeeds |
| `extract_metadata` | bool                    | True    | Extract PDF metadata (title, author, creation date, etc.)                                           |
| `hierarchy`        | HierarchyConfig \| None | None    | Document hierarchy detection configuration. None = no hierarchy detection                           |

**Example:**

```python
from kreuzberg import ExtractionConfig, PdfConfig, HierarchyConfig

# Basic PDF configuration
config = ExtractionConfig(pdf_options=PdfConfig())

# Extract metadata and images from PDF
config = ExtractionConfig(pdf_options=PdfConfig(extract_images=True, extract_metadata=True))

# Handle encrypted PDFs
config = ExtractionConfig(pdf_options=PdfConfig(passwords=["password123", "fallback_password"]))

# Enable hierarchy detection
config = ExtractionConfig(pdf_options=PdfConfig(hierarchy=HierarchyConfig(k_clusters=6)))
```

### ImageExtractionConfig

Configuration for extracting images FROM documents. This is NOT for preprocessing images before OCR.

**Attributes:**

| Field                 | Type | Default | Description                                                                          |
| --------------------- | ---- | ------- | ------------------------------------------------------------------------------------ |
| `extract_images`      | bool | True    | Enable image extraction from documents                                               |
| `target_dpi`          | int  | 300     | Target DPI for image normalization. Images are resampled to this DPI for consistency |
| `max_image_dimension` | int  | 4096    | Maximum width or height for extracted images. Larger images are downscaled to fit    |
| `auto_adjust_dpi`     | bool | True    | Automatically adjust DPI based on image content quality                              |
| `min_dpi`             | int  | 72      | Minimum DPI threshold. Images with lower DPI are upscaled                            |
| `max_dpi`             | int  | 600     | Maximum DPI threshold. Images with higher DPI are downscaled                         |

**Example:**

```python
from kreuzberg import ExtractionConfig, ImageExtractionConfig

# Basic image extraction
config = ExtractionConfig(images=ImageExtractionConfig())

# Extract images with custom DPI settings
config = ExtractionConfig(
    images=ImageExtractionConfig(target_dpi=150, max_image_dimension=2048, auto_adjust_dpi=False)
)
```

### EmbeddingConfig

Embedding generation configuration for text chunks. Configures embedding generation using ONNX models via fastembed-rs.

**Attributes:**

| Field                    | Type               | Default           | Description                                                                                 |
| ------------------------ | ------------------ | ----------------- | ------------------------------------------------------------------------------------------- |
| `model`                  | EmbeddingModelType | Preset "balanced" | The embedding model to use (preset, fastembed, or custom)                                   |
| `normalize`              | bool               | True              | Whether to normalize embedding vectors to unit length (recommended for cosine similarity)   |
| `batch_size`             | int                | 32                | Number of texts to process simultaneously. Higher values use more memory but may be faster  |
| `show_download_progress` | bool               | False             | Display progress during embedding model download                                            |
| `cache_dir`              | str \| None        | None              | Custom directory for caching downloaded models (defaults to ~/.cache/kreuzberg/embeddings/) |

**Example:**

```python
from kreuzberg import EmbeddingConfig, EmbeddingModelType

# Basic preset embedding (recommended)
config = EmbeddingConfig()

# Specific preset with settings
config = EmbeddingConfig(
    model=EmbeddingModelType.preset("balanced"),
    normalize=True,
    batch_size=64
)

# Custom ONNX model
config = EmbeddingConfig(
    model=EmbeddingModelType.custom(model_id="sentence-transformers/all-MiniLM-L6-v2", dimensions=384)
)

# With custom cache directory
config = EmbeddingConfig(cache_dir="/path/to/model/cache")
```

### EmbeddingModelType

Embedding model type selector with multiple configurations.

**Static Methods:**

```python
@staticmethod
def preset(name: str) -> EmbeddingModelType
```

Use a preset configuration (recommended for most use cases). Available presets: balanced, compact, large.

```python
@staticmethod
def fastembed(model: str, dimensions: int) -> EmbeddingModelType
```

Use a specific fastembed model by name.

```python
@staticmethod
def custom(model_id: str, dimensions: int) -> EmbeddingModelType
```

Use a custom ONNX model from HuggingFace (e.g., sentence-transformers/\*).

**Example:**

```python
from kreuzberg import EmbeddingModelType, list_embedding_presets

# Using the balanced preset (recommended for general use)
model = EmbeddingModelType.preset("balanced")

# Using a specific fast embedding model
model = EmbeddingModelType.fastembed(model="BAAI/bge-small-en-v1.5", dimensions=384)

# Using a custom HuggingFace model
model = EmbeddingModelType.custom(
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    dimensions=384
)

# Listing available presets
presets = list_embedding_presets()
print(f"Available presets: {presets}")
```

### TokenReductionConfig

Configuration for reducing token count in extracted content. Reduces token count to lower costs when working with LLM APIs.

**Attributes:**

| Field                      | Type | Default | Description                                                                                      |
| -------------------------- | ---- | ------- | ------------------------------------------------------------------------------------------------ |
| `mode`                     | str  | "off"   | Token reduction mode: "off", "light", "moderate", "aggressive", or "maximum"                     |
| `preserve_important_words` | bool | True    | Preserve capitalized words, technical terms, and proper nouns even in aggressive reduction modes |

**Example:**

```python
from kreuzberg import ExtractionConfig, TokenReductionConfig

# Moderate token reduction
config = ExtractionConfig(
    token_reduction=TokenReductionConfig(mode="moderate", preserve_important_words=True)
)

# Maximum reduction for large batches
config = ExtractionConfig(
    token_reduction=TokenReductionConfig(mode="maximum", preserve_important_words=True)
)

# No reduction (default)
config = ExtractionConfig(
    token_reduction=TokenReductionConfig(mode="off")
)
```

### LanguageDetectionConfig

Configuration for detecting document language(s).

**Attributes:**

| Field             | Type  | Default | Description                                                                                         |
| ----------------- | ----- | ------- | --------------------------------------------------------------------------------------------------- |
| `enabled`         | bool  | True    | Enable language detection for extracted content                                                     |
| `min_confidence`  | float | 0.8     | Minimum confidence threshold (0.0-1.0) for language detection                                       |
| `detect_multiple` | bool  | False   | Detect multiple languages in the document. When False, only the most confident language is returned |

**Example:**

```python
from kreuzberg import ExtractionConfig, LanguageDetectionConfig, extract_file_sync

# Basic language detection
config = ExtractionConfig(language_detection=LanguageDetectionConfig())

# Detect multiple languages with lower confidence threshold
config = ExtractionConfig(
    language_detection=LanguageDetectionConfig(detect_multiple=True, min_confidence=0.6)
)

# Access detected languages in result
result = extract_file_sync("multilingual.pdf", config=config)
print(f"Languages: {result.detected_languages}")
```

### KeywordConfig

Keyword extraction configuration.

**Attributes:**

| Field          | Type               | Default | Description                                                                   |
| -------------- | ------------------ | ------- | ----------------------------------------------------------------------------- |
| `algorithm`    | KeywordAlgorithm   | -       | Keyword extraction algorithm (KeywordAlgorithm.Yake or KeywordAlgorithm.Rake) |
| `max_keywords` | int                | 10      | Maximum number of keywords to extract                                         |
| `min_score`    | float              | 0.0     | Minimum score threshold                                                       |
| `ngram_range`  | tuple[int, int]    | (1, 3)  | N-gram range for keyword extraction                                           |
| `language`     | str \| None        | "en"    | Optional language hint                                                        |
| `yake_params`  | YakeParams \| None | None    | YAKE-specific tuning parameters                                               |
| `rake_params`  | RakeParams \| None | None    | RAKE-specific tuning parameters                                               |

### PageConfig

Page extraction and tracking configuration.

**Attributes:**

| Field                 | Type | Default                                | Description                                  |
| --------------------- | ---- | -------------------------------------- | -------------------------------------------- |
| `extract_pages`       | bool | False                                  | Enable page tracking and per-page extraction |
| `insert_page_markers` | bool | False                                  | Insert page markers into content             |
| `marker_format`       | str  | "\\n\\n<!-- PAGE {page_num} -->\\n\\n" | Marker template containing {page_num}        |

**Example:**

```python
from kreuzberg import ExtractionConfig, PageConfig

config = ExtractionConfig(pages=PageConfig(extract_pages=True))
```

### PostProcessorConfig

Configuration for post-processors in the extraction pipeline.

**Attributes:**

| Field                 | Type              | Default | Description                                                 |
| --------------------- | ----------------- | ------- | ----------------------------------------------------------- |
| `enabled`             | bool              | True    | Enable post-processors in the extraction pipeline           |
| `enabled_processors`  | list[str] \| None | None    | Whitelist of processor names to run. None = run all enabled |
| `disabled_processors` | list[str] \| None | None    | Blacklist of processor names to skip. None = none disabled  |

**Example:**

```python
from kreuzberg import ExtractionConfig, PostProcessorConfig

# Basic post-processing with defaults
config = ExtractionConfig(postprocessor=PostProcessorConfig())

# Enable only specific processors
config = ExtractionConfig(
    postprocessor=PostProcessorConfig(
        enabled=True,
        enabled_processors=["normalize_whitespace", "fix_encoding"]
    )
)

# Disable specific processors
config = ExtractionConfig(
    postprocessor=PostProcessorConfig(
        enabled=True,
        disabled_processors=["experimental_cleanup"]
    )
)

# Disable all post-processing
config = ExtractionConfig(postprocessor=PostProcessorConfig(enabled=False))
```

### ImagePreprocessingConfig

Configuration for preprocessing images before OCR. This is NOT for extracting images from documents.

**Attributes:**

| Field                 | Type | Default | Description                                       |
| --------------------- | ---- | ------- | ------------------------------------------------- |
| `target_dpi`          | int  | 300     | Target DPI for image normalization before OCR     |
| `auto_rotate`         | bool | True    | Automatically detect and correct image rotation   |
| `deskew`              | bool | True    | Correct skewed images to improve OCR accuracy     |
| `denoise`             | bool | False   | Apply denoising filters to reduce noise in images |
| `contrast_enhance`    | bool | False   | Enhance contrast to improve text readability      |
| `binarization_method` | str  | "otsu"  | Method for converting images to black and white   |
| `invert_colors`       | bool | False   | Invert colors (white text on black background)    |

**Example:**

```python
from kreuzberg import TesseractConfig, ImagePreprocessingConfig

# Basic preprocessing for OCR
config = TesseractConfig(preprocessing=ImagePreprocessingConfig())

# Aggressive preprocessing for low-quality scans
config = TesseractConfig(
    preprocessing=ImagePreprocessingConfig(
        target_dpi=300,
        denoise=True,
        contrast_enhance=True,
        auto_rotate=True,
        deskew=True
    )
)
```

## ExtractionResult

Result object returned by extraction functions.

**Attributes:**

| Field                 | Type                           | Description                                                                      |
| --------------------- | ------------------------------ | -------------------------------------------------------------------------------- |
| `content`             | str                            | Main extracted text content in the specified output_format                       |
| `mime_type`           | str                            | MIME type of the processed document                                              |
| `metadata`            | Metadata                       | Extracted document metadata (title, author, created_at, format_type, etc.)       |
| `tables`              | list[ExtractedTable]           | Extracted tables from the document                                               |
| `detected_languages`  | list[str] \| None              | Detected language codes (e.g., ["en", "de"]) if language detection is enabled    |
| `chunks`              | list[Chunk] \| None            | Text chunks if chunking is enabled (each chunk has content, embedding, metadata) |
| `images`              | list[ExtractedImage] \| None   | Extracted images if image extraction is enabled                                  |
| `pages`               | list[PageContent] \| None      | Per-page content and metadata if page extraction is enabled                      |
| `elements`            | list[Element] \| None          | Semantic elements if result_format="element_based"                               |
| `output_format`       | str \| None                    | Format of the content field (plain, markdown, djot, html)                        |
| `result_format`       | str \| None                    | Result format used (unified or element_based)                                    |
| `extracted_keywords`  | list[ExtractedKeyword] \| None | Extracted keywords with relevance scores if keyword extraction enabled           |
| `quality_score`       | float \| None                  | Overall quality score for the extraction result (0.0-1.0)                        |
| `processing_warnings` | list[ProcessingWarning]        | Non-fatal warnings encountered during extraction pipeline                        |

**Methods:**

```python
def get_page_count(self) -> int
```

Get the total number of pages in the document.

```python
def get_chunk_count(self) -> int
```

Get the total number of chunks if chunking is enabled.

```python
def get_detected_language(self) -> str | None
```

Get the most confident detected language code.

```python
def get_metadata_field(self, field_name: str) -> Any | None
```

Get a specific metadata field by name.

**Example:**

```python
from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig

config = ExtractionConfig(
    chunking=ChunkingConfig(max_chars=512),
    output_format="markdown"
)
result = extract_file_sync("document.pdf", config=config)

print(f"Content preview: {result.content[:200]}")
print(f"MIME type: {result.mime_type}")
print(f"Page count: {result.get_page_count()}")
print(f"Chunk count: {result.get_chunk_count()}")
print(f"Detected language: {result.get_detected_language()}")

if result.tables:
    print(f"Found {len(result.tables)} tables")

if result.chunks:
    first_chunk = result.chunks[0]
    print(f"First chunk: {first_chunk.content[:100]}")
    if first_chunk.embedding:
        print(f"Embedding dimensions: {len(first_chunk.embedding)}")
```

## Error Classes

All exceptions inherit from `KreuzbergError`, the base exception class.

### KreuzbergError

Base exception class for all Kreuzberg errors.

```python
class KreuzbergError(Exception):
    """Base exception for all Kreuzberg errors."""
```

### ParsingError

Raised when document parsing fails.

```python
class ParsingError(KreuzbergError):
    """Document parsing failed (corrupt, malformed, etc.)."""
```

### OCRError

Raised when OCR processing fails.

```python
class OCRError(KreuzbergError):
    """OCR operation failed."""
```

### ValidationError

Raised when validation fails.

```python
class ValidationError(KreuzbergError):
    """Validation failed (invalid parameters, constraints, format mismatches)."""
```

### MissingDependencyError

Raised when required dependencies are not available.

```python
class MissingDependencyError(KreuzbergError):
    """Required dependency not available (easyocr, paddleocr, tesseract, etc.)."""

    @staticmethod
    def create_for_package(dependency_group: str, functionality: str, package_name: str) -> MissingDependencyError
```

**Example:**

```python
from kreuzberg import extract_file_sync, MissingDependencyError, OCRError, ParsingError

try:
    result = extract_file_sync("document.pdf")
except ParsingError as e:
    print(f"Failed to parse document: {e}")
except OCRError as e:
    print(f"OCR failed: {e}")
except MissingDependencyError as e:
    print(f"Missing dependency: {e}")
```

## Utility Functions

### MIME Type Detection

```python
def detect_mime_type(data: bytes | bytearray) -> str
```

Detect MIME type from file bytes using magic number detection.

**Parameters:**

- `data` (bytes | bytearray): File content as bytes or bytearray

**Returns:** Detected MIME type (e.g., "application/pdf", "image/png")

```python
def detect_mime_type_from_path(path: str | Path) -> str
```

Detect MIME type from file path by reading the file and detecting its MIME type.

**Parameters:**

- `path` (str | Path): Path to the file

**Returns:** Detected MIME type

**Raises:**

- `OSError`: If file cannot be read (file not found, permission denied, etc.)
- `RuntimeError`: If MIME type detection fails

**Example:**

```python
from kreuzberg import detect_mime_type, detect_mime_type_from_path

# From bytes
pdf_bytes = b"%PDF-1.4\n"
mime_type = detect_mime_type(pdf_bytes)

# From path
mime_type = detect_mime_type_from_path("document.pdf")
```

### MIME Type Validation

```python
def validate_mime_type(mime_type: str) -> str
```

Validate a MIME type string and return the canonical form.

```python
def get_extensions_for_mime(mime_type: str) -> list[str]
```

Get file extensions associated with a MIME type.

**Example:**

```python
from kreuzberg import validate_mime_type, get_extensions_for_mime

canonical = validate_mime_type("application/pdf")
extensions = get_extensions_for_mime("application/pdf")  # Returns ["pdf"]
```

### Configuration Loading

```python
def load_extraction_config_from_file(path: str | Path) -> ExtractionConfig
```

Load extraction configuration from a specific file.

**Parameters:**

- `path` (str | Path): Path to the configuration file (.toml, .yaml, or .json)

**Returns:** ExtractionConfig parsed from the file

**Raises:**

- `FileNotFoundError`: If the configuration file does not exist
- `RuntimeError`: If the file cannot be read or parsed
- `ValueError`: If the file format is invalid or unsupported

```python
def discover_extraction_config() -> ExtractionConfig | None
```

Discover extraction configuration from the environment (deprecated).

Attempts to locate a Kreuzberg configuration file using:

1. KREUZBERG_CONFIG_PATH environment variable
2. Search for kreuzberg.toml, kreuzberg.yaml, or kreuzberg.json in current and parent directories

**Returns:** ExtractionConfig if found, None otherwise

**Note:** Deprecated in favor of `load_extraction_config_from_file` for more predictable behavior.

**Example:**

```python
from kreuzberg import load_extraction_config_from_file, extract_file_sync

# Load from specific file
config = load_extraction_config_from_file("kreuzberg.toml")
result = extract_file_sync("document.pdf", config=config)

# Auto-discover configuration
import os
os.environ["KREUZBERG_CONFIG_PATH"] = "config/kreuzberg.yaml"
# Then extraction will use the discovered config
```

## Plugin System

### Registering Post-Processors

```python
def register_post_processor(processor: Any) -> None
```

Register a Python PostProcessor with the Rust core. Once registered, the processor will be called automatically after extraction to enrich results.

**Required Methods:**

- `name() -> str`: Return processor name (must be non-empty)
- `process(result: ExtractionResult) -> ExtractionResult`: Process and enrich the extraction result
- `processing_stage() -> str`: Return "early", "middle", or "late"

**Optional Methods:**

- `initialize() -> None`: Called when processor is registered
- `shutdown() -> None`: Called when processor is unregistered

**Example:**

```python
from kreuzberg import register_post_processor, ExtractionResult

class EntityExtractor:
    def name(self) -> str:
        return "entity_extraction"

    def processing_stage(self) -> str:
        return "early"

    def process(self, result: ExtractionResult) -> ExtractionResult:
        entities = {"PERSON": ["John Doe"], "ORG": ["Microsoft"]}
        result.metadata["entities"] = entities
        return result

register_post_processor(EntityExtractor())
```

### Registering OCR Backends

```python
def register_ocr_backend(backend: Any) -> None
```

Register a Python OCR backend with the Rust core.

**Required Methods:**

- `name() -> str`: Return backend name (must be non-empty)
- `supported_languages() -> list[str]`: Return list of supported language codes
- `process_image(image_bytes: bytes, language: str) -> OcrResult`: Process image and return OCR result
- `process_file(path: str, language: str) -> OcrResult`: Process file and return OCR result
- `initialize() -> None`: Called when backend is registered
- `shutdown() -> None`: Called when backend is unregistered
- `version() -> str`: Return backend version string

**Example:**

```python
from kreuzberg import register_ocr_backend

class MyOcrBackend:
    def name(self) -> str:
        return "my-ocr"

    def supported_languages(self) -> list[str]:
        return ["eng", "deu", "fra"]

    def process_image(self, image_bytes: bytes, language: str) -> dict:
        return {
            "content": "extracted text",
            "metadata": {"confidence": 0.95},
            "tables": []
        }

register_ocr_backend(MyOcrBackend())
```

### Registering Validators

```python
def register_validator(validator: Any) -> None
```

Register a Python Validator with the Rust core. Validators are called automatically after extraction to validate results.

**Required Methods:**

- `name() -> str`: Return validator name (must be non-empty)
- `validate(result: ExtractionResult) -> None`: Validate the extraction result (raise error to fail)

**Optional Methods:**

- `should_validate(result: ExtractionResult) -> bool`: Check if validator should run (defaults to True)
- `priority() -> int`: Return priority (defaults to 50, higher runs first)

**Example:**

```python
from kreuzberg import register_validator, ValidationError, ExtractionResult

class MinLengthValidator:
    def name(self) -> str:
        return "min_length_validator"

    def priority(self) -> int:
        return 100

    def validate(self, result: ExtractionResult) -> None:
        if len(result.content) < 100:
            raise ValidationError("Content too short")

register_validator(MinLengthValidator())
```

### Plugin Management Functions

```python
def list_post_processors() -> list[str]
```

List names of all registered post-processors.

```python
def list_validators() -> list[str]
```

List names of all registered validators.

```python
def list_ocr_backends() -> list[str]
```

List names of all available OCR backends.

```python
def unregister_post_processor(name: str) -> None
```

Unregister a post-processor by name.

```python
def unregister_validator(name: str) -> None
```

Unregister a validator by name.

```python
def unregister_ocr_backend(name: str) -> None
```

Unregister an OCR backend by name.

```python
def clear_post_processors() -> None
```

Clear all registered post-processors.

```python
def clear_validators() -> None
```

Clear all registered validators.

```python
def clear_ocr_backends() -> None
```

Clear all registered OCR backends.

## Format Enums

### OutputFormat

Output format for extraction results.

```python
class OutputFormat(str, Enum):
    PLAIN = "plain"         # Plain text format
    MARKDOWN = "markdown"   # Markdown format
    DJOT = "djot"          # Djot lightweight markup format
    HTML = "html"          # HTML format
```

### ResultFormat

Result format controlling extraction output structure.

```python
class ResultFormat(str, Enum):
    UNIFIED = "unified"                # All content in `content` field
    ELEMENT_BASED = "element_based"   # Unstructured-compatible output with semantic elements
```

## Error Handling

### Error Code Functions

```python
def get_last_error_code() -> int
```

Get the last error code from the FFI layer.

**Returns:**

- 0 (SUCCESS): No error occurred
- 1 (GENERIC_ERROR): Generic unspecified error
- 2 (PANIC): A panic occurred in the Rust core
- 3 (INVALID_ARGUMENT): Invalid argument provided
- 4 (IO_ERROR): I/O operation failed
- 5 (PARSING_ERROR): Document parsing failed
- 6 (OCR_ERROR): OCR operation failed
- 7 (MISSING_DEPENDENCY): Required dependency not available

```python
def get_error_details() -> dict[str, Any]
```

Get detailed error information from the FFI layer.

**Returns:** dict with keys:

- `message` (str): Human-readable error message
- `error_code` (int): Numeric error code (0-7)
- `error_type` (str): Error type name (e.g., "validation", "ocr")
- `source_file` (str | None): Source file path if available
- `source_function` (str | None): Function name if available
- `source_line` (int): Line number (0 if unknown)
- `context_info` (str | None): Additional context if available
- `is_panic` (bool): Whether error came from a panic

```python
def classify_error(message: str) -> int
```

Classify an error message into a Kreuzberg error code.

**Parameters:**

- `message` (str): The error message to classify

**Returns:** int error code (0-7) representing the classification

```python
def error_code_name(code: int) -> str
```

Get the human-readable name of an error code.

**Parameters:**

- `code` (int): Numeric error code (0-7)

**Returns:** Human-readable error code name (e.g., "validation", "ocr")

**Example:**

```python
from kreuzberg import get_error_details, get_last_error_code, error_code_name, classify_error

try:
    result = extract_file_sync("document.pdf")
except Exception as e:
    code = get_last_error_code()
    if code:
        print(f"Error code: {code} ({error_code_name(code)})")

    details = get_error_details()
    print(f"Error: {details['message']}")
    print(f"Type: {details['error_type']}")

    classified = classify_error(str(e))
    print(f"Classified as: {error_code_name(classified)}")
```

## Validation Functions

### Parameter Validation

```python
def validate_chunking_params(max_chars: int, max_overlap: int) -> bool
```

Validate chunking parameters.

```python
def validate_confidence(confidence: float) -> bool
```

Validate confidence value (0.0-1.0).

```python
def validate_dpi(dpi: int) -> bool
```

Validate DPI value.

```python
def validate_tesseract_psm(psm: int) -> bool
```

Validate Tesseract Page Segmentation Mode.

```python
def validate_tesseract_oem(oem: int) -> bool
```

Validate Tesseract OCR Engine Mode.

```python
def validate_ocr_backend(backend: str) -> bool
```

Validate OCR backend name.

```python
def validate_language_code(code: str) -> bool
```

Validate language code format.

```python
def validate_token_reduction_level(level: str) -> bool
```

Validate token reduction level.

```python
def validate_output_format(output_format: str) -> bool
```

Validate output format string.

```python
def validate_binarization_method(method: str) -> bool
```

Validate binarization method for image preprocessing.

### Getting Valid Values

```python
def get_valid_binarization_methods() -> list[str]
```

Get list of valid binarization methods.

```python
def get_valid_language_codes() -> list[str]
```

Get list of valid language codes.

```python
def get_valid_ocr_backends() -> list[str]
```

Get list of valid OCR backend names.

```python
def get_valid_token_reduction_levels() -> list[str]
```

Get list of valid token reduction levels.

```python
def list_embedding_presets() -> list[str]
```

List available embedding presets.

```python
def get_embedding_preset(name: str) -> EmbeddingPreset | None
```

Get details about a specific embedding preset.

**Example:**

```python
from kreuzberg import (
    validate_dpi,
    get_valid_binarization_methods,
    list_embedding_presets,
    get_embedding_preset
)

# Validate parameters
if not validate_dpi(300):
    print("Invalid DPI")

# List valid values
binarization_methods = get_valid_binarization_methods()
presets = list_embedding_presets()

# Get preset details
preset = get_embedding_preset("balanced")
if preset:
    print(f"Balanced preset: {preset.description}")
    print(f"Dimensions: {preset.dimensions}")
    print(f"Recommended chunk size: {preset.chunk_size}")
```

## Configuration Utilities

### Config Manipulation

```python
def config_to_json(config: ExtractionConfig) -> str
```

Convert ExtractionConfig to JSON string.

```python
def config_get_field(config: ExtractionConfig, field_name: str) -> Any | None
```

Get a specific field value from ExtractionConfig.

```python
def config_merge(base: ExtractionConfig, override: ExtractionConfig) -> None
```

Merge override config into base config (mutates base).

**Example:**

```python
from kreuzberg import ExtractionConfig, config_to_json, config_get_field, config_merge

config = ExtractionConfig(use_cache=True, enable_quality_processing=False)

# Convert to JSON
json_str = config_to_json(config)
print(json_str)

# Get field
use_cache = config_get_field(config, "use_cache")
print(f"use_cache: {use_cache}")

# Merge configs
override = ExtractionConfig(use_cache=False)
config_merge(config, override)
```

## Version Information

```python
__version__: str
```

Current version of the kreuzberg package.

**Example:**

```python
from kreuzberg import __version__

print(f"Kreuzberg version: {__version__}")
```