Files

1441 lines
50 KiB
Markdown
Raw Permalink Normal View History

2026-06-01 23:40:55 +02:00
# Kreuzberg Python API Reference
Comprehensive documentation for the Kreuzberg Python API. All extraction logic and heavy lifting is implemented in high-performance Rust, with Python adding OCR backends (EasyOCR, PaddleOCR) and custom post-processor support.
## Extraction Functions
### Synchronous File Extraction
```python
def extract_file_sync(
file_path: str | Path,
mime_type: str | None = None,
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult
```
Extract content from a file (synchronous).
**Parameters:**
- `file_path` (str | Path): Path to the file
- `mime_type` (str | None): Optional MIME type hint (auto-detected if None)
- `config` (ExtractionConfig | None): Extraction configuration (uses defaults if None)
- `easyocr_kwargs` (dict | None): EasyOCR initialization options (languages, use_gpu, beam_width, etc.)
- `paddleocr_kwargs` (dict | None): PaddleOCR initialization options (lang, use_angle_cls, show_log, etc.)
**Returns:** ExtractionResult with content, metadata, and tables
**Example:**
```python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig
# Basic usage
result = extract_file_sync("document.pdf")
# With Tesseract configuration
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
)
)
result = extract_file_sync("invoice.pdf", config=config)
# With EasyOCR custom options
config = ExtractionConfig(ocr=OcrConfig(backend="easyocr", language="eng"))
result = extract_file_sync("scanned.pdf", config=config, easyocr_kwargs={"use_gpu": True})
```
### Asynchronous File Extraction
```python
async def extract_file(
file_path: str | Path,
mime_type: str | None = None,
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult
```
Extract content from a file (asynchronous). Same parameters and behavior as `extract_file_sync`.
### Synchronous Bytes Extraction
```python
def extract_bytes_sync(
data: bytes | bytearray,
mime_type: str,
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult
```
Extract content from bytes (synchronous).
**Parameters:**
- `data` (bytes | bytearray): File content as bytes or bytearray
- `mime_type` (str): MIME type of the data (required for format detection)
- `config` (ExtractionConfig | None): Extraction configuration
- `easyocr_kwargs` (dict | None): EasyOCR initialization options
- `paddleocr_kwargs` (dict | None): PaddleOCR initialization options
**Returns:** ExtractionResult with content, metadata, and tables
### Asynchronous Bytes Extraction
```python
async def extract_bytes(
data: bytes | bytearray,
mime_type: str,
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult
```
Extract content from bytes (asynchronous). Same parameters and behavior as `extract_bytes_sync`.
### Batch File Extraction
```python
async def batch_extract_files(
paths: list[str | Path],
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
```
Extract content from multiple files in parallel (asynchronous).
**Parameters:**
- `paths` (list[str | Path]): List of file paths
- `config` (ExtractionConfig | None): Extraction configuration
- `easyocr_kwargs` (dict | None): EasyOCR initialization options
- `paddleocr_kwargs` (dict | None): PaddleOCR initialization options
**Returns:** List of ExtractionResults (one per file)
### Batch File Extraction (Synchronous)
```python
def batch_extract_files_sync(
paths: list[str | Path],
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
```
Extract content from multiple files in parallel (synchronous).
### Batch Bytes Extraction
```python
async def batch_extract_bytes(
data_list: list[bytes | bytearray],
mime_types: list[str],
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
```
Extract content from multiple byte arrays in parallel (asynchronous).
**Parameters:**
- `data_list` (list[bytes | bytearray]): List of file contents as bytes/bytearray
- `mime_types` (list[str]): List of MIME types (one per data item)
- `config` (ExtractionConfig | None): Extraction configuration
- `easyocr_kwargs` (dict | None): EasyOCR initialization options
- `paddleocr_kwargs` (dict | None): PaddleOCR initialization options
**Returns:** List of ExtractionResults (one per data item)
### Batch Bytes Extraction (Synchronous)
```python
def batch_extract_bytes_sync(
data_list: list[bytes | bytearray],
mime_types: list[str],
config: ExtractionConfig | None = None,
*,
easyocr_kwargs: dict[str, Any] | None = None,
paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
```
Extract content from multiple byte arrays in parallel (synchronous).
### Per-File Config in Batch Functions
As of v4.5.0, per-file configuration overrides are passed as an optional `file_configs` parameter on the unified batch functions:
```python
def batch_extract_files_sync(
paths: list[str | Path],
config: ExtractionConfig | None = None,
*,
file_configs: list[FileExtractionConfig | None] | None = None,
easyocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]
```
The `file_configs` list must have the same length as `paths`. Each element is either a `FileExtractionConfig` override or `None` to use batch defaults. The same parameter is available on `batch_extract_files`, `batch_extract_bytes_sync`, and `batch_extract_bytes`.
> **Note:** The separate `batch_extract_files_with_configs_sync` / `batch_extract_files_with_configs` / `batch_extract_bytes_with_configs_sync` / `batch_extract_bytes_with_configs` functions have been removed in v4.5.0.
## Configuration Classes
### ExtractionConfig
Main extraction configuration for document processing. All attributes are optional and use sensible defaults when not specified.
**Attributes:**
| Field | Type | Default | Description |
| ---------------------------- | ------------------------------- | ------------- | ----------------------------------------------------------------------------------------- |
| `use_cache` | bool | True | Enable caching of extraction results to improve performance on repeated extractions |
| `enable_quality_processing` | bool | True | Enable quality post-processing to clean and normalize extracted text |
| `ocr` | OcrConfig \| None | None | OCR configuration for extracting text from images. None = OCR disabled |
| `force_ocr` | bool | False | Force OCR processing even for searchable PDFs that contain extractable text |
| `chunking` | ChunkingConfig \| None | None | Text chunking configuration for dividing content into manageable chunks. None = disabled |
| `images` | ImageExtractionConfig \| None | None | Image extraction configuration for extracting images FROM documents. None = no extraction |
| `pdf_options` | PdfConfig \| None | None | PDF-specific options like password handling and metadata extraction |
| `token_reduction` | TokenReductionConfig \| None | None | Token reduction configuration for reducing token count in extracted content |
| `language_detection` | LanguageDetectionConfig \| None | None | Language detection configuration for identifying document language(s) |
| `keywords` | KeywordConfig \| None | None | Keyword extraction configuration for identifying important terms and phrases |
| `postprocessor` | PostProcessorConfig \| None | None | Post-processor configuration for custom text processing |
| `max_concurrent_extractions` | int \| None | num_cpus \* 2 | Maximum concurrent extractions in batch operations |
| `html_options` | HtmlConversionOptions \| None | None | HTML conversion options for converting documents to markdown |
| `pages` | PageConfig \| None | None | Page extraction configuration for tracking page boundaries |
| `security_limits` | dict[str, int] \| None | None | Security limits configuration |
| `result_format` | str | "unified" | Result format: "unified" or "element_based" |
| `output_format` | str | "plain" | Output content format: "plain", "markdown", "djot", or "html" |
**Example:**
```python
from kreuzberg import ExtractionConfig, ChunkingConfig, OcrConfig
# Basic extraction with defaults
config = ExtractionConfig()
# Enable chunking with 512-char chunks and 100-char overlap
config = ExtractionConfig(chunking=ChunkingConfig(max_chars=512, max_overlap=100))
# Enable OCR with Tesseract
config = ExtractionConfig(ocr=OcrConfig(backend="tesseract", language="eng"))
# Multiple options
config = ExtractionConfig(
use_cache=True,
enable_quality_processing=True,
output_format="markdown",
result_format="unified"
)
```
### FileExtractionConfig
Per-file extraction overrides for batch operations. All fields optional (`None` = use batch default).
**Key fields:** `enable_quality_processing`, `ocr`, `force_ocr`, `chunking`, `images`, `pdf_options`, `token_reduction`, `language_detection`, `pages`, `keywords`, `postprocessor`, `html_options`, `result_format`, `output_format`, `include_document_structure`, `layout`.
Excluded (batch-level only): `max_concurrent_extractions`, `use_cache`, `acceleration`, `security_limits`.
```python
per_file = FileExtractionConfig(
force_ocr=True,
ocr=OcrConfig(backend="tesseract", language="deu"),
)
```
### OcrConfig
OCR configuration for extracting text from images.
**Attributes:**
| Field | Type | Default | Description |
| ------------------ | ----------------------- | ----------- | ----------------------------------------------------------------------------------------------------- |
| `backend` | str | "tesseract" | OCR backend: "tesseract", "easyocr", or "paddleocr" |
| `language` | str | "eng" | Language code (ISO 639-3 three-letter: "eng", "deu", "fra" or ISO 639-1 two-letter: "en", "de", "fr") |
| `tesseract_config` | TesseractConfig \| None | None | Tesseract-specific configuration (only used when backend="tesseract") |
**Example:**
```python
from kreuzberg import OcrConfig
# Tesseract with German language
config = OcrConfig(backend="tesseract", language="deu")
# EasyOCR for faster recognition
config = OcrConfig(backend="easyocr", language="eng")
# PaddleOCR for production deployments
config = OcrConfig(backend="paddleocr", language="chi_sim")
```
### TesseractConfig
Detailed Tesseract OCR configuration for advanced tuning. Fine-tune Tesseract OCR behavior for specific document types and quality levels.
**Attributes:**
| Field | Type | Default | Description |
| ------------------------------------ | -------------------------------- | ---------- | ----------------------------------------------------------------------------------------- |
| `language` | str | "eng" | OCR language (ISO 639-3 three-letter code) |
| `psm` | int | 3 | Page Segmentation Mode: 0 (detection only), 3 (auto), 6 (uniform block), 11 (sparse text) |
| `output_format` | str | "markdown" | Output format for OCR results |
| `oem` | int | 3 | OCR Engine Mode: 0 (legacy), 1 (LSTM), 2 (both), 3 (auto) |
| `min_confidence` | float | 0.0 | Minimum confidence threshold (0.0-1.0) for accepting OCR results |
| `preprocessing` | ImagePreprocessingConfig \| None | None | Image preprocessing configuration before OCR |
| `enable_table_detection` | bool | True | Enable automatic table detection and extraction |
| `table_min_confidence` | float | 0.0 | Minimum confidence for table detection (0.0-1.0) |
| `table_column_threshold` | int | 50 | Minimum pixel width between columns |
| `table_row_threshold_ratio` | float | 0.5 | Minimum row height ratio |
| `use_cache` | bool | True | Cache OCR results for improved performance |
| `classify_use_pre_adapted_templates` | bool | True | Use pre-adapted character templates |
| `language_model_ngram_on` | bool | False | Enable language model n-gram processing |
| `tessedit_dont_blkrej_good_wds` | bool | True | Don't block-reject good words |
| `tessedit_dont_rowrej_good_wds` | bool | True | Don't row-reject good words |
| `tessedit_enable_dict_correction` | bool | True | Enable dictionary-based spelling correction |
| `tessedit_char_whitelist` | str | "" | Whitelist of characters to recognize (empty = all) |
| `tessedit_char_blacklist` | str | "" | Blacklist of characters to ignore |
| `tessedit_use_primary_params_model` | bool | True | Use primary parameters model |
| `textord_space_size_is_variable` | bool | True | Allow variable space sizes |
| `thresholding_method` | bool | False | Thresholding method for binarization |
**Example:**
```python
from kreuzberg import TesseractConfig, ImagePreprocessingConfig
# General document OCR
config = TesseractConfig(psm=3, oem=3)
# Invoice/form OCR with table detection
config = TesseractConfig(psm=6, oem=2, enable_table_detection=True, min_confidence=0.6)
# High-precision technical document OCR
config = TesseractConfig(
psm=3,
oem=2,
preprocessing=ImagePreprocessingConfig(denoise=True, contrast_enhance=True, auto_rotate=True),
min_confidence=0.7,
tessedit_enable_dict_correction=True,
)
# Numeric-only OCR (for receipts, barcodes)
config = TesseractConfig(psm=6, tessedit_char_whitelist="0123456789.-,", min_confidence=0.8)
# Multiple language document
config = TesseractConfig(language="eng+fra+deu", psm=3, oem=2)
```
### ChunkingConfig
Text chunking configuration for dividing content into chunks. Chunking is useful for preparing content for embedding, indexing, or processing with length-limited systems (like LLM context windows).
**Attributes:**
| Field | Type | Default | Description |
| ------------- | ----------------------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| `max_chars` | int | 1000 | Maximum number of characters per chunk. Chunks larger than this will be split intelligently at sentence/paragraph boundaries |
| `max_overlap` | int | 200 | Overlap between consecutive chunks in characters. Creates context bridges between chunks for smoother processing |
| `embedding` | EmbeddingConfig \| None | None | Configuration for generating embeddings for each chunk using ONNX models. None = no embeddings |
| `preset` | str \| None | None | Use a preset chunking configuration (overrides individual settings if provided). Use list_embedding_presets() to see available presets |
**IMPORTANT:** The fields are `max_chars` and `max_overlap` (NOT `max_characters` or `overlap`).
**Example:**
```python
from kreuzberg import ExtractionConfig, ChunkingConfig, EmbeddingConfig, EmbeddingModelType
# Basic chunking with defaults
config = ExtractionConfig(chunking=ChunkingConfig())
# Custom chunk size with overlap
config = ExtractionConfig(chunking=ChunkingConfig(max_chars=512, max_overlap=100))
# Chunking with embeddings
config = ExtractionConfig(
chunking=ChunkingConfig(
max_chars=512,
embedding=EmbeddingConfig(model=EmbeddingModelType.preset("balanced"))
)
)
# Using preset configuration
config = ExtractionConfig(chunking=ChunkingConfig(preset="semantic"))
```
### PdfConfig
PDF-specific extraction configuration.
**Attributes:**
| Field | Type | Default | Description |
| ------------------ | ----------------------- | ------- | --------------------------------------------------------------------------------------------------- |
| `extract_images` | bool | False | Extract images from PDF documents |
| `passwords` | list[str] \| None | None | List of passwords to try when opening encrypted PDFs. Try each password in order until one succeeds |
| `extract_metadata` | bool | True | Extract PDF metadata (title, author, creation date, etc.) |
| `hierarchy` | HierarchyConfig \| None | None | Document hierarchy detection configuration. None = no hierarchy detection |
**Example:**
```python
from kreuzberg import ExtractionConfig, PdfConfig, HierarchyConfig
# Basic PDF configuration
config = ExtractionConfig(pdf_options=PdfConfig())
# Extract metadata and images from PDF
config = ExtractionConfig(pdf_options=PdfConfig(extract_images=True, extract_metadata=True))
# Handle encrypted PDFs
config = ExtractionConfig(pdf_options=PdfConfig(passwords=["password123", "fallback_password"]))
# Enable hierarchy detection
config = ExtractionConfig(pdf_options=PdfConfig(hierarchy=HierarchyConfig(k_clusters=6)))
```
### ImageExtractionConfig
Configuration for extracting images FROM documents. This is NOT for preprocessing images before OCR.
**Attributes:**
| Field | Type | Default | Description |
| --------------------- | ---- | ------- | ------------------------------------------------------------------------------------ |
| `extract_images` | bool | True | Enable image extraction from documents |
| `target_dpi` | int | 300 | Target DPI for image normalization. Images are resampled to this DPI for consistency |
| `max_image_dimension` | int | 4096 | Maximum width or height for extracted images. Larger images are downscaled to fit |
| `auto_adjust_dpi` | bool | True | Automatically adjust DPI based on image content quality |
| `min_dpi` | int | 72 | Minimum DPI threshold. Images with lower DPI are upscaled |
| `max_dpi` | int | 600 | Maximum DPI threshold. Images with higher DPI are downscaled |
**Example:**
```python
from kreuzberg import ExtractionConfig, ImageExtractionConfig
# Basic image extraction
config = ExtractionConfig(images=ImageExtractionConfig())
# Extract images with custom DPI settings
config = ExtractionConfig(
images=ImageExtractionConfig(target_dpi=150, max_image_dimension=2048, auto_adjust_dpi=False)
)
```
### EmbeddingConfig
Embedding generation configuration for text chunks. Configures embedding generation using ONNX models via fastembed-rs.
**Attributes:**
| Field | Type | Default | Description |
| ------------------------ | ------------------ | ----------------- | ------------------------------------------------------------------------------------------- |
| `model` | EmbeddingModelType | Preset "balanced" | The embedding model to use (preset, fastembed, or custom) |
| `normalize` | bool | True | Whether to normalize embedding vectors to unit length (recommended for cosine similarity) |
| `batch_size` | int | 32 | Number of texts to process simultaneously. Higher values use more memory but may be faster |
| `show_download_progress` | bool | False | Display progress during embedding model download |
| `cache_dir` | str \| None | None | Custom directory for caching downloaded models (defaults to ~/.cache/kreuzberg/embeddings/) |
**Example:**
```python
from kreuzberg import EmbeddingConfig, EmbeddingModelType
# Basic preset embedding (recommended)
config = EmbeddingConfig()
# Specific preset with settings
config = EmbeddingConfig(
model=EmbeddingModelType.preset("balanced"),
normalize=True,
batch_size=64
)
# Custom ONNX model
config = EmbeddingConfig(
model=EmbeddingModelType.custom(model_id="sentence-transformers/all-MiniLM-L6-v2", dimensions=384)
)
# With custom cache directory
config = EmbeddingConfig(cache_dir="/path/to/model/cache")
```
### EmbeddingModelType
Embedding model type selector with multiple configurations.
**Static Methods:**
```python
@staticmethod
def preset(name: str) -> EmbeddingModelType
```
Use a preset configuration (recommended for most use cases). Available presets: balanced, compact, large.
```python
@staticmethod
def fastembed(model: str, dimensions: int) -> EmbeddingModelType
```
Use a specific fastembed model by name.
```python
@staticmethod
def custom(model_id: str, dimensions: int) -> EmbeddingModelType
```
Use a custom ONNX model from HuggingFace (e.g., sentence-transformers/\*).
**Example:**
```python
from kreuzberg import EmbeddingModelType, list_embedding_presets
# Using the balanced preset (recommended for general use)
model = EmbeddingModelType.preset("balanced")
# Using a specific fast embedding model
model = EmbeddingModelType.fastembed(model="BAAI/bge-small-en-v1.5", dimensions=384)
# Using a custom HuggingFace model
model = EmbeddingModelType.custom(
model_id="sentence-transformers/all-MiniLM-L6-v2",
dimensions=384
)
# Listing available presets
presets = list_embedding_presets()
print(f"Available presets: {presets}")
```
### TokenReductionConfig
Configuration for reducing token count in extracted content. Reduces token count to lower costs when working with LLM APIs.
**Attributes:**
| Field | Type | Default | Description |
| -------------------------- | ---- | ------- | ------------------------------------------------------------------------------------------------ |
| `mode` | str | "off" | Token reduction mode: "off", "light", "moderate", "aggressive", or "maximum" |
| `preserve_important_words` | bool | True | Preserve capitalized words, technical terms, and proper nouns even in aggressive reduction modes |
**Example:**
```python
from kreuzberg import ExtractionConfig, TokenReductionConfig
# Moderate token reduction
config = ExtractionConfig(
token_reduction=TokenReductionConfig(mode="moderate", preserve_important_words=True)
)
# Maximum reduction for large batches
config = ExtractionConfig(
token_reduction=TokenReductionConfig(mode="maximum", preserve_important_words=True)
)
# No reduction (default)
config = ExtractionConfig(
token_reduction=TokenReductionConfig(mode="off")
)
```
### LanguageDetectionConfig
Configuration for detecting document language(s).
**Attributes:**
| Field | Type | Default | Description |
| ----------------- | ----- | ------- | --------------------------------------------------------------------------------------------------- |
| `enabled` | bool | True | Enable language detection for extracted content |
| `min_confidence` | float | 0.8 | Minimum confidence threshold (0.0-1.0) for language detection |
| `detect_multiple` | bool | False | Detect multiple languages in the document. When False, only the most confident language is returned |
**Example:**
```python
from kreuzberg import ExtractionConfig, LanguageDetectionConfig, extract_file_sync
# Basic language detection
config = ExtractionConfig(language_detection=LanguageDetectionConfig())
# Detect multiple languages with lower confidence threshold
config = ExtractionConfig(
language_detection=LanguageDetectionConfig(detect_multiple=True, min_confidence=0.6)
)
# Access detected languages in result
result = extract_file_sync("multilingual.pdf", config=config)
print(f"Languages: {result.detected_languages}")
```
### KeywordConfig
Keyword extraction configuration.
**Attributes:**
| Field | Type | Default | Description |
| -------------- | ------------------ | ------- | ----------------------------------------------------------------------------- |
| `algorithm` | KeywordAlgorithm | - | Keyword extraction algorithm (KeywordAlgorithm.Yake or KeywordAlgorithm.Rake) |
| `max_keywords` | int | 10 | Maximum number of keywords to extract |
| `min_score` | float | 0.0 | Minimum score threshold |
| `ngram_range` | tuple[int, int] | (1, 3) | N-gram range for keyword extraction |
| `language` | str \| None | "en" | Optional language hint |
| `yake_params` | YakeParams \| None | None | YAKE-specific tuning parameters |
| `rake_params` | RakeParams \| None | None | RAKE-specific tuning parameters |
### PageConfig
Page extraction and tracking configuration.
**Attributes:**
| Field | Type | Default | Description |
| --------------------- | ---- | -------------------------------------- | -------------------------------------------- |
| `extract_pages` | bool | False | Enable page tracking and per-page extraction |
| `insert_page_markers` | bool | False | Insert page markers into content |
| `marker_format` | str | "\\n\\n<!-- PAGE {page_num} -->\\n\\n" | Marker template containing {page_num} |
**Example:**
```python
from kreuzberg import ExtractionConfig, PageConfig
config = ExtractionConfig(pages=PageConfig(extract_pages=True))
```
### PostProcessorConfig
Configuration for post-processors in the extraction pipeline.
**Attributes:**
| Field | Type | Default | Description |
| --------------------- | ----------------- | ------- | ----------------------------------------------------------- |
| `enabled` | bool | True | Enable post-processors in the extraction pipeline |
| `enabled_processors` | list[str] \| None | None | Whitelist of processor names to run. None = run all enabled |
| `disabled_processors` | list[str] \| None | None | Blacklist of processor names to skip. None = none disabled |
**Example:**
```python
from kreuzberg import ExtractionConfig, PostProcessorConfig
# Basic post-processing with defaults
config = ExtractionConfig(postprocessor=PostProcessorConfig())
# Enable only specific processors
config = ExtractionConfig(
postprocessor=PostProcessorConfig(
enabled=True,
enabled_processors=["normalize_whitespace", "fix_encoding"]
)
)
# Disable specific processors
config = ExtractionConfig(
postprocessor=PostProcessorConfig(
enabled=True,
disabled_processors=["experimental_cleanup"]
)
)
# Disable all post-processing
config = ExtractionConfig(postprocessor=PostProcessorConfig(enabled=False))
```
### ImagePreprocessingConfig
Configuration for preprocessing images before OCR. This is NOT for extracting images from documents.
**Attributes:**
| Field | Type | Default | Description |
| --------------------- | ---- | ------- | ------------------------------------------------- |
| `target_dpi` | int | 300 | Target DPI for image normalization before OCR |
| `auto_rotate` | bool | True | Automatically detect and correct image rotation |
| `deskew` | bool | True | Correct skewed images to improve OCR accuracy |
| `denoise` | bool | False | Apply denoising filters to reduce noise in images |
| `contrast_enhance` | bool | False | Enhance contrast to improve text readability |
| `binarization_method` | str | "otsu" | Method for converting images to black and white |
| `invert_colors` | bool | False | Invert colors (white text on black background) |
**Example:**
```python
from kreuzberg import TesseractConfig, ImagePreprocessingConfig
# Basic preprocessing for OCR
config = TesseractConfig(preprocessing=ImagePreprocessingConfig())
# Aggressive preprocessing for low-quality scans
config = TesseractConfig(
preprocessing=ImagePreprocessingConfig(
target_dpi=300,
denoise=True,
contrast_enhance=True,
auto_rotate=True,
deskew=True
)
)
```
## ExtractionResult
Result object returned by extraction functions.
**Attributes:**
| Field | Type | Description |
| --------------------- | ------------------------------ | -------------------------------------------------------------------------------- |
| `content` | str | Main extracted text content in the specified output_format |
| `mime_type` | str | MIME type of the processed document |
| `metadata` | Metadata | Extracted document metadata (title, author, created_at, format_type, etc.) |
| `tables` | list[ExtractedTable] | Extracted tables from the document |
| `detected_languages` | list[str] \| None | Detected language codes (e.g., ["en", "de"]) if language detection is enabled |
| `chunks` | list[Chunk] \| None | Text chunks if chunking is enabled (each chunk has content, embedding, metadata) |
| `images` | list[ExtractedImage] \| None | Extracted images if image extraction is enabled |
| `pages` | list[PageContent] \| None | Per-page content and metadata if page extraction is enabled |
| `elements` | list[Element] \| None | Semantic elements if result_format="element_based" |
| `output_format` | str \| None | Format of the content field (plain, markdown, djot, html) |
| `result_format` | str \| None | Result format used (unified or element_based) |
| `extracted_keywords` | list[ExtractedKeyword] \| None | Extracted keywords with relevance scores if keyword extraction enabled |
| `quality_score` | float \| None | Overall quality score for the extraction result (0.0-1.0) |
| `processing_warnings` | list[ProcessingWarning] | Non-fatal warnings encountered during extraction pipeline |
**Methods:**
```python
def get_page_count(self) -> int
```
Get the total number of pages in the document.
```python
def get_chunk_count(self) -> int
```
Get the total number of chunks if chunking is enabled.
```python
def get_detected_language(self) -> str | None
```
Get the most confident detected language code.
```python
def get_metadata_field(self, field_name: str) -> Any | None
```
Get a specific metadata field by name.
**Example:**
```python
from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig
config = ExtractionConfig(
chunking=ChunkingConfig(max_chars=512),
output_format="markdown"
)
result = extract_file_sync("document.pdf", config=config)
print(f"Content preview: {result.content[:200]}")
print(f"MIME type: {result.mime_type}")
print(f"Page count: {result.get_page_count()}")
print(f"Chunk count: {result.get_chunk_count()}")
print(f"Detected language: {result.get_detected_language()}")
if result.tables:
print(f"Found {len(result.tables)} tables")
if result.chunks:
first_chunk = result.chunks[0]
print(f"First chunk: {first_chunk.content[:100]}")
if first_chunk.embedding:
print(f"Embedding dimensions: {len(first_chunk.embedding)}")
```
## Error Classes
All exceptions inherit from `KreuzbergError`, the base exception class.
### KreuzbergError
Base exception class for all Kreuzberg errors.
```python
class KreuzbergError(Exception):
"""Base exception for all Kreuzberg errors."""
```
### ParsingError
Raised when document parsing fails.
```python
class ParsingError(KreuzbergError):
"""Document parsing failed (corrupt, malformed, etc.)."""
```
### OCRError
Raised when OCR processing fails.
```python
class OCRError(KreuzbergError):
"""OCR operation failed."""
```
### ValidationError
Raised when validation fails.
```python
class ValidationError(KreuzbergError):
"""Validation failed (invalid parameters, constraints, format mismatches)."""
```
### MissingDependencyError
Raised when required dependencies are not available.
```python
class MissingDependencyError(KreuzbergError):
"""Required dependency not available (easyocr, paddleocr, tesseract, etc.)."""
@staticmethod
def create_for_package(dependency_group: str, functionality: str, package_name: str) -> MissingDependencyError
```
**Example:**
```python
from kreuzberg import extract_file_sync, MissingDependencyError, OCRError, ParsingError
try:
result = extract_file_sync("document.pdf")
except ParsingError as e:
print(f"Failed to parse document: {e}")
except OCRError as e:
print(f"OCR failed: {e}")
except MissingDependencyError as e:
print(f"Missing dependency: {e}")
```
## Utility Functions
### MIME Type Detection
```python
def detect_mime_type(data: bytes | bytearray) -> str
```
Detect MIME type from file bytes using magic number detection.
**Parameters:**
- `data` (bytes | bytearray): File content as bytes or bytearray
**Returns:** Detected MIME type (e.g., "application/pdf", "image/png")
```python
def detect_mime_type_from_path(path: str | Path) -> str
```
Detect MIME type from file path by reading the file and detecting its MIME type.
**Parameters:**
- `path` (str | Path): Path to the file
**Returns:** Detected MIME type
**Raises:**
- `OSError`: If file cannot be read (file not found, permission denied, etc.)
- `RuntimeError`: If MIME type detection fails
**Example:**
```python
from kreuzberg import detect_mime_type, detect_mime_type_from_path
# From bytes
pdf_bytes = b"%PDF-1.4\n"
mime_type = detect_mime_type(pdf_bytes)
# From path
mime_type = detect_mime_type_from_path("document.pdf")
```
### MIME Type Validation
```python
def validate_mime_type(mime_type: str) -> str
```
Validate a MIME type string and return the canonical form.
```python
def get_extensions_for_mime(mime_type: str) -> list[str]
```
Get file extensions associated with a MIME type.
**Example:**
```python
from kreuzberg import validate_mime_type, get_extensions_for_mime
canonical = validate_mime_type("application/pdf")
extensions = get_extensions_for_mime("application/pdf") # Returns ["pdf"]
```
### Configuration Loading
```python
def load_extraction_config_from_file(path: str | Path) -> ExtractionConfig
```
Load extraction configuration from a specific file.
**Parameters:**
- `path` (str | Path): Path to the configuration file (.toml, .yaml, or .json)
**Returns:** ExtractionConfig parsed from the file
**Raises:**
- `FileNotFoundError`: If the configuration file does not exist
- `RuntimeError`: If the file cannot be read or parsed
- `ValueError`: If the file format is invalid or unsupported
```python
def discover_extraction_config() -> ExtractionConfig | None
```
Discover extraction configuration from the environment (deprecated).
Attempts to locate a Kreuzberg configuration file using:
1. KREUZBERG_CONFIG_PATH environment variable
2. Search for kreuzberg.toml, kreuzberg.yaml, or kreuzberg.json in current and parent directories
**Returns:** ExtractionConfig if found, None otherwise
**Note:** Deprecated in favor of `load_extraction_config_from_file` for more predictable behavior.
**Example:**
```python
from kreuzberg import load_extraction_config_from_file, extract_file_sync
# Load from specific file
config = load_extraction_config_from_file("kreuzberg.toml")
result = extract_file_sync("document.pdf", config=config)
# Auto-discover configuration
import os
os.environ["KREUZBERG_CONFIG_PATH"] = "config/kreuzberg.yaml"
# Then extraction will use the discovered config
```
## Plugin System
### Registering Post-Processors
```python
def register_post_processor(processor: Any) -> None
```
Register a Python PostProcessor with the Rust core. Once registered, the processor will be called automatically after extraction to enrich results.
**Required Methods:**
- `name() -> str`: Return processor name (must be non-empty)
- `process(result: ExtractionResult) -> ExtractionResult`: Process and enrich the extraction result
- `processing_stage() -> str`: Return "early", "middle", or "late"
**Optional Methods:**
- `initialize() -> None`: Called when processor is registered
- `shutdown() -> None`: Called when processor is unregistered
**Example:**
```python
from kreuzberg import register_post_processor, ExtractionResult
class EntityExtractor:
def name(self) -> str:
return "entity_extraction"
def processing_stage(self) -> str:
return "early"
def process(self, result: ExtractionResult) -> ExtractionResult:
entities = {"PERSON": ["John Doe"], "ORG": ["Microsoft"]}
result.metadata["entities"] = entities
return result
register_post_processor(EntityExtractor())
```
### Registering OCR Backends
```python
def register_ocr_backend(backend: Any) -> None
```
Register a Python OCR backend with the Rust core.
**Required Methods:**
- `name() -> str`: Return backend name (must be non-empty)
- `supported_languages() -> list[str]`: Return list of supported language codes
- `process_image(image_bytes: bytes, language: str) -> OcrResult`: Process image and return OCR result
- `process_file(path: str, language: str) -> OcrResult`: Process file and return OCR result
- `initialize() -> None`: Called when backend is registered
- `shutdown() -> None`: Called when backend is unregistered
- `version() -> str`: Return backend version string
**Example:**
```python
from kreuzberg import register_ocr_backend
class MyOcrBackend:
def name(self) -> str:
return "my-ocr"
def supported_languages(self) -> list[str]:
return ["eng", "deu", "fra"]
def process_image(self, image_bytes: bytes, language: str) -> dict:
return {
"content": "extracted text",
"metadata": {"confidence": 0.95},
"tables": []
}
register_ocr_backend(MyOcrBackend())
```
### Registering Validators
```python
def register_validator(validator: Any) -> None
```
Register a Python Validator with the Rust core. Validators are called automatically after extraction to validate results.
**Required Methods:**
- `name() -> str`: Return validator name (must be non-empty)
- `validate(result: ExtractionResult) -> None`: Validate the extraction result (raise error to fail)
**Optional Methods:**
- `should_validate(result: ExtractionResult) -> bool`: Check if validator should run (defaults to True)
- `priority() -> int`: Return priority (defaults to 50, higher runs first)
**Example:**
```python
from kreuzberg import register_validator, ValidationError, ExtractionResult
class MinLengthValidator:
def name(self) -> str:
return "min_length_validator"
def priority(self) -> int:
return 100
def validate(self, result: ExtractionResult) -> None:
if len(result.content) < 100:
raise ValidationError("Content too short")
register_validator(MinLengthValidator())
```
### Plugin Management Functions
```python
def list_post_processors() -> list[str]
```
List names of all registered post-processors.
```python
def list_validators() -> list[str]
```
List names of all registered validators.
```python
def list_ocr_backends() -> list[str]
```
List names of all available OCR backends.
```python
def unregister_post_processor(name: str) -> None
```
Unregister a post-processor by name.
```python
def unregister_validator(name: str) -> None
```
Unregister a validator by name.
```python
def unregister_ocr_backend(name: str) -> None
```
Unregister an OCR backend by name.
```python
def clear_post_processors() -> None
```
Clear all registered post-processors.
```python
def clear_validators() -> None
```
Clear all registered validators.
```python
def clear_ocr_backends() -> None
```
Clear all registered OCR backends.
## Format Enums
### OutputFormat
Output format for extraction results.
```python
class OutputFormat(str, Enum):
PLAIN = "plain" # Plain text format
MARKDOWN = "markdown" # Markdown format
DJOT = "djot" # Djot lightweight markup format
HTML = "html" # HTML format
```
### ResultFormat
Result format controlling extraction output structure.
```python
class ResultFormat(str, Enum):
UNIFIED = "unified" # All content in `content` field
ELEMENT_BASED = "element_based" # Unstructured-compatible output with semantic elements
```
## Error Handling
### Error Code Functions
```python
def get_last_error_code() -> int
```
Get the last error code from the FFI layer.
**Returns:**
- 0 (SUCCESS): No error occurred
- 1 (GENERIC_ERROR): Generic unspecified error
- 2 (PANIC): A panic occurred in the Rust core
- 3 (INVALID_ARGUMENT): Invalid argument provided
- 4 (IO_ERROR): I/O operation failed
- 5 (PARSING_ERROR): Document parsing failed
- 6 (OCR_ERROR): OCR operation failed
- 7 (MISSING_DEPENDENCY): Required dependency not available
```python
def get_error_details() -> dict[str, Any]
```
Get detailed error information from the FFI layer.
**Returns:** dict with keys:
- `message` (str): Human-readable error message
- `error_code` (int): Numeric error code (0-7)
- `error_type` (str): Error type name (e.g., "validation", "ocr")
- `source_file` (str | None): Source file path if available
- `source_function` (str | None): Function name if available
- `source_line` (int): Line number (0 if unknown)
- `context_info` (str | None): Additional context if available
- `is_panic` (bool): Whether error came from a panic
```python
def classify_error(message: str) -> int
```
Classify an error message into a Kreuzberg error code.
**Parameters:**
- `message` (str): The error message to classify
**Returns:** int error code (0-7) representing the classification
```python
def error_code_name(code: int) -> str
```
Get the human-readable name of an error code.
**Parameters:**
- `code` (int): Numeric error code (0-7)
**Returns:** Human-readable error code name (e.g., "validation", "ocr")
**Example:**
```python
from kreuzberg import get_error_details, get_last_error_code, error_code_name, classify_error
try:
result = extract_file_sync("document.pdf")
except Exception as e:
code = get_last_error_code()
if code:
print(f"Error code: {code} ({error_code_name(code)})")
details = get_error_details()
print(f"Error: {details['message']}")
print(f"Type: {details['error_type']}")
classified = classify_error(str(e))
print(f"Classified as: {error_code_name(classified)}")
```
## Validation Functions
### Parameter Validation
```python
def validate_chunking_params(max_chars: int, max_overlap: int) -> bool
```
Validate chunking parameters.
```python
def validate_confidence(confidence: float) -> bool
```
Validate confidence value (0.0-1.0).
```python
def validate_dpi(dpi: int) -> bool
```
Validate DPI value.
```python
def validate_tesseract_psm(psm: int) -> bool
```
Validate Tesseract Page Segmentation Mode.
```python
def validate_tesseract_oem(oem: int) -> bool
```
Validate Tesseract OCR Engine Mode.
```python
def validate_ocr_backend(backend: str) -> bool
```
Validate OCR backend name.
```python
def validate_language_code(code: str) -> bool
```
Validate language code format.
```python
def validate_token_reduction_level(level: str) -> bool
```
Validate token reduction level.
```python
def validate_output_format(output_format: str) -> bool
```
Validate output format string.
```python
def validate_binarization_method(method: str) -> bool
```
Validate binarization method for image preprocessing.
### Getting Valid Values
```python
def get_valid_binarization_methods() -> list[str]
```
Get list of valid binarization methods.
```python
def get_valid_language_codes() -> list[str]
```
Get list of valid language codes.
```python
def get_valid_ocr_backends() -> list[str]
```
Get list of valid OCR backend names.
```python
def get_valid_token_reduction_levels() -> list[str]
```
Get list of valid token reduction levels.
```python
def list_embedding_presets() -> list[str]
```
List available embedding presets.
```python
def get_embedding_preset(name: str) -> EmbeddingPreset | None
```
Get details about a specific embedding preset.
**Example:**
```python
from kreuzberg import (
validate_dpi,
get_valid_binarization_methods,
list_embedding_presets,
get_embedding_preset
)
# Validate parameters
if not validate_dpi(300):
print("Invalid DPI")
# List valid values
binarization_methods = get_valid_binarization_methods()
presets = list_embedding_presets()
# Get preset details
preset = get_embedding_preset("balanced")
if preset:
print(f"Balanced preset: {preset.description}")
print(f"Dimensions: {preset.dimensions}")
print(f"Recommended chunk size: {preset.chunk_size}")
```
## Configuration Utilities
### Config Manipulation
```python
def config_to_json(config: ExtractionConfig) -> str
```
Convert ExtractionConfig to JSON string.
```python
def config_get_field(config: ExtractionConfig, field_name: str) -> Any | None
```
Get a specific field value from ExtractionConfig.
```python
def config_merge(base: ExtractionConfig, override: ExtractionConfig) -> None
```
Merge override config into base config (mutates base).
**Example:**
```python
from kreuzberg import ExtractionConfig, config_to_json, config_get_field, config_merge
config = ExtractionConfig(use_cache=True, enable_quality_processing=False)
# Convert to JSON
json_str = config_to_json(config)
print(json_str)
# Get field
use_cache = config_get_field(config, "use_cache")
print(f"use_cache: {use_cache}")
# Merge configs
override = ExtractionConfig(use_cache=False)
config_merge(config, override)
```
## Version Information
```python
__version__: str
```
Current version of the kreuzberg package.
**Example:**
```python
from kreuzberg import __version__
print(f"Kreuzberg version: {__version__}")
```