Files
fil/skills/kreuzberg/references/python-api.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

50 KiB

Kreuzberg Python API Reference

Comprehensive documentation for the Kreuzberg Python API. All extraction logic and heavy lifting is implemented in high-performance Rust, with Python adding OCR backends (EasyOCR, PaddleOCR) and custom post-processor support.

Extraction Functions

Synchronous File Extraction

def extract_file_sync(
    file_path: str | Path,
    mime_type: str | None = None,
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult

Extract content from a file (synchronous).

Parameters:

  • file_path (str | Path): Path to the file
  • mime_type (str | None): Optional MIME type hint (auto-detected if None)
  • config (ExtractionConfig | None): Extraction configuration (uses defaults if None)
  • easyocr_kwargs (dict | None): EasyOCR initialization options (languages, use_gpu, beam_width, etc.)
  • paddleocr_kwargs (dict | None): PaddleOCR initialization options (lang, use_angle_cls, show_log, etc.)

Returns: ExtractionResult with content, metadata, and tables

Example:

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig

# Basic usage
result = extract_file_sync("document.pdf")

# With Tesseract configuration
config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        language="eng",
        tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
    )
)
result = extract_file_sync("invoice.pdf", config=config)

# With EasyOCR custom options
config = ExtractionConfig(ocr=OcrConfig(backend="easyocr", language="eng"))
result = extract_file_sync("scanned.pdf", config=config, easyocr_kwargs={"use_gpu": True})

Asynchronous File Extraction

async def extract_file(
    file_path: str | Path,
    mime_type: str | None = None,
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult

Extract content from a file (asynchronous). Same parameters and behavior as extract_file_sync.

Synchronous Bytes Extraction

def extract_bytes_sync(
    data: bytes | bytearray,
    mime_type: str,
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult

Extract content from bytes (synchronous).

Parameters:

  • data (bytes | bytearray): File content as bytes or bytearray
  • mime_type (str): MIME type of the data (required for format detection)
  • config (ExtractionConfig | None): Extraction configuration
  • easyocr_kwargs (dict | None): EasyOCR initialization options
  • paddleocr_kwargs (dict | None): PaddleOCR initialization options

Returns: ExtractionResult with content, metadata, and tables

Asynchronous Bytes Extraction

async def extract_bytes(
    data: bytes | bytearray,
    mime_type: str,
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> ExtractionResult

Extract content from bytes (asynchronous). Same parameters and behavior as extract_bytes_sync.

Batch File Extraction

async def batch_extract_files(
    paths: list[str | Path],
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]

Extract content from multiple files in parallel (asynchronous).

Parameters:

  • paths (list[str | Path]): List of file paths
  • config (ExtractionConfig | None): Extraction configuration
  • easyocr_kwargs (dict | None): EasyOCR initialization options
  • paddleocr_kwargs (dict | None): PaddleOCR initialization options

Returns: List of ExtractionResults (one per file)

Batch File Extraction (Synchronous)

def batch_extract_files_sync(
    paths: list[str | Path],
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]

Extract content from multiple files in parallel (synchronous).

Batch Bytes Extraction

async def batch_extract_bytes(
    data_list: list[bytes | bytearray],
    mime_types: list[str],
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]

Extract content from multiple byte arrays in parallel (asynchronous).

Parameters:

  • data_list (list[bytes | bytearray]): List of file contents as bytes/bytearray
  • mime_types (list[str]): List of MIME types (one per data item)
  • config (ExtractionConfig | None): Extraction configuration
  • easyocr_kwargs (dict | None): EasyOCR initialization options
  • paddleocr_kwargs (dict | None): PaddleOCR initialization options

Returns: List of ExtractionResults (one per data item)

Batch Bytes Extraction (Synchronous)

def batch_extract_bytes_sync(
    data_list: list[bytes | bytearray],
    mime_types: list[str],
    config: ExtractionConfig | None = None,
    *,
    easyocr_kwargs: dict[str, Any] | None = None,
    paddleocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]

Extract content from multiple byte arrays in parallel (synchronous).

Per-File Config in Batch Functions

As of v4.5.0, per-file configuration overrides are passed as an optional file_configs parameter on the unified batch functions:

def batch_extract_files_sync(
    paths: list[str | Path],
    config: ExtractionConfig | None = None,
    *,
    file_configs: list[FileExtractionConfig | None] | None = None,
    easyocr_kwargs: dict[str, Any] | None = None,
) -> list[ExtractionResult]

The file_configs list must have the same length as paths. Each element is either a FileExtractionConfig override or None to use batch defaults. The same parameter is available on batch_extract_files, batch_extract_bytes_sync, and batch_extract_bytes.

Note: The separate batch_extract_files_with_configs_sync / batch_extract_files_with_configs / batch_extract_bytes_with_configs_sync / batch_extract_bytes_with_configs functions have been removed in v4.5.0.

Configuration Classes

ExtractionConfig

Main extraction configuration for document processing. All attributes are optional and use sensible defaults when not specified.

Attributes:

Field Type Default Description
use_cache bool True Enable caching of extraction results to improve performance on repeated extractions
enable_quality_processing bool True Enable quality post-processing to clean and normalize extracted text
ocr OcrConfig | None None OCR configuration for extracting text from images. None = OCR disabled
force_ocr bool False Force OCR processing even for searchable PDFs that contain extractable text
chunking ChunkingConfig | None None Text chunking configuration for dividing content into manageable chunks. None = disabled
images ImageExtractionConfig | None None Image extraction configuration for extracting images FROM documents. None = no extraction
pdf_options PdfConfig | None None PDF-specific options like password handling and metadata extraction
token_reduction TokenReductionConfig | None None Token reduction configuration for reducing token count in extracted content
language_detection LanguageDetectionConfig | None None Language detection configuration for identifying document language(s)
keywords KeywordConfig | None None Keyword extraction configuration for identifying important terms and phrases
postprocessor PostProcessorConfig | None None Post-processor configuration for custom text processing
max_concurrent_extractions int | None num_cpus * 2 Maximum concurrent extractions in batch operations
html_options HtmlConversionOptions | None None HTML conversion options for converting documents to markdown
pages PageConfig | None None Page extraction configuration for tracking page boundaries
security_limits dict[str, int] | None None Security limits configuration
result_format str "unified" Result format: "unified" or "element_based"
output_format str "plain" Output content format: "plain", "markdown", "djot", or "html"

Example:

from kreuzberg import ExtractionConfig, ChunkingConfig, OcrConfig

# Basic extraction with defaults
config = ExtractionConfig()

# Enable chunking with 512-char chunks and 100-char overlap
config = ExtractionConfig(chunking=ChunkingConfig(max_chars=512, max_overlap=100))

# Enable OCR with Tesseract
config = ExtractionConfig(ocr=OcrConfig(backend="tesseract", language="eng"))

# Multiple options
config = ExtractionConfig(
    use_cache=True,
    enable_quality_processing=True,
    output_format="markdown",
    result_format="unified"
)

FileExtractionConfig

Per-file extraction overrides for batch operations. All fields optional (None = use batch default).

Key fields: enable_quality_processing, ocr, force_ocr, chunking, images, pdf_options, token_reduction, language_detection, pages, keywords, postprocessor, html_options, result_format, output_format, include_document_structure, layout.

Excluded (batch-level only): max_concurrent_extractions, use_cache, acceleration, security_limits.

per_file = FileExtractionConfig(
    force_ocr=True,
    ocr=OcrConfig(backend="tesseract", language="deu"),
)

OcrConfig

OCR configuration for extracting text from images.

Attributes:

Field Type Default Description
backend str "tesseract" OCR backend: "tesseract", "easyocr", or "paddleocr"
language str "eng" Language code (ISO 639-3 three-letter: "eng", "deu", "fra" or ISO 639-1 two-letter: "en", "de", "fr")
tesseract_config TesseractConfig | None None Tesseract-specific configuration (only used when backend="tesseract")

Example:

from kreuzberg import OcrConfig

# Tesseract with German language
config = OcrConfig(backend="tesseract", language="deu")

# EasyOCR for faster recognition
config = OcrConfig(backend="easyocr", language="eng")

# PaddleOCR for production deployments
config = OcrConfig(backend="paddleocr", language="chi_sim")

TesseractConfig

Detailed Tesseract OCR configuration for advanced tuning. Fine-tune Tesseract OCR behavior for specific document types and quality levels.

Attributes:

Field Type Default Description
language str "eng" OCR language (ISO 639-3 three-letter code)
psm int 3 Page Segmentation Mode: 0 (detection only), 3 (auto), 6 (uniform block), 11 (sparse text)
output_format str "markdown" Output format for OCR results
oem int 3 OCR Engine Mode: 0 (legacy), 1 (LSTM), 2 (both), 3 (auto)
min_confidence float 0.0 Minimum confidence threshold (0.0-1.0) for accepting OCR results
preprocessing ImagePreprocessingConfig | None None Image preprocessing configuration before OCR
enable_table_detection bool True Enable automatic table detection and extraction
table_min_confidence float 0.0 Minimum confidence for table detection (0.0-1.0)
table_column_threshold int 50 Minimum pixel width between columns
table_row_threshold_ratio float 0.5 Minimum row height ratio
use_cache bool True Cache OCR results for improved performance
classify_use_pre_adapted_templates bool True Use pre-adapted character templates
language_model_ngram_on bool False Enable language model n-gram processing
tessedit_dont_blkrej_good_wds bool True Don't block-reject good words
tessedit_dont_rowrej_good_wds bool True Don't row-reject good words
tessedit_enable_dict_correction bool True Enable dictionary-based spelling correction
tessedit_char_whitelist str "" Whitelist of characters to recognize (empty = all)
tessedit_char_blacklist str "" Blacklist of characters to ignore
tessedit_use_primary_params_model bool True Use primary parameters model
textord_space_size_is_variable bool True Allow variable space sizes
thresholding_method bool False Thresholding method for binarization

Example:

from kreuzberg import TesseractConfig, ImagePreprocessingConfig

# General document OCR
config = TesseractConfig(psm=3, oem=3)

# Invoice/form OCR with table detection
config = TesseractConfig(psm=6, oem=2, enable_table_detection=True, min_confidence=0.6)

# High-precision technical document OCR
config = TesseractConfig(
    psm=3,
    oem=2,
    preprocessing=ImagePreprocessingConfig(denoise=True, contrast_enhance=True, auto_rotate=True),
    min_confidence=0.7,
    tessedit_enable_dict_correction=True,
)

# Numeric-only OCR (for receipts, barcodes)
config = TesseractConfig(psm=6, tessedit_char_whitelist="0123456789.-,", min_confidence=0.8)

# Multiple language document
config = TesseractConfig(language="eng+fra+deu", psm=3, oem=2)

ChunkingConfig

Text chunking configuration for dividing content into chunks. Chunking is useful for preparing content for embedding, indexing, or processing with length-limited systems (like LLM context windows).

Attributes:

Field Type Default Description
max_chars int 1000 Maximum number of characters per chunk. Chunks larger than this will be split intelligently at sentence/paragraph boundaries
max_overlap int 200 Overlap between consecutive chunks in characters. Creates context bridges between chunks for smoother processing
embedding EmbeddingConfig | None None Configuration for generating embeddings for each chunk using ONNX models. None = no embeddings
preset str | None None Use a preset chunking configuration (overrides individual settings if provided). Use list_embedding_presets() to see available presets

IMPORTANT: The fields are max_chars and max_overlap (NOT max_characters or overlap).

Example:

from kreuzberg import ExtractionConfig, ChunkingConfig, EmbeddingConfig, EmbeddingModelType

# Basic chunking with defaults
config = ExtractionConfig(chunking=ChunkingConfig())

# Custom chunk size with overlap
config = ExtractionConfig(chunking=ChunkingConfig(max_chars=512, max_overlap=100))

# Chunking with embeddings
config = ExtractionConfig(
    chunking=ChunkingConfig(
        max_chars=512,
        embedding=EmbeddingConfig(model=EmbeddingModelType.preset("balanced"))
    )
)

# Using preset configuration
config = ExtractionConfig(chunking=ChunkingConfig(preset="semantic"))

PdfConfig

PDF-specific extraction configuration.

Attributes:

Field Type Default Description
extract_images bool False Extract images from PDF documents
passwords list[str] | None None List of passwords to try when opening encrypted PDFs. Try each password in order until one succeeds
extract_metadata bool True Extract PDF metadata (title, author, creation date, etc.)
hierarchy HierarchyConfig | None None Document hierarchy detection configuration. None = no hierarchy detection

Example:

from kreuzberg import ExtractionConfig, PdfConfig, HierarchyConfig

# Basic PDF configuration
config = ExtractionConfig(pdf_options=PdfConfig())

# Extract metadata and images from PDF
config = ExtractionConfig(pdf_options=PdfConfig(extract_images=True, extract_metadata=True))

# Handle encrypted PDFs
config = ExtractionConfig(pdf_options=PdfConfig(passwords=["password123", "fallback_password"]))

# Enable hierarchy detection
config = ExtractionConfig(pdf_options=PdfConfig(hierarchy=HierarchyConfig(k_clusters=6)))

ImageExtractionConfig

Configuration for extracting images FROM documents. This is NOT for preprocessing images before OCR.

Attributes:

Field Type Default Description
extract_images bool True Enable image extraction from documents
target_dpi int 300 Target DPI for image normalization. Images are resampled to this DPI for consistency
max_image_dimension int 4096 Maximum width or height for extracted images. Larger images are downscaled to fit
auto_adjust_dpi bool True Automatically adjust DPI based on image content quality
min_dpi int 72 Minimum DPI threshold. Images with lower DPI are upscaled
max_dpi int 600 Maximum DPI threshold. Images with higher DPI are downscaled

Example:

from kreuzberg import ExtractionConfig, ImageExtractionConfig

# Basic image extraction
config = ExtractionConfig(images=ImageExtractionConfig())

# Extract images with custom DPI settings
config = ExtractionConfig(
    images=ImageExtractionConfig(target_dpi=150, max_image_dimension=2048, auto_adjust_dpi=False)
)

EmbeddingConfig

Embedding generation configuration for text chunks. Configures embedding generation using ONNX models via fastembed-rs.

Attributes:

Field Type Default Description
model EmbeddingModelType Preset "balanced" The embedding model to use (preset, fastembed, or custom)
normalize bool True Whether to normalize embedding vectors to unit length (recommended for cosine similarity)
batch_size int 32 Number of texts to process simultaneously. Higher values use more memory but may be faster
show_download_progress bool False Display progress during embedding model download
cache_dir str | None None Custom directory for caching downloaded models (defaults to ~/.cache/kreuzberg/embeddings/)

Example:

from kreuzberg import EmbeddingConfig, EmbeddingModelType

# Basic preset embedding (recommended)
config = EmbeddingConfig()

# Specific preset with settings
config = EmbeddingConfig(
    model=EmbeddingModelType.preset("balanced"),
    normalize=True,
    batch_size=64
)

# Custom ONNX model
config = EmbeddingConfig(
    model=EmbeddingModelType.custom(model_id="sentence-transformers/all-MiniLM-L6-v2", dimensions=384)
)

# With custom cache directory
config = EmbeddingConfig(cache_dir="/path/to/model/cache")

EmbeddingModelType

Embedding model type selector with multiple configurations.

Static Methods:

@staticmethod
def preset(name: str) -> EmbeddingModelType

Use a preset configuration (recommended for most use cases). Available presets: balanced, compact, large.

@staticmethod
def fastembed(model: str, dimensions: int) -> EmbeddingModelType

Use a specific fastembed model by name.

@staticmethod
def custom(model_id: str, dimensions: int) -> EmbeddingModelType

Use a custom ONNX model from HuggingFace (e.g., sentence-transformers/*).

Example:

from kreuzberg import EmbeddingModelType, list_embedding_presets

# Using the balanced preset (recommended for general use)
model = EmbeddingModelType.preset("balanced")

# Using a specific fast embedding model
model = EmbeddingModelType.fastembed(model="BAAI/bge-small-en-v1.5", dimensions=384)

# Using a custom HuggingFace model
model = EmbeddingModelType.custom(
    model_id="sentence-transformers/all-MiniLM-L6-v2",
    dimensions=384
)

# Listing available presets
presets = list_embedding_presets()
print(f"Available presets: {presets}")

TokenReductionConfig

Configuration for reducing token count in extracted content. Reduces token count to lower costs when working with LLM APIs.

Attributes:

Field Type Default Description
mode str "off" Token reduction mode: "off", "light", "moderate", "aggressive", or "maximum"
preserve_important_words bool True Preserve capitalized words, technical terms, and proper nouns even in aggressive reduction modes

Example:

from kreuzberg import ExtractionConfig, TokenReductionConfig

# Moderate token reduction
config = ExtractionConfig(
    token_reduction=TokenReductionConfig(mode="moderate", preserve_important_words=True)
)

# Maximum reduction for large batches
config = ExtractionConfig(
    token_reduction=TokenReductionConfig(mode="maximum", preserve_important_words=True)
)

# No reduction (default)
config = ExtractionConfig(
    token_reduction=TokenReductionConfig(mode="off")
)

LanguageDetectionConfig

Configuration for detecting document language(s).

Attributes:

Field Type Default Description
enabled bool True Enable language detection for extracted content
min_confidence float 0.8 Minimum confidence threshold (0.0-1.0) for language detection
detect_multiple bool False Detect multiple languages in the document. When False, only the most confident language is returned

Example:

from kreuzberg import ExtractionConfig, LanguageDetectionConfig, extract_file_sync

# Basic language detection
config = ExtractionConfig(language_detection=LanguageDetectionConfig())

# Detect multiple languages with lower confidence threshold
config = ExtractionConfig(
    language_detection=LanguageDetectionConfig(detect_multiple=True, min_confidence=0.6)
)

# Access detected languages in result
result = extract_file_sync("multilingual.pdf", config=config)
print(f"Languages: {result.detected_languages}")

KeywordConfig

Keyword extraction configuration.

Attributes:

Field Type Default Description
algorithm KeywordAlgorithm - Keyword extraction algorithm (KeywordAlgorithm.Yake or KeywordAlgorithm.Rake)
max_keywords int 10 Maximum number of keywords to extract
min_score float 0.0 Minimum score threshold
ngram_range tuple[int, int] (1, 3) N-gram range for keyword extraction
language str | None "en" Optional language hint
yake_params YakeParams | None None YAKE-specific tuning parameters
rake_params RakeParams | None None RAKE-specific tuning parameters

PageConfig

Page extraction and tracking configuration.

Attributes:

Field Type Default Description
extract_pages bool False Enable page tracking and per-page extraction
insert_page_markers bool False Insert page markers into content
marker_format str "\n\n\n\n" Marker template containing {page_num}

Example:

from kreuzberg import ExtractionConfig, PageConfig

config = ExtractionConfig(pages=PageConfig(extract_pages=True))

PostProcessorConfig

Configuration for post-processors in the extraction pipeline.

Attributes:

Field Type Default Description
enabled bool True Enable post-processors in the extraction pipeline
enabled_processors list[str] | None None Whitelist of processor names to run. None = run all enabled
disabled_processors list[str] | None None Blacklist of processor names to skip. None = none disabled

Example:

from kreuzberg import ExtractionConfig, PostProcessorConfig

# Basic post-processing with defaults
config = ExtractionConfig(postprocessor=PostProcessorConfig())

# Enable only specific processors
config = ExtractionConfig(
    postprocessor=PostProcessorConfig(
        enabled=True,
        enabled_processors=["normalize_whitespace", "fix_encoding"]
    )
)

# Disable specific processors
config = ExtractionConfig(
    postprocessor=PostProcessorConfig(
        enabled=True,
        disabled_processors=["experimental_cleanup"]
    )
)

# Disable all post-processing
config = ExtractionConfig(postprocessor=PostProcessorConfig(enabled=False))

ImagePreprocessingConfig

Configuration for preprocessing images before OCR. This is NOT for extracting images from documents.

Attributes:

Field Type Default Description
target_dpi int 300 Target DPI for image normalization before OCR
auto_rotate bool True Automatically detect and correct image rotation
deskew bool True Correct skewed images to improve OCR accuracy
denoise bool False Apply denoising filters to reduce noise in images
contrast_enhance bool False Enhance contrast to improve text readability
binarization_method str "otsu" Method for converting images to black and white
invert_colors bool False Invert colors (white text on black background)

Example:

from kreuzberg import TesseractConfig, ImagePreprocessingConfig

# Basic preprocessing for OCR
config = TesseractConfig(preprocessing=ImagePreprocessingConfig())

# Aggressive preprocessing for low-quality scans
config = TesseractConfig(
    preprocessing=ImagePreprocessingConfig(
        target_dpi=300,
        denoise=True,
        contrast_enhance=True,
        auto_rotate=True,
        deskew=True
    )
)

ExtractionResult

Result object returned by extraction functions.

Attributes:

Field Type Description
content str Main extracted text content in the specified output_format
mime_type str MIME type of the processed document
metadata Metadata Extracted document metadata (title, author, created_at, format_type, etc.)
tables list[ExtractedTable] Extracted tables from the document
detected_languages list[str] | None Detected language codes (e.g., ["en", "de"]) if language detection is enabled
chunks list[Chunk] | None Text chunks if chunking is enabled (each chunk has content, embedding, metadata)
images list[ExtractedImage] | None Extracted images if image extraction is enabled
pages list[PageContent] | None Per-page content and metadata if page extraction is enabled
elements list[Element] | None Semantic elements if result_format="element_based"
output_format str | None Format of the content field (plain, markdown, djot, html)
result_format str | None Result format used (unified or element_based)
extracted_keywords list[ExtractedKeyword] | None Extracted keywords with relevance scores if keyword extraction enabled
quality_score float | None Overall quality score for the extraction result (0.0-1.0)
processing_warnings list[ProcessingWarning] Non-fatal warnings encountered during extraction pipeline

Methods:

def get_page_count(self) -> int

Get the total number of pages in the document.

def get_chunk_count(self) -> int

Get the total number of chunks if chunking is enabled.

def get_detected_language(self) -> str | None

Get the most confident detected language code.

def get_metadata_field(self, field_name: str) -> Any | None

Get a specific metadata field by name.

Example:

from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig

config = ExtractionConfig(
    chunking=ChunkingConfig(max_chars=512),
    output_format="markdown"
)
result = extract_file_sync("document.pdf", config=config)

print(f"Content preview: {result.content[:200]}")
print(f"MIME type: {result.mime_type}")
print(f"Page count: {result.get_page_count()}")
print(f"Chunk count: {result.get_chunk_count()}")
print(f"Detected language: {result.get_detected_language()}")

if result.tables:
    print(f"Found {len(result.tables)} tables")

if result.chunks:
    first_chunk = result.chunks[0]
    print(f"First chunk: {first_chunk.content[:100]}")
    if first_chunk.embedding:
        print(f"Embedding dimensions: {len(first_chunk.embedding)}")

Error Classes

All exceptions inherit from KreuzbergError, the base exception class.

KreuzbergError

Base exception class for all Kreuzberg errors.

class KreuzbergError(Exception):
    """Base exception for all Kreuzberg errors."""

ParsingError

Raised when document parsing fails.

class ParsingError(KreuzbergError):
    """Document parsing failed (corrupt, malformed, etc.)."""

OCRError

Raised when OCR processing fails.

class OCRError(KreuzbergError):
    """OCR operation failed."""

ValidationError

Raised when validation fails.

class ValidationError(KreuzbergError):
    """Validation failed (invalid parameters, constraints, format mismatches)."""

MissingDependencyError

Raised when required dependencies are not available.

class MissingDependencyError(KreuzbergError):
    """Required dependency not available (easyocr, paddleocr, tesseract, etc.)."""

    @staticmethod
    def create_for_package(dependency_group: str, functionality: str, package_name: str) -> MissingDependencyError

Example:

from kreuzberg import extract_file_sync, MissingDependencyError, OCRError, ParsingError

try:
    result = extract_file_sync("document.pdf")
except ParsingError as e:
    print(f"Failed to parse document: {e}")
except OCRError as e:
    print(f"OCR failed: {e}")
except MissingDependencyError as e:
    print(f"Missing dependency: {e}")

Utility Functions

MIME Type Detection

def detect_mime_type(data: bytes | bytearray) -> str

Detect MIME type from file bytes using magic number detection.

Parameters:

  • data (bytes | bytearray): File content as bytes or bytearray

Returns: Detected MIME type (e.g., "application/pdf", "image/png")

def detect_mime_type_from_path(path: str | Path) -> str

Detect MIME type from file path by reading the file and detecting its MIME type.

Parameters:

  • path (str | Path): Path to the file

Returns: Detected MIME type

Raises:

  • OSError: If file cannot be read (file not found, permission denied, etc.)
  • RuntimeError: If MIME type detection fails

Example:

from kreuzberg import detect_mime_type, detect_mime_type_from_path

# From bytes
pdf_bytes = b"%PDF-1.4\n"
mime_type = detect_mime_type(pdf_bytes)

# From path
mime_type = detect_mime_type_from_path("document.pdf")

MIME Type Validation

def validate_mime_type(mime_type: str) -> str

Validate a MIME type string and return the canonical form.

def get_extensions_for_mime(mime_type: str) -> list[str]

Get file extensions associated with a MIME type.

Example:

from kreuzberg import validate_mime_type, get_extensions_for_mime

canonical = validate_mime_type("application/pdf")
extensions = get_extensions_for_mime("application/pdf")  # Returns ["pdf"]

Configuration Loading

def load_extraction_config_from_file(path: str | Path) -> ExtractionConfig

Load extraction configuration from a specific file.

Parameters:

  • path (str | Path): Path to the configuration file (.toml, .yaml, or .json)

Returns: ExtractionConfig parsed from the file

Raises:

  • FileNotFoundError: If the configuration file does not exist
  • RuntimeError: If the file cannot be read or parsed
  • ValueError: If the file format is invalid or unsupported
def discover_extraction_config() -> ExtractionConfig | None

Discover extraction configuration from the environment (deprecated).

Attempts to locate a Kreuzberg configuration file using:

  1. KREUZBERG_CONFIG_PATH environment variable
  2. Search for kreuzberg.toml, kreuzberg.yaml, or kreuzberg.json in current and parent directories

Returns: ExtractionConfig if found, None otherwise

Note: Deprecated in favor of load_extraction_config_from_file for more predictable behavior.

Example:

from kreuzberg import load_extraction_config_from_file, extract_file_sync

# Load from specific file
config = load_extraction_config_from_file("kreuzberg.toml")
result = extract_file_sync("document.pdf", config=config)

# Auto-discover configuration
import os
os.environ["KREUZBERG_CONFIG_PATH"] = "config/kreuzberg.yaml"
# Then extraction will use the discovered config

Plugin System

Registering Post-Processors

def register_post_processor(processor: Any) -> None

Register a Python PostProcessor with the Rust core. Once registered, the processor will be called automatically after extraction to enrich results.

Required Methods:

  • name() -> str: Return processor name (must be non-empty)
  • process(result: ExtractionResult) -> ExtractionResult: Process and enrich the extraction result
  • processing_stage() -> str: Return "early", "middle", or "late"

Optional Methods:

  • initialize() -> None: Called when processor is registered
  • shutdown() -> None: Called when processor is unregistered

Example:

from kreuzberg import register_post_processor, ExtractionResult

class EntityExtractor:
    def name(self) -> str:
        return "entity_extraction"

    def processing_stage(self) -> str:
        return "early"

    def process(self, result: ExtractionResult) -> ExtractionResult:
        entities = {"PERSON": ["John Doe"], "ORG": ["Microsoft"]}
        result.metadata["entities"] = entities
        return result

register_post_processor(EntityExtractor())

Registering OCR Backends

def register_ocr_backend(backend: Any) -> None

Register a Python OCR backend with the Rust core.

Required Methods:

  • name() -> str: Return backend name (must be non-empty)
  • supported_languages() -> list[str]: Return list of supported language codes
  • process_image(image_bytes: bytes, language: str) -> OcrResult: Process image and return OCR result
  • process_file(path: str, language: str) -> OcrResult: Process file and return OCR result
  • initialize() -> None: Called when backend is registered
  • shutdown() -> None: Called when backend is unregistered
  • version() -> str: Return backend version string

Example:

from kreuzberg import register_ocr_backend

class MyOcrBackend:
    def name(self) -> str:
        return "my-ocr"

    def supported_languages(self) -> list[str]:
        return ["eng", "deu", "fra"]

    def process_image(self, image_bytes: bytes, language: str) -> dict:
        return {
            "content": "extracted text",
            "metadata": {"confidence": 0.95},
            "tables": []
        }

register_ocr_backend(MyOcrBackend())

Registering Validators

def register_validator(validator: Any) -> None

Register a Python Validator with the Rust core. Validators are called automatically after extraction to validate results.

Required Methods:

  • name() -> str: Return validator name (must be non-empty)
  • validate(result: ExtractionResult) -> None: Validate the extraction result (raise error to fail)

Optional Methods:

  • should_validate(result: ExtractionResult) -> bool: Check if validator should run (defaults to True)
  • priority() -> int: Return priority (defaults to 50, higher runs first)

Example:

from kreuzberg import register_validator, ValidationError, ExtractionResult

class MinLengthValidator:
    def name(self) -> str:
        return "min_length_validator"

    def priority(self) -> int:
        return 100

    def validate(self, result: ExtractionResult) -> None:
        if len(result.content) < 100:
            raise ValidationError("Content too short")

register_validator(MinLengthValidator())

Plugin Management Functions

def list_post_processors() -> list[str]

List names of all registered post-processors.

def list_validators() -> list[str]

List names of all registered validators.

def list_ocr_backends() -> list[str]

List names of all available OCR backends.

def unregister_post_processor(name: str) -> None

Unregister a post-processor by name.

def unregister_validator(name: str) -> None

Unregister a validator by name.

def unregister_ocr_backend(name: str) -> None

Unregister an OCR backend by name.

def clear_post_processors() -> None

Clear all registered post-processors.

def clear_validators() -> None

Clear all registered validators.

def clear_ocr_backends() -> None

Clear all registered OCR backends.

Format Enums

OutputFormat

Output format for extraction results.

class OutputFormat(str, Enum):
    PLAIN = "plain"         # Plain text format
    MARKDOWN = "markdown"   # Markdown format
    DJOT = "djot"          # Djot lightweight markup format
    HTML = "html"          # HTML format

ResultFormat

Result format controlling extraction output structure.

class ResultFormat(str, Enum):
    UNIFIED = "unified"                # All content in `content` field
    ELEMENT_BASED = "element_based"   # Unstructured-compatible output with semantic elements

Error Handling

Error Code Functions

def get_last_error_code() -> int

Get the last error code from the FFI layer.

Returns:

  • 0 (SUCCESS): No error occurred
  • 1 (GENERIC_ERROR): Generic unspecified error
  • 2 (PANIC): A panic occurred in the Rust core
  • 3 (INVALID_ARGUMENT): Invalid argument provided
  • 4 (IO_ERROR): I/O operation failed
  • 5 (PARSING_ERROR): Document parsing failed
  • 6 (OCR_ERROR): OCR operation failed
  • 7 (MISSING_DEPENDENCY): Required dependency not available
def get_error_details() -> dict[str, Any]

Get detailed error information from the FFI layer.

Returns: dict with keys:

  • message (str): Human-readable error message
  • error_code (int): Numeric error code (0-7)
  • error_type (str): Error type name (e.g., "validation", "ocr")
  • source_file (str | None): Source file path if available
  • source_function (str | None): Function name if available
  • source_line (int): Line number (0 if unknown)
  • context_info (str | None): Additional context if available
  • is_panic (bool): Whether error came from a panic
def classify_error(message: str) -> int

Classify an error message into a Kreuzberg error code.

Parameters:

  • message (str): The error message to classify

Returns: int error code (0-7) representing the classification

def error_code_name(code: int) -> str

Get the human-readable name of an error code.

Parameters:

  • code (int): Numeric error code (0-7)

Returns: Human-readable error code name (e.g., "validation", "ocr")

Example:

from kreuzberg import get_error_details, get_last_error_code, error_code_name, classify_error

try:
    result = extract_file_sync("document.pdf")
except Exception as e:
    code = get_last_error_code()
    if code:
        print(f"Error code: {code} ({error_code_name(code)})")

    details = get_error_details()
    print(f"Error: {details['message']}")
    print(f"Type: {details['error_type']}")

    classified = classify_error(str(e))
    print(f"Classified as: {error_code_name(classified)}")

Validation Functions

Parameter Validation

def validate_chunking_params(max_chars: int, max_overlap: int) -> bool

Validate chunking parameters.

def validate_confidence(confidence: float) -> bool

Validate confidence value (0.0-1.0).

def validate_dpi(dpi: int) -> bool

Validate DPI value.

def validate_tesseract_psm(psm: int) -> bool

Validate Tesseract Page Segmentation Mode.

def validate_tesseract_oem(oem: int) -> bool

Validate Tesseract OCR Engine Mode.

def validate_ocr_backend(backend: str) -> bool

Validate OCR backend name.

def validate_language_code(code: str) -> bool

Validate language code format.

def validate_token_reduction_level(level: str) -> bool

Validate token reduction level.

def validate_output_format(output_format: str) -> bool

Validate output format string.

def validate_binarization_method(method: str) -> bool

Validate binarization method for image preprocessing.

Getting Valid Values

def get_valid_binarization_methods() -> list[str]

Get list of valid binarization methods.

def get_valid_language_codes() -> list[str]

Get list of valid language codes.

def get_valid_ocr_backends() -> list[str]

Get list of valid OCR backend names.

def get_valid_token_reduction_levels() -> list[str]

Get list of valid token reduction levels.

def list_embedding_presets() -> list[str]

List available embedding presets.

def get_embedding_preset(name: str) -> EmbeddingPreset | None

Get details about a specific embedding preset.

Example:

from kreuzberg import (
    validate_dpi,
    get_valid_binarization_methods,
    list_embedding_presets,
    get_embedding_preset
)

# Validate parameters
if not validate_dpi(300):
    print("Invalid DPI")

# List valid values
binarization_methods = get_valid_binarization_methods()
presets = list_embedding_presets()

# Get preset details
preset = get_embedding_preset("balanced")
if preset:
    print(f"Balanced preset: {preset.description}")
    print(f"Dimensions: {preset.dimensions}")
    print(f"Recommended chunk size: {preset.chunk_size}")

Configuration Utilities

Config Manipulation

def config_to_json(config: ExtractionConfig) -> str

Convert ExtractionConfig to JSON string.

def config_get_field(config: ExtractionConfig, field_name: str) -> Any | None

Get a specific field value from ExtractionConfig.

def config_merge(base: ExtractionConfig, override: ExtractionConfig) -> None

Merge override config into base config (mutates base).

Example:

from kreuzberg import ExtractionConfig, config_to_json, config_get_field, config_merge

config = ExtractionConfig(use_cache=True, enable_quality_processing=False)

# Convert to JSON
json_str = config_to_json(config)
print(json_str)

# Get field
use_cache = config_get_field(config, "use_cache")
print(f"use_cache: {use_cache}")

# Merge configs
override = ExtractionConfig(use_cache=False)
config_merge(config, override)

Version Information

__version__: str

Current version of the kreuzberg package.

Example:

from kreuzberg import __version__

print(f"Kreuzberg version: {__version__}")