11 KiB
Kreuzberg
{% include 'partials/badges.html.jinja' %}
{{ description }}
What This Package Provides
- Python-native extraction — sync and async APIs for files, bytes, URLs, and batch ingestion.
- Structured results — text, tables, images, metadata, language detection, chunks, and warnings in typed Python objects.
- OCR choices — Tesseract, EasyOCR, PaddleOCR, and VLM OCR where configured.
- Same Rust engine as every binding — behavior matches the Node.js, Ruby, Go, Java, .NET, PHP, Elixir, R, Dart, Swift, Zig, WASM, and C FFI packages.
Installation
pip install kreuzberg
With OCR Support
pip install "kreuzberg[easyocr]"
pip install "kreuzberg[paddleocr]"
All Features
pip install "kreuzberg[all]"
Quick Start
Basic Usage
{{ 'getting-started/basic_usage.md' | include_snippet('python') }}
Simple Extraction
{{ 'getting-started/extract_file.md' | include_snippet('python') }}
Reading Content
{{ 'getting-started/read_content.md' | include_snippet('python') }}
OCR Support
Using OCR
{{ 'getting-started/extract_with_ocr.md' | include_snippet('python') }}
EasyOCR (GPU-Accelerated)
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(
ocr=OcrConfig(backend="easyocr", language="en")
)
result = extract_file_sync(
"photo.jpg",
config=config,
easyocr_kwargs={"use_gpu": True}
)
PaddleOCR (Complex Layouts)
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(
ocr=OcrConfig(backend="paddleocr", language="ch")
)
result = extract_file_sync(
"invoice.pdf",
config=config,
)
Table Extraction
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
tesseract_config=TesseractConfig(
enable_table_detection=True
)
)
)
result = extract_file_sync("invoice.pdf", config=config)
for table in result.tables:
print(table.markdown)
print(table.cells)
Configuration
Complete Configuration Example
from kreuzberg import (
extract_file_sync,
ExtractionConfig,
OcrConfig,
TesseractConfig,
ChunkingConfig,
ImageExtractionConfig,
PdfConfig,
TokenReductionConfig,
LanguageDetectionConfig,
)
config = ExtractionConfig(
use_cache=True,
enable_quality_processing=True,
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(
psm=6,
enable_table_detection=True,
min_confidence=50.0,
),
),
force_ocr=False,
chunking=ChunkingConfig(
max_chars=1000,
max_overlap=200,
),
images=ImageExtractionConfig(
extract_images=True,
target_dpi=300,
max_image_dimension=4096,
auto_adjust_dpi=True,
),
pdf_options=PdfConfig(
extract_images=True,
passwords=["password1", "password2"],
extract_metadata=True,
),
token_reduction=TokenReductionConfig(
mode="moderate",
preserve_important_words=True,
),
language_detection=LanguageDetectionConfig(
enabled=True,
min_confidence=0.8,
detect_multiple=False,
),
)
result = extract_file_sync("document.pdf", config=config)
HTML Conversion Options & Batch Concurrency
from kreuzberg import ExtractionConfig
config = ExtractionConfig(
max_concurrent_extractions=8,
html_options={
"extract_metadata": True,
"wrap": True,
"wrap_width": 100,
"strip_tags": ["script", "style"],
"preprocessing": {"enabled": True, "preset": "standard"},
},
)
Metadata Extraction
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
if result.images:
print(f"Extracted {len(result.images)} inline images")
if result.chunks:
print(f"First chunk tokens: {result.chunks[0]['metadata']['token_count']}")
print(result.metadata.get("pdf", {}))
print(result.metadata.get("language"))
print(result.metadata.get("format"))
if "pdf" in result.metadata:
pdf_meta = result.metadata["pdf"]
print(f"Title: {pdf_meta.get('title')}")
print(f"Author: {pdf_meta.get('author')}")
print(f"Pages: {pdf_meta.get('page_count')}")
print(f"Created: {pdf_meta.get('creation_date')}")
Password-Protected PDFs
from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig
config = ExtractionConfig(
pdf_options=PdfConfig(
passwords=["password1", "password2", "password3"]
)
)
result = extract_file_sync("protected.pdf", config=config)
Language Detection
from kreuzberg import extract_file_sync, ExtractionConfig, LanguageDetectionConfig
config = ExtractionConfig(
language_detection=LanguageDetectionConfig(enabled=True)
)
result = extract_file_sync("multilingual.pdf", config=config)
print(result.detected_languages)
Text Chunking
from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig
config = ExtractionConfig(
chunking=ChunkingConfig(
max_chars=1000,
max_overlap=200,
)
)
result = extract_file_sync("long_document.pdf", config=config)
for chunk in result.chunks:
print(chunk)
Extract from Bytes
from kreuzberg import extract_bytes_sync
with open("document.pdf", "rb") as f:
data = f.read()
result = extract_bytes_sync(data, "application/pdf")
print(result.content)
API Reference
Extraction Functions
extract_file(file_path, mime_type=None, config=None, **kwargs)– Async extractionextract_file_sync(file_path, mime_type=None, config=None, **kwargs)– Sync extractionextract_bytes(data, mime_type, config=None, **kwargs)– Async extraction from bytesextract_bytes_sync(data, mime_type, config=None, **kwargs)– Sync extraction from bytesbatch_extract_files(paths, config=None, **kwargs)– Async batch extractionbatch_extract_files_sync(paths, config=None, **kwargs)– Sync batch extractionbatch_extract_bytes(data_list, mime_types, config=None, **kwargs)– Async batch from bytesbatch_extract_bytes_sync(data_list, mime_types, config=None, **kwargs)– Sync batch from bytes
Configuration Classes
ExtractionConfig– Main configurationOcrConfig– OCR settingsTesseractConfig– Tesseract-specific optionsChunkingConfig– Text chunking settingsImageExtractionConfig– Image extraction settingsPdfConfig– PDF-specific optionsTokenReductionConfig– Token reduction settingsLanguageDetectionConfig– Language detection settings
Result Types
ExtractionResult– Main result object withcontent,metadata,tables,detected_languages,chunksExtractedTable– Table withcells,markdown,page_numberMetadata– Typed metadata dictionary
Exceptions
KreuzbergError– Base exceptionValidationError– Invalid configuration or inputParsingError– Document parsing failureOCRError– OCR processing failureMissingDependencyError– Missing optional dependency
Examples
Custom Processing
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
text = result.content
text = text.lower()
text = text.replace("old", "new")
print(text)
Multiple Files with Progress
from kreuzberg import extract_file_sync
from pathlib import Path
files = list(Path("documents").glob("*.pdf"))
results = []
for i, file in enumerate(files, 1):
print(f"Processing {i}/{len(files)}: {file.name}")
result = extract_file_sync(str(file))
results.append((file.name, result))
for name, result in results:
print(f"{name}: {len(result.content)} characters")
Filter by Language
from kreuzberg import extract_file_sync, ExtractionConfig, LanguageDetectionConfig
config = ExtractionConfig(
language_detection=LanguageDetectionConfig(enabled=True)
)
result = extract_file_sync("document.pdf", config=config)
if result.detected_languages and "en" in result.detected_languages:
print("English document detected")
print(result.content)
System Requirements
ONNX Runtime (for embeddings)
If using embeddings functionality, ONNX Runtime version 1.22.x must be installed:
# macOS
brew install onnxruntime
# Ubuntu/Debian (download from GitHub - Debian packages may have older versions)
# Download from https://github.com/microsoft/onnxruntime/releases
# Windows
# Download from https://github.com/microsoft/onnxruntime/releases
Important: Kreuzberg requires ONNX Runtime version 1.22.x for embeddings.
Without ONNX Runtime, embeddings will raise MissingDependencyError with installation instructions.
Tesseract OCR (Required for OCR)
brew install tesseract
sudo apt-get install tesseract-ocr
Pandoc (Optional, for some formats)
brew install pandoc
sudo apt-get install pandoc
Troubleshooting
Import Error: No module named '_kreuzberg'
This usually means the Rust extension wasn't built correctly. Try:
pip install --force-reinstall --no-cache-dir kreuzberg
OCR Not Working
Make sure Tesseract is installed:
tesseract --version
Memory Issues with Large PDFs
Use streaming or enable chunking:
config = ExtractionConfig(
chunking=ChunkingConfig(max_chars=1000)
)
PDFium Integration
PDF extraction is powered by PDFium, which is automatically bundled with this package. No system installation required.
Platform Support
| Platform | Status | Notes |
|---|---|---|
| Linux x86_64 | ✅ | Bundled |
| macOS ARM64 | ✅ | Bundled |
| macOS x86_64 | ✅ | Bundled |
| Windows x86_64 | ✅ | Bundled |
Binary Size Impact
PDFium adds approximately 8-15 MB to the package size depending on platform. This ensures consistent PDF extraction across all environments without external dependencies.
Documentation
For comprehensive documentation, visit https://kreuzberg.dev
Part of Kreuzberg.dev
- Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
- kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- html-to-markdown — fast, lossless HTML→Markdown engine.
- liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
- tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
- alef — the polyglot binding generator that produces this README and all per-language bindings.
- Discord — community, roadmap, announcements.
License
{{ license }} License - see LICENSE for details.