hjess/fil

Fork 0

Files

History

Henrik Jess Nielsen b4c07d3693

Deploy fil (kreuzberg) / deploy (push) Successful in 49s

Details

Nomad changes

2026-06-01 23:40:55 +02:00

kreuzberg

Nomad changes

2026-06-01 23:40:55 +02:00

LICENSE

Nomad changes

2026-06-01 23:40:55 +02:00

pyproject.toml

Nomad changes

2026-06-01 23:40:55 +02:00

README.md

Nomad changes

2026-06-01 23:40:55 +02:00

README.md

Kreuzberg

Extract text, tables, images, and metadata from 90+ file formats and 300+ programming languages including PDF, Office documents, and images. Native Python bindings with async/await support, multiple OCR backends (Tesseract, EasyOCR, PaddleOCR), and extensible plugin system.

What This Package Provides

Python-native extraction — sync and async APIs for files, bytes, URLs, and batch ingestion.
Structured results — text, tables, images, metadata, language detection, chunks, and warnings in typed Python objects.
OCR choices — Tesseract, EasyOCR, PaddleOCR, and VLM OCR where configured.
Same Rust engine as every binding — behavior matches the Node.js, Ruby, Go, Java, .NET, PHP, Elixir, R, Dart, Swift, Zig, WASM, and C FFI packages.

Installation

pip install kreuzberg

With OCR Support

pip install "kreuzberg[easyocr]"
pip install "kreuzberg[paddleocr]"

All Features

pip install "kreuzberg[all]"

Quick Start

Basic Usage

import asyncio
from kreuzberg import extract_file, ExtractionConfig

async def main() -> None:
    config = ExtractionConfig(
        use_cache=True,
        enable_quality_processing=True
    )
    result = await extract_file("document.pdf", config=config)
    print(result.content)

asyncio.run(main())

Simple Extraction

import asyncio
from pathlib import Path
from kreuzberg import extract_file

async def main() -> None:
    file_path: Path = Path("document.pdf")

    result = await extract_file(file_path)

    print(f"Content: {result.content}")
    print(f"Format: {result.metadata.format.format_type if result.metadata.format else None}")
    print(f"Tables: {len(result.tables)}")

asyncio.run(main())

Reading Content

import asyncio
from kreuzberg import extract_file

async def main() -> None:
    result = await extract_file("document.pdf")

    content: str = result.content
    tables: int = len(result.tables)
    format_type: str | None = result.metadata.format.format_type if result.metadata.format else None

    print(f"Content length: {len(content)} characters")
    print(f"Tables found: {tables}")
    print(f"Format: {format_type}")

asyncio.run(main())

OCR Support

Using OCR

import asyncio
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig

async def main() -> None:
    config = ExtractionConfig(
        force_ocr=True,
        ocr=OcrConfig(
            backend="tesseract",
            language="eng",
            tesseract_config=TesseractConfig(psm=3)
        )
    )
    result = await extract_file("scanned.pdf", config=config)
    print(result.content)
    print(f"Detected Languages: {result.detected_languages}")

asyncio.run(main())

EasyOCR (GPU-Accelerated)

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config = ExtractionConfig(
    ocr=OcrConfig(backend="easyocr", language="en")
)

result = extract_file_sync(
    "photo.jpg",
    config=config,
    easyocr_kwargs={"use_gpu": True}
)

PaddleOCR (Complex Layouts)

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config = ExtractionConfig(
    ocr=OcrConfig(backend="paddleocr", language="ch")
)

result = extract_file_sync(
    "invoice.pdf",
    config=config,
)

Table Extraction

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig

config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        tesseract_config=TesseractConfig(
            enable_table_detection=True
        )
    )
)

result = extract_file_sync("invoice.pdf", config=config)

for table in result.tables:
    print(table.markdown)
    print(table.cells)

Configuration

Complete Configuration Example

from kreuzberg import (
    extract_file_sync,
    ExtractionConfig,
    OcrConfig,
    TesseractConfig,
    ChunkingConfig,
    ImageExtractionConfig,
    PdfConfig,
    TokenReductionConfig,
    LanguageDetectionConfig,
)

config = ExtractionConfig(
    use_cache=True,
    enable_quality_processing=True,
    ocr=OcrConfig(
        backend="tesseract",
        language="eng",
        tesseract_config=TesseractConfig(
            psm=6,
            enable_table_detection=True,
            min_confidence=50.0,
        ),
    ),
    force_ocr=False,
    chunking=ChunkingConfig(
        max_chars=1000,
        max_overlap=200,
    ),
    images=ImageExtractionConfig(
        extract_images=True,
        target_dpi=300,
        max_image_dimension=4096,
        auto_adjust_dpi=True,
    ),
    pdf_options=PdfConfig(
        extract_images=True,
        passwords=["password1", "password2"],
        extract_metadata=True,
    ),
    token_reduction=TokenReductionConfig(
        mode="moderate",
        preserve_important_words=True,
    ),
    language_detection=LanguageDetectionConfig(
        enabled=True,
        min_confidence=0.8,
        detect_multiple=False,
    ),
)

result = extract_file_sync("document.pdf", config=config)

HTML Conversion Options & Batch Concurrency

from kreuzberg import ExtractionConfig

config = ExtractionConfig(
    max_concurrent_extractions=8,
    html_options={
        "extract_metadata": True,
        "wrap": True,
        "wrap_width": 100,
        "strip_tags": ["script", "style"],
        "preprocessing": {"enabled": True, "preset": "standard"},
    },
)

Metadata Extraction

from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")

if result.images:
    print(f"Extracted {len(result.images)} inline images")

if result.chunks:
    print(f"First chunk tokens: {result.chunks[0]['metadata']['token_count']}")

print(result.metadata.get("pdf", {}))
print(result.metadata.get("language"))
print(result.metadata.get("format"))

if "pdf" in result.metadata:
    pdf_meta = result.metadata["pdf"]
    print(f"Title: {pdf_meta.get('title')}")
    print(f"Author: {pdf_meta.get('author')}")
    print(f"Pages: {pdf_meta.get('page_count')}")
    print(f"Created: {pdf_meta.get('creation_date')}")

Password-Protected PDFs

from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig

config = ExtractionConfig(
    pdf_options=PdfConfig(
        passwords=["password1", "password2", "password3"]
    )
)

result = extract_file_sync("protected.pdf", config=config)

Language Detection

from kreuzberg import extract_file_sync, ExtractionConfig, LanguageDetectionConfig

config = ExtractionConfig(
    language_detection=LanguageDetectionConfig(enabled=True)
)

result = extract_file_sync("multilingual.pdf", config=config)
print(result.detected_languages)

Text Chunking

from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig

config = ExtractionConfig(
    chunking=ChunkingConfig(
        max_chars=1000,
        max_overlap=200,
    )
)

result = extract_file_sync("long_document.pdf", config=config)

for chunk in result.chunks:
    print(chunk)

Extract from Bytes

from kreuzberg import extract_bytes_sync

with open("document.pdf", "rb") as f:
    data = f.read()

result = extract_bytes_sync(data, "application/pdf")
print(result.content)

API Reference

Extraction Functions

extract_file(file_path, mime_type=None, config=None, **kwargs) – Async extraction
extract_file_sync(file_path, mime_type=None, config=None, **kwargs) – Sync extraction
extract_bytes(data, mime_type, config=None, **kwargs) – Async extraction from bytes
extract_bytes_sync(data, mime_type, config=None, **kwargs) – Sync extraction from bytes
batch_extract_files(paths, config=None, **kwargs) – Async batch extraction
batch_extract_files_sync(paths, config=None, **kwargs) – Sync batch extraction
batch_extract_bytes(data_list, mime_types, config=None, **kwargs) – Async batch from bytes
batch_extract_bytes_sync(data_list, mime_types, config=None, **kwargs) – Sync batch from bytes

Configuration Classes

ExtractionConfig – Main configuration
OcrConfig – OCR settings
TesseractConfig – Tesseract-specific options
ChunkingConfig – Text chunking settings
ImageExtractionConfig – Image extraction settings
PdfConfig – PDF-specific options
TokenReductionConfig – Token reduction settings
LanguageDetectionConfig – Language detection settings

Result Types

ExtractionResult – Main result object with content, metadata, tables, detected_languages, chunks
ExtractedTable – Table with cells, markdown, page_number
Metadata – Typed metadata dictionary

Exceptions

KreuzbergError – Base exception
ValidationError – Invalid configuration or input
ParsingError – Document parsing failure
OCRError – OCR processing failure
MissingDependencyError – Missing optional dependency

Examples

Custom Processing

from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")

text = result.content
text = text.lower()
text = text.replace("old", "new")

print(text)

Multiple Files with Progress

from kreuzberg import extract_file_sync
from pathlib import Path

files = list(Path("documents").glob("*.pdf"))
results = []

for i, file in enumerate(files, 1):
    print(f"Processing {i}/{len(files)}: {file.name}")
    result = extract_file_sync(str(file))
    results.append((file.name, result))

for name, result in results:
    print(f"{name}: {len(result.content)} characters")

Filter by Language

from kreuzberg import extract_file_sync, ExtractionConfig, LanguageDetectionConfig

config = ExtractionConfig(
    language_detection=LanguageDetectionConfig(enabled=True)
)

result = extract_file_sync("document.pdf", config=config)

if result.detected_languages and "en" in result.detected_languages:
    print("English document detected")
    print(result.content)

System Requirements

ONNX Runtime (for embeddings)

If using embeddings functionality, ONNX Runtime version 1.22.x must be installed:

# macOS
brew install onnxruntime

# Ubuntu/Debian (download from GitHub - Debian packages may have older versions)
# Download from https://github.com/microsoft/onnxruntime/releases

# Windows
# Download from https://github.com/microsoft/onnxruntime/releases

Important: Kreuzberg requires ONNX Runtime version 1.22.x for embeddings.

Without ONNX Runtime, embeddings will raise MissingDependencyError with installation instructions.

Tesseract OCR (Required for OCR)

brew install tesseract

sudo apt-get install tesseract-ocr

Pandoc (Optional, for some formats)

brew install pandoc

sudo apt-get install pandoc

Troubleshooting

Import Error: No module named '_kreuzberg'

This usually means the Rust extension wasn't built correctly. Try:

pip install --force-reinstall --no-cache-dir kreuzberg

OCR Not Working

Make sure Tesseract is installed:

tesseract --version

Memory Issues with Large PDFs

Use streaming or enable chunking:

config = ExtractionConfig(
    chunking=ChunkingConfig(max_chars=1000)
)

PDFium Integration

PDF extraction is powered by PDFium, which is automatically bundled with this package. No system installation required.

Platform Support

Platform	Status	Notes
Linux x86_64	✅	Bundled
macOS ARM64	✅	Bundled
macOS x86_64	✅	Bundled
Windows x86_64	✅	Bundled

Binary Size Impact

PDFium adds approximately 8-15 MB to the package size depending on platform. This ensures consistent PDF extraction across all environments without external dependencies.

Documentation

For comprehensive documentation, visit https://kreuzberg.dev

Part of Kreuzberg.dev

Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
html-to-markdown — fast, lossless HTML→Markdown engine.
liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
alef — the polyglot binding generator that produces this README and all per-language bindings.
Discord — community, roadmap, announcements.

License

Elastic-2.0 License - see LICENSE for details.

README.md Unescape Escape

Kreuzberg

What This Package Provides

Installation

With OCR Support

All Features

Quick Start

Basic Usage

Simple Extraction

Reading Content

OCR Support

Using OCR

EasyOCR (GPU-Accelerated)

PaddleOCR (Complex Layouts)

Table Extraction

Configuration

Complete Configuration Example

HTML Conversion Options & Batch Concurrency

Metadata Extraction

Password-Protected PDFs

Language Detection

Text Chunking

Extract from Bytes

API Reference

Extraction Functions

Configuration Classes

Result Types

Exceptions

Examples

Custom Processing

Multiple Files with Progress

Filter by Language

System Requirements

ONNX Runtime (for embeddings)

Tesseract OCR (Required for OCR)

Pandoc (Optional, for some formats)

Troubleshooting

Import Error: No module named '_kreuzberg'

OCR Not Working

Memory Issues with Large PDFs

PDFium Integration

Platform Support

Binary Size Impact

Documentation

Part of Kreuzberg.dev

License

README.md