5.7 KiB
{{ name }}
{% include 'partials/badges.html.jinja' %}
{{ description }}
What This Package Provides
- Document intelligence core — extract text, tables, images, metadata, entities, keywords, and code intelligence from one API.
- Format coverage — PDF, Office, images, HTML/XML, email, archives, notebooks, citations, scientific formats, and plain text.
- OCR choices — Tesseract, PaddleOCR, EasyOCR where supported, VLM OCR through liter-llm, and plugin hooks for custom backends.
- Same engine as every binding — Rust, Python, Node.js, Go, Java, PHP, Ruby, .NET, Elixir, R, WASM, Kotlin Android, Swift, Dart, Zig, and C FFI share the same Rust implementation. {% if language == "typescript" %}
- Node-first TypeScript API — NAPI-RS package with typed options/results and async extraction. {% elif language == "python" %}
- Python package — sync and async APIs with typed results for ingestion, RAG, and data workflows. {% elif language == "go" %}
- Go module — context-aware API over the shared native library. {% elif language == "java" %}
- Java package — FFM binding for direct native document extraction. {% elif language == "php" %}
- PHP package — PHP 8.2+ API with generated types. {% elif language == "ruby" %}
- Ruby package — native extension with idiomatic Ruby objects. {% elif language == "csharp" %}
- .NET package — async/await API with nullable-aware result types. {% elif language == "elixir" %}
- BEAM package — Rustler NIF binding for OTP pipelines. {% elif language == "wasm" %}
- WASM package — browser and edge-compatible extraction where native libraries are unavailable. {% elif language == "r" %}
- R package — data workflow binding with data-frame-friendly extracted structures. {% elif language == "ffi" %}
- C ABI — stable shared library surface for custom hosts and secondary bindings. {% elif language == "kotlin_android" %}
- Android AAR — JNI-backed package for mobile extraction workloads. {% elif language == "swift" %}
- SwiftPM package — Swift Concurrency API for Apple targets. {% elif language == "dart" %}
- Dart package — Future/Stream API through flutter_rust_bridge. {% elif language == "zig" %}
- Zig package — wrapper over the C FFI with explicit memory ownership. {% endif %}
Installation
{% include 'partials/installation.md.jinja' %}
Quick Start
{% include 'partials/quick_start.md.jinja' %}
{% if language == "typescript" %} {% include 'partials/napi_implementation.md.jinja' %}
{% endif %}
Features
{% include 'partials/features.md.jinja' %}
{% if features.ocr %}
OCR Support
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
{% for backend in ocr_backends %}
- {{ backend | title }} {% endfor %}
OCR Configuration Example
{{ snippets.ocr_configuration | include_snippet(language) }}
{% endif %} {% if features.async %}
Async Support
This binding provides full async/await support for non-blocking document processing:
{{ snippets.async_extraction | include_snippet(language) }}
{% endif %} {% if features.plugin_system %}
Plugin System
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
For detailed plugin documentation, visit Plugin System Guide.
{% if snippets.plugin_system %}
Plugin Example
{{ snippets.plugin_system | include_snippet(language) }}
{% endif %} {% endif %} {% if features.embeddings %}
Embeddings Support
Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
Embeddings Guide {% endif %}
{% if snippets.batch_processing %}
Batch Processing
Process multiple documents efficiently:
{{ snippets.batch_processing | include_snippet(language) }}
{% endif %}
Configuration
For advanced configuration options including language detection, table extraction, OCR settings, and more:
Documentation
Contributing
Contributions are welcome! See Contributing Guide.
Part of Kreuzberg.dev
- Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
- kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- html-to-markdown — fast, lossless HTML→Markdown engine.
- liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
- tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
- alef — the polyglot binding generator that produces this README and all per-language bindings.
- Discord — community, roadmap, announcements.
License
{{ license }} License — see LICENSE for details.
Support
- Discord Community: Join our Discord
- GitHub Issues: Report bugs
- Discussions: Ask questions