157 lines
5.7 KiB
Markdown
157 lines
5.7 KiB
Markdown
|
|
# {{ name }}
|
||
|
|
|
||
|
|
{% include 'partials/badges.html.jinja' %}
|
||
|
|
|
||
|
|
{{ description }}
|
||
|
|
|
||
|
|
## What This Package Provides
|
||
|
|
|
||
|
|
- **Document intelligence core** — extract text, tables, images, metadata, entities, keywords, and code intelligence from one API.
|
||
|
|
- **Format coverage** — PDF, Office, images, HTML/XML, email, archives, notebooks, citations, scientific formats, and plain text.
|
||
|
|
- **OCR choices** — Tesseract, PaddleOCR, EasyOCR where supported, VLM OCR through liter-llm, and plugin hooks for custom backends.
|
||
|
|
- **Same engine as every binding** — Rust, Python, Node.js, Go, Java, PHP, Ruby, .NET, Elixir, R, WASM, Kotlin Android, Swift, Dart, Zig, and C FFI share the same Rust implementation.
|
||
|
|
{% if language == "typescript" %}
|
||
|
|
- **Node-first TypeScript API** — NAPI-RS package with typed options/results and async extraction.
|
||
|
|
{% elif language == "python" %}
|
||
|
|
- **Python package** — sync and async APIs with typed results for ingestion, RAG, and data workflows.
|
||
|
|
{% elif language == "go" %}
|
||
|
|
- **Go module** — context-aware API over the shared native library.
|
||
|
|
{% elif language == "java" %}
|
||
|
|
- **Java package** — FFM binding for direct native document extraction.
|
||
|
|
{% elif language == "php" %}
|
||
|
|
- **PHP package** — PHP 8.2+ API with generated types.
|
||
|
|
{% elif language == "ruby" %}
|
||
|
|
- **Ruby package** — native extension with idiomatic Ruby objects.
|
||
|
|
{% elif language == "csharp" %}
|
||
|
|
- **.NET package** — async/await API with nullable-aware result types.
|
||
|
|
{% elif language == "elixir" %}
|
||
|
|
- **BEAM package** — Rustler NIF binding for OTP pipelines.
|
||
|
|
{% elif language == "wasm" %}
|
||
|
|
- **WASM package** — browser and edge-compatible extraction where native libraries are unavailable.
|
||
|
|
{% elif language == "r" %}
|
||
|
|
- **R package** — data workflow binding with data-frame-friendly extracted structures.
|
||
|
|
{% elif language == "ffi" %}
|
||
|
|
- **C ABI** — stable shared library surface for custom hosts and secondary bindings.
|
||
|
|
{% elif language == "kotlin_android" %}
|
||
|
|
- **Android AAR** — JNI-backed package for mobile extraction workloads.
|
||
|
|
{% elif language == "swift" %}
|
||
|
|
- **SwiftPM package** — Swift Concurrency API for Apple targets.
|
||
|
|
{% elif language == "dart" %}
|
||
|
|
- **Dart package** — Future/Stream API through flutter_rust_bridge.
|
||
|
|
{% elif language == "zig" %}
|
||
|
|
- **Zig package** — wrapper over the C FFI with explicit memory ownership.
|
||
|
|
{% endif %}
|
||
|
|
|
||
|
|
## Installation
|
||
|
|
|
||
|
|
{% include 'partials/installation.md.jinja' %}
|
||
|
|
|
||
|
|
## Quick Start
|
||
|
|
|
||
|
|
{% include 'partials/quick_start.md.jinja' %}
|
||
|
|
|
||
|
|
{% if language == "typescript" %}
|
||
|
|
{% include 'partials/napi_implementation.md.jinja' %}
|
||
|
|
|
||
|
|
{% endif %}
|
||
|
|
|
||
|
|
## Features
|
||
|
|
|
||
|
|
{% include 'partials/features.md.jinja' %}
|
||
|
|
|
||
|
|
{% if features.ocr %}
|
||
|
|
|
||
|
|
## OCR Support
|
||
|
|
|
||
|
|
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
|
||
|
|
|
||
|
|
{% for backend in ocr_backends %}
|
||
|
|
|
||
|
|
- **{{ backend | title }}**
|
||
|
|
{% endfor %}
|
||
|
|
|
||
|
|
### OCR Configuration Example
|
||
|
|
|
||
|
|
{{ snippets.ocr_configuration | include_snippet(language) }}
|
||
|
|
|
||
|
|
{% endif %}
|
||
|
|
{% if features.async %}
|
||
|
|
|
||
|
|
## Async Support
|
||
|
|
|
||
|
|
This binding provides full async/await support for non-blocking document processing:
|
||
|
|
|
||
|
|
{{ snippets.async_extraction | include_snippet(language) }}
|
||
|
|
|
||
|
|
{% endif %}
|
||
|
|
{% if features.plugin_system %}
|
||
|
|
|
||
|
|
## Plugin System
|
||
|
|
|
||
|
|
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
|
||
|
|
|
||
|
|
For detailed plugin documentation, visit [Plugin System Guide](https://docs.kreuzberg.dev/guides/plugins/).
|
||
|
|
|
||
|
|
{% if snippets.plugin_system %}
|
||
|
|
|
||
|
|
### Plugin Example
|
||
|
|
|
||
|
|
{{ snippets.plugin_system | include_snippet(language) }}
|
||
|
|
|
||
|
|
{% endif %}
|
||
|
|
{% endif %}
|
||
|
|
{% if features.embeddings %}
|
||
|
|
|
||
|
|
## Embeddings Support
|
||
|
|
|
||
|
|
Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
|
||
|
|
|
||
|
|
**[Embeddings Guide](https://docs.kreuzberg.dev/features/#embeddings)**
|
||
|
|
{% endif %}
|
||
|
|
|
||
|
|
{% if snippets.batch_processing %}
|
||
|
|
|
||
|
|
## Batch Processing
|
||
|
|
|
||
|
|
Process multiple documents efficiently:
|
||
|
|
|
||
|
|
{{ snippets.batch_processing | include_snippet(language) }}
|
||
|
|
|
||
|
|
{% endif %}
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
For advanced configuration options including language detection, table extraction, OCR settings, and more:
|
||
|
|
|
||
|
|
**[Configuration Guide](https://docs.kreuzberg.dev/guides/configuration/)**
|
||
|
|
|
||
|
|
## Documentation
|
||
|
|
|
||
|
|
- **[Official Documentation](https://docs.kreuzberg.dev/)**
|
||
|
|
- **[API Reference](https://docs.kreuzberg.dev/reference/api-python/)**
|
||
|
|
- **[Examples & Guides](https://docs.kreuzberg.dev/)**
|
||
|
|
|
||
|
|
## Contributing
|
||
|
|
|
||
|
|
Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
|
||
|
|
|
||
|
|
## Part of Kreuzberg.dev
|
||
|
|
|
||
|
|
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
|
||
|
|
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
|
||
|
|
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
|
||
|
|
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
|
||
|
|
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
|
||
|
|
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
|
||
|
|
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
|
||
|
|
|
||
|
|
## License
|
||
|
|
|
||
|
|
{{ license }} License — see [LICENSE](../../LICENSE) for details.
|
||
|
|
|
||
|
|
## Support
|
||
|
|
|
||
|
|
- **Discord Community**: [Join our Discord](https://discord.gg/xt9WY3GnKR)
|
||
|
|
- **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
|
||
|
|
- **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)
|