Nomad changes
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s

This commit is contained in:
Henrik Jess Nielsen
2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions

273
templates/readme/go.md Normal file
View File

@@ -0,0 +1,273 @@
# Kreuzberg
{% include 'partials/badges.html.jinja' %}
High-performance document intelligence for Go backed by the Rust core that powers every Kreuzberg binding.
> **Version {{ version }}**
> Report issues at [github.com/kreuzberg-dev/kreuzberg](https://github.com/kreuzberg-dev/kreuzberg/issues).
## What This Package Provides
- **Go module over the Rust core** — context-aware extraction with Go structs and errors.
- **Structured results** — text, tables, images, metadata, language detection, chunks, and warnings.
- **Static-link workflow** — build against `kreuzberg-ffi` and ship a self-contained Go binary.
- **Cross-binding parity** — output matches the Python, Node.js, Ruby, Java, .NET, PHP, Elixir, R, Dart, Swift, Zig, WASM, and C FFI packages.
## Install
Kreuzberg Go binaries are **statically linked** — once built, they are self-contained and require no runtime library dependencies. Only the static library is needed at build time.
### Quick Start (Monorepo Development)
For development in the Kreuzberg monorepo:
```bash
# Build the static FFI library
cargo build -p kreuzberg-ffi --release
# Go build will automatically link against the static library
# (from target/release/libkreuzberg_ffi.a)
cd packages/go/v5
go build -v
# Run your binary (no library path needed - it's statically linked)
./v4
```
That's it! The resulting binary is self-contained and has no runtime dependencies on Kreuzberg libraries.
### Using Go Modules
To use this package via `go get`:
```bash
# Get the latest release
go get {{ package_name }}@latest
# Or a specific version
go get {{ package_name }}@v{{ version }}
```
You'll need to provide the static library at build time. See [Building with Static Libraries](#building-with-static-libraries) below.
### Building with Static Libraries
When building outside the Kreuzberg monorepo, you need to provide the static library (`.a` file on Unix, `.lib` on Windows).
#### Option 1: Download Pre-built Static Library
Download the static library for your platform from [GitHub Releases](https://github.com/kreuzberg-dev/kreuzberg/releases):
```bash
# Example: Linux x86_64
curl -LO https://github.com/kreuzberg-dev/kreuzberg/releases/download/v{{ version }}/go-ffi-linux-x86_64.tar.gz
tar -xzf go-ffi-linux-x86_64.tar.gz
# Copy to a permanent location
mkdir -p ~/kreuzberg/lib
cp kreuzberg-ffi/lib/libkreuzberg_ffi.a ~/kreuzberg/lib/
```
Then build with `CGO_LDFLAGS`:
```bash
# Linux/macOS
CGO_LDFLAGS="-L$HOME/kreuzberg/lib -lkreuzberg_ffi" go build
# Windows (MSVC)
set CGO_LDFLAGS=-L%USERPROFILE%\kreuzberg\lib -lkreuzberg_ffi
go build
```
#### Option 2: Build Static Library Yourself
If pre-built libraries aren't available for your platform:
```bash
# Clone the repository
git clone https://github.com/kreuzberg-dev/kreuzberg.git
cd kreuzberg
# Build the static library
cargo build -p kreuzberg-ffi --release
# The static library is now at: target/release/libkreuzberg_ffi.a
# Copy it to a permanent location
mkdir -p ~/kreuzberg/lib
cp target/release/libkreuzberg_ffi.a ~/kreuzberg/lib/
# Now you can build Go projects
cd ~/my-go-project
CGO_LDFLAGS="-L$HOME/kreuzberg/lib -lkreuzberg_ffi" go build
```
### System Requirements
#### ONNX Runtime (for embeddings)
If using embeddings functionality, ONNX Runtime must be installed **at build time**:
```bash
# macOS
brew install onnxruntime
# Ubuntu/Debian
sudo apt install libonnxruntime libonnxruntime-dev
# Windows (MSVC)
scoop install onnxruntime
# OR download from https://github.com/microsoft/onnxruntime/releases
```
The resulting binary will have ONNX Runtime statically linked or dynamically linked depending on how the FFI library was built. Check the build configuration.
**Note:** Windows MinGW builds do not support embeddings (ONNX Runtime requires MSVC). Use Windows MSVC for embeddings support.
## Quickstart
```go
package main
import (
"fmt"
"log"
"{{ package_name }}"
)
func main() {
result, err := v4.ExtractFileSync("document.pdf", nil)
if err != nil {
log.Fatalf("extract failed: %v", err)
}
fmt.Println("MIME:", result.MimeType)
fmt.Println("First 200 chars:")
fmt.Println(result.Content[:200])
}
```
Build and run:
```bash
# Build (make sure you have the static library available - see Install)
CGO_LDFLAGS="-L$HOME/kreuzberg/lib -lkreuzberg_ffi" go build
# Run - no library paths needed!
./myapp
```
The binary is self-contained and can be distributed without any Kreuzberg library dependencies.
## Examples
### Extract bytes
```go
data, err := os.ReadFile("slides.pptx")
if err != nil {
log.Fatal(err)
}
result, err := v4.ExtractBytesSync(data, "application/vnd.openxmlformats-officedocument.presentationml.presentation", nil)
if err != nil {
log.Fatal(err)
}
fmt.Println(result.Metadata.FormatType())
```
### Use advanced configuration
```go
lang := "eng"
cfg := &v4.ExtractionConfig{
UseCache: true,
ForceOCR: false,
ImageExtraction: &v4.ImageExtractionConfig{Enabled: true},
OCR: &v4.OcrConfig{
Backend: "tesseract",
Language: &lang,
},
}
result, err := v4.ExtractFileSync("scanned.pdf", cfg)
```
### Async (context-aware) extraction
```go
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
result, err := v4.ExtractFile(ctx, "large.pdf", nil)
if err != nil {
log.Fatal(err)
}
fmt.Println("Content length:", len(result.Content))
```
### Batch extract
```go
paths := []string{"doc1.pdf", "doc2.docx", "report.xlsx"}
results, err := v4.BatchExtractFilesSync(paths, nil)
if err != nil {
log.Fatal(err)
}
for i, res := range results {
if res == nil {
continue
}
fmt.Printf("[%d] %s => %d bytes\n", i, res.MimeType, len(res.Content))
}
```
### Register a validator
```go
//export customValidator
func customValidator(resultJSON *C.char) *C.char {
// Validate JSON payload and return an error string (or NULL if ok)
return nil
}
func init() {
if err := v4.RegisterValidator("go-validator", 50, (C.ValidatorCallback)(C.customValidator)); err != nil {
log.Fatalf("validator registration failed: %v", err)
}
}
```
## API Reference
- **GoDoc**: [pkg.go.dev/{{ package_name }}](<https://pkg.go.dev/{{ package_name }}>)
- **Full documentation**: [kreuzberg.dev](https://kreuzberg.dev) (configuration, formats, OCR backends)
## Troubleshooting
| Issue | Fix |
| ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ld returned 1 exit status` or `undefined reference to 'html_to_markdown_...'` | The static library wasn't found. Make sure `CGO_LDFLAGS` points to the directory containing `libkreuzberg_ffi.a`: `CGO_LDFLAGS="-L/path/to/lib -lkreuzberg_ffi" go build` |
| `cannot find -lkreuzberg_ffi` | The static library file is missing or in the wrong location. Download it from [GitHub Releases](https://github.com/kreuzberg-dev/kreuzberg/releases) or build it yourself: `cargo build -p kreuzberg-ffi --release` |
| `undefined: v4.ExtractFile` | This function was removed in v4.1.0. Use `ExtractFileSync` and wrap in goroutine if needed (see migration guide) |
| `Missing dependency: tesseract` | Install the OCR backend and ensure it is on `PATH`. Errors bubble up as `*v4.MissingDependencyError`. |
| `undefined: C.customValidator` during build | Export the callback with `//export` in a `*_cgo.go` file before using it in `Register*` helpers. |
| `Missing dependency: onnxruntime` | Install ONNX Runtime at build time: `brew install onnxruntime` (macOS), `apt install libonnxruntime libonnxruntime-dev` (Linux), `scoop install onnxruntime` (Windows). Required for embeddings functionality. |
| Embeddings not available on Windows MinGW | Windows MinGW builds cannot link ONNX Runtime (MSVC-only). Use Windows MSVC build for embeddings support, or build without embeddings feature. |
## Testing / Tooling
- `task go:lint` runs `gofmt` and `golangci-lint` (`golangci-lint` pinned to v2.11.3).
- `task go:test` executes `go test ./...` (after building the static FFI library).
- `task e2e:go:verify` regenerates fixtures via the e2e generator and runs `go test ./...` inside `e2e/go`.
Need help? Join the [Discord](https://discord.gg/xt9WY3GnKR) or open an issue with logs, platform info, and the steps you tried.
## Part of Kreuzberg.dev
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.

View File

@@ -0,0 +1,156 @@
# {{ name }}
{% include 'partials/badges.html.jinja' %}
{{ description }}
## What This Package Provides
- **Document intelligence core** — extract text, tables, images, metadata, entities, keywords, and code intelligence from one API.
- **Format coverage** — PDF, Office, images, HTML/XML, email, archives, notebooks, citations, scientific formats, and plain text.
- **OCR choices** — Tesseract, PaddleOCR, EasyOCR where supported, VLM OCR through liter-llm, and plugin hooks for custom backends.
- **Same engine as every binding** — Rust, Python, Node.js, Go, Java, PHP, Ruby, .NET, Elixir, R, WASM, Kotlin Android, Swift, Dart, Zig, and C FFI share the same Rust implementation.
{% if language == "typescript" %}
- **Node-first TypeScript API** — NAPI-RS package with typed options/results and async extraction.
{% elif language == "python" %}
- **Python package** — sync and async APIs with typed results for ingestion, RAG, and data workflows.
{% elif language == "go" %}
- **Go module** — context-aware API over the shared native library.
{% elif language == "java" %}
- **Java package** — FFM binding for direct native document extraction.
{% elif language == "php" %}
- **PHP package** — PHP 8.2+ API with generated types.
{% elif language == "ruby" %}
- **Ruby package** — native extension with idiomatic Ruby objects.
{% elif language == "csharp" %}
- **.NET package** — async/await API with nullable-aware result types.
{% elif language == "elixir" %}
- **BEAM package** — Rustler NIF binding for OTP pipelines.
{% elif language == "wasm" %}
- **WASM package** — browser and edge-compatible extraction where native libraries are unavailable.
{% elif language == "r" %}
- **R package** — data workflow binding with data-frame-friendly extracted structures.
{% elif language == "ffi" %}
- **C ABI** — stable shared library surface for custom hosts and secondary bindings.
{% elif language == "kotlin_android" %}
- **Android AAR** — JNI-backed package for mobile extraction workloads.
{% elif language == "swift" %}
- **SwiftPM package** — Swift Concurrency API for Apple targets.
{% elif language == "dart" %}
- **Dart package** — Future/Stream API through flutter_rust_bridge.
{% elif language == "zig" %}
- **Zig package** — wrapper over the C FFI with explicit memory ownership.
{% endif %}
## Installation
{% include 'partials/installation.md.jinja' %}
## Quick Start
{% include 'partials/quick_start.md.jinja' %}
{% if language == "typescript" %}
{% include 'partials/napi_implementation.md.jinja' %}
{% endif %}
## Features
{% include 'partials/features.md.jinja' %}
{% if features.ocr %}
## OCR Support
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
{% for backend in ocr_backends %}
- **{{ backend | title }}**
{% endfor %}
### OCR Configuration Example
{{ snippets.ocr_configuration | include_snippet(language) }}
{% endif %}
{% if features.async %}
## Async Support
This binding provides full async/await support for non-blocking document processing:
{{ snippets.async_extraction | include_snippet(language) }}
{% endif %}
{% if features.plugin_system %}
## Plugin System
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
For detailed plugin documentation, visit [Plugin System Guide](https://docs.kreuzberg.dev/guides/plugins/).
{% if snippets.plugin_system %}
### Plugin Example
{{ snippets.plugin_system | include_snippet(language) }}
{% endif %}
{% endif %}
{% if features.embeddings %}
## Embeddings Support
Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
**[Embeddings Guide](https://docs.kreuzberg.dev/features/#embeddings)**
{% endif %}
{% if snippets.batch_processing %}
## Batch Processing
Process multiple documents efficiently:
{{ snippets.batch_processing | include_snippet(language) }}
{% endif %}
## Configuration
For advanced configuration options including language detection, table extraction, OCR settings, and more:
**[Configuration Guide](https://docs.kreuzberg.dev/guides/configuration/)**
## Documentation
- **[Official Documentation](https://docs.kreuzberg.dev/)**
- **[API Reference](https://docs.kreuzberg.dev/reference/api-python/)**
- **[Examples & Guides](https://docs.kreuzberg.dev/)**
## Contributing
Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
## Part of Kreuzberg.dev
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
## License
{{ license }} License — see [LICENSE](../../LICENSE) for details.
## Support
- **Discord Community**: [Join our Discord](https://discord.gg/xt9WY3GnKR)
- **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
- **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)

View File

@@ -0,0 +1,86 @@
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
<a href="https://github.com/kreuzberg-dev/alef">
<img src="https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6" alt="Bindings">
</a>
<!-- Language Bindings -->
<a href="https://crates.io/crates/kreuzberg">
<img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
</a>
<a href="https://pypi.org/project/kreuzberg/">
<img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
</a>
<a href="https://www.npmjs.com/package/@kreuzberg/node">
<img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
</a>
<a href="https://www.npmjs.com/package/@kreuzberg/wasm">
<img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
</a>
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go/v5">
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v5*" alt="Go">
</a>
<a href="https://www.nuget.org/packages/Kreuzberg/">
<img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
</a>
<a href="https://packagist.org/packages/kreuzberg/kreuzberg">
<img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
</a>
<a href="https://rubygems.org/gems/kreuzberg">
<img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
</a>
<a href="https://hex.pm/packages/kreuzberg">
<img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
</a>
<a href="https://kreuzberg-dev.r-universe.dev/kreuzberg">
<img src="https://img.shields.io/badge/R-kreuzberg-007ec6" alt="R">
</a>
<a href="https://pub.dev/packages/kreuzberg">
<img src="https://img.shields.io/pub/v/kreuzberg?label=Dart&color=007ec6" alt="Dart">
</a>
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg-android">
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg-android?label=Kotlin&color=007ec6" alt="Kotlin">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/swift">
<img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/zig">
<img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
<img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/kreuzberg">
<img src="https://img.shields.io/badge/Docker-ghcr.io-007ec6?logo=docker&logoColor=white" alt="Docker">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/charts%2Fkreuzberg">
<img src="https://img.shields.io/badge/Helm-ghcr.io-007ec6?logo=helm&logoColor=white" alt="Helm">
</a>
<!-- Project Info -->
<a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
<img src="https://img.shields.io/badge/License-Elastic--2.0-007ec6" alt="License">
</a>
<a href="https://docs.kreuzberg.dev">
<img src="https://img.shields.io/badge/Docs-kreuzberg-007ec6" alt="Documentation">
</a>
<a href="https://huggingface.co/Kreuzberg">
<img src="https://img.shields.io/badge/Hugging%20Face-Kreuzberg-007ec6" alt="Hugging Face">
</a>
</div>
<div align="center" style="margin: 24px 0 0;">
<a href="https://kreuzberg.dev">
<img alt="Kreuzberg" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
</a>
</div>
<div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
<a href="https://discord.gg/xt9WY3GnKR">
<img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
</a>
<a href="https://docs.kreuzberg.dev/demo.html">
<img height="22" src="https://img.shields.io/badge/Live%20Demo-Open-007ec6?logo=webassembly&logoColor=white" alt="Live Demo">
</a>
</div>

View File

@@ -0,0 +1,95 @@
### Supported File Formats (90+)
90+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
#### Office Documents
| Category | Formats | Capabilities |
|----------|---------|--------------|
| **Word Processing** | `.docx`, `.docm`, `.dotx`, `.dotm`, `.dot`, `.odt` | Full text, tables, images, metadata, styles |
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.xltx`, `.xlt`, `.ods` | Sheet data, formulas, cell metadata, charts |
| **Presentations** | `.pptx`, `.pptm`, `.ppsx`, `.potx`, `.potm`, `.pot`, `.ppt` | Slides, speaker notes, images, metadata |
| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
| **Database** | `.dbf` | Table data extraction, field type support |
| **Hangul** | `.hwp`, `.hwpx` | Korean document format, text extraction |
#### Images (OCR-Enabled)
| Category | Formats | Features |
|----------|---------|----------|
| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata |
| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
#### Web & Data
| Category | Formats | Features |
|----------|---------|----------|
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, Djot, reStructuredText, Org Mode |
#### Email & Archives
| Category | Formats | Features |
|----------|---------|----------|
| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
#### Academic & Scientific
| Category | Formats | Features |
|----------|---------|----------|
| **Citations** | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl` | Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON |
| **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
| **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
#### Code Intelligence (300+ Languages)
| Feature | Description |
|---------|-------------|
| **Structure Extraction** | Functions, classes, methods, structs, interfaces, enums |
| **Import/Export Analysis** | Module dependencies, re-exports, wildcard imports |
| **Symbol Extraction** | Variables, constants, type aliases, properties |
| **Docstring Parsing** | Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats |
| **Diagnostics** | Parse errors with line/column positions |
| **Syntax-Aware Chunking** | Split code by semantic boundaries, not arbitrary byte offsets |
Powered by [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — [documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev).
**[Complete Format Reference](https://docs.kreuzberg.dev/reference/formats/)**
### Key Capabilities
- **Text Extraction** - Extract all text content with position and formatting information
- **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
- **Table Extraction** - Parse tables with structure and cell content preservation
- **Image Extraction** - Extract embedded images and render page previews
- **OCR Support** - Integrate multiple OCR backends for scanned documents
{% if features.async %}
- **Async/Await** - Non-blocking document processing with concurrent operations
{% endif %}
{% if features.plugin_system %}
- **Plugin System** - Extensible post-processing for custom text transformation
{% endif %}
{% if features.embeddings %}
- **Embeddings** - Generate vector embeddings using ONNX Runtime models
{% endif %}
- **Batch Processing** - Efficiently process multiple documents in parallel
- **Memory Efficient** - Stream large files without loading entirely into memory
- **Language Detection** - Detect and support multiple languages in documents
{% if features.code_intelligence %}
- **Code Intelligence** - Extract structure, imports, exports, symbols, and docstrings from [300+ programming languages](https://docs.tree-sitter-language-pack.kreuzberg.dev) via tree-sitter
{% endif %}
- **Configuration** - Fine-grained control over extraction behavior
### Performance Characteristics
| Format | Speed | Memory | Notes |
|--------|-------|--------|-------|
| **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
| **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
| **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
| **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
| **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |

View File

@@ -0,0 +1,272 @@
### Package Installation
{% for pm in package_manager %}
{% if pm == "pip" %}
Install via pip:
```bash
pip install {{ package_name }}
```
For async support and additional features:
```bash
pip install {{ package_name }}[async]
```
{% elif pm == "npm" %}
```bash
npm install {{ package_name }}
```
{% elif pm == "pnpm" %}
```bash
pnpm add {{ package_name }}
```
{% elif pm == "yarn" %}
```bash
yarn add {{ package_name }}
```
{% elif pm == "go get" %}
```bash
go get {{ package_name }}
```
For more details on FFI setup and native library linking, see the [Go API Reference](https://docs.kreuzberg.dev/reference/api-go/).
{% elif pm == "maven" %}
{% set maven_parts = package_name | split(":") %}
Add to your `pom.xml`:
```xml
<dependency>
<groupId>{{ maven_parts[0] }}</groupId>
<artifactId>{{ maven_parts[1] }}</artifactId>
<version>{{ version }}</version>
</dependency>
```
{% elif pm == "gradle" %}
Kotlin DSL (`build.gradle.kts`):
```kotlin
implementation("{{ package_name }}:{{ version }}")
```
Groovy DSL (`build.gradle`):
```groovy
implementation '{{ package_name }}:{{ version }}'
```
{% elif pm == "rubygems" %}
Install via gem:
```bash
gem install {{ package_name }}
```
{% elif pm == "bundler" %}
Or add to your Gemfile:
```ruby
gem '{{ package_name }}'
```
{% elif pm == "composer" %}
Install via Composer:
```bash
composer require {{ package_name }}
```
{% elif pm == "mix" %}
Add to your `mix.exs` dependencies:
```elixir
def deps do
[
{:{{ package_name }}, "~> {{ version }}"}
]
end
```
Then run:
```bash
mix deps.get
```
{% elif pm == "nuget" %}
Install via NuGet:
```bash
dotnet add package {{ package_name }}
```
Or via NuGet Package Manager:
```
Install-Package {{ package_name }}
```
{% elif pm == "pub" %}
Install via pub:
```bash
dart pub add {{ package_name }}
```
For Flutter projects:
```bash
flutter pub add {{ package_name }}
```
{% elif pm == "spm" %}
Add to your `Package.swift` dependencies:
```swift
.package(url: "https://github.com/kreuzberg-dev/kreuzberg.git", from: "{{ version }}"),
```
Then add the product to the relevant target:
```swift
.target(
name: "YourTarget",
dependencies: [
.product(name: "{{ package_name }}", package: "kreuzberg"),
]
),
```
{% elif pm == "hex" %}
Install via Hex:
```elixir
def deps do
[
{:{{ package_name }}, "~> {{ version }}"}
]
end
```
{% elif pm == "zig" %}
Fetch the package and pin it in `build.zig.zon`:
```bash
zig fetch --save https://github.com/kreuzberg-dev/kreuzberg/archive/refs/tags/v{{ version }}.tar.gz
```
Then wire it into `build.zig`:
```zig
const kreuzberg_dep = b.dependency("{{ package_name }}", .{
.target = target,
.optimize = optimize,
});
exe.root_module.addImport("{{ package_name }}", kreuzberg_dep.module("{{ package_name }}"));
```
{% elif pm == "install.packages" %}
Install from the kreuzberg R-universe:
```r
install.packages("{{ package_name }}",
repos = c("https://kreuzberg-dev.r-universe.dev", getOption("repos")))
```
{% elif pm == "cargo" %}
Build the shared library from the workspace:
```bash
cargo build --release -p {{ package_name }}
```
The built artifacts are emitted under `target/release/` (`lib{{ package_name | replace("-", "_") }}.{so,dylib,a}`) along with the C header at `crates/{{ package_name }}/include/kreuzberg.h`.
{% endif %}
{% endfor %}
### System Requirements
{% if language == "python" %}
- **Python 3.10+** required
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
{% elif language == "typescript" %}
- **Node.js 22+** required (NAPI-RS native bindings)
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
### Platform Support
Pre-built binaries available for:
- macOS (arm64, x64)
- Linux (x64)
- Windows (x64)
{% elif language == "go" %}
- **Go 1.19+** required
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
{% elif language == "java" %}
- **Java 25+** required (Foreign Function & Memory API; build run with `--enable-preview` and `--enable-native-access=ALL-UNNAMED`)
- Native libraries bundled in the JAR for macOS (arm64, x64), Linux (x64, arm64), and Windows (x64)
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
{% elif language == "kotlin" %}
- **JDK 25+** required
- Native libraries bundled via the Java facade JAR
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
{% elif language == "ruby" %}
- **Ruby 3.2.0 or higher** required (including Ruby 4.x)
- Ruby 4.0+ is fully supported with no code changes required
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
{% elif language == "php" %}
- **PHP 8.2+** required
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
{% elif language == "csharp" %}
- **.NET 10.0+** required
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
{% elif language == "elixir" %}
- **Elixir 1.14+** and **Erlang/OTP 26+** required
- Pre-compiled NIFs bundled via `rustler_precompiled` for macOS (arm64, x64), Linux (x64, arm64), and Windows (x64)
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
{% elif language == "wasm" %}
- Modern browser with WebAssembly support, or Deno 1.0+, or Cloudflare Workers
- Optional: [Tesseract WASM](https://github.com/naptha/tesseract.js) for OCR functionality
{% elif language == "r" %}
- **R 4.1+** required (extendr bindings)
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
{% elif language == "ffi" %}
- A C/C++ toolchain (clang, gcc, or MSVC) and a Rust toolchain (`rustup`) for building from source
- A `pkg-config` or CMake-aware build system that can locate `libkreuzberg_ffi` and `kreuzberg.h`
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
{% elif language == "dart" %}
- **Dart SDK 3.0+** for pure-Dart consumers
- Flutter projects supported on macOS, iOS, Android, Linux, and Windows; Flutter Web is not supported
- Native runtime delivered via `flutter_rust_bridge` with bundled binaries for the supported platforms
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
{% elif language == "swift" %}
- **Swift 6.0+** (`swift-tools-version: 6.0`) on macOS 13+ or iOS 16+
- Native runtime delivered through the C FFI surface from `kreuzberg-ffi`; published artifacts ship as a binary target
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
{% elif language == "zig" %}
- **Zig 0.16.0+** required (`minimum_zig_version` declared in `build.zig.zon`)
- Links the C FFI surface from `kreuzberg-ffi`; the build resolves the library via `linkSystemLibrary` against the consumer-provided search path
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
{% else %}
- See [Installation Guide](https://docs.kreuzberg.dev/getting-started/installation/) for requirements
{% endif %}

View File

@@ -0,0 +1,24 @@
## NAPI-RS Implementation Details
### Native Performance
This binding uses NAPI-RS to provide native Node.js bindings with:
- **Zero-copy data transfer** between JavaScript and Rust layers
- **Native thread pool** for concurrent document processing
- **Direct memory management** for efficient large document handling
- **Binary-compatible** pre-built native modules across platforms
### Threading Model
- Single documents are processed synchronously or asynchronously in a dedicated thread
- Batch operations distribute work across available CPU cores
- Thread count is configurable but defaults to system CPU count
- Long-running extractions block the event loop unless using async APIs
### Memory Management
- Large documents (> 100 MB) are streamed to avoid loading entirely into memory
- Temporary files are created in system temp directory for extraction
- Memory is automatically released after extraction completion
- ONNX models are cached in memory for repeated embeddings operations

View File

@@ -0,0 +1,77 @@
### Basic Extraction
Extract text, metadata, and structure from any supported document format:
{{ snippets.basic_extraction | include_snippet(language) }}
### Common Use Cases
#### Extract with Custom Configuration
Most use cases benefit from configuration to control extraction behavior:
{% if snippets.ocr_configuration %}
**With OCR (for scanned documents):**
{{ snippets.ocr_configuration | include_snippet(language) }}
{% endif %}
#### Table Extraction
{% if snippets.table_extraction %}
{{ snippets.table_extraction | include_snippet(language) }}
{% else %}
See [Configuration Guide](https://docs.kreuzberg.dev/guides/configuration/) for table extraction options.
{% endif %}
#### Processing Multiple Files
{% if snippets.batch_processing %}
{{ snippets.batch_processing | include_snippet(language) }}
{% endif %}
{% if snippets.async_extraction %}
#### Async Processing
For non-blocking document processing:
{{ snippets.async_extraction | include_snippet(language) }}
{% endif %}
{% if snippets.config_discovery %}
#### Configuration Discovery
{{ snippets.config_discovery | include_snippet(language) }}
{% endif %}
{% if snippets.worker_pool %}
#### Worker Thread Pool
{{ snippets.worker_pool | include_snippet(language) }}
**Performance Benefits:**
- **Parallel Processing**: Multiple documents extracted simultaneously
- **CPU Utilization**: Maximizes multi-core CPU usage for large batches
- **Queue Management**: Automatically distributes work across available workers
- **Resource Control**: Prevents thread exhaustion with configurable pool size
**Best Practices:**
- Use worker pools for batches of 10+ documents
- Set pool size to number of CPU cores (default behavior)
- Always close pools with `closeWorkerPool()` to prevent resource leaks
- Reuse pools across multiple batch operations for efficiency
{% endif %}
### Next Steps
- **[Installation Guide](https://docs.kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
- **[API Documentation](https://docs.kreuzberg.dev/reference/api-python/)** - Complete API reference
- **[Examples & Guides](https://docs.kreuzberg.dev/)** - Full code examples and usage guides
- **[Configuration Guide](https://docs.kreuzberg.dev/guides/configuration/)** - Advanced configuration options

453
templates/readme/python.md Normal file
View File

@@ -0,0 +1,453 @@
# Kreuzberg
{% include 'partials/badges.html.jinja' %}
{{ description }}
## What This Package Provides
- **Python-native extraction** — sync and async APIs for files, bytes, URLs, and batch ingestion.
- **Structured results** — text, tables, images, metadata, language detection, chunks, and warnings in typed Python objects.
- **OCR choices** — Tesseract, EasyOCR, PaddleOCR, and VLM OCR where configured.
- **Same Rust engine as every binding** — behavior matches the Node.js, Ruby, Go, Java, .NET, PHP, Elixir, R, Dart, Swift, Zig, WASM, and C FFI packages.
## Installation
```bash
pip install kreuzberg
```
### With OCR Support
```bash
pip install "kreuzberg[easyocr]"
pip install "kreuzberg[paddleocr]"
```
### All Features
```bash
pip install "kreuzberg[all]"
```
## Quick Start
### Basic Usage
{{ 'getting-started/basic_usage.md' | include_snippet('python') }}
### Simple Extraction
{{ 'getting-started/extract_file.md' | include_snippet('python') }}
### Reading Content
{{ 'getting-started/read_content.md' | include_snippet('python') }}
## OCR Support
### Using OCR
{{ 'getting-started/extract_with_ocr.md' | include_snippet('python') }}
### EasyOCR (GPU-Accelerated)
```python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(
ocr=OcrConfig(backend="easyocr", language="en")
)
result = extract_file_sync(
"photo.jpg",
config=config,
easyocr_kwargs={"use_gpu": True}
)
```
### PaddleOCR (Complex Layouts)
```python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(
ocr=OcrConfig(backend="paddleocr", language="ch")
)
result = extract_file_sync(
"invoice.pdf",
config=config,
)
```
## Table Extraction
```python
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
tesseract_config=TesseractConfig(
enable_table_detection=True
)
)
)
result = extract_file_sync("invoice.pdf", config=config)
for table in result.tables:
print(table.markdown)
print(table.cells)
```
## Configuration
### Complete Configuration Example
```python
from kreuzberg import (
extract_file_sync,
ExtractionConfig,
OcrConfig,
TesseractConfig,
ChunkingConfig,
ImageExtractionConfig,
PdfConfig,
TokenReductionConfig,
LanguageDetectionConfig,
)
config = ExtractionConfig(
use_cache=True,
enable_quality_processing=True,
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(
psm=6,
enable_table_detection=True,
min_confidence=50.0,
),
),
force_ocr=False,
chunking=ChunkingConfig(
max_chars=1000,
max_overlap=200,
),
images=ImageExtractionConfig(
extract_images=True,
target_dpi=300,
max_image_dimension=4096,
auto_adjust_dpi=True,
),
pdf_options=PdfConfig(
extract_images=True,
passwords=["password1", "password2"],
extract_metadata=True,
),
token_reduction=TokenReductionConfig(
mode="moderate",
preserve_important_words=True,
),
language_detection=LanguageDetectionConfig(
enabled=True,
min_confidence=0.8,
detect_multiple=False,
),
)
result = extract_file_sync("document.pdf", config=config)
```
### HTML Conversion Options & Batch Concurrency
```python
from kreuzberg import ExtractionConfig
config = ExtractionConfig(
max_concurrent_extractions=8,
html_options={
"extract_metadata": True,
"wrap": True,
"wrap_width": 100,
"strip_tags": ["script", "style"],
"preprocessing": {"enabled": True, "preset": "standard"},
},
)
```
## Metadata Extraction
```python
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
if result.images:
print(f"Extracted {len(result.images)} inline images")
if result.chunks:
print(f"First chunk tokens: {result.chunks[0]['metadata']['token_count']}")
print(result.metadata.get("pdf", {}))
print(result.metadata.get("language"))
print(result.metadata.get("format"))
if "pdf" in result.metadata:
pdf_meta = result.metadata["pdf"]
print(f"Title: {pdf_meta.get('title')}")
print(f"Author: {pdf_meta.get('author')}")
print(f"Pages: {pdf_meta.get('page_count')}")
print(f"Created: {pdf_meta.get('creation_date')}")
```
## Password-Protected PDFs
```python
from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig
config = ExtractionConfig(
pdf_options=PdfConfig(
passwords=["password1", "password2", "password3"]
)
)
result = extract_file_sync("protected.pdf", config=config)
```
## Language Detection
```python
from kreuzberg import extract_file_sync, ExtractionConfig, LanguageDetectionConfig
config = ExtractionConfig(
language_detection=LanguageDetectionConfig(enabled=True)
)
result = extract_file_sync("multilingual.pdf", config=config)
print(result.detected_languages)
```
## Text Chunking
```python
from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig
config = ExtractionConfig(
chunking=ChunkingConfig(
max_chars=1000,
max_overlap=200,
)
)
result = extract_file_sync("long_document.pdf", config=config)
for chunk in result.chunks:
print(chunk)
```
## Extract from Bytes
```python
from kreuzberg import extract_bytes_sync
with open("document.pdf", "rb") as f:
data = f.read()
result = extract_bytes_sync(data, "application/pdf")
print(result.content)
```
## API Reference
### Extraction Functions
- `extract_file(file_path, mime_type=None, config=None, **kwargs)` Async extraction
- `extract_file_sync(file_path, mime_type=None, config=None, **kwargs)` Sync extraction
- `extract_bytes(data, mime_type, config=None, **kwargs)` Async extraction from bytes
- `extract_bytes_sync(data, mime_type, config=None, **kwargs)` Sync extraction from bytes
- `batch_extract_files(paths, config=None, **kwargs)` Async batch extraction
- `batch_extract_files_sync(paths, config=None, **kwargs)` Sync batch extraction
- `batch_extract_bytes(data_list, mime_types, config=None, **kwargs)` Async batch from bytes
- `batch_extract_bytes_sync(data_list, mime_types, config=None, **kwargs)` Sync batch from bytes
### Configuration Classes
- `ExtractionConfig` Main configuration
- `OcrConfig` OCR settings
- `TesseractConfig` Tesseract-specific options
- `ChunkingConfig` Text chunking settings
- `ImageExtractionConfig` Image extraction settings
- `PdfConfig` PDF-specific options
- `TokenReductionConfig` Token reduction settings
- `LanguageDetectionConfig` Language detection settings
### Result Types
- `ExtractionResult` Main result object with `content`, `metadata`, `tables`, `detected_languages`, `chunks`
- `ExtractedTable` Table with `cells`, `markdown`, `page_number`
- `Metadata` Typed metadata dictionary
### Exceptions
- `KreuzbergError` Base exception
- `ValidationError` Invalid configuration or input
- `ParsingError` Document parsing failure
- `OCRError` OCR processing failure
- `MissingDependencyError` Missing optional dependency
## Examples
### Custom Processing
```python
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
text = result.content
text = text.lower()
text = text.replace("old", "new")
print(text)
```
### Multiple Files with Progress
```python
from kreuzberg import extract_file_sync
from pathlib import Path
files = list(Path("documents").glob("*.pdf"))
results = []
for i, file in enumerate(files, 1):
print(f"Processing {i}/{len(files)}: {file.name}")
result = extract_file_sync(str(file))
results.append((file.name, result))
for name, result in results:
print(f"{name}: {len(result.content)} characters")
```
### Filter by Language
```python
from kreuzberg import extract_file_sync, ExtractionConfig, LanguageDetectionConfig
config = ExtractionConfig(
language_detection=LanguageDetectionConfig(enabled=True)
)
result = extract_file_sync("document.pdf", config=config)
if result.detected_languages and "en" in result.detected_languages:
print("English document detected")
print(result.content)
```
## System Requirements
### ONNX Runtime (for embeddings)
If using embeddings functionality, ONNX Runtime version 1.22.x must be installed:
```bash
# macOS
brew install onnxruntime
# Ubuntu/Debian (download from GitHub - Debian packages may have older versions)
# Download from https://github.com/microsoft/onnxruntime/releases
# Windows
# Download from https://github.com/microsoft/onnxruntime/releases
```
**Important:** Kreuzberg requires ONNX Runtime version 1.22.x for embeddings.
Without ONNX Runtime, embeddings will raise `MissingDependencyError` with installation instructions.
### Tesseract OCR (Required for OCR)
```bash
brew install tesseract
```
```bash
sudo apt-get install tesseract-ocr
```
### Pandoc (Optional, for some formats)
```bash
brew install pandoc
```
```bash
sudo apt-get install pandoc
```
## Troubleshooting
### Import Error: No module named '\_kreuzberg'
This usually means the Rust extension wasn't built correctly. Try:
```bash
pip install --force-reinstall --no-cache-dir kreuzberg
```
### OCR Not Working
Make sure Tesseract is installed:
```bash
tesseract --version
```
### Memory Issues with Large PDFs
Use streaming or enable chunking:
```python
config = ExtractionConfig(
chunking=ChunkingConfig(max_chars=1000)
)
```
## PDFium Integration
PDF extraction is powered by PDFium, which is automatically bundled with this package. No system installation required.
### Platform Support
| Platform | Status | Notes |
| -------------- | ------ | ------- |
| Linux x86_64 | ✅ | Bundled |
| macOS ARM64 | ✅ | Bundled |
| macOS x86_64 | ✅ | Bundled |
| Windows x86_64 | ✅ | Bundled |
### Binary Size Impact
PDFium adds approximately 8-15 MB to the package size depending on platform. This ensures consistent PDF extraction across all environments without external dependencies.
## Documentation
For comprehensive documentation, visit [https://kreuzberg.dev](https://kreuzberg.dev)
## Part of Kreuzberg.dev
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
## License
{{ license }} License - see [LICENSE](../../LICENSE) for details.

448
templates/readme/root.md Normal file
View File

@@ -0,0 +1,448 @@
# Kreuzberg
{% include 'partials/badges.html.jinja' %}
Extract text, metadata, and code intelligence from 90+ file formats and 300+ programming languages at native speeds without needing a GPU.
## Key Features
- **Code intelligence** Extract functions, classes, imports, symbols, and docstrings from [300+ programming languages](https://docs.tree-sitter-language-pack.kreuzberg.dev) via tree-sitter. Results in `ExtractionResult.code_intelligence` with semantic chunking
- **Extensible architecture** Plugin system for custom OCR backends, validators, post-processors, document extractors, and renderers
- **Polyglot** Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, Kotlin, C#, PHP, Elixir, R, Dart, Swift, Zig, and C
- **90+ file formats** PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
- **LLM intelligence** VLM OCR (GPT-4o, Claude, Gemini, Ollama), structured JSON extraction with schema constraints, and provider-hosted embeddings via 143 LLM providers (including local engines: Ollama, LM Studio, vLLM, llama.cpp) through [liter-llm](https://github.com/kreuzberg-dev/liter-llm)
- **OCR support** Tesseract (all bindings, including Tesseract-WASM for browsers), PaddleOCR (all native bindings), EasyOCR (Python), VLM OCR (143 vision model providers including local engines), extensible via plugin API
- **High performance** Rust core with pure-Rust PDF, SIMD optimizations and full parallelism
- **Flexible deployment** Use as library, CLI tool, REST API server, or MCP server
- **TOON wire format** Token-efficient serialization for LLM/RAG pipelines, ~30-50% fewer tokens than JSON
- **GFM-quality output** Comrak-based rendering with proper fenced code blocks, table nodes, bracket escaping, and cross-format parity (Markdown, HTML, Djot, Plain)
- **HTML passthrough** HTML-to-Markdown conversion uses html-to-markdown output directly, bypassing lossy intermediate round-trips
- **Memory efficient** Streaming parsers for multi-GB files
**[Complete Documentation](https://kreuzberg.dev/)** | **[Live Demo](https://docs.kreuzberg.dev/demo.html)** | **[Installation Guides](#installation)**
## Installation
Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:
**Scripting Languages:**
- **[Python](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/python)** PyPI package, async/sync APIs, OCR backends (Tesseract, PaddleOCR, EasyOCR)
- **[Ruby](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/ruby)** RubyGems package, idiomatic Ruby API, native bindings
- **[PHP](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/php)** Composer package, modern PHP 8.2+ support, type-safe API, async extraction
- **[Elixir](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/elixir)** Hex package, OTP integration, concurrent processing
- **[R](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/r)** r-universe package, idiomatic R API, extendr bindings
- **[Dart / Flutter](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/dart)** pub.dev package, flutter_rust_bridge runtime, native bindings for macOS/iOS/Android/Linux/Windows
**JavaScript/TypeScript:**
- **[@kreuzberg/node](https://github.com/kreuzberg-dev/kreuzberg/tree/main/crates/kreuzberg-node)** Native NAPI-RS bindings for Node.js/Bun, fastest performance
- **[@kreuzberg/wasm](https://github.com/kreuzberg-dev/kreuzberg/tree/main/crates/kreuzberg-wasm)** WebAssembly for browsers/Deno/Cloudflare Workers, comprehensive format and OCR support (PDF, Excel, archives, all office formats, real Tesseract via the WASI build) — only ORT-dependent features (paddle-ocr, layout detection, embeddings, auto-rotate) and server modes (api/mcp/cli) are excluded
**Compiled Languages:**
- **[Go](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go/v5)** Go module with FFI bindings, context-aware async
- **[Java](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/java)** Maven Central, Foreign Function & Memory API
- **[Kotlin](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/kotlin-android)** Maven Central, Kotlin/JVM with idiomatic data classes, sealed enums, and coroutine-based async
- **[C#](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/csharp)** NuGet package, .NET 6.0+, full async/await support
- **[Swift](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/swift)** Swift Package Manager, macOS 13+/iOS 16+, native Swift types and async/await
**Native:**
- **[Rust](https://github.com/kreuzberg-dev/kreuzberg/tree/main/crates/kreuzberg)** Core library, flexible feature flags, zero-copy APIs
- **[Zig](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/zig)** `zig fetch` + `build.zig.zon`, idiomatic error sets, optional types, slice-based memory
- **[C (FFI)](https://github.com/kreuzberg-dev/kreuzberg/tree/main/crates/kreuzberg-ffi)** C header + shared library, pkg-config/CMake support, cross-platform
**Containers:**
- **[Docker](https://docs.kreuzberg.dev/guides/docker/)** Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)
**Command-Line:**
- **[CLI](https://docs.kreuzberg.dev/cli/usage/)** Cross-platform binary, batch processing, MCP server mode
> All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.
## Platform Support
Complete architecture coverage across all language bindings:
| Language | Linux x86_64 | Linux aarch64 | macOS ARM64 | Windows x64 |
| -------- | :----------: | :-----------: | :---------: | :---------: |
| Python | ✅ | ✅ | ✅ | ✅ |
| Node.js | ✅ | ✅ | ✅ | ✅ |
| WASM | ✅ | ✅ | ✅ | ✅ |
| Ruby | ✅ | ✅ | ✅ | - |
| R | ✅ | ✅ | ✅ | ✅ |
| Elixir | ✅ | ✅ | ✅ | ✅ |
| Go | ✅ | ✅ | ✅ | ✅ |
| Java | ✅ | ✅ | ✅ | ✅ |
| Kotlin | ✅ | ✅ | ✅ | ✅ |
| C# | ✅ | ✅ | ✅ | ✅ |
| PHP | ✅ | ✅ | ✅ | ✅ |
| Swift | - | - | ✅ | - |
| Dart | ✅ | ✅ | ✅ | ✅ |
| Zig | ✅ | ✅ | ✅ | ✅ |
| Rust | ✅ | ✅ | ✅ | ✅ |
| C (FFI) | ✅ | ✅ | ✅ | ✅ |
| CLI | ✅ | ✅ | ✅ | ✅ |
| Docker | ✅ | ✅ | ✅ | - |
**Note**: ✅ = Precompiled binaries available with instant installation. WASM runs in any environment with WebAssembly support (browsers, Deno, Bun, Cloudflare Workers). All platforms are tested in CI. MacOS support is Apple Silicon only.
### Mobile (iOS, Android)
| Target | ORT-dependent features\* |
| -------------------------------------------------- | :----------------------: |
| iOS (`aarch64-apple-ios`, `aarch64-apple-ios-sim`) | ✅ |
| Android arm64 (`aarch64-linux-android`) | ✅ |
| Android x86_64 emulator (`x86_64-linux-android`) | ❌ |
\*ORT-dependent features: PaddleOCR, layout detection, embeddings, auto-rotate.
All non-ORT capabilities (Tesseract OCR, every document format, chunking, language detection, keywords, tree-sitter code intelligence, API/MCP, LLM) are available on all four mobile targets.
The `x86_64-linux-android` emulator triple lacks an ORT prebuilt upstream; kreuzberg's `kreuzberg` crate exposes an `android-target` aggregate feature that selects the same no-ORT feature set as WASM. The `kreuzberg-ffi` and `kreuzberg-dart` crates auto-select that aggregate for the emulator via target-conditional dependencies — host and arm64 phones get full features automatically.
### Browsers / Edge (WebAssembly)
WASM excludes the same ORT-dependent feature set as the Android x86_64 emulator. The shared no-ORT base lives behind the `no-ort-target` feature in the core crate; both `wasm-target` and `android-target` compose it.
### Embeddings Support (Optional)
To use embeddings functionality:
1. **Install ONNX Runtime 1.24+**:
- Linux: Download from [ONNX Runtime releases](https://github.com/microsoft/onnxruntime/releases) (Debian packages may have older versions)
- MacOS: `brew install onnxruntime`
- Windows: Download from [ONNX Runtime releases](https://github.com/microsoft/onnxruntime/releases)
2. Use embeddings in your code - see [Embeddings Guide](https://docs.kreuzberg.dev/features/#embeddings)
**Note:** Kreuzberg requires ONNX Runtime version 1.24+ for embeddings. All other Kreuzberg features work without ONNX Runtime.
## Supported Formats
90+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
### Office Documents
| Category | Formats | Capabilities |
| ------------------- | ------------------------------------------------------------------------------------------------ | -------------------------------------------------- |
| **Word Processing** | `.docx`, `.docm`, `.dotx`, `.dotm`, `.dot`, `.odt`, `.pages` | Full text, tables, lists, images, metadata, styles |
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.xltx`, `.xlt`, `.ods`, `.numbers` | Sheet data, formulas, cell metadata, charts |
| **Presentations** | `.pptx`, `.pptm`, `.ppsx`, `.potx`, `.potm`, `.pot`, `.key` | Slides, speaker notes, images, metadata |
| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
| **Database** | `.dbf` | Table data extraction, field type support |
| **Hangul** | `.hwp`, `.hwpx` | Korean document format, text extraction |
### Images (OCR-Enabled)
| Category | Formats | Features |
| ------------ | -------------------------------------------------------------------------------- | ------------------------------------------------------------ |
| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | Pure Rust decoders (JPEG 2000, JBIG2), OCR, table detection |
| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
### Web & Data
| Category | Formats | Features |
| ------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------- |
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.djot`, `.mdx`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, Djot, MDX, reStructuredText, Org Mode, Rich Text |
### Email & Archives
| Category | Formats | Features |
| ------------ | ------------------------------------ | ------------------------------------------------------- |
| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, UTF-16 support |
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | Recursive extraction, nested archives, metadata |
### Academic & Scientific
| Category | Formats | Features |
| ----------------- | ----------------------------------------------------- | ----------------------------------------------------------- |
| **Citations** | `.bib`, `.ris`, `.nbib`, `.enw`, `.csl` | BibTeX/BibLaTeX, RIS, PubMed/MEDLINE, EndNote XML, CSL JSON |
| **Scientific** | `.tex`, `.latex`, `.typ`, `.typst`, `.jats`, `.ipynb` | LaTeX, Typst, JATS journal articles, Jupyter notebooks |
| **Publishing** | `.fb2`, `.docbook`, `.dbk`, `.opml` | FictionBook, DocBook XML, OPML outlines |
| **Documentation** | `.pod`, `.mdoc`, `.troff` | Perl POD, man pages, troff |
**[Complete Format Reference →](https://docs.kreuzberg.dev/reference/formats/)**
### Code Intelligence (300+ Languages)
| Feature | Description |
| -------------------------- | ------------------------------------------------------------- |
| **Structure Extraction** | Functions, classes, methods, structs, interfaces, enums |
| **Import/Export Analysis** | Module dependencies, re-exports, wildcard imports |
| **Symbol Extraction** | Variables, constants, type aliases, properties |
| **Docstring Parsing** | Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats |
| **Diagnostics** | Parse errors with line/column positions |
| **Syntax-Aware Chunking** | Split code by semantic boundaries, not arbitrary byte offsets |
Powered by [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) with dynamic grammar download. See [TSLP documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) for the full language list.
## Key Features
<details>
<summary><strong>OCR with Table Extraction</strong></summary>
Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.
**[OCR Backend Documentation →](https://docs.kreuzberg.dev/guides/ocr/)**
</details>
<details>
<summary><strong>Batch Processing</strong></summary>
Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.
**[Batch Processing Guide →](https://docs.kreuzberg.dev/features/#batch-processing)**
</details>
<details>
<summary><strong>Password-Protected PDFs</strong></summary>
Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.
**[PDF Configuration →](https://docs.kreuzberg.dev/guides/configuration/)**
</details>
<details>
<summary><strong>Language Detection</strong></summary>
Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.
**[Language Detection Guide →](https://docs.kreuzberg.dev/features/#language-detection)**
</details>
<details>
<summary><strong>Metadata Extraction</strong></summary>
Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.
**[Metadata Guide →](https://docs.kreuzberg.dev/reference/types/#metadata)**
</details>
## AI Coding Assistants
Kreuzberg ships with an [Agent Skill](https://agentskills.io) that teaches AI coding assistants how to use the library correctly. It works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard.
Install the skill into any project using the [Vercel Skills CLI](https://github.com/vercel-labs/skills):
```bash
npx skills add kreuzberg-dev/kreuzberg
```
The skill is located at [`skills/kreuzberg/SKILL.md`](skills/kreuzberg/SKILL.md) and is automatically discovered by supported AI coding tools once installed.
## Documentation
- **[Installation Guide](https://docs.kreuzberg.dev/getting-started/installation/)** Setup and dependencies
- **[User Guide](https://docs.kreuzberg.dev/guides/extraction/)** Comprehensive usage guide
- **[API Reference](https://docs.kreuzberg.dev/reference/api-python/)** Complete API documentation
- **[Format Support](https://docs.kreuzberg.dev/reference/formats/)** Supported file formats
- **[OCR Backends](https://docs.kreuzberg.dev/guides/ocr/)** OCR engine setup
- **[CLI Guide](https://docs.kreuzberg.dev/cli/usage/)** Command-line usage
- **[Migration Guides](https://docs.kreuzberg.dev/migration/from-unstructured/)** Upgrading from other libraries
## Contributing
Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
## Part of Kreuzberg.dev
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces all per-language bindings.
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
## License
Elastic License 2.0 (ELv2) - see [LICENSE](LICENSE) for details. See [https://www.elastic.co/licensing/elastic-license](https://www.elastic.co/licensing/elastic-license) for the full license text.
## FAQ
### What is Kreuzberg?
Kreuzberg is a polyglot document intelligence framework with a Rust core. It extracts text, metadata, and code intelligence from 90+ file formats and 300+ programming languages at native speeds without needing a GPU. It provides native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, Kotlin, C#, PHP, Elixir, R, Dart, Swift, Zig, and C.
### How does Kreuzberg differ from other document extraction tools?
- **Kreuzberg**: Rust core, 90+ formats, 300+ languages, polyglot bindings, code intelligence via tree-sitter, VLM OCR, native speeds, no GPU needed
- **Apache Tika**: Java-based, broader format support, but slower, no code intelligence, no VLM OCR
- **pdfplumber**: Python-only, PDF focus, slower, no code intelligence
- **unstructured**: Python-based, good format coverage, but slower, requires more dependencies
Kreuzberg's Rust core with SIMD optimizations and parallelism delivers 10-100x faster extraction than Python alternatives.
### What are Kreuzberg's key features?
- **Code intelligence** — Extract functions, classes, imports, symbols, docstrings from 300+ languages via tree-sitter
- **Extensible architecture** — Plugin system for custom OCR backends, validators, post-processors, document extractors, renderers
- **Polyglot bindings** — Native bindings for 14+ languages (Rust, Python, Node.js, Ruby, Go, Java, Kotlin, C#, PHP, Elixir, R, Dart, Swift, Zig, C)
- **90+ file formats** — PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
- **LLM intelligence** — VLM OCR (GPT-4o, Claude, Gemini, Ollama), structured JSON extraction, embeddings via 143 LLM providers
- **OCR support** — Tesseract (all bindings including WASM for browsers), PaddleOCR, EasyOCR, VLM OCR, extensible via plugin API
- **High performance** — Rust core with pure-Rust PDF, SIMD optimizations, full parallelism
- **Flexible deployment** — Library, CLI tool, REST API server, or MCP server
- **TOON wire format** — Token-efficient serialization for LLM/RAG pipelines, ~30-50% fewer tokens than JSON
- **GFM-quality output** — Comrak-based Markdown rendering with proper fenced code blocks, table nodes
- **Memory efficient** — Streaming parsers for multi-GB files
### What file formats does Kreuzberg support?
8 categories covering 90+ formats:
- **Documents** — PDF, DOCX, DOC, ODT, RTF, Hangul
- **Office** — XLSX, XLS, PPTX, PPT, ODS, iWork
- **Images** — PNG, JPEG, TIFF, BMP, GIF, WebP
- **Web** — HTML, XML, XHTML, SVG
- **Emails** — MSG, EML, PST
- **Archives** — ZIP, TAR, GZ, TGZ, 7Z
- **Academic** — LaTeX, BibTeX, RIS
- **Code** — 300+ programming languages via tree-sitter
### How do I get started?
Choose your platform:
**Python:**
```bash
pip install kreuzberg
```
See [Python docs](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/python)
**Node.js:**
```bash
npm install @kreuzberg/node
```
See [Node.js docs](https://github.com/kreuzberg-dev/kreuzberg/tree/main/crates/kreuzberg-node)
**Rust:**
```bash
cargo add kreuzberg
```
See [Rust docs](https://github.com/kreuzberg-dev/kreuzberg/tree/main/crates/kreuzberg)
**Docker:**
```bash
docker pull ghcr.io/kreuzberg-dev/kreuzberg:latest
```
See [Docker docs](https://docs.kreuzberg.dev/guides/docker/)
### What LLM/VLM providers are supported?
143 providers including:
- **OpenAI** — GPT-4o (vision), text models
- **Anthropic** — Claude (vision), Claude 3.5 Sonnet
- **Google** — Gemini (vision), Gemini 2.0 Flash
- **Local engines** — Ollama, LM Studio, vLLM, llama.cpp
- **Cloud providers** — Fireworks, Together, Groq, OpenRouter
- **All OpenAI-compatible endpoints**
### What OCR backends are available?
- **Tesseract** — All bindings, including Tesseract-WASM for browsers
- **PaddleOCR** — All native bindings (Python, Node.js, etc.)
- **EasyOCR** — Python binding
- **VLM OCR** — 143 vision model providers (GPT-4o, Claude, Gemini, Ollama local)
- **Custom OCR** — Extensible via plugin API
### What is the TOON wire format?
TOON is Kreuzberg's token-efficient serialization format for LLM/RAG pipelines. It uses ~30-50% fewer tokens than JSON, making it ideal for:
- Large document processing
- RAG system integration
- LLM context window optimization
- Cost reduction in API calls
### What is code intelligence extraction?
Kreuzberg extracts semantic code information via tree-sitter:
- **Functions** — Names, parameters, return types, docstrings
- **Classes** — Names, methods, inheritance, properties
- **Imports** — Module names, import paths
- **Symbols** — Variables, constants, type definitions
- **Docstrings** — Documentation comments
Results in `ExtractionResult.code_intelligence` with semantic chunking.
### Does Kreuzberg work in browsers?
Yes! The WASM package (`@kreuzberg/wasm`) supports browsers, Deno, and Cloudflare Workers with:
- PDF, Excel, archives, all office formats
- Real Tesseract OCR via WASI build
- Only ORT-dependent features excluded (PaddleOCR, layout detection, embeddings, auto-rotate)
### What deployment options are available?
- **Library** — Use as a dependency in your application
- **CLI** — Cross-platform binary for batch processing
- **REST API server** — HTTP endpoint for document extraction
- **MCP server** — Model Context Protocol server for AI assistants
- **Docker** — Official images with API, CLI, and MCP modes
### What languages have native bindings?
| Language | Package Manager | Status |
|----------|----------------|--------|
| Rust | Cargo | ✅ Core library |
| Python | PyPI | ✅ Full support |
| Node.js | npm (NAPI-RS) | ✅ Fastest performance |
| WASM | npm | ✅ Browser/Deno/CF Workers |
| Ruby | RubyGems | ✅ Native bindings |
| Go | Go modules | ✅ FFI bindings |
| Java | Maven Central | ✅ Foreign Function API |
| Kotlin | Maven Central | ✅ Coroutine-based |
| C# | NuGet | ✅ .NET 6.0+ |
| PHP | Composer | ✅ PHP 8.2+ |
| Elixir | Hex | ✅ OTP integration |
| R | r-universe | ✅ extendr bindings |
| Dart/Flutter | pub.dev | ✅ flutter_rust_bridge |
| Swift | SPM | ✅ macOS 13+/iOS 16+ |
| Zig | build.zig.zon | ✅ Idiomatic API |
| C (FFI) | pkg-config/CMake | ✅ Header + shared lib |
### What platforms are supported?
All bindings support:
- **Linux** — x86_64 and aarch64
- **macOS** — ARM64
- **Windows** — x64 (most bindings)
Precompiled binaries included for all architectures.
### What license does Kreuzberg use?
Elastic-2.0 License — open-source with commercial use restrictions. See [LICENSE](https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE) for details.
### Where can I get help?
- **Documentation**: [docs.kreuzberg.dev](https://docs.kreuzberg.dev)
- **Live Demo**: [docs.kreuzberg.dev/demo.html](https://docs.kreuzberg.dev/demo.html)
- **Discord**: [discord.gg/xt9WY3GnKR](https://discord.gg/xt9WY3GnKR)
- **Hugging Face**: [huggingface.co/Kreuzberg](https://huggingface.co/Kreuzberg)
- **GitHub Issues**: [github.com/kreuzberg-dev/kreuzberg/issues](https://github.com/kreuzberg-dev/kreuzberg/issues)

382
templates/readme/ruby.md Normal file
View File

@@ -0,0 +1,382 @@
# Kreuzberg for Ruby
{% include 'partials/badges.html.jinja' %}
{{ description }}
## What This Package Provides
- **Ruby-native extraction** — idiomatic Ruby objects over the shared Rust document engine.
- **Structured results** — text, tables, images, metadata, language detection, chunks, and warnings.
- **OCR support** — Tesseract and PaddleOCR through the same configuration model as other bindings.
- **Cross-binding parity** — output matches the Python, Node.js, Go, Java, .NET, PHP, Elixir, R, Dart, Swift, Zig, WASM, and C FFI packages.
## Installation
Add to your Gemfile:
```ruby
gem 'kreuzberg'
```
Then execute:
```bash
bundle install
```
Or install it directly:
```bash
gem install kreuzberg
```
## Quick Start
### Basic Usage
```ruby
require 'kreuzberg'
# Simple synchronous extraction
result = Kreuzberg.extract_file("document.pdf")
puts result.content
```
### Async Extraction
```ruby
require 'kreuzberg'
# Using Fiber for concurrency (Ruby 3.0+)
Fiber.new do
result = Kreuzberg.extract_file_async("document.pdf")
puts result.content
end.resume
```
### Batch Processing
```ruby
require 'kreuzberg'
files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
results = files.map { |file| Kreuzberg.extract_file(file) }
results.each do |result|
puts "Content length: #{result.content.length}"
end
```
## Configuration
```ruby
require 'kreuzberg'
config = Kreuzberg::ExtractionConfig.new(
use_cache: true,
enable_quality_processing: true,
ocr: Kreuzberg::OcrConfig.new(
backend: 'tesseract',
language: 'eng'
)
)
result = Kreuzberg.extract_file("document.pdf", config: config)
puts result.content
```
## OCR Support
### Tesseract Configuration
```ruby
require 'kreuzberg'
config = Kreuzberg::ExtractionConfig.new(
ocr: Kreuzberg::OcrConfig.new(
backend: 'tesseract',
language: 'eng',
tesseract_config: Kreuzberg::TesseractConfig.new(
psm: 6,
enable_table_detection: true
)
)
)
result = Kreuzberg.extract_file("scanned.pdf", config: config)
puts result.content
```
## Table Extraction
```ruby
require 'kreuzberg'
config = Kreuzberg::ExtractionConfig.new(
ocr: Kreuzberg::OcrConfig.new(
backend: 'tesseract',
tesseract_config: Kreuzberg::TesseractConfig.new(
enable_table_detection: true
)
)
)
result = Kreuzberg.extract_file("invoice.pdf", config: config)
result.tables.each_with_index do |table, index|
puts "Table #{index}:"
puts table.markdown
end
```
## Metadata Extraction
```ruby
require 'kreuzberg'
result = Kreuzberg.extract_file("document.pdf")
# PDF metadata
if result.metadata[:pdf]
pdf_meta = result.metadata[:pdf]
puts "Title: #{pdf_meta[:title]}"
puts "Author: #{pdf_meta[:author]}"
puts "Pages: #{pdf_meta[:page_count]}"
end
# Detected languages
puts "Languages: #{result.detected_languages}"
# Images
if result.images
puts "Images found: #{result.images.count}"
end
```
## Text Chunking
```ruby
require 'kreuzberg'
config = Kreuzberg::ExtractionConfig.new(
chunking: Kreuzberg::ChunkingConfig.new(
max_chars: 1000,
max_overlap: 200
)
)
result = Kreuzberg.extract_file("long_document.pdf", config: config)
result.chunks.each_with_index do |chunk, index|
puts "Chunk #{index}: #{chunk.length} characters"
end
```
## Password-Protected PDFs
```ruby
require 'kreuzberg'
config = Kreuzberg::ExtractionConfig.new(
pdf_options: Kreuzberg::PdfConfig.new(
passwords: ["password1", "password2"]
)
)
result = Kreuzberg.extract_file("protected.pdf", config: config)
puts result.content
```
## Language Detection
```ruby
require 'kreuzberg'
config = Kreuzberg::ExtractionConfig.new(
language_detection: Kreuzberg::LanguageDetectionConfig.new(
enabled: true
)
)
result = Kreuzberg.extract_file("multilingual.pdf", config: config)
puts "Detected languages: #{result.detected_languages}"
```
## API Reference
### Main Methods
- `Kreuzberg.extract_file(path, config: nil)` Extract from file
- `Kreuzberg.extract_file_async(path, config: nil)` Async extraction
- `Kreuzberg.extract_bytes(data, mime_type, config: nil)` Extract from bytes
- `Kreuzberg.batch_extract_files(paths, config: nil)` Batch processing
### Configuration Classes
- `ExtractionConfig` Main configuration
- `OcrConfig` OCR settings
- `TesseractConfig` Tesseract-specific options
- `ChunkingConfig` Text chunking settings
- `PdfConfig` PDF-specific options
- `LanguageDetectionConfig` Language detection settings
### Result Object
- `content` Extracted text
- `metadata` File metadata as Hash
- `tables` Array of ExtractedTable objects
- `detected_languages` Array of language codes
- `chunks` Array of text chunks
- `images` Array of extracted images (if enabled)
## System Requirements
### Ruby Version
- **Ruby 3.2.0 or higher** (including Ruby 4.x)
- Ruby 4.0+ is fully supported with no code changes required
- Magnus bindings compile successfully on all supported Ruby versions
### Required
- Rust toolchain (for native extension compilation)
### Optional
```bash
# Tesseract OCR
brew install tesseract # macOS
sudo apt-get install tesseract-ocr # Ubuntu/Debian
```
### Ruby 4.0 Compatibility
Kreuzberg is fully compatible with Ruby 4.0 (released December 25, 2025) and later. Key Ruby 4.0 features that work seamlessly:
- **Ruby Box** - Improved memory efficiency and performance
- **ZJIT Compiler** - Enhanced JIT compilation for faster execution
- **Ractor Improvements** - Better multi-threaded document processing
- **Set Promoted to Core** - No changes needed for Kreuzberg
All tests pass with Ruby 4.0.1 with 100% compatibility. The gem compiles without any breaking changes.
## Development
Clone and setup:
```bash
git clone https://github.com/kreuzberg-dev/kreuzberg.git
cd kreuzberg
bundle install
```
Run tests:
```bash
rake test
```
## Troubleshooting
### Native extension compilation error
Ensure build tools are installed:
```bash
# macOS
xcode-select --install
# Ubuntu/Debian
sudo apt-get install build-essential ruby-dev
# Windows (via RubyInstaller)
ridk install
```
### "Could not find Kreuzberg"
Reinstall the gem:
```bash
gem uninstall kreuzberg
gem install kreuzberg --no-document
```
### OCR not working
Verify Tesseract is installed:
```bash
tesseract --version
```
## Examples
### Process Directory of PDFs
```ruby
require 'kreuzberg'
require 'pathname'
Dir.glob("documents/*.pdf").each do |file|
puts "Processing: #{file}"
result = Kreuzberg.extract_file(file)
puts " Content length: #{result.content.length}"
puts " Language: #{result.detected_languages}"
end
```
### Extract and Parse Structured Data
```ruby
require 'kreuzberg'
require 'json'
result = Kreuzberg.extract_file("data.pdf")
# Parse content as JSON (if applicable)
begin
data = JSON.parse(result.content)
puts "Parsed data: #{data}"
rescue JSON::ParserError
puts "Content is not JSON"
end
```
### Save Extracted Images
```ruby
require 'kreuzberg'
config = Kreuzberg::ExtractionConfig.new(
images: Kreuzberg::ImageExtractionConfig.new(
extract_images: true
)
)
result = Kreuzberg.extract_file("document.pdf", config: config)
result.images&.each_with_index do |image, index|
File.write("image_#{index}.png", image.data)
end
```
## Documentation
For comprehensive documentation, visit [https://kreuzberg.dev](https://kreuzberg.dev)
## Part of Kreuzberg.dev
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
## License
{{ license }} License - see [LICENSE](../../LICENSE) for details.