This commit is contained in:
273
templates/readme/go.md
Normal file
273
templates/readme/go.md
Normal file
@@ -0,0 +1,273 @@
|
||||
# Kreuzberg
|
||||
|
||||
{% include 'partials/badges.html.jinja' %}
|
||||
|
||||
High-performance document intelligence for Go backed by the Rust core that powers every Kreuzberg binding.
|
||||
|
||||
> **Version {{ version }}**
|
||||
> Report issues at [github.com/kreuzberg-dev/kreuzberg](https://github.com/kreuzberg-dev/kreuzberg/issues).
|
||||
|
||||
## What This Package Provides
|
||||
|
||||
- **Go module over the Rust core** — context-aware extraction with Go structs and errors.
|
||||
- **Structured results** — text, tables, images, metadata, language detection, chunks, and warnings.
|
||||
- **Static-link workflow** — build against `kreuzberg-ffi` and ship a self-contained Go binary.
|
||||
- **Cross-binding parity** — output matches the Python, Node.js, Ruby, Java, .NET, PHP, Elixir, R, Dart, Swift, Zig, WASM, and C FFI packages.
|
||||
|
||||
## Install
|
||||
|
||||
Kreuzberg Go binaries are **statically linked** — once built, they are self-contained and require no runtime library dependencies. Only the static library is needed at build time.
|
||||
|
||||
### Quick Start (Monorepo Development)
|
||||
|
||||
For development in the Kreuzberg monorepo:
|
||||
|
||||
```bash
|
||||
# Build the static FFI library
|
||||
cargo build -p kreuzberg-ffi --release
|
||||
|
||||
# Go build will automatically link against the static library
|
||||
# (from target/release/libkreuzberg_ffi.a)
|
||||
cd packages/go/v5
|
||||
go build -v
|
||||
|
||||
# Run your binary (no library path needed - it's statically linked)
|
||||
./v4
|
||||
```
|
||||
|
||||
That's it! The resulting binary is self-contained and has no runtime dependencies on Kreuzberg libraries.
|
||||
|
||||
### Using Go Modules
|
||||
|
||||
To use this package via `go get`:
|
||||
|
||||
```bash
|
||||
# Get the latest release
|
||||
go get {{ package_name }}@latest
|
||||
|
||||
# Or a specific version
|
||||
go get {{ package_name }}@v{{ version }}
|
||||
```
|
||||
|
||||
You'll need to provide the static library at build time. See [Building with Static Libraries](#building-with-static-libraries) below.
|
||||
|
||||
### Building with Static Libraries
|
||||
|
||||
When building outside the Kreuzberg monorepo, you need to provide the static library (`.a` file on Unix, `.lib` on Windows).
|
||||
|
||||
#### Option 1: Download Pre-built Static Library
|
||||
|
||||
Download the static library for your platform from [GitHub Releases](https://github.com/kreuzberg-dev/kreuzberg/releases):
|
||||
|
||||
```bash
|
||||
# Example: Linux x86_64
|
||||
curl -LO https://github.com/kreuzberg-dev/kreuzberg/releases/download/v{{ version }}/go-ffi-linux-x86_64.tar.gz
|
||||
tar -xzf go-ffi-linux-x86_64.tar.gz
|
||||
|
||||
# Copy to a permanent location
|
||||
mkdir -p ~/kreuzberg/lib
|
||||
cp kreuzberg-ffi/lib/libkreuzberg_ffi.a ~/kreuzberg/lib/
|
||||
```
|
||||
|
||||
Then build with `CGO_LDFLAGS`:
|
||||
|
||||
```bash
|
||||
# Linux/macOS
|
||||
CGO_LDFLAGS="-L$HOME/kreuzberg/lib -lkreuzberg_ffi" go build
|
||||
|
||||
# Windows (MSVC)
|
||||
set CGO_LDFLAGS=-L%USERPROFILE%\kreuzberg\lib -lkreuzberg_ffi
|
||||
go build
|
||||
```
|
||||
|
||||
#### Option 2: Build Static Library Yourself
|
||||
|
||||
If pre-built libraries aren't available for your platform:
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone https://github.com/kreuzberg-dev/kreuzberg.git
|
||||
cd kreuzberg
|
||||
|
||||
# Build the static library
|
||||
cargo build -p kreuzberg-ffi --release
|
||||
|
||||
# The static library is now at: target/release/libkreuzberg_ffi.a
|
||||
# Copy it to a permanent location
|
||||
mkdir -p ~/kreuzberg/lib
|
||||
cp target/release/libkreuzberg_ffi.a ~/kreuzberg/lib/
|
||||
|
||||
# Now you can build Go projects
|
||||
cd ~/my-go-project
|
||||
CGO_LDFLAGS="-L$HOME/kreuzberg/lib -lkreuzberg_ffi" go build
|
||||
```
|
||||
|
||||
### System Requirements
|
||||
|
||||
#### ONNX Runtime (for embeddings)
|
||||
|
||||
If using embeddings functionality, ONNX Runtime must be installed **at build time**:
|
||||
|
||||
```bash
|
||||
# macOS
|
||||
brew install onnxruntime
|
||||
|
||||
# Ubuntu/Debian
|
||||
sudo apt install libonnxruntime libonnxruntime-dev
|
||||
|
||||
# Windows (MSVC)
|
||||
scoop install onnxruntime
|
||||
# OR download from https://github.com/microsoft/onnxruntime/releases
|
||||
```
|
||||
|
||||
The resulting binary will have ONNX Runtime statically linked or dynamically linked depending on how the FFI library was built. Check the build configuration.
|
||||
|
||||
**Note:** Windows MinGW builds do not support embeddings (ONNX Runtime requires MSVC). Use Windows MSVC for embeddings support.
|
||||
|
||||
## Quickstart
|
||||
|
||||
```go
|
||||
package main
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"log"
|
||||
|
||||
"{{ package_name }}"
|
||||
)
|
||||
|
||||
func main() {
|
||||
result, err := v4.ExtractFileSync("document.pdf", nil)
|
||||
if err != nil {
|
||||
log.Fatalf("extract failed: %v", err)
|
||||
}
|
||||
|
||||
fmt.Println("MIME:", result.MimeType)
|
||||
fmt.Println("First 200 chars:")
|
||||
fmt.Println(result.Content[:200])
|
||||
}
|
||||
```
|
||||
|
||||
Build and run:
|
||||
|
||||
```bash
|
||||
# Build (make sure you have the static library available - see Install)
|
||||
CGO_LDFLAGS="-L$HOME/kreuzberg/lib -lkreuzberg_ffi" go build
|
||||
|
||||
# Run - no library paths needed!
|
||||
./myapp
|
||||
```
|
||||
|
||||
The binary is self-contained and can be distributed without any Kreuzberg library dependencies.
|
||||
|
||||
## Examples
|
||||
|
||||
### Extract bytes
|
||||
|
||||
```go
|
||||
data, err := os.ReadFile("slides.pptx")
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
result, err := v4.ExtractBytesSync(data, "application/vnd.openxmlformats-officedocument.presentationml.presentation", nil)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
fmt.Println(result.Metadata.FormatType())
|
||||
```
|
||||
|
||||
### Use advanced configuration
|
||||
|
||||
```go
|
||||
lang := "eng"
|
||||
cfg := &v4.ExtractionConfig{
|
||||
UseCache: true,
|
||||
ForceOCR: false,
|
||||
ImageExtraction: &v4.ImageExtractionConfig{Enabled: true},
|
||||
OCR: &v4.OcrConfig{
|
||||
Backend: "tesseract",
|
||||
Language: &lang,
|
||||
},
|
||||
}
|
||||
result, err := v4.ExtractFileSync("scanned.pdf", cfg)
|
||||
```
|
||||
|
||||
### Async (context-aware) extraction
|
||||
|
||||
```go
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
|
||||
defer cancel()
|
||||
|
||||
result, err := v4.ExtractFile(ctx, "large.pdf", nil)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
fmt.Println("Content length:", len(result.Content))
|
||||
```
|
||||
|
||||
### Batch extract
|
||||
|
||||
```go
|
||||
paths := []string{"doc1.pdf", "doc2.docx", "report.xlsx"}
|
||||
results, err := v4.BatchExtractFilesSync(paths, nil)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
for i, res := range results {
|
||||
if res == nil {
|
||||
continue
|
||||
}
|
||||
fmt.Printf("[%d] %s => %d bytes\n", i, res.MimeType, len(res.Content))
|
||||
}
|
||||
```
|
||||
|
||||
### Register a validator
|
||||
|
||||
```go
|
||||
//export customValidator
|
||||
func customValidator(resultJSON *C.char) *C.char {
|
||||
// Validate JSON payload and return an error string (or NULL if ok)
|
||||
return nil
|
||||
}
|
||||
|
||||
func init() {
|
||||
if err := v4.RegisterValidator("go-validator", 50, (C.ValidatorCallback)(C.customValidator)); err != nil {
|
||||
log.Fatalf("validator registration failed: %v", err)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## API Reference
|
||||
|
||||
- **GoDoc**: [pkg.go.dev/{{ package_name }}](<https://pkg.go.dev/{{ package_name }}>)
|
||||
- **Full documentation**: [kreuzberg.dev](https://kreuzberg.dev) (configuration, formats, OCR backends)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Issue | Fix |
|
||||
| ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `ld returned 1 exit status` or `undefined reference to 'html_to_markdown_...'` | The static library wasn't found. Make sure `CGO_LDFLAGS` points to the directory containing `libkreuzberg_ffi.a`: `CGO_LDFLAGS="-L/path/to/lib -lkreuzberg_ffi" go build` |
|
||||
| `cannot find -lkreuzberg_ffi` | The static library file is missing or in the wrong location. Download it from [GitHub Releases](https://github.com/kreuzberg-dev/kreuzberg/releases) or build it yourself: `cargo build -p kreuzberg-ffi --release` |
|
||||
| `undefined: v4.ExtractFile` | This function was removed in v4.1.0. Use `ExtractFileSync` and wrap in goroutine if needed (see migration guide) |
|
||||
| `Missing dependency: tesseract` | Install the OCR backend and ensure it is on `PATH`. Errors bubble up as `*v4.MissingDependencyError`. |
|
||||
| `undefined: C.customValidator` during build | Export the callback with `//export` in a `*_cgo.go` file before using it in `Register*` helpers. |
|
||||
| `Missing dependency: onnxruntime` | Install ONNX Runtime at build time: `brew install onnxruntime` (macOS), `apt install libonnxruntime libonnxruntime-dev` (Linux), `scoop install onnxruntime` (Windows). Required for embeddings functionality. |
|
||||
| Embeddings not available on Windows MinGW | Windows MinGW builds cannot link ONNX Runtime (MSVC-only). Use Windows MSVC build for embeddings support, or build without embeddings feature. |
|
||||
|
||||
## Testing / Tooling
|
||||
|
||||
- `task go:lint` – runs `gofmt` and `golangci-lint` (`golangci-lint` pinned to v2.11.3).
|
||||
- `task go:test` – executes `go test ./...` (after building the static FFI library).
|
||||
- `task e2e:go:verify` – regenerates fixtures via the e2e generator and runs `go test ./...` inside `e2e/go`.
|
||||
|
||||
Need help? Join the [Discord](https://discord.gg/xt9WY3GnKR) or open an issue with logs, platform info, and the steps you tried.
|
||||
|
||||
## Part of Kreuzberg.dev
|
||||
|
||||
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
|
||||
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
|
||||
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
|
||||
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
|
||||
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
|
||||
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
|
||||
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
|
||||
156
templates/readme/language_package.md
Normal file
156
templates/readme/language_package.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# {{ name }}
|
||||
|
||||
{% include 'partials/badges.html.jinja' %}
|
||||
|
||||
{{ description }}
|
||||
|
||||
## What This Package Provides
|
||||
|
||||
- **Document intelligence core** — extract text, tables, images, metadata, entities, keywords, and code intelligence from one API.
|
||||
- **Format coverage** — PDF, Office, images, HTML/XML, email, archives, notebooks, citations, scientific formats, and plain text.
|
||||
- **OCR choices** — Tesseract, PaddleOCR, EasyOCR where supported, VLM OCR through liter-llm, and plugin hooks for custom backends.
|
||||
- **Same engine as every binding** — Rust, Python, Node.js, Go, Java, PHP, Ruby, .NET, Elixir, R, WASM, Kotlin Android, Swift, Dart, Zig, and C FFI share the same Rust implementation.
|
||||
{% if language == "typescript" %}
|
||||
- **Node-first TypeScript API** — NAPI-RS package with typed options/results and async extraction.
|
||||
{% elif language == "python" %}
|
||||
- **Python package** — sync and async APIs with typed results for ingestion, RAG, and data workflows.
|
||||
{% elif language == "go" %}
|
||||
- **Go module** — context-aware API over the shared native library.
|
||||
{% elif language == "java" %}
|
||||
- **Java package** — FFM binding for direct native document extraction.
|
||||
{% elif language == "php" %}
|
||||
- **PHP package** — PHP 8.2+ API with generated types.
|
||||
{% elif language == "ruby" %}
|
||||
- **Ruby package** — native extension with idiomatic Ruby objects.
|
||||
{% elif language == "csharp" %}
|
||||
- **.NET package** — async/await API with nullable-aware result types.
|
||||
{% elif language == "elixir" %}
|
||||
- **BEAM package** — Rustler NIF binding for OTP pipelines.
|
||||
{% elif language == "wasm" %}
|
||||
- **WASM package** — browser and edge-compatible extraction where native libraries are unavailable.
|
||||
{% elif language == "r" %}
|
||||
- **R package** — data workflow binding with data-frame-friendly extracted structures.
|
||||
{% elif language == "ffi" %}
|
||||
- **C ABI** — stable shared library surface for custom hosts and secondary bindings.
|
||||
{% elif language == "kotlin_android" %}
|
||||
- **Android AAR** — JNI-backed package for mobile extraction workloads.
|
||||
{% elif language == "swift" %}
|
||||
- **SwiftPM package** — Swift Concurrency API for Apple targets.
|
||||
{% elif language == "dart" %}
|
||||
- **Dart package** — Future/Stream API through flutter_rust_bridge.
|
||||
{% elif language == "zig" %}
|
||||
- **Zig package** — wrapper over the C FFI with explicit memory ownership.
|
||||
{% endif %}
|
||||
|
||||
## Installation
|
||||
|
||||
{% include 'partials/installation.md.jinja' %}
|
||||
|
||||
## Quick Start
|
||||
|
||||
{% include 'partials/quick_start.md.jinja' %}
|
||||
|
||||
{% if language == "typescript" %}
|
||||
{% include 'partials/napi_implementation.md.jinja' %}
|
||||
|
||||
{% endif %}
|
||||
|
||||
## Features
|
||||
|
||||
{% include 'partials/features.md.jinja' %}
|
||||
|
||||
{% if features.ocr %}
|
||||
|
||||
## OCR Support
|
||||
|
||||
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
|
||||
|
||||
{% for backend in ocr_backends %}
|
||||
|
||||
- **{{ backend | title }}**
|
||||
{% endfor %}
|
||||
|
||||
### OCR Configuration Example
|
||||
|
||||
{{ snippets.ocr_configuration | include_snippet(language) }}
|
||||
|
||||
{% endif %}
|
||||
{% if features.async %}
|
||||
|
||||
## Async Support
|
||||
|
||||
This binding provides full async/await support for non-blocking document processing:
|
||||
|
||||
{{ snippets.async_extraction | include_snippet(language) }}
|
||||
|
||||
{% endif %}
|
||||
{% if features.plugin_system %}
|
||||
|
||||
## Plugin System
|
||||
|
||||
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
|
||||
|
||||
For detailed plugin documentation, visit [Plugin System Guide](https://docs.kreuzberg.dev/guides/plugins/).
|
||||
|
||||
{% if snippets.plugin_system %}
|
||||
|
||||
### Plugin Example
|
||||
|
||||
{{ snippets.plugin_system | include_snippet(language) }}
|
||||
|
||||
{% endif %}
|
||||
{% endif %}
|
||||
{% if features.embeddings %}
|
||||
|
||||
## Embeddings Support
|
||||
|
||||
Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
|
||||
|
||||
**[Embeddings Guide](https://docs.kreuzberg.dev/features/#embeddings)**
|
||||
{% endif %}
|
||||
|
||||
{% if snippets.batch_processing %}
|
||||
|
||||
## Batch Processing
|
||||
|
||||
Process multiple documents efficiently:
|
||||
|
||||
{{ snippets.batch_processing | include_snippet(language) }}
|
||||
|
||||
{% endif %}
|
||||
|
||||
## Configuration
|
||||
|
||||
For advanced configuration options including language detection, table extraction, OCR settings, and more:
|
||||
|
||||
**[Configuration Guide](https://docs.kreuzberg.dev/guides/configuration/)**
|
||||
|
||||
## Documentation
|
||||
|
||||
- **[Official Documentation](https://docs.kreuzberg.dev/)**
|
||||
- **[API Reference](https://docs.kreuzberg.dev/reference/api-python/)**
|
||||
- **[Examples & Guides](https://docs.kreuzberg.dev/)**
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
|
||||
|
||||
## Part of Kreuzberg.dev
|
||||
|
||||
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
|
||||
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
|
||||
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
|
||||
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
|
||||
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
|
||||
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
|
||||
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
|
||||
|
||||
## License
|
||||
|
||||
{{ license }} License — see [LICENSE](../../LICENSE) for details.
|
||||
|
||||
## Support
|
||||
|
||||
- **Discord Community**: [Join our Discord](https://discord.gg/xt9WY3GnKR)
|
||||
- **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
|
||||
- **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)
|
||||
86
templates/readme/partials/badges.html.jinja
Normal file
86
templates/readme/partials/badges.html.jinja
Normal file
@@ -0,0 +1,86 @@
|
||||
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
|
||||
<a href="https://github.com/kreuzberg-dev/alef">
|
||||
<img src="https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6" alt="Bindings">
|
||||
</a>
|
||||
<!-- Language Bindings -->
|
||||
<a href="https://crates.io/crates/kreuzberg">
|
||||
<img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
|
||||
</a>
|
||||
<a href="https://pypi.org/project/kreuzberg/">
|
||||
<img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
|
||||
</a>
|
||||
<a href="https://www.npmjs.com/package/@kreuzberg/node">
|
||||
<img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
|
||||
</a>
|
||||
<a href="https://www.npmjs.com/package/@kreuzberg/wasm">
|
||||
<img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
|
||||
</a>
|
||||
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
|
||||
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
|
||||
</a>
|
||||
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go/v5">
|
||||
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v5*" alt="Go">
|
||||
</a>
|
||||
<a href="https://www.nuget.org/packages/Kreuzberg/">
|
||||
<img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
|
||||
</a>
|
||||
<a href="https://packagist.org/packages/kreuzberg/kreuzberg">
|
||||
<img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
|
||||
</a>
|
||||
<a href="https://rubygems.org/gems/kreuzberg">
|
||||
<img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
|
||||
</a>
|
||||
<a href="https://hex.pm/packages/kreuzberg">
|
||||
<img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
|
||||
</a>
|
||||
<a href="https://kreuzberg-dev.r-universe.dev/kreuzberg">
|
||||
<img src="https://img.shields.io/badge/R-kreuzberg-007ec6" alt="R">
|
||||
</a>
|
||||
<a href="https://pub.dev/packages/kreuzberg">
|
||||
<img src="https://img.shields.io/pub/v/kreuzberg?label=Dart&color=007ec6" alt="Dart">
|
||||
</a>
|
||||
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg-android">
|
||||
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg-android?label=Kotlin&color=007ec6" alt="Kotlin">
|
||||
</a>
|
||||
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/swift">
|
||||
<img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
|
||||
</a>
|
||||
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/zig">
|
||||
<img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
|
||||
</a>
|
||||
<a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
|
||||
<img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
|
||||
</a>
|
||||
<a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/kreuzberg">
|
||||
<img src="https://img.shields.io/badge/Docker-ghcr.io-007ec6?logo=docker&logoColor=white" alt="Docker">
|
||||
</a>
|
||||
<a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/charts%2Fkreuzberg">
|
||||
<img src="https://img.shields.io/badge/Helm-ghcr.io-007ec6?logo=helm&logoColor=white" alt="Helm">
|
||||
</a>
|
||||
|
||||
<!-- Project Info -->
|
||||
<a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
|
||||
<img src="https://img.shields.io/badge/License-Elastic--2.0-007ec6" alt="License">
|
||||
</a>
|
||||
<a href="https://docs.kreuzberg.dev">
|
||||
<img src="https://img.shields.io/badge/Docs-kreuzberg-007ec6" alt="Documentation">
|
||||
</a>
|
||||
<a href="https://huggingface.co/Kreuzberg">
|
||||
<img src="https://img.shields.io/badge/Hugging%20Face-Kreuzberg-007ec6" alt="Hugging Face">
|
||||
</a>
|
||||
</div>
|
||||
|
||||
<div align="center" style="margin: 24px 0 0;">
|
||||
<a href="https://kreuzberg.dev">
|
||||
<img alt="Kreuzberg" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
|
||||
</a>
|
||||
</div>
|
||||
|
||||
<div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
|
||||
<a href="https://discord.gg/xt9WY3GnKR">
|
||||
<img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
|
||||
</a>
|
||||
<a href="https://docs.kreuzberg.dev/demo.html">
|
||||
<img height="22" src="https://img.shields.io/badge/Live%20Demo-Open-007ec6?logo=webassembly&logoColor=white" alt="Live Demo">
|
||||
</a>
|
||||
</div>
|
||||
95
templates/readme/partials/features.md.jinja
Normal file
95
templates/readme/partials/features.md.jinja
Normal file
@@ -0,0 +1,95 @@
|
||||
### Supported File Formats (90+)
|
||||
|
||||
90+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
|
||||
|
||||
#### Office Documents
|
||||
|
||||
| Category | Formats | Capabilities |
|
||||
|----------|---------|--------------|
|
||||
| **Word Processing** | `.docx`, `.docm`, `.dotx`, `.dotm`, `.dot`, `.odt` | Full text, tables, images, metadata, styles |
|
||||
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.xltx`, `.xlt`, `.ods` | Sheet data, formulas, cell metadata, charts |
|
||||
| **Presentations** | `.pptx`, `.pptm`, `.ppsx`, `.potx`, `.potm`, `.pot`, `.ppt` | Slides, speaker notes, images, metadata |
|
||||
| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
|
||||
| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
|
||||
| **Database** | `.dbf` | Table data extraction, field type support |
|
||||
| **Hangul** | `.hwp`, `.hwpx` | Korean document format, text extraction |
|
||||
|
||||
#### Images (OCR-Enabled)
|
||||
|
||||
| Category | Formats | Features |
|
||||
|----------|---------|----------|
|
||||
| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
|
||||
| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata |
|
||||
| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
|
||||
|
||||
#### Web & Data
|
||||
|
||||
| Category | Formats | Features |
|
||||
|----------|---------|----------|
|
||||
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
|
||||
| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
|
||||
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, Djot, reStructuredText, Org Mode |
|
||||
|
||||
#### Email & Archives
|
||||
|
||||
| Category | Formats | Features |
|
||||
|----------|---------|----------|
|
||||
| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
|
||||
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
|
||||
|
||||
#### Academic & Scientific
|
||||
|
||||
| Category | Formats | Features |
|
||||
|----------|---------|----------|
|
||||
| **Citations** | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl` | Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON |
|
||||
| **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
|
||||
| **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
|
||||
|
||||
#### Code Intelligence (300+ Languages)
|
||||
|
||||
| Feature | Description |
|
||||
|---------|-------------|
|
||||
| **Structure Extraction** | Functions, classes, methods, structs, interfaces, enums |
|
||||
| **Import/Export Analysis** | Module dependencies, re-exports, wildcard imports |
|
||||
| **Symbol Extraction** | Variables, constants, type aliases, properties |
|
||||
| **Docstring Parsing** | Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats |
|
||||
| **Diagnostics** | Parse errors with line/column positions |
|
||||
| **Syntax-Aware Chunking** | Split code by semantic boundaries, not arbitrary byte offsets |
|
||||
|
||||
Powered by [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — [documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev).
|
||||
|
||||
**[Complete Format Reference](https://docs.kreuzberg.dev/reference/formats/)**
|
||||
|
||||
### Key Capabilities
|
||||
|
||||
- **Text Extraction** - Extract all text content with position and formatting information
|
||||
- **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
|
||||
- **Table Extraction** - Parse tables with structure and cell content preservation
|
||||
- **Image Extraction** - Extract embedded images and render page previews
|
||||
- **OCR Support** - Integrate multiple OCR backends for scanned documents
|
||||
{% if features.async %}
|
||||
- **Async/Await** - Non-blocking document processing with concurrent operations
|
||||
{% endif %}
|
||||
{% if features.plugin_system %}
|
||||
- **Plugin System** - Extensible post-processing for custom text transformation
|
||||
{% endif %}
|
||||
{% if features.embeddings %}
|
||||
- **Embeddings** - Generate vector embeddings using ONNX Runtime models
|
||||
{% endif %}
|
||||
- **Batch Processing** - Efficiently process multiple documents in parallel
|
||||
- **Memory Efficient** - Stream large files without loading entirely into memory
|
||||
- **Language Detection** - Detect and support multiple languages in documents
|
||||
{% if features.code_intelligence %}
|
||||
- **Code Intelligence** - Extract structure, imports, exports, symbols, and docstrings from [300+ programming languages](https://docs.tree-sitter-language-pack.kreuzberg.dev) via tree-sitter
|
||||
{% endif %}
|
||||
- **Configuration** - Fine-grained control over extraction behavior
|
||||
|
||||
### Performance Characteristics
|
||||
|
||||
| Format | Speed | Memory | Notes |
|
||||
|--------|-------|--------|-------|
|
||||
| **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
|
||||
| **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
|
||||
| **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
|
||||
| **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
|
||||
| **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
|
||||
272
templates/readme/partials/installation.md.jinja
Normal file
272
templates/readme/partials/installation.md.jinja
Normal file
@@ -0,0 +1,272 @@
|
||||
### Package Installation
|
||||
|
||||
{% for pm in package_manager %}
|
||||
{% if pm == "pip" %}
|
||||
Install via pip:
|
||||
|
||||
```bash
|
||||
pip install {{ package_name }}
|
||||
```
|
||||
|
||||
For async support and additional features:
|
||||
|
||||
```bash
|
||||
pip install {{ package_name }}[async]
|
||||
```
|
||||
|
||||
{% elif pm == "npm" %}
|
||||
```bash
|
||||
npm install {{ package_name }}
|
||||
```
|
||||
|
||||
{% elif pm == "pnpm" %}
|
||||
```bash
|
||||
pnpm add {{ package_name }}
|
||||
```
|
||||
|
||||
{% elif pm == "yarn" %}
|
||||
```bash
|
||||
yarn add {{ package_name }}
|
||||
```
|
||||
|
||||
{% elif pm == "go get" %}
|
||||
```bash
|
||||
go get {{ package_name }}
|
||||
```
|
||||
|
||||
For more details on FFI setup and native library linking, see the [Go API Reference](https://docs.kreuzberg.dev/reference/api-go/).
|
||||
|
||||
{% elif pm == "maven" %}
|
||||
{% set maven_parts = package_name | split(":") %}
|
||||
Add to your `pom.xml`:
|
||||
|
||||
```xml
|
||||
<dependency>
|
||||
<groupId>{{ maven_parts[0] }}</groupId>
|
||||
<artifactId>{{ maven_parts[1] }}</artifactId>
|
||||
<version>{{ version }}</version>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
{% elif pm == "gradle" %}
|
||||
Kotlin DSL (`build.gradle.kts`):
|
||||
|
||||
```kotlin
|
||||
implementation("{{ package_name }}:{{ version }}")
|
||||
```
|
||||
|
||||
Groovy DSL (`build.gradle`):
|
||||
|
||||
```groovy
|
||||
implementation '{{ package_name }}:{{ version }}'
|
||||
```
|
||||
|
||||
{% elif pm == "rubygems" %}
|
||||
Install via gem:
|
||||
|
||||
```bash
|
||||
gem install {{ package_name }}
|
||||
```
|
||||
|
||||
{% elif pm == "bundler" %}
|
||||
Or add to your Gemfile:
|
||||
|
||||
```ruby
|
||||
gem '{{ package_name }}'
|
||||
```
|
||||
|
||||
{% elif pm == "composer" %}
|
||||
Install via Composer:
|
||||
|
||||
```bash
|
||||
composer require {{ package_name }}
|
||||
```
|
||||
|
||||
{% elif pm == "mix" %}
|
||||
Add to your `mix.exs` dependencies:
|
||||
|
||||
```elixir
|
||||
def deps do
|
||||
[
|
||||
{:{{ package_name }}, "~> {{ version }}"}
|
||||
]
|
||||
end
|
||||
```
|
||||
|
||||
Then run:
|
||||
|
||||
```bash
|
||||
mix deps.get
|
||||
```
|
||||
|
||||
{% elif pm == "nuget" %}
|
||||
Install via NuGet:
|
||||
|
||||
```bash
|
||||
dotnet add package {{ package_name }}
|
||||
```
|
||||
|
||||
Or via NuGet Package Manager:
|
||||
|
||||
```
|
||||
Install-Package {{ package_name }}
|
||||
```
|
||||
|
||||
{% elif pm == "pub" %}
|
||||
Install via pub:
|
||||
|
||||
```bash
|
||||
dart pub add {{ package_name }}
|
||||
```
|
||||
|
||||
For Flutter projects:
|
||||
|
||||
```bash
|
||||
flutter pub add {{ package_name }}
|
||||
```
|
||||
|
||||
{% elif pm == "spm" %}
|
||||
Add to your `Package.swift` dependencies:
|
||||
|
||||
```swift
|
||||
.package(url: "https://github.com/kreuzberg-dev/kreuzberg.git", from: "{{ version }}"),
|
||||
```
|
||||
|
||||
Then add the product to the relevant target:
|
||||
|
||||
```swift
|
||||
.target(
|
||||
name: "YourTarget",
|
||||
dependencies: [
|
||||
.product(name: "{{ package_name }}", package: "kreuzberg"),
|
||||
]
|
||||
),
|
||||
```
|
||||
|
||||
{% elif pm == "hex" %}
|
||||
Install via Hex:
|
||||
|
||||
```elixir
|
||||
def deps do
|
||||
[
|
||||
{:{{ package_name }}, "~> {{ version }}"}
|
||||
]
|
||||
end
|
||||
```
|
||||
|
||||
{% elif pm == "zig" %}
|
||||
Fetch the package and pin it in `build.zig.zon`:
|
||||
|
||||
```bash
|
||||
zig fetch --save https://github.com/kreuzberg-dev/kreuzberg/archive/refs/tags/v{{ version }}.tar.gz
|
||||
```
|
||||
|
||||
Then wire it into `build.zig`:
|
||||
|
||||
```zig
|
||||
const kreuzberg_dep = b.dependency("{{ package_name }}", .{
|
||||
.target = target,
|
||||
.optimize = optimize,
|
||||
});
|
||||
exe.root_module.addImport("{{ package_name }}", kreuzberg_dep.module("{{ package_name }}"));
|
||||
```
|
||||
|
||||
{% elif pm == "install.packages" %}
|
||||
Install from the kreuzberg R-universe:
|
||||
|
||||
```r
|
||||
install.packages("{{ package_name }}",
|
||||
repos = c("https://kreuzberg-dev.r-universe.dev", getOption("repos")))
|
||||
```
|
||||
|
||||
{% elif pm == "cargo" %}
|
||||
Build the shared library from the workspace:
|
||||
|
||||
```bash
|
||||
cargo build --release -p {{ package_name }}
|
||||
```
|
||||
|
||||
The built artifacts are emitted under `target/release/` (`lib{{ package_name | replace("-", "_") }}.{so,dylib,a}`) along with the C header at `crates/{{ package_name }}/include/kreuzberg.h`.
|
||||
|
||||
{% endif %}
|
||||
{% endfor %}
|
||||
|
||||
### System Requirements
|
||||
{% if language == "python" %}
|
||||
- **Python 3.10+** required
|
||||
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
{% elif language == "typescript" %}
|
||||
- **Node.js 22+** required (NAPI-RS native bindings)
|
||||
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
|
||||
### Platform Support
|
||||
|
||||
Pre-built binaries available for:
|
||||
- macOS (arm64, x64)
|
||||
- Linux (x64)
|
||||
- Windows (x64)
|
||||
|
||||
{% elif language == "go" %}
|
||||
- **Go 1.19+** required
|
||||
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
{% elif language == "java" %}
|
||||
- **Java 25+** required (Foreign Function & Memory API; build run with `--enable-preview` and `--enable-native-access=ALL-UNNAMED`)
|
||||
- Native libraries bundled in the JAR for macOS (arm64, x64), Linux (x64, arm64), and Windows (x64)
|
||||
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
{% elif language == "kotlin" %}
|
||||
- **JDK 25+** required
|
||||
- Native libraries bundled via the Java facade JAR
|
||||
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
{% elif language == "ruby" %}
|
||||
- **Ruby 3.2.0 or higher** required (including Ruby 4.x)
|
||||
- Ruby 4.0+ is fully supported with no code changes required
|
||||
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
{% elif language == "php" %}
|
||||
- **PHP 8.2+** required
|
||||
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
{% elif language == "csharp" %}
|
||||
- **.NET 10.0+** required
|
||||
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
{% elif language == "elixir" %}
|
||||
- **Elixir 1.14+** and **Erlang/OTP 26+** required
|
||||
- Pre-compiled NIFs bundled via `rustler_precompiled` for macOS (arm64, x64), Linux (x64, arm64), and Windows (x64)
|
||||
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
{% elif language == "wasm" %}
|
||||
- Modern browser with WebAssembly support, or Deno 1.0+, or Cloudflare Workers
|
||||
- Optional: [Tesseract WASM](https://github.com/naptha/tesseract.js) for OCR functionality
|
||||
{% elif language == "r" %}
|
||||
- **R 4.1+** required (extendr bindings)
|
||||
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
{% elif language == "ffi" %}
|
||||
- A C/C++ toolchain (clang, gcc, or MSVC) and a Rust toolchain (`rustup`) for building from source
|
||||
- A `pkg-config` or CMake-aware build system that can locate `libkreuzberg_ffi` and `kreuzberg.h`
|
||||
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
{% elif language == "dart" %}
|
||||
- **Dart SDK 3.0+** for pure-Dart consumers
|
||||
- Flutter projects supported on macOS, iOS, Android, Linux, and Windows; Flutter Web is not supported
|
||||
- Native runtime delivered via `flutter_rust_bridge` with bundled binaries for the supported platforms
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
{% elif language == "swift" %}
|
||||
- **Swift 6.0+** (`swift-tools-version: 6.0`) on macOS 13+ or iOS 16+
|
||||
- Native runtime delivered through the C FFI surface from `kreuzberg-ffi`; published artifacts ship as a binary target
|
||||
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
{% elif language == "zig" %}
|
||||
- **Zig 0.16.0+** required (`minimum_zig_version` declared in `build.zig.zon`)
|
||||
- Links the C FFI surface from `kreuzberg-ffi`; the build resolves the library via `linkSystemLibrary` against the consumer-provided search path
|
||||
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
|
||||
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
|
||||
{% else %}
|
||||
- See [Installation Guide](https://docs.kreuzberg.dev/getting-started/installation/) for requirements
|
||||
{% endif %}
|
||||
24
templates/readme/partials/napi_implementation.md.jinja
Normal file
24
templates/readme/partials/napi_implementation.md.jinja
Normal file
@@ -0,0 +1,24 @@
|
||||
## NAPI-RS Implementation Details
|
||||
|
||||
### Native Performance
|
||||
|
||||
This binding uses NAPI-RS to provide native Node.js bindings with:
|
||||
|
||||
- **Zero-copy data transfer** between JavaScript and Rust layers
|
||||
- **Native thread pool** for concurrent document processing
|
||||
- **Direct memory management** for efficient large document handling
|
||||
- **Binary-compatible** pre-built native modules across platforms
|
||||
|
||||
### Threading Model
|
||||
|
||||
- Single documents are processed synchronously or asynchronously in a dedicated thread
|
||||
- Batch operations distribute work across available CPU cores
|
||||
- Thread count is configurable but defaults to system CPU count
|
||||
- Long-running extractions block the event loop unless using async APIs
|
||||
|
||||
### Memory Management
|
||||
|
||||
- Large documents (> 100 MB) are streamed to avoid loading entirely into memory
|
||||
- Temporary files are created in system temp directory for extraction
|
||||
- Memory is automatically released after extraction completion
|
||||
- ONNX models are cached in memory for repeated embeddings operations
|
||||
77
templates/readme/partials/quick_start.md.jinja
Normal file
77
templates/readme/partials/quick_start.md.jinja
Normal file
@@ -0,0 +1,77 @@
|
||||
### Basic Extraction
|
||||
|
||||
Extract text, metadata, and structure from any supported document format:
|
||||
|
||||
{{ snippets.basic_extraction | include_snippet(language) }}
|
||||
|
||||
### Common Use Cases
|
||||
|
||||
#### Extract with Custom Configuration
|
||||
|
||||
Most use cases benefit from configuration to control extraction behavior:
|
||||
|
||||
{% if snippets.ocr_configuration %}
|
||||
**With OCR (for scanned documents):**
|
||||
|
||||
{{ snippets.ocr_configuration | include_snippet(language) }}
|
||||
|
||||
{% endif %}
|
||||
|
||||
#### Table Extraction
|
||||
|
||||
{% if snippets.table_extraction %}
|
||||
{{ snippets.table_extraction | include_snippet(language) }}
|
||||
|
||||
{% else %}
|
||||
See [Configuration Guide](https://docs.kreuzberg.dev/guides/configuration/) for table extraction options.
|
||||
|
||||
{% endif %}
|
||||
|
||||
#### Processing Multiple Files
|
||||
|
||||
{% if snippets.batch_processing %}
|
||||
{{ snippets.batch_processing | include_snippet(language) }}
|
||||
|
||||
{% endif %}
|
||||
|
||||
{% if snippets.async_extraction %}
|
||||
#### Async Processing
|
||||
|
||||
For non-blocking document processing:
|
||||
|
||||
{{ snippets.async_extraction | include_snippet(language) }}
|
||||
|
||||
{% endif %}
|
||||
{% if snippets.config_discovery %}
|
||||
|
||||
#### Configuration Discovery
|
||||
|
||||
{{ snippets.config_discovery | include_snippet(language) }}
|
||||
|
||||
{% endif %}
|
||||
{% if snippets.worker_pool %}
|
||||
|
||||
#### Worker Thread Pool
|
||||
|
||||
{{ snippets.worker_pool | include_snippet(language) }}
|
||||
|
||||
**Performance Benefits:**
|
||||
- **Parallel Processing**: Multiple documents extracted simultaneously
|
||||
- **CPU Utilization**: Maximizes multi-core CPU usage for large batches
|
||||
- **Queue Management**: Automatically distributes work across available workers
|
||||
- **Resource Control**: Prevents thread exhaustion with configurable pool size
|
||||
|
||||
**Best Practices:**
|
||||
- Use worker pools for batches of 10+ documents
|
||||
- Set pool size to number of CPU cores (default behavior)
|
||||
- Always close pools with `closeWorkerPool()` to prevent resource leaks
|
||||
- Reuse pools across multiple batch operations for efficiency
|
||||
|
||||
{% endif %}
|
||||
|
||||
### Next Steps
|
||||
|
||||
- **[Installation Guide](https://docs.kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
|
||||
- **[API Documentation](https://docs.kreuzberg.dev/reference/api-python/)** - Complete API reference
|
||||
- **[Examples & Guides](https://docs.kreuzberg.dev/)** - Full code examples and usage guides
|
||||
- **[Configuration Guide](https://docs.kreuzberg.dev/guides/configuration/)** - Advanced configuration options
|
||||
453
templates/readme/python.md
Normal file
453
templates/readme/python.md
Normal file
@@ -0,0 +1,453 @@
|
||||
# Kreuzberg
|
||||
|
||||
{% include 'partials/badges.html.jinja' %}
|
||||
|
||||
{{ description }}
|
||||
|
||||
## What This Package Provides
|
||||
|
||||
- **Python-native extraction** — sync and async APIs for files, bytes, URLs, and batch ingestion.
|
||||
- **Structured results** — text, tables, images, metadata, language detection, chunks, and warnings in typed Python objects.
|
||||
- **OCR choices** — Tesseract, EasyOCR, PaddleOCR, and VLM OCR where configured.
|
||||
- **Same Rust engine as every binding** — behavior matches the Node.js, Ruby, Go, Java, .NET, PHP, Elixir, R, Dart, Swift, Zig, WASM, and C FFI packages.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install kreuzberg
|
||||
```
|
||||
|
||||
### With OCR Support
|
||||
|
||||
```bash
|
||||
pip install "kreuzberg[easyocr]"
|
||||
pip install "kreuzberg[paddleocr]"
|
||||
```
|
||||
|
||||
### All Features
|
||||
|
||||
```bash
|
||||
pip install "kreuzberg[all]"
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Usage
|
||||
|
||||
{{ 'getting-started/basic_usage.md' | include_snippet('python') }}
|
||||
|
||||
### Simple Extraction
|
||||
|
||||
{{ 'getting-started/extract_file.md' | include_snippet('python') }}
|
||||
|
||||
### Reading Content
|
||||
|
||||
{{ 'getting-started/read_content.md' | include_snippet('python') }}
|
||||
|
||||
## OCR Support
|
||||
|
||||
### Using OCR
|
||||
|
||||
{{ 'getting-started/extract_with_ocr.md' | include_snippet('python') }}
|
||||
|
||||
### EasyOCR (GPU-Accelerated)
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
|
||||
|
||||
config = ExtractionConfig(
|
||||
ocr=OcrConfig(backend="easyocr", language="en")
|
||||
)
|
||||
|
||||
result = extract_file_sync(
|
||||
"photo.jpg",
|
||||
config=config,
|
||||
easyocr_kwargs={"use_gpu": True}
|
||||
)
|
||||
```
|
||||
|
||||
### PaddleOCR (Complex Layouts)
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
|
||||
|
||||
config = ExtractionConfig(
|
||||
ocr=OcrConfig(backend="paddleocr", language="ch")
|
||||
)
|
||||
|
||||
result = extract_file_sync(
|
||||
"invoice.pdf",
|
||||
config=config,
|
||||
)
|
||||
```
|
||||
|
||||
## Table Extraction
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig
|
||||
|
||||
config = ExtractionConfig(
|
||||
ocr=OcrConfig(
|
||||
backend="tesseract",
|
||||
tesseract_config=TesseractConfig(
|
||||
enable_table_detection=True
|
||||
)
|
||||
)
|
||||
)
|
||||
|
||||
result = extract_file_sync("invoice.pdf", config=config)
|
||||
|
||||
for table in result.tables:
|
||||
print(table.markdown)
|
||||
print(table.cells)
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Complete Configuration Example
|
||||
|
||||
```python
|
||||
from kreuzberg import (
|
||||
extract_file_sync,
|
||||
ExtractionConfig,
|
||||
OcrConfig,
|
||||
TesseractConfig,
|
||||
ChunkingConfig,
|
||||
ImageExtractionConfig,
|
||||
PdfConfig,
|
||||
TokenReductionConfig,
|
||||
LanguageDetectionConfig,
|
||||
)
|
||||
|
||||
config = ExtractionConfig(
|
||||
use_cache=True,
|
||||
enable_quality_processing=True,
|
||||
ocr=OcrConfig(
|
||||
backend="tesseract",
|
||||
language="eng",
|
||||
tesseract_config=TesseractConfig(
|
||||
psm=6,
|
||||
enable_table_detection=True,
|
||||
min_confidence=50.0,
|
||||
),
|
||||
),
|
||||
force_ocr=False,
|
||||
chunking=ChunkingConfig(
|
||||
max_chars=1000,
|
||||
max_overlap=200,
|
||||
),
|
||||
images=ImageExtractionConfig(
|
||||
extract_images=True,
|
||||
target_dpi=300,
|
||||
max_image_dimension=4096,
|
||||
auto_adjust_dpi=True,
|
||||
),
|
||||
pdf_options=PdfConfig(
|
||||
extract_images=True,
|
||||
passwords=["password1", "password2"],
|
||||
extract_metadata=True,
|
||||
),
|
||||
token_reduction=TokenReductionConfig(
|
||||
mode="moderate",
|
||||
preserve_important_words=True,
|
||||
),
|
||||
language_detection=LanguageDetectionConfig(
|
||||
enabled=True,
|
||||
min_confidence=0.8,
|
||||
detect_multiple=False,
|
||||
),
|
||||
)
|
||||
|
||||
result = extract_file_sync("document.pdf", config=config)
|
||||
```
|
||||
|
||||
### HTML Conversion Options & Batch Concurrency
|
||||
|
||||
```python
|
||||
from kreuzberg import ExtractionConfig
|
||||
|
||||
config = ExtractionConfig(
|
||||
max_concurrent_extractions=8,
|
||||
html_options={
|
||||
"extract_metadata": True,
|
||||
"wrap": True,
|
||||
"wrap_width": 100,
|
||||
"strip_tags": ["script", "style"],
|
||||
"preprocessing": {"enabled": True, "preset": "standard"},
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
## Metadata Extraction
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file_sync
|
||||
|
||||
result = extract_file_sync("document.pdf")
|
||||
|
||||
if result.images:
|
||||
print(f"Extracted {len(result.images)} inline images")
|
||||
|
||||
if result.chunks:
|
||||
print(f"First chunk tokens: {result.chunks[0]['metadata']['token_count']}")
|
||||
|
||||
print(result.metadata.get("pdf", {}))
|
||||
print(result.metadata.get("language"))
|
||||
print(result.metadata.get("format"))
|
||||
|
||||
if "pdf" in result.metadata:
|
||||
pdf_meta = result.metadata["pdf"]
|
||||
print(f"Title: {pdf_meta.get('title')}")
|
||||
print(f"Author: {pdf_meta.get('author')}")
|
||||
print(f"Pages: {pdf_meta.get('page_count')}")
|
||||
print(f"Created: {pdf_meta.get('creation_date')}")
|
||||
```
|
||||
|
||||
## Password-Protected PDFs
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file_sync, ExtractionConfig, PdfConfig
|
||||
|
||||
config = ExtractionConfig(
|
||||
pdf_options=PdfConfig(
|
||||
passwords=["password1", "password2", "password3"]
|
||||
)
|
||||
)
|
||||
|
||||
result = extract_file_sync("protected.pdf", config=config)
|
||||
```
|
||||
|
||||
## Language Detection
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file_sync, ExtractionConfig, LanguageDetectionConfig
|
||||
|
||||
config = ExtractionConfig(
|
||||
language_detection=LanguageDetectionConfig(enabled=True)
|
||||
)
|
||||
|
||||
result = extract_file_sync("multilingual.pdf", config=config)
|
||||
print(result.detected_languages)
|
||||
```
|
||||
|
||||
## Text Chunking
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file_sync, ExtractionConfig, ChunkingConfig
|
||||
|
||||
config = ExtractionConfig(
|
||||
chunking=ChunkingConfig(
|
||||
max_chars=1000,
|
||||
max_overlap=200,
|
||||
)
|
||||
)
|
||||
|
||||
result = extract_file_sync("long_document.pdf", config=config)
|
||||
|
||||
for chunk in result.chunks:
|
||||
print(chunk)
|
||||
```
|
||||
|
||||
## Extract from Bytes
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_bytes_sync
|
||||
|
||||
with open("document.pdf", "rb") as f:
|
||||
data = f.read()
|
||||
|
||||
result = extract_bytes_sync(data, "application/pdf")
|
||||
print(result.content)
|
||||
```
|
||||
|
||||
## API Reference
|
||||
|
||||
### Extraction Functions
|
||||
|
||||
- `extract_file(file_path, mime_type=None, config=None, **kwargs)` – Async extraction
|
||||
- `extract_file_sync(file_path, mime_type=None, config=None, **kwargs)` – Sync extraction
|
||||
- `extract_bytes(data, mime_type, config=None, **kwargs)` – Async extraction from bytes
|
||||
- `extract_bytes_sync(data, mime_type, config=None, **kwargs)` – Sync extraction from bytes
|
||||
- `batch_extract_files(paths, config=None, **kwargs)` – Async batch extraction
|
||||
- `batch_extract_files_sync(paths, config=None, **kwargs)` – Sync batch extraction
|
||||
- `batch_extract_bytes(data_list, mime_types, config=None, **kwargs)` – Async batch from bytes
|
||||
- `batch_extract_bytes_sync(data_list, mime_types, config=None, **kwargs)` – Sync batch from bytes
|
||||
|
||||
### Configuration Classes
|
||||
|
||||
- `ExtractionConfig` – Main configuration
|
||||
- `OcrConfig` – OCR settings
|
||||
- `TesseractConfig` – Tesseract-specific options
|
||||
- `ChunkingConfig` – Text chunking settings
|
||||
- `ImageExtractionConfig` – Image extraction settings
|
||||
- `PdfConfig` – PDF-specific options
|
||||
- `TokenReductionConfig` – Token reduction settings
|
||||
- `LanguageDetectionConfig` – Language detection settings
|
||||
|
||||
### Result Types
|
||||
|
||||
- `ExtractionResult` – Main result object with `content`, `metadata`, `tables`, `detected_languages`, `chunks`
|
||||
- `ExtractedTable` – Table with `cells`, `markdown`, `page_number`
|
||||
- `Metadata` – Typed metadata dictionary
|
||||
|
||||
### Exceptions
|
||||
|
||||
- `KreuzbergError` – Base exception
|
||||
- `ValidationError` – Invalid configuration or input
|
||||
- `ParsingError` – Document parsing failure
|
||||
- `OCRError` – OCR processing failure
|
||||
- `MissingDependencyError` – Missing optional dependency
|
||||
|
||||
## Examples
|
||||
|
||||
### Custom Processing
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file_sync
|
||||
|
||||
result = extract_file_sync("document.pdf")
|
||||
|
||||
text = result.content
|
||||
text = text.lower()
|
||||
text = text.replace("old", "new")
|
||||
|
||||
print(text)
|
||||
```
|
||||
|
||||
### Multiple Files with Progress
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file_sync
|
||||
from pathlib import Path
|
||||
|
||||
files = list(Path("documents").glob("*.pdf"))
|
||||
results = []
|
||||
|
||||
for i, file in enumerate(files, 1):
|
||||
print(f"Processing {i}/{len(files)}: {file.name}")
|
||||
result = extract_file_sync(str(file))
|
||||
results.append((file.name, result))
|
||||
|
||||
for name, result in results:
|
||||
print(f"{name}: {len(result.content)} characters")
|
||||
```
|
||||
|
||||
### Filter by Language
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file_sync, ExtractionConfig, LanguageDetectionConfig
|
||||
|
||||
config = ExtractionConfig(
|
||||
language_detection=LanguageDetectionConfig(enabled=True)
|
||||
)
|
||||
|
||||
result = extract_file_sync("document.pdf", config=config)
|
||||
|
||||
if result.detected_languages and "en" in result.detected_languages:
|
||||
print("English document detected")
|
||||
print(result.content)
|
||||
```
|
||||
|
||||
## System Requirements
|
||||
|
||||
### ONNX Runtime (for embeddings)
|
||||
|
||||
If using embeddings functionality, ONNX Runtime version 1.22.x must be installed:
|
||||
|
||||
```bash
|
||||
# macOS
|
||||
brew install onnxruntime
|
||||
|
||||
# Ubuntu/Debian (download from GitHub - Debian packages may have older versions)
|
||||
# Download from https://github.com/microsoft/onnxruntime/releases
|
||||
|
||||
# Windows
|
||||
# Download from https://github.com/microsoft/onnxruntime/releases
|
||||
```
|
||||
|
||||
**Important:** Kreuzberg requires ONNX Runtime version 1.22.x for embeddings.
|
||||
|
||||
Without ONNX Runtime, embeddings will raise `MissingDependencyError` with installation instructions.
|
||||
|
||||
### Tesseract OCR (Required for OCR)
|
||||
|
||||
```bash
|
||||
brew install tesseract
|
||||
```
|
||||
|
||||
```bash
|
||||
sudo apt-get install tesseract-ocr
|
||||
```
|
||||
|
||||
### Pandoc (Optional, for some formats)
|
||||
|
||||
```bash
|
||||
brew install pandoc
|
||||
```
|
||||
|
||||
```bash
|
||||
sudo apt-get install pandoc
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Import Error: No module named '\_kreuzberg'
|
||||
|
||||
This usually means the Rust extension wasn't built correctly. Try:
|
||||
|
||||
```bash
|
||||
pip install --force-reinstall --no-cache-dir kreuzberg
|
||||
```
|
||||
|
||||
### OCR Not Working
|
||||
|
||||
Make sure Tesseract is installed:
|
||||
|
||||
```bash
|
||||
tesseract --version
|
||||
```
|
||||
|
||||
### Memory Issues with Large PDFs
|
||||
|
||||
Use streaming or enable chunking:
|
||||
|
||||
```python
|
||||
config = ExtractionConfig(
|
||||
chunking=ChunkingConfig(max_chars=1000)
|
||||
)
|
||||
```
|
||||
|
||||
## PDFium Integration
|
||||
|
||||
PDF extraction is powered by PDFium, which is automatically bundled with this package. No system installation required.
|
||||
|
||||
### Platform Support
|
||||
|
||||
| Platform | Status | Notes |
|
||||
| -------------- | ------ | ------- |
|
||||
| Linux x86_64 | ✅ | Bundled |
|
||||
| macOS ARM64 | ✅ | Bundled |
|
||||
| macOS x86_64 | ✅ | Bundled |
|
||||
| Windows x86_64 | ✅ | Bundled |
|
||||
|
||||
### Binary Size Impact
|
||||
|
||||
PDFium adds approximately 8-15 MB to the package size depending on platform. This ensures consistent PDF extraction across all environments without external dependencies.
|
||||
|
||||
## Documentation
|
||||
|
||||
For comprehensive documentation, visit [https://kreuzberg.dev](https://kreuzberg.dev)
|
||||
|
||||
## Part of Kreuzberg.dev
|
||||
|
||||
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
|
||||
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
|
||||
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
|
||||
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
|
||||
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
|
||||
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
|
||||
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
|
||||
|
||||
## License
|
||||
|
||||
{{ license }} License - see [LICENSE](../../LICENSE) for details.
|
||||
448
templates/readme/root.md
Normal file
448
templates/readme/root.md
Normal file
@@ -0,0 +1,448 @@
|
||||
# Kreuzberg
|
||||
|
||||
{% include 'partials/badges.html.jinja' %}
|
||||
|
||||
Extract text, metadata, and code intelligence from 90+ file formats and 300+ programming languages at native speeds without needing a GPU.
|
||||
|
||||
## Key Features
|
||||
|
||||
- **Code intelligence** – Extract functions, classes, imports, symbols, and docstrings from [300+ programming languages](https://docs.tree-sitter-language-pack.kreuzberg.dev) via tree-sitter. Results in `ExtractionResult.code_intelligence` with semantic chunking
|
||||
- **Extensible architecture** – Plugin system for custom OCR backends, validators, post-processors, document extractors, and renderers
|
||||
- **Polyglot** – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, Kotlin, C#, PHP, Elixir, R, Dart, Swift, Zig, and C
|
||||
- **90+ file formats** – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
|
||||
- **LLM intelligence** – VLM OCR (GPT-4o, Claude, Gemini, Ollama), structured JSON extraction with schema constraints, and provider-hosted embeddings via 143 LLM providers (including local engines: Ollama, LM Studio, vLLM, llama.cpp) through [liter-llm](https://github.com/kreuzberg-dev/liter-llm)
|
||||
- **OCR support** – Tesseract (all bindings, including Tesseract-WASM for browsers), PaddleOCR (all native bindings), EasyOCR (Python), VLM OCR (143 vision model providers including local engines), extensible via plugin API
|
||||
- **High performance** – Rust core with pure-Rust PDF, SIMD optimizations and full parallelism
|
||||
- **Flexible deployment** – Use as library, CLI tool, REST API server, or MCP server
|
||||
- **TOON wire format** – Token-efficient serialization for LLM/RAG pipelines, ~30-50% fewer tokens than JSON
|
||||
- **GFM-quality output** – Comrak-based rendering with proper fenced code blocks, table nodes, bracket escaping, and cross-format parity (Markdown, HTML, Djot, Plain)
|
||||
- **HTML passthrough** – HTML-to-Markdown conversion uses html-to-markdown output directly, bypassing lossy intermediate round-trips
|
||||
- **Memory efficient** – Streaming parsers for multi-GB files
|
||||
|
||||
**[Complete Documentation](https://kreuzberg.dev/)** | **[Live Demo](https://docs.kreuzberg.dev/demo.html)** | **[Installation Guides](#installation)**
|
||||
|
||||
## Installation
|
||||
|
||||
Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:
|
||||
|
||||
**Scripting Languages:**
|
||||
|
||||
- **[Python](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/python)** – PyPI package, async/sync APIs, OCR backends (Tesseract, PaddleOCR, EasyOCR)
|
||||
- **[Ruby](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/ruby)** – RubyGems package, idiomatic Ruby API, native bindings
|
||||
- **[PHP](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/php)** – Composer package, modern PHP 8.2+ support, type-safe API, async extraction
|
||||
- **[Elixir](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/elixir)** – Hex package, OTP integration, concurrent processing
|
||||
- **[R](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/r)** – r-universe package, idiomatic R API, extendr bindings
|
||||
- **[Dart / Flutter](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/dart)** – pub.dev package, flutter_rust_bridge runtime, native bindings for macOS/iOS/Android/Linux/Windows
|
||||
|
||||
**JavaScript/TypeScript:**
|
||||
|
||||
- **[@kreuzberg/node](https://github.com/kreuzberg-dev/kreuzberg/tree/main/crates/kreuzberg-node)** – Native NAPI-RS bindings for Node.js/Bun, fastest performance
|
||||
- **[@kreuzberg/wasm](https://github.com/kreuzberg-dev/kreuzberg/tree/main/crates/kreuzberg-wasm)** – WebAssembly for browsers/Deno/Cloudflare Workers, comprehensive format and OCR support (PDF, Excel, archives, all office formats, real Tesseract via the WASI build) — only ORT-dependent features (paddle-ocr, layout detection, embeddings, auto-rotate) and server modes (api/mcp/cli) are excluded
|
||||
|
||||
**Compiled Languages:**
|
||||
|
||||
- **[Go](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go/v5)** – Go module with FFI bindings, context-aware async
|
||||
- **[Java](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/java)** – Maven Central, Foreign Function & Memory API
|
||||
- **[Kotlin](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/kotlin-android)** – Maven Central, Kotlin/JVM with idiomatic data classes, sealed enums, and coroutine-based async
|
||||
- **[C#](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/csharp)** – NuGet package, .NET 6.0+, full async/await support
|
||||
- **[Swift](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/swift)** – Swift Package Manager, macOS 13+/iOS 16+, native Swift types and async/await
|
||||
|
||||
**Native:**
|
||||
|
||||
- **[Rust](https://github.com/kreuzberg-dev/kreuzberg/tree/main/crates/kreuzberg)** – Core library, flexible feature flags, zero-copy APIs
|
||||
- **[Zig](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/zig)** – `zig fetch` + `build.zig.zon`, idiomatic error sets, optional types, slice-based memory
|
||||
- **[C (FFI)](https://github.com/kreuzberg-dev/kreuzberg/tree/main/crates/kreuzberg-ffi)** – C header + shared library, pkg-config/CMake support, cross-platform
|
||||
|
||||
**Containers:**
|
||||
|
||||
- **[Docker](https://docs.kreuzberg.dev/guides/docker/)** – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)
|
||||
|
||||
**Command-Line:**
|
||||
|
||||
- **[CLI](https://docs.kreuzberg.dev/cli/usage/)** – Cross-platform binary, batch processing, MCP server mode
|
||||
|
||||
> All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.
|
||||
|
||||
## Platform Support
|
||||
|
||||
Complete architecture coverage across all language bindings:
|
||||
|
||||
| Language | Linux x86_64 | Linux aarch64 | macOS ARM64 | Windows x64 |
|
||||
| -------- | :----------: | :-----------: | :---------: | :---------: |
|
||||
| Python | ✅ | ✅ | ✅ | ✅ |
|
||||
| Node.js | ✅ | ✅ | ✅ | ✅ |
|
||||
| WASM | ✅ | ✅ | ✅ | ✅ |
|
||||
| Ruby | ✅ | ✅ | ✅ | - |
|
||||
| R | ✅ | ✅ | ✅ | ✅ |
|
||||
| Elixir | ✅ | ✅ | ✅ | ✅ |
|
||||
| Go | ✅ | ✅ | ✅ | ✅ |
|
||||
| Java | ✅ | ✅ | ✅ | ✅ |
|
||||
| Kotlin | ✅ | ✅ | ✅ | ✅ |
|
||||
| C# | ✅ | ✅ | ✅ | ✅ |
|
||||
| PHP | ✅ | ✅ | ✅ | ✅ |
|
||||
| Swift | - | - | ✅ | - |
|
||||
| Dart | ✅ | ✅ | ✅ | ✅ |
|
||||
| Zig | ✅ | ✅ | ✅ | ✅ |
|
||||
| Rust | ✅ | ✅ | ✅ | ✅ |
|
||||
| C (FFI) | ✅ | ✅ | ✅ | ✅ |
|
||||
| CLI | ✅ | ✅ | ✅ | ✅ |
|
||||
| Docker | ✅ | ✅ | ✅ | - |
|
||||
|
||||
**Note**: ✅ = Precompiled binaries available with instant installation. WASM runs in any environment with WebAssembly support (browsers, Deno, Bun, Cloudflare Workers). All platforms are tested in CI. MacOS support is Apple Silicon only.
|
||||
|
||||
### Mobile (iOS, Android)
|
||||
|
||||
| Target | ORT-dependent features\* |
|
||||
| -------------------------------------------------- | :----------------------: |
|
||||
| iOS (`aarch64-apple-ios`, `aarch64-apple-ios-sim`) | ✅ |
|
||||
| Android arm64 (`aarch64-linux-android`) | ✅ |
|
||||
| Android x86_64 emulator (`x86_64-linux-android`) | ❌ |
|
||||
|
||||
\*ORT-dependent features: PaddleOCR, layout detection, embeddings, auto-rotate.
|
||||
All non-ORT capabilities (Tesseract OCR, every document format, chunking, language detection, keywords, tree-sitter code intelligence, API/MCP, LLM) are available on all four mobile targets.
|
||||
|
||||
The `x86_64-linux-android` emulator triple lacks an ORT prebuilt upstream; kreuzberg's `kreuzberg` crate exposes an `android-target` aggregate feature that selects the same no-ORT feature set as WASM. The `kreuzberg-ffi` and `kreuzberg-dart` crates auto-select that aggregate for the emulator via target-conditional dependencies — host and arm64 phones get full features automatically.
|
||||
|
||||
### Browsers / Edge (WebAssembly)
|
||||
|
||||
WASM excludes the same ORT-dependent feature set as the Android x86_64 emulator. The shared no-ORT base lives behind the `no-ort-target` feature in the core crate; both `wasm-target` and `android-target` compose it.
|
||||
|
||||
### Embeddings Support (Optional)
|
||||
|
||||
To use embeddings functionality:
|
||||
|
||||
1. **Install ONNX Runtime 1.24+**:
|
||||
- Linux: Download from [ONNX Runtime releases](https://github.com/microsoft/onnxruntime/releases) (Debian packages may have older versions)
|
||||
- MacOS: `brew install onnxruntime`
|
||||
- Windows: Download from [ONNX Runtime releases](https://github.com/microsoft/onnxruntime/releases)
|
||||
|
||||
2. Use embeddings in your code - see [Embeddings Guide](https://docs.kreuzberg.dev/features/#embeddings)
|
||||
|
||||
**Note:** Kreuzberg requires ONNX Runtime version 1.24+ for embeddings. All other Kreuzberg features work without ONNX Runtime.
|
||||
|
||||
## Supported Formats
|
||||
|
||||
90+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
|
||||
|
||||
### Office Documents
|
||||
|
||||
| Category | Formats | Capabilities |
|
||||
| ------------------- | ------------------------------------------------------------------------------------------------ | -------------------------------------------------- |
|
||||
| **Word Processing** | `.docx`, `.docm`, `.dotx`, `.dotm`, `.dot`, `.odt`, `.pages` | Full text, tables, lists, images, metadata, styles |
|
||||
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.xltx`, `.xlt`, `.ods`, `.numbers` | Sheet data, formulas, cell metadata, charts |
|
||||
| **Presentations** | `.pptx`, `.pptm`, `.ppsx`, `.potx`, `.potm`, `.pot`, `.key` | Slides, speaker notes, images, metadata |
|
||||
| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
|
||||
| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
|
||||
| **Database** | `.dbf` | Table data extraction, field type support |
|
||||
| **Hangul** | `.hwp`, `.hwpx` | Korean document format, text extraction |
|
||||
|
||||
### Images (OCR-Enabled)
|
||||
|
||||
| Category | Formats | Features |
|
||||
| ------------ | -------------------------------------------------------------------------------- | ------------------------------------------------------------ |
|
||||
| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
|
||||
| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | Pure Rust decoders (JPEG 2000, JBIG2), OCR, table detection |
|
||||
| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
|
||||
|
||||
### Web & Data
|
||||
|
||||
| Category | Formats | Features |
|
||||
| ------------------- | ------------------------------------------------------------------- | ----------------------------------------------------------------- |
|
||||
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
|
||||
| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
|
||||
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.djot`, `.mdx`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, Djot, MDX, reStructuredText, Org Mode, Rich Text |
|
||||
|
||||
### Email & Archives
|
||||
|
||||
| Category | Formats | Features |
|
||||
| ------------ | ------------------------------------ | ------------------------------------------------------- |
|
||||
| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, UTF-16 support |
|
||||
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | Recursive extraction, nested archives, metadata |
|
||||
|
||||
### Academic & Scientific
|
||||
|
||||
| Category | Formats | Features |
|
||||
| ----------------- | ----------------------------------------------------- | ----------------------------------------------------------- |
|
||||
| **Citations** | `.bib`, `.ris`, `.nbib`, `.enw`, `.csl` | BibTeX/BibLaTeX, RIS, PubMed/MEDLINE, EndNote XML, CSL JSON |
|
||||
| **Scientific** | `.tex`, `.latex`, `.typ`, `.typst`, `.jats`, `.ipynb` | LaTeX, Typst, JATS journal articles, Jupyter notebooks |
|
||||
| **Publishing** | `.fb2`, `.docbook`, `.dbk`, `.opml` | FictionBook, DocBook XML, OPML outlines |
|
||||
| **Documentation** | `.pod`, `.mdoc`, `.troff` | Perl POD, man pages, troff |
|
||||
|
||||
**[Complete Format Reference →](https://docs.kreuzberg.dev/reference/formats/)**
|
||||
|
||||
### Code Intelligence (300+ Languages)
|
||||
|
||||
| Feature | Description |
|
||||
| -------------------------- | ------------------------------------------------------------- |
|
||||
| **Structure Extraction** | Functions, classes, methods, structs, interfaces, enums |
|
||||
| **Import/Export Analysis** | Module dependencies, re-exports, wildcard imports |
|
||||
| **Symbol Extraction** | Variables, constants, type aliases, properties |
|
||||
| **Docstring Parsing** | Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats |
|
||||
| **Diagnostics** | Parse errors with line/column positions |
|
||||
| **Syntax-Aware Chunking** | Split code by semantic boundaries, not arbitrary byte offsets |
|
||||
|
||||
Powered by [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) with dynamic grammar download. See [TSLP documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) for the full language list.
|
||||
|
||||
## Key Features
|
||||
|
||||
<details>
|
||||
<summary><strong>OCR with Table Extraction</strong></summary>
|
||||
|
||||
Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.
|
||||
|
||||
**[OCR Backend Documentation →](https://docs.kreuzberg.dev/guides/ocr/)**
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>Batch Processing</strong></summary>
|
||||
|
||||
Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.
|
||||
|
||||
**[Batch Processing Guide →](https://docs.kreuzberg.dev/features/#batch-processing)**
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>Password-Protected PDFs</strong></summary>
|
||||
|
||||
Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.
|
||||
|
||||
**[PDF Configuration →](https://docs.kreuzberg.dev/guides/configuration/)**
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>Language Detection</strong></summary>
|
||||
|
||||
Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.
|
||||
|
||||
**[Language Detection Guide →](https://docs.kreuzberg.dev/features/#language-detection)**
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary><strong>Metadata Extraction</strong></summary>
|
||||
|
||||
Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.
|
||||
|
||||
**[Metadata Guide →](https://docs.kreuzberg.dev/reference/types/#metadata)**
|
||||
|
||||
</details>
|
||||
|
||||
## AI Coding Assistants
|
||||
|
||||
Kreuzberg ships with an [Agent Skill](https://agentskills.io) that teaches AI coding assistants how to use the library correctly. It works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard.
|
||||
|
||||
Install the skill into any project using the [Vercel Skills CLI](https://github.com/vercel-labs/skills):
|
||||
|
||||
```bash
|
||||
npx skills add kreuzberg-dev/kreuzberg
|
||||
```
|
||||
|
||||
The skill is located at [`skills/kreuzberg/SKILL.md`](skills/kreuzberg/SKILL.md) and is automatically discovered by supported AI coding tools once installed.
|
||||
|
||||
## Documentation
|
||||
|
||||
- **[Installation Guide](https://docs.kreuzberg.dev/getting-started/installation/)** – Setup and dependencies
|
||||
- **[User Guide](https://docs.kreuzberg.dev/guides/extraction/)** – Comprehensive usage guide
|
||||
- **[API Reference](https://docs.kreuzberg.dev/reference/api-python/)** – Complete API documentation
|
||||
- **[Format Support](https://docs.kreuzberg.dev/reference/formats/)** – Supported file formats
|
||||
- **[OCR Backends](https://docs.kreuzberg.dev/guides/ocr/)** – OCR engine setup
|
||||
- **[CLI Guide](https://docs.kreuzberg.dev/cli/usage/)** – Command-line usage
|
||||
- **[Migration Guides](https://docs.kreuzberg.dev/migration/from-unstructured/)** – Upgrading from other libraries
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
|
||||
|
||||
## Part of Kreuzberg.dev
|
||||
|
||||
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
|
||||
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
|
||||
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
|
||||
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
|
||||
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
|
||||
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces all per-language bindings.
|
||||
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
|
||||
|
||||
## License
|
||||
|
||||
Elastic License 2.0 (ELv2) - see [LICENSE](LICENSE) for details. See [https://www.elastic.co/licensing/elastic-license](https://www.elastic.co/licensing/elastic-license) for the full license text.
|
||||
|
||||
## FAQ
|
||||
|
||||
### What is Kreuzberg?
|
||||
|
||||
Kreuzberg is a polyglot document intelligence framework with a Rust core. It extracts text, metadata, and code intelligence from 90+ file formats and 300+ programming languages at native speeds without needing a GPU. It provides native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, Kotlin, C#, PHP, Elixir, R, Dart, Swift, Zig, and C.
|
||||
|
||||
### How does Kreuzberg differ from other document extraction tools?
|
||||
|
||||
- **Kreuzberg**: Rust core, 90+ formats, 300+ languages, polyglot bindings, code intelligence via tree-sitter, VLM OCR, native speeds, no GPU needed
|
||||
- **Apache Tika**: Java-based, broader format support, but slower, no code intelligence, no VLM OCR
|
||||
- **pdfplumber**: Python-only, PDF focus, slower, no code intelligence
|
||||
- **unstructured**: Python-based, good format coverage, but slower, requires more dependencies
|
||||
|
||||
Kreuzberg's Rust core with SIMD optimizations and parallelism delivers 10-100x faster extraction than Python alternatives.
|
||||
|
||||
### What are Kreuzberg's key features?
|
||||
|
||||
- **Code intelligence** — Extract functions, classes, imports, symbols, docstrings from 300+ languages via tree-sitter
|
||||
- **Extensible architecture** — Plugin system for custom OCR backends, validators, post-processors, document extractors, renderers
|
||||
- **Polyglot bindings** — Native bindings for 14+ languages (Rust, Python, Node.js, Ruby, Go, Java, Kotlin, C#, PHP, Elixir, R, Dart, Swift, Zig, C)
|
||||
- **90+ file formats** — PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
|
||||
- **LLM intelligence** — VLM OCR (GPT-4o, Claude, Gemini, Ollama), structured JSON extraction, embeddings via 143 LLM providers
|
||||
- **OCR support** — Tesseract (all bindings including WASM for browsers), PaddleOCR, EasyOCR, VLM OCR, extensible via plugin API
|
||||
- **High performance** — Rust core with pure-Rust PDF, SIMD optimizations, full parallelism
|
||||
- **Flexible deployment** — Library, CLI tool, REST API server, or MCP server
|
||||
- **TOON wire format** — Token-efficient serialization for LLM/RAG pipelines, ~30-50% fewer tokens than JSON
|
||||
- **GFM-quality output** — Comrak-based Markdown rendering with proper fenced code blocks, table nodes
|
||||
- **Memory efficient** — Streaming parsers for multi-GB files
|
||||
|
||||
### What file formats does Kreuzberg support?
|
||||
|
||||
8 categories covering 90+ formats:
|
||||
|
||||
- **Documents** — PDF, DOCX, DOC, ODT, RTF, Hangul
|
||||
- **Office** — XLSX, XLS, PPTX, PPT, ODS, iWork
|
||||
- **Images** — PNG, JPEG, TIFF, BMP, GIF, WebP
|
||||
- **Web** — HTML, XML, XHTML, SVG
|
||||
- **Emails** — MSG, EML, PST
|
||||
- **Archives** — ZIP, TAR, GZ, TGZ, 7Z
|
||||
- **Academic** — LaTeX, BibTeX, RIS
|
||||
- **Code** — 300+ programming languages via tree-sitter
|
||||
|
||||
### How do I get started?
|
||||
|
||||
Choose your platform:
|
||||
|
||||
**Python:**
|
||||
|
||||
```bash
|
||||
pip install kreuzberg
|
||||
```
|
||||
|
||||
See [Python docs](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/python)
|
||||
|
||||
**Node.js:**
|
||||
|
||||
```bash
|
||||
npm install @kreuzberg/node
|
||||
```
|
||||
|
||||
See [Node.js docs](https://github.com/kreuzberg-dev/kreuzberg/tree/main/crates/kreuzberg-node)
|
||||
|
||||
**Rust:**
|
||||
|
||||
```bash
|
||||
cargo add kreuzberg
|
||||
```
|
||||
|
||||
See [Rust docs](https://github.com/kreuzberg-dev/kreuzberg/tree/main/crates/kreuzberg)
|
||||
|
||||
**Docker:**
|
||||
|
||||
```bash
|
||||
docker pull ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
```
|
||||
|
||||
See [Docker docs](https://docs.kreuzberg.dev/guides/docker/)
|
||||
|
||||
### What LLM/VLM providers are supported?
|
||||
|
||||
143 providers including:
|
||||
|
||||
- **OpenAI** — GPT-4o (vision), text models
|
||||
- **Anthropic** — Claude (vision), Claude 3.5 Sonnet
|
||||
- **Google** — Gemini (vision), Gemini 2.0 Flash
|
||||
- **Local engines** — Ollama, LM Studio, vLLM, llama.cpp
|
||||
- **Cloud providers** — Fireworks, Together, Groq, OpenRouter
|
||||
- **All OpenAI-compatible endpoints**
|
||||
|
||||
### What OCR backends are available?
|
||||
|
||||
- **Tesseract** — All bindings, including Tesseract-WASM for browsers
|
||||
- **PaddleOCR** — All native bindings (Python, Node.js, etc.)
|
||||
- **EasyOCR** — Python binding
|
||||
- **VLM OCR** — 143 vision model providers (GPT-4o, Claude, Gemini, Ollama local)
|
||||
- **Custom OCR** — Extensible via plugin API
|
||||
|
||||
### What is the TOON wire format?
|
||||
|
||||
TOON is Kreuzberg's token-efficient serialization format for LLM/RAG pipelines. It uses ~30-50% fewer tokens than JSON, making it ideal for:
|
||||
|
||||
- Large document processing
|
||||
- RAG system integration
|
||||
- LLM context window optimization
|
||||
- Cost reduction in API calls
|
||||
|
||||
### What is code intelligence extraction?
|
||||
|
||||
Kreuzberg extracts semantic code information via tree-sitter:
|
||||
|
||||
- **Functions** — Names, parameters, return types, docstrings
|
||||
- **Classes** — Names, methods, inheritance, properties
|
||||
- **Imports** — Module names, import paths
|
||||
- **Symbols** — Variables, constants, type definitions
|
||||
- **Docstrings** — Documentation comments
|
||||
|
||||
Results in `ExtractionResult.code_intelligence` with semantic chunking.
|
||||
|
||||
### Does Kreuzberg work in browsers?
|
||||
|
||||
Yes! The WASM package (`@kreuzberg/wasm`) supports browsers, Deno, and Cloudflare Workers with:
|
||||
|
||||
- PDF, Excel, archives, all office formats
|
||||
- Real Tesseract OCR via WASI build
|
||||
- Only ORT-dependent features excluded (PaddleOCR, layout detection, embeddings, auto-rotate)
|
||||
|
||||
### What deployment options are available?
|
||||
|
||||
- **Library** — Use as a dependency in your application
|
||||
- **CLI** — Cross-platform binary for batch processing
|
||||
- **REST API server** — HTTP endpoint for document extraction
|
||||
- **MCP server** — Model Context Protocol server for AI assistants
|
||||
- **Docker** — Official images with API, CLI, and MCP modes
|
||||
|
||||
### What languages have native bindings?
|
||||
|
||||
| Language | Package Manager | Status |
|
||||
|----------|----------------|--------|
|
||||
| Rust | Cargo | ✅ Core library |
|
||||
| Python | PyPI | ✅ Full support |
|
||||
| Node.js | npm (NAPI-RS) | ✅ Fastest performance |
|
||||
| WASM | npm | ✅ Browser/Deno/CF Workers |
|
||||
| Ruby | RubyGems | ✅ Native bindings |
|
||||
| Go | Go modules | ✅ FFI bindings |
|
||||
| Java | Maven Central | ✅ Foreign Function API |
|
||||
| Kotlin | Maven Central | ✅ Coroutine-based |
|
||||
| C# | NuGet | ✅ .NET 6.0+ |
|
||||
| PHP | Composer | ✅ PHP 8.2+ |
|
||||
| Elixir | Hex | ✅ OTP integration |
|
||||
| R | r-universe | ✅ extendr bindings |
|
||||
| Dart/Flutter | pub.dev | ✅ flutter_rust_bridge |
|
||||
| Swift | SPM | ✅ macOS 13+/iOS 16+ |
|
||||
| Zig | build.zig.zon | ✅ Idiomatic API |
|
||||
| C (FFI) | pkg-config/CMake | ✅ Header + shared lib |
|
||||
|
||||
### What platforms are supported?
|
||||
|
||||
All bindings support:
|
||||
|
||||
- **Linux** — x86_64 and aarch64
|
||||
- **macOS** — ARM64
|
||||
- **Windows** — x64 (most bindings)
|
||||
|
||||
Precompiled binaries included for all architectures.
|
||||
|
||||
### What license does Kreuzberg use?
|
||||
|
||||
Elastic-2.0 License — open-source with commercial use restrictions. See [LICENSE](https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE) for details.
|
||||
|
||||
### Where can I get help?
|
||||
|
||||
- **Documentation**: [docs.kreuzberg.dev](https://docs.kreuzberg.dev)
|
||||
- **Live Demo**: [docs.kreuzberg.dev/demo.html](https://docs.kreuzberg.dev/demo.html)
|
||||
- **Discord**: [discord.gg/xt9WY3GnKR](https://discord.gg/xt9WY3GnKR)
|
||||
- **Hugging Face**: [huggingface.co/Kreuzberg](https://huggingface.co/Kreuzberg)
|
||||
- **GitHub Issues**: [github.com/kreuzberg-dev/kreuzberg/issues](https://github.com/kreuzberg-dev/kreuzberg/issues)
|
||||
382
templates/readme/ruby.md
Normal file
382
templates/readme/ruby.md
Normal file
@@ -0,0 +1,382 @@
|
||||
# Kreuzberg for Ruby
|
||||
|
||||
{% include 'partials/badges.html.jinja' %}
|
||||
|
||||
{{ description }}
|
||||
|
||||
## What This Package Provides
|
||||
|
||||
- **Ruby-native extraction** — idiomatic Ruby objects over the shared Rust document engine.
|
||||
- **Structured results** — text, tables, images, metadata, language detection, chunks, and warnings.
|
||||
- **OCR support** — Tesseract and PaddleOCR through the same configuration model as other bindings.
|
||||
- **Cross-binding parity** — output matches the Python, Node.js, Go, Java, .NET, PHP, Elixir, R, Dart, Swift, Zig, WASM, and C FFI packages.
|
||||
|
||||
## Installation
|
||||
|
||||
Add to your Gemfile:
|
||||
|
||||
```ruby
|
||||
gem 'kreuzberg'
|
||||
```
|
||||
|
||||
Then execute:
|
||||
|
||||
```bash
|
||||
bundle install
|
||||
```
|
||||
|
||||
Or install it directly:
|
||||
|
||||
```bash
|
||||
gem install kreuzberg
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
# Simple synchronous extraction
|
||||
result = Kreuzberg.extract_file("document.pdf")
|
||||
puts result.content
|
||||
```
|
||||
|
||||
### Async Extraction
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
# Using Fiber for concurrency (Ruby 3.0+)
|
||||
Fiber.new do
|
||||
result = Kreuzberg.extract_file_async("document.pdf")
|
||||
puts result.content
|
||||
end.resume
|
||||
```
|
||||
|
||||
### Batch Processing
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]
|
||||
|
||||
results = files.map { |file| Kreuzberg.extract_file(file) }
|
||||
|
||||
results.each do |result|
|
||||
puts "Content length: #{result.content.length}"
|
||||
end
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
config = Kreuzberg::ExtractionConfig.new(
|
||||
use_cache: true,
|
||||
enable_quality_processing: true,
|
||||
ocr: Kreuzberg::OcrConfig.new(
|
||||
backend: 'tesseract',
|
||||
language: 'eng'
|
||||
)
|
||||
)
|
||||
|
||||
result = Kreuzberg.extract_file("document.pdf", config: config)
|
||||
puts result.content
|
||||
```
|
||||
|
||||
## OCR Support
|
||||
|
||||
### Tesseract Configuration
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
config = Kreuzberg::ExtractionConfig.new(
|
||||
ocr: Kreuzberg::OcrConfig.new(
|
||||
backend: 'tesseract',
|
||||
language: 'eng',
|
||||
tesseract_config: Kreuzberg::TesseractConfig.new(
|
||||
psm: 6,
|
||||
enable_table_detection: true
|
||||
)
|
||||
)
|
||||
)
|
||||
|
||||
result = Kreuzberg.extract_file("scanned.pdf", config: config)
|
||||
puts result.content
|
||||
```
|
||||
|
||||
## Table Extraction
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
config = Kreuzberg::ExtractionConfig.new(
|
||||
ocr: Kreuzberg::OcrConfig.new(
|
||||
backend: 'tesseract',
|
||||
tesseract_config: Kreuzberg::TesseractConfig.new(
|
||||
enable_table_detection: true
|
||||
)
|
||||
)
|
||||
)
|
||||
|
||||
result = Kreuzberg.extract_file("invoice.pdf", config: config)
|
||||
|
||||
result.tables.each_with_index do |table, index|
|
||||
puts "Table #{index}:"
|
||||
puts table.markdown
|
||||
end
|
||||
```
|
||||
|
||||
## Metadata Extraction
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
result = Kreuzberg.extract_file("document.pdf")
|
||||
|
||||
# PDF metadata
|
||||
if result.metadata[:pdf]
|
||||
pdf_meta = result.metadata[:pdf]
|
||||
puts "Title: #{pdf_meta[:title]}"
|
||||
puts "Author: #{pdf_meta[:author]}"
|
||||
puts "Pages: #{pdf_meta[:page_count]}"
|
||||
end
|
||||
|
||||
# Detected languages
|
||||
puts "Languages: #{result.detected_languages}"
|
||||
|
||||
# Images
|
||||
if result.images
|
||||
puts "Images found: #{result.images.count}"
|
||||
end
|
||||
```
|
||||
|
||||
## Text Chunking
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
config = Kreuzberg::ExtractionConfig.new(
|
||||
chunking: Kreuzberg::ChunkingConfig.new(
|
||||
max_chars: 1000,
|
||||
max_overlap: 200
|
||||
)
|
||||
)
|
||||
|
||||
result = Kreuzberg.extract_file("long_document.pdf", config: config)
|
||||
|
||||
result.chunks.each_with_index do |chunk, index|
|
||||
puts "Chunk #{index}: #{chunk.length} characters"
|
||||
end
|
||||
```
|
||||
|
||||
## Password-Protected PDFs
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
config = Kreuzberg::ExtractionConfig.new(
|
||||
pdf_options: Kreuzberg::PdfConfig.new(
|
||||
passwords: ["password1", "password2"]
|
||||
)
|
||||
)
|
||||
|
||||
result = Kreuzberg.extract_file("protected.pdf", config: config)
|
||||
puts result.content
|
||||
```
|
||||
|
||||
## Language Detection
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
config = Kreuzberg::ExtractionConfig.new(
|
||||
language_detection: Kreuzberg::LanguageDetectionConfig.new(
|
||||
enabled: true
|
||||
)
|
||||
)
|
||||
|
||||
result = Kreuzberg.extract_file("multilingual.pdf", config: config)
|
||||
puts "Detected languages: #{result.detected_languages}"
|
||||
```
|
||||
|
||||
## API Reference
|
||||
|
||||
### Main Methods
|
||||
|
||||
- `Kreuzberg.extract_file(path, config: nil)` – Extract from file
|
||||
- `Kreuzberg.extract_file_async(path, config: nil)` – Async extraction
|
||||
- `Kreuzberg.extract_bytes(data, mime_type, config: nil)` – Extract from bytes
|
||||
- `Kreuzberg.batch_extract_files(paths, config: nil)` – Batch processing
|
||||
|
||||
### Configuration Classes
|
||||
|
||||
- `ExtractionConfig` – Main configuration
|
||||
- `OcrConfig` – OCR settings
|
||||
- `TesseractConfig` – Tesseract-specific options
|
||||
- `ChunkingConfig` – Text chunking settings
|
||||
- `PdfConfig` – PDF-specific options
|
||||
- `LanguageDetectionConfig` – Language detection settings
|
||||
|
||||
### Result Object
|
||||
|
||||
- `content` – Extracted text
|
||||
- `metadata` – File metadata as Hash
|
||||
- `tables` – Array of ExtractedTable objects
|
||||
- `detected_languages` – Array of language codes
|
||||
- `chunks` – Array of text chunks
|
||||
- `images` – Array of extracted images (if enabled)
|
||||
|
||||
## System Requirements
|
||||
|
||||
### Ruby Version
|
||||
|
||||
- **Ruby 3.2.0 or higher** (including Ruby 4.x)
|
||||
- Ruby 4.0+ is fully supported with no code changes required
|
||||
- Magnus bindings compile successfully on all supported Ruby versions
|
||||
|
||||
### Required
|
||||
|
||||
- Rust toolchain (for native extension compilation)
|
||||
|
||||
### Optional
|
||||
|
||||
```bash
|
||||
# Tesseract OCR
|
||||
brew install tesseract # macOS
|
||||
sudo apt-get install tesseract-ocr # Ubuntu/Debian
|
||||
```
|
||||
|
||||
### Ruby 4.0 Compatibility
|
||||
|
||||
Kreuzberg is fully compatible with Ruby 4.0 (released December 25, 2025) and later. Key Ruby 4.0 features that work seamlessly:
|
||||
|
||||
- **Ruby Box** - Improved memory efficiency and performance
|
||||
- **ZJIT Compiler** - Enhanced JIT compilation for faster execution
|
||||
- **Ractor Improvements** - Better multi-threaded document processing
|
||||
- **Set Promoted to Core** - No changes needed for Kreuzberg
|
||||
|
||||
All tests pass with Ruby 4.0.1 with 100% compatibility. The gem compiles without any breaking changes.
|
||||
|
||||
## Development
|
||||
|
||||
Clone and setup:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/kreuzberg-dev/kreuzberg.git
|
||||
cd kreuzberg
|
||||
bundle install
|
||||
```
|
||||
|
||||
Run tests:
|
||||
|
||||
```bash
|
||||
rake test
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Native extension compilation error
|
||||
|
||||
Ensure build tools are installed:
|
||||
|
||||
```bash
|
||||
# macOS
|
||||
xcode-select --install
|
||||
|
||||
# Ubuntu/Debian
|
||||
sudo apt-get install build-essential ruby-dev
|
||||
|
||||
# Windows (via RubyInstaller)
|
||||
ridk install
|
||||
```
|
||||
|
||||
### "Could not find Kreuzberg"
|
||||
|
||||
Reinstall the gem:
|
||||
|
||||
```bash
|
||||
gem uninstall kreuzberg
|
||||
gem install kreuzberg --no-document
|
||||
```
|
||||
|
||||
### OCR not working
|
||||
|
||||
Verify Tesseract is installed:
|
||||
|
||||
```bash
|
||||
tesseract --version
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Process Directory of PDFs
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
require 'pathname'
|
||||
|
||||
Dir.glob("documents/*.pdf").each do |file|
|
||||
puts "Processing: #{file}"
|
||||
result = Kreuzberg.extract_file(file)
|
||||
puts " Content length: #{result.content.length}"
|
||||
puts " Language: #{result.detected_languages}"
|
||||
end
|
||||
```
|
||||
|
||||
### Extract and Parse Structured Data
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
require 'json'
|
||||
|
||||
result = Kreuzberg.extract_file("data.pdf")
|
||||
|
||||
# Parse content as JSON (if applicable)
|
||||
begin
|
||||
data = JSON.parse(result.content)
|
||||
puts "Parsed data: #{data}"
|
||||
rescue JSON::ParserError
|
||||
puts "Content is not JSON"
|
||||
end
|
||||
```
|
||||
|
||||
### Save Extracted Images
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
config = Kreuzberg::ExtractionConfig.new(
|
||||
images: Kreuzberg::ImageExtractionConfig.new(
|
||||
extract_images: true
|
||||
)
|
||||
)
|
||||
|
||||
result = Kreuzberg.extract_file("document.pdf", config: config)
|
||||
|
||||
result.images&.each_with_index do |image, index|
|
||||
File.write("image_#{index}.png", image.data)
|
||||
end
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
For comprehensive documentation, visit [https://kreuzberg.dev](https://kreuzberg.dev)
|
||||
|
||||
## Part of Kreuzberg.dev
|
||||
|
||||
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
|
||||
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
|
||||
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
|
||||
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
|
||||
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
|
||||
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
|
||||
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
|
||||
|
||||
## License
|
||||
|
||||
{{ license }} License - see [LICENSE](../../LICENSE) for details.
|
||||
Reference in New Issue
Block a user