Files
fil/crates/kreuzberg/README.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

297 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Kreuzberg
[![Bindings](https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6)](https://github.com/kreuzberg-dev/alef)
[![Rust](https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6)](https://crates.io/crates/kreuzberg)
[![Python](https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6)](https://pypi.org/project/kreuzberg/)
[![TypeScript](https://img.shields.io/npm/v/@kreuzberg/node?label=TypeScript&color=007ec6)](https://www.npmjs.com/package/@kreuzberg/node)
[![WASM](https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6)](https://www.npmjs.com/package/@kreuzberg/wasm)
[![Ruby](https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6)](https://rubygems.org/gems/kreuzberg)
[![Java](https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6)](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
[![Go](https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6)](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go/v5)
[![C#](https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6)](https://www.nuget.org/packages/Kreuzberg/)
[![License: Elastic-2.0](https://img.shields.io/badge/License-Elastic--2.0-007ec6)](https://www.elastic.co/licensing/elastic-license)
[![Documentation](https://img.shields.io/badge/Docs-kreuzberg-007ec6)](https://kreuzberg.dev/)
[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Kreuzberg-007ec6)](https://huggingface.co/Kreuzberg)
[![Discord](https://img.shields.io/badge/Discord-Chat-007ec6)](https://discord.gg/xt9WY3GnKR)
High-performance document intelligence library for Rust. Extract text, metadata, and structured information from PDFs, Office documents, images, and 75 formats.
This is the core Rust library that powers the Python, TypeScript, and Ruby bindings.
> **🚀 Version 4.9.5 Release**
> This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
>
> **Note**: The Rust crate is not currently published to crates.io for this RC. Use git dependencies or language bindings (Python, TypeScript, Ruby) instead.
## Installation
```toml
[dependencies]
kreuzberg = "4.0"
tokio = { version = "1", features = ["rt", "macros"] }
```
## PDFium Linking Options
Kreuzberg offers flexible PDFium linking strategies for different deployment scenarios. **Note:** Language bindings (Python, TypeScript, Ruby, Java, Go, C#, PHP, Elixir) automatically bundle PDFium—no configuration needed. This section applies only to the Rust crate.
| Strategy | Feature Flag | Description | Use Case |
| --------------------- | ------------- | ------------------------------------- | --------------------------------------------------- |
| **Default (Dynamic)** | None | Links to system PDFium at runtime | Development, system package users |
| **Static** | `pdf-static` | Statically links PDFium into binary | Single binary distribution, no runtime dependencies |
| **Bundled** | `pdf-bundled` | Downloads and embeds PDFium in binary | CI/CD, hermetic builds, largest binary size |
| **System** | `pdf-system` | Uses system PDFium via pkg-config | Linux distributions with PDFium package |
**Example Cargo.toml configurations:**
```toml
# Default (dynamic linking)
[dependencies]
kreuzberg = "4.0"
# Static linking
[dependencies]
kreuzberg = { version = "4.0", features = ["pdf-static"] }
# Bundled in binary
[dependencies]
kreuzberg = { version = "4.0", features = ["pdf-bundled"] }
# System library (requires PDFium installed)
[dependencies]
kreuzberg = { version = "4.0", features = ["pdf-system"] }
```
For more details on feature flags and configuration options, see the [Kreuzberg documentation](https://docs.kreuzberg.dev).
## System Requirements
### ONNX Runtime (for embeddings)
If using embeddings functionality, ONNX Runtime must be installed:
```bash
# macOS
brew install onnxruntime
# Ubuntu/Debian
sudo apt install libonnxruntime libonnxruntime-dev
# Windows (MSVC)
scoop install onnxruntime
# OR download from https://github.com/microsoft/onnxruntime/releases
```
Without ONNX Runtime, embeddings will raise `MissingDependencyError` with installation instructions.
## Quick Start
```rust
use kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
```
### Async Extraction
```rust
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file("document.pdf", None, &config).await?;
println!("{}", result.content);
Ok(())
}
```
### Batch Processing
```rust
use kreuzberg::{batch_extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let files = vec!["doc1.pdf", "doc2.pdf", "doc3.pdf"];
let results = batch_extract_file(&files, None, &config).await?;
for result in results {
println!("{}", result.content);
}
Ok(())
}
```
## OCR with Table Extraction
```rust
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: "eng".to_string(),
tesseract_config: Some(TesseractConfig {
enable_table_detection: true,
..Default::default()
}),
}),
..Default::default()
};
let result = extract_file_sync("invoice.pdf", None, &config)?;
for table in &result.tables {
println!("{}", table.markdown);
}
Ok(())
}
```
## Password-Protected PDFs
```rust
use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
pdf_options: Some(PdfConfig {
passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("protected.pdf", None, &config)?;
Ok(())
}
```
## Extract from Bytes
```rust
use kreuzberg::{extract_bytes_sync, ExtractionConfig};
use std::fs;
fn main() -> kreuzberg::Result<()> {
let data = fs::read("document.pdf")?;
let config = ExtractionConfig::default();
let result = extract_bytes_sync(&data, "application/pdf", &config)?;
println!("{}", result.content);
Ok(())
}
```
## Code Intelligence
Kreuzberg integrates [tree-sitter-language-pack](https://docs.tree-sitter-language-pack.kreuzberg.dev) to parse and analyze source code files across **300+ programming languages**. When you extract a source code file, Kreuzberg automatically detects the language and produces structured analysis including functions, classes, imports, exports, symbols, diagnostics, and semantic code chunks.
Code intelligence data is available via the `metadata.format` field as a `FormatMetadata::Code` variant containing a `ProcessResult`.
```rust
use kreuzberg::{extract_file_sync, ExtractionConfig, TreeSitterConfig, TreeSitterProcessConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig {
tree_sitter: Some(TreeSitterConfig {
process: TreeSitterProcessConfig {
structure: true,
imports: true,
exports: true,
comments: true,
docstrings: true,
..Default::default()
},
..Default::default()
}),
..Default::default()
};
let result = extract_file_sync("app.py", None, &config)?;
// Access code intelligence from format metadata
if let Some(kreuzberg::types::FormatMetadata::Code(ref code)) = result.metadata.format {
println!("Language: {}", code.language);
println!("Functions/classes: {}", code.structure.len());
println!("Imports: {}", code.imports.len());
for item in &code.structure {
println!(" {:?}: {:?} at line {}", item.kind, item.name, item.span.start_line);
}
for chunk in &code.chunks {
println!("Chunk ({} bytes): {}...", chunk.content.len(), &chunk.content[..50.min(chunk.content.len())]);
}
}
Ok(())
}
```
Requires the `tree-sitter` feature flag (included in `full`). See the [Kreuzberg docs](https://docs.kreuzberg.dev/) for configuration details and examples in all languages.
## Features
The crate uses feature flags for optional functionality:
```toml
[dependencies]
kreuzberg = { version = "4.0", features = ["pdf", "excel", "ocr"] }
```
### Available Features
| Feature | Description | Binary Size |
| -------------------- | -------------------------- | ----------- |
| `pdf` | PDF extraction (pure Rust) | +2MB |
| `excel` | Excel/spreadsheet parsing | +3MB |
| `office` | DOCX, PPTX extraction | +1MB |
| `email` | EML, MSG extraction | +500KB |
| `html` | HTML to markdown | +1MB |
| `xml` | XML streaming parser | +500KB |
| `archives` | ZIP, TAR, 7Z extraction | +2MB |
| `ocr` | OCR with Tesseract | +5MB |
| `language-detection` | Language detection | +100KB |
| `chunking` | Text chunking | +200KB |
| `quality` | Text quality processing | +500KB |
### Feature Bundles
```toml
kreuzberg = { version = "4.0", features = ["full"] }
kreuzberg = { version = "4.0", features = ["server"] }
kreuzberg = { version = "4.0", features = ["cli"] }
```
## PDF Support
Kreuzberg uses **pdf_oxide** — a pure-Rust PDF library with no system dependencies.
Enable PDF extraction with the `pdf` feature:
```toml
[dependencies]
kreuzberg = { version = "5.0", features = ["pdf"] }
```
No native libraries required. Works on all platforms including musl, Docker, and WASM.
## Documentation
**[API Documentation](https://docs.rs/kreuzberg)** Complete API reference with examples
**[https://docs.kreuzberg.dev](https://docs.kreuzberg.dev)** User guide and tutorials
## License
Elastic License 2.0 (ELv2) - see [LICENSE](../../LICENSE) for details.