297 lines
11 KiB
Markdown
297 lines
11 KiB
Markdown
|
|
# Kreuzberg
|
|||
|
|
|
|||
|
|
[](https://github.com/kreuzberg-dev/alef)
|
|||
|
|
|
|||
|
|
[](https://crates.io/crates/kreuzberg)
|
|||
|
|
[](https://pypi.org/project/kreuzberg/)
|
|||
|
|
[](https://www.npmjs.com/package/@kreuzberg/node)
|
|||
|
|
[](https://www.npmjs.com/package/@kreuzberg/wasm)
|
|||
|
|
[](https://rubygems.org/gems/kreuzberg)
|
|||
|
|
[](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
|
|||
|
|
[](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go/v5)
|
|||
|
|
[](https://www.nuget.org/packages/Kreuzberg/)
|
|||
|
|
|
|||
|
|
[](https://www.elastic.co/licensing/elastic-license)
|
|||
|
|
[](https://kreuzberg.dev/)
|
|||
|
|
[](https://huggingface.co/Kreuzberg)
|
|||
|
|
[](https://discord.gg/xt9WY3GnKR)
|
|||
|
|
|
|||
|
|
High-performance document intelligence library for Rust. Extract text, metadata, and structured information from PDFs, Office documents, images, and 75 formats.
|
|||
|
|
|
|||
|
|
This is the core Rust library that powers the Python, TypeScript, and Ruby bindings.
|
|||
|
|
|
|||
|
|
> **🚀 Version 4.9.5 Release**
|
|||
|
|
> This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
|
|||
|
|
>
|
|||
|
|
> **Note**: The Rust crate is not currently published to crates.io for this RC. Use git dependencies or language bindings (Python, TypeScript, Ruby) instead.
|
|||
|
|
|
|||
|
|
## Installation
|
|||
|
|
|
|||
|
|
```toml
|
|||
|
|
[dependencies]
|
|||
|
|
kreuzberg = "4.0"
|
|||
|
|
tokio = { version = "1", features = ["rt", "macros"] }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## PDFium Linking Options
|
|||
|
|
|
|||
|
|
Kreuzberg offers flexible PDFium linking strategies for different deployment scenarios. **Note:** Language bindings (Python, TypeScript, Ruby, Java, Go, C#, PHP, Elixir) automatically bundle PDFium—no configuration needed. This section applies only to the Rust crate.
|
|||
|
|
|
|||
|
|
| Strategy | Feature Flag | Description | Use Case |
|
|||
|
|
| --------------------- | ------------- | ------------------------------------- | --------------------------------------------------- |
|
|||
|
|
| **Default (Dynamic)** | None | Links to system PDFium at runtime | Development, system package users |
|
|||
|
|
| **Static** | `pdf-static` | Statically links PDFium into binary | Single binary distribution, no runtime dependencies |
|
|||
|
|
| **Bundled** | `pdf-bundled` | Downloads and embeds PDFium in binary | CI/CD, hermetic builds, largest binary size |
|
|||
|
|
| **System** | `pdf-system` | Uses system PDFium via pkg-config | Linux distributions with PDFium package |
|
|||
|
|
|
|||
|
|
**Example Cargo.toml configurations:**
|
|||
|
|
|
|||
|
|
```toml
|
|||
|
|
# Default (dynamic linking)
|
|||
|
|
[dependencies]
|
|||
|
|
kreuzberg = "4.0"
|
|||
|
|
|
|||
|
|
# Static linking
|
|||
|
|
[dependencies]
|
|||
|
|
kreuzberg = { version = "4.0", features = ["pdf-static"] }
|
|||
|
|
|
|||
|
|
# Bundled in binary
|
|||
|
|
[dependencies]
|
|||
|
|
kreuzberg = { version = "4.0", features = ["pdf-bundled"] }
|
|||
|
|
|
|||
|
|
# System library (requires PDFium installed)
|
|||
|
|
[dependencies]
|
|||
|
|
kreuzberg = { version = "4.0", features = ["pdf-system"] }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
For more details on feature flags and configuration options, see the [Kreuzberg documentation](https://docs.kreuzberg.dev).
|
|||
|
|
|
|||
|
|
## System Requirements
|
|||
|
|
|
|||
|
|
### ONNX Runtime (for embeddings)
|
|||
|
|
|
|||
|
|
If using embeddings functionality, ONNX Runtime must be installed:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# macOS
|
|||
|
|
brew install onnxruntime
|
|||
|
|
|
|||
|
|
# Ubuntu/Debian
|
|||
|
|
sudo apt install libonnxruntime libonnxruntime-dev
|
|||
|
|
|
|||
|
|
# Windows (MSVC)
|
|||
|
|
scoop install onnxruntime
|
|||
|
|
# OR download from https://github.com/microsoft/onnxruntime/releases
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Without ONNX Runtime, embeddings will raise `MissingDependencyError` with installation instructions.
|
|||
|
|
|
|||
|
|
## Quick Start
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
use kreuzberg::{extract_file_sync, ExtractionConfig};
|
|||
|
|
|
|||
|
|
fn main() -> kreuzberg::Result<()> {
|
|||
|
|
let config = ExtractionConfig::default();
|
|||
|
|
let result = extract_file_sync("document.pdf", None, &config)?;
|
|||
|
|
println!("{}", result.content);
|
|||
|
|
Ok(())
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Async Extraction
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
use kreuzberg::{extract_file, ExtractionConfig};
|
|||
|
|
|
|||
|
|
#[tokio::main]
|
|||
|
|
async fn main() -> kreuzberg::Result<()> {
|
|||
|
|
let config = ExtractionConfig::default();
|
|||
|
|
let result = extract_file("document.pdf", None, &config).await?;
|
|||
|
|
println!("{}", result.content);
|
|||
|
|
Ok(())
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Batch Processing
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
use kreuzberg::{batch_extract_file, ExtractionConfig};
|
|||
|
|
|
|||
|
|
#[tokio::main]
|
|||
|
|
async fn main() -> kreuzberg::Result<()> {
|
|||
|
|
let config = ExtractionConfig::default();
|
|||
|
|
let files = vec!["doc1.pdf", "doc2.pdf", "doc3.pdf"];
|
|||
|
|
let results = batch_extract_file(&files, None, &config).await?;
|
|||
|
|
|
|||
|
|
for result in results {
|
|||
|
|
println!("{}", result.content);
|
|||
|
|
}
|
|||
|
|
Ok(())
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## OCR with Table Extraction
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig};
|
|||
|
|
|
|||
|
|
fn main() -> kreuzberg::Result<()> {
|
|||
|
|
let config = ExtractionConfig {
|
|||
|
|
ocr: Some(OcrConfig {
|
|||
|
|
backend: "tesseract".to_string(),
|
|||
|
|
language: "eng".to_string(),
|
|||
|
|
tesseract_config: Some(TesseractConfig {
|
|||
|
|
enable_table_detection: true,
|
|||
|
|
..Default::default()
|
|||
|
|
}),
|
|||
|
|
}),
|
|||
|
|
..Default::default()
|
|||
|
|
};
|
|||
|
|
|
|||
|
|
let result = extract_file_sync("invoice.pdf", None, &config)?;
|
|||
|
|
|
|||
|
|
for table in &result.tables {
|
|||
|
|
println!("{}", table.markdown);
|
|||
|
|
}
|
|||
|
|
Ok(())
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Password-Protected PDFs
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig};
|
|||
|
|
|
|||
|
|
fn main() -> kreuzberg::Result<()> {
|
|||
|
|
let config = ExtractionConfig {
|
|||
|
|
pdf_options: Some(PdfConfig {
|
|||
|
|
passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
|
|||
|
|
..Default::default()
|
|||
|
|
}),
|
|||
|
|
..Default::default()
|
|||
|
|
};
|
|||
|
|
|
|||
|
|
let result = extract_file_sync("protected.pdf", None, &config)?;
|
|||
|
|
Ok(())
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Extract from Bytes
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
use kreuzberg::{extract_bytes_sync, ExtractionConfig};
|
|||
|
|
use std::fs;
|
|||
|
|
|
|||
|
|
fn main() -> kreuzberg::Result<()> {
|
|||
|
|
let data = fs::read("document.pdf")?;
|
|||
|
|
let config = ExtractionConfig::default();
|
|||
|
|
let result = extract_bytes_sync(&data, "application/pdf", &config)?;
|
|||
|
|
println!("{}", result.content);
|
|||
|
|
Ok(())
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Code Intelligence
|
|||
|
|
|
|||
|
|
Kreuzberg integrates [tree-sitter-language-pack](https://docs.tree-sitter-language-pack.kreuzberg.dev) to parse and analyze source code files across **300+ programming languages**. When you extract a source code file, Kreuzberg automatically detects the language and produces structured analysis including functions, classes, imports, exports, symbols, diagnostics, and semantic code chunks.
|
|||
|
|
|
|||
|
|
Code intelligence data is available via the `metadata.format` field as a `FormatMetadata::Code` variant containing a `ProcessResult`.
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
use kreuzberg::{extract_file_sync, ExtractionConfig, TreeSitterConfig, TreeSitterProcessConfig};
|
|||
|
|
|
|||
|
|
fn main() -> kreuzberg::Result<()> {
|
|||
|
|
let config = ExtractionConfig {
|
|||
|
|
tree_sitter: Some(TreeSitterConfig {
|
|||
|
|
process: TreeSitterProcessConfig {
|
|||
|
|
structure: true,
|
|||
|
|
imports: true,
|
|||
|
|
exports: true,
|
|||
|
|
comments: true,
|
|||
|
|
docstrings: true,
|
|||
|
|
..Default::default()
|
|||
|
|
},
|
|||
|
|
..Default::default()
|
|||
|
|
}),
|
|||
|
|
..Default::default()
|
|||
|
|
};
|
|||
|
|
|
|||
|
|
let result = extract_file_sync("app.py", None, &config)?;
|
|||
|
|
|
|||
|
|
// Access code intelligence from format metadata
|
|||
|
|
if let Some(kreuzberg::types::FormatMetadata::Code(ref code)) = result.metadata.format {
|
|||
|
|
println!("Language: {}", code.language);
|
|||
|
|
println!("Functions/classes: {}", code.structure.len());
|
|||
|
|
println!("Imports: {}", code.imports.len());
|
|||
|
|
|
|||
|
|
for item in &code.structure {
|
|||
|
|
println!(" {:?}: {:?} at line {}", item.kind, item.name, item.span.start_line);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
for chunk in &code.chunks {
|
|||
|
|
println!("Chunk ({} bytes): {}...", chunk.content.len(), &chunk.content[..50.min(chunk.content.len())]);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
Ok(())
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Requires the `tree-sitter` feature flag (included in `full`). See the [Kreuzberg docs](https://docs.kreuzberg.dev/) for configuration details and examples in all languages.
|
|||
|
|
|
|||
|
|
## Features
|
|||
|
|
|
|||
|
|
The crate uses feature flags for optional functionality:
|
|||
|
|
|
|||
|
|
```toml
|
|||
|
|
[dependencies]
|
|||
|
|
kreuzberg = { version = "4.0", features = ["pdf", "excel", "ocr"] }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Available Features
|
|||
|
|
|
|||
|
|
| Feature | Description | Binary Size |
|
|||
|
|
| -------------------- | -------------------------- | ----------- |
|
|||
|
|
| `pdf` | PDF extraction (pure Rust) | +2MB |
|
|||
|
|
| `excel` | Excel/spreadsheet parsing | +3MB |
|
|||
|
|
| `office` | DOCX, PPTX extraction | +1MB |
|
|||
|
|
| `email` | EML, MSG extraction | +500KB |
|
|||
|
|
| `html` | HTML to markdown | +1MB |
|
|||
|
|
| `xml` | XML streaming parser | +500KB |
|
|||
|
|
| `archives` | ZIP, TAR, 7Z extraction | +2MB |
|
|||
|
|
| `ocr` | OCR with Tesseract | +5MB |
|
|||
|
|
| `language-detection` | Language detection | +100KB |
|
|||
|
|
| `chunking` | Text chunking | +200KB |
|
|||
|
|
| `quality` | Text quality processing | +500KB |
|
|||
|
|
|
|||
|
|
### Feature Bundles
|
|||
|
|
|
|||
|
|
```toml
|
|||
|
|
kreuzberg = { version = "4.0", features = ["full"] }
|
|||
|
|
kreuzberg = { version = "4.0", features = ["server"] }
|
|||
|
|
kreuzberg = { version = "4.0", features = ["cli"] }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## PDF Support
|
|||
|
|
|
|||
|
|
Kreuzberg uses **pdf_oxide** — a pure-Rust PDF library with no system dependencies.
|
|||
|
|
Enable PDF extraction with the `pdf` feature:
|
|||
|
|
|
|||
|
|
```toml
|
|||
|
|
[dependencies]
|
|||
|
|
kreuzberg = { version = "5.0", features = ["pdf"] }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
No native libraries required. Works on all platforms including musl, Docker, and WASM.
|
|||
|
|
|
|||
|
|
## Documentation
|
|||
|
|
|
|||
|
|
**[API Documentation](https://docs.rs/kreuzberg)** – Complete API reference with examples
|
|||
|
|
|
|||
|
|
**[https://docs.kreuzberg.dev](https://docs.kreuzberg.dev)** – User guide and tutorials
|
|||
|
|
|
|||
|
|
## License
|
|||
|
|
|
|||
|
|
Elastic License 2.0 (ELv2) - see [LICENSE](../../LICENSE) for details.
|