297 lines
11 KiB
Markdown
297 lines
11 KiB
Markdown
# Kreuzberg
|
||
|
||
[](https://github.com/kreuzberg-dev/alef)
|
||
|
||
[](https://crates.io/crates/kreuzberg)
|
||
[](https://pypi.org/project/kreuzberg/)
|
||
[](https://www.npmjs.com/package/@kreuzberg/node)
|
||
[](https://www.npmjs.com/package/@kreuzberg/wasm)
|
||
[](https://rubygems.org/gems/kreuzberg)
|
||
[](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
|
||
[](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go/v5)
|
||
[](https://www.nuget.org/packages/Kreuzberg/)
|
||
|
||
[](https://www.elastic.co/licensing/elastic-license)
|
||
[](https://kreuzberg.dev/)
|
||
[](https://huggingface.co/Kreuzberg)
|
||
[](https://discord.gg/xt9WY3GnKR)
|
||
|
||
High-performance document intelligence library for Rust. Extract text, metadata, and structured information from PDFs, Office documents, images, and 75 formats.
|
||
|
||
This is the core Rust library that powers the Python, TypeScript, and Ruby bindings.
|
||
|
||
> **🚀 Version 4.9.5 Release**
|
||
> This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
|
||
>
|
||
> **Note**: The Rust crate is not currently published to crates.io for this RC. Use git dependencies or language bindings (Python, TypeScript, Ruby) instead.
|
||
|
||
## Installation
|
||
|
||
```toml
|
||
[dependencies]
|
||
kreuzberg = "4.0"
|
||
tokio = { version = "1", features = ["rt", "macros"] }
|
||
```
|
||
|
||
## PDFium Linking Options
|
||
|
||
Kreuzberg offers flexible PDFium linking strategies for different deployment scenarios. **Note:** Language bindings (Python, TypeScript, Ruby, Java, Go, C#, PHP, Elixir) automatically bundle PDFium—no configuration needed. This section applies only to the Rust crate.
|
||
|
||
| Strategy | Feature Flag | Description | Use Case |
|
||
| --------------------- | ------------- | ------------------------------------- | --------------------------------------------------- |
|
||
| **Default (Dynamic)** | None | Links to system PDFium at runtime | Development, system package users |
|
||
| **Static** | `pdf-static` | Statically links PDFium into binary | Single binary distribution, no runtime dependencies |
|
||
| **Bundled** | `pdf-bundled` | Downloads and embeds PDFium in binary | CI/CD, hermetic builds, largest binary size |
|
||
| **System** | `pdf-system` | Uses system PDFium via pkg-config | Linux distributions with PDFium package |
|
||
|
||
**Example Cargo.toml configurations:**
|
||
|
||
```toml
|
||
# Default (dynamic linking)
|
||
[dependencies]
|
||
kreuzberg = "4.0"
|
||
|
||
# Static linking
|
||
[dependencies]
|
||
kreuzberg = { version = "4.0", features = ["pdf-static"] }
|
||
|
||
# Bundled in binary
|
||
[dependencies]
|
||
kreuzberg = { version = "4.0", features = ["pdf-bundled"] }
|
||
|
||
# System library (requires PDFium installed)
|
||
[dependencies]
|
||
kreuzberg = { version = "4.0", features = ["pdf-system"] }
|
||
```
|
||
|
||
For more details on feature flags and configuration options, see the [Kreuzberg documentation](https://docs.kreuzberg.dev).
|
||
|
||
## System Requirements
|
||
|
||
### ONNX Runtime (for embeddings)
|
||
|
||
If using embeddings functionality, ONNX Runtime must be installed:
|
||
|
||
```bash
|
||
# macOS
|
||
brew install onnxruntime
|
||
|
||
# Ubuntu/Debian
|
||
sudo apt install libonnxruntime libonnxruntime-dev
|
||
|
||
# Windows (MSVC)
|
||
scoop install onnxruntime
|
||
# OR download from https://github.com/microsoft/onnxruntime/releases
|
||
```
|
||
|
||
Without ONNX Runtime, embeddings will raise `MissingDependencyError` with installation instructions.
|
||
|
||
## Quick Start
|
||
|
||
```rust
|
||
use kreuzberg::{extract_file_sync, ExtractionConfig};
|
||
|
||
fn main() -> kreuzberg::Result<()> {
|
||
let config = ExtractionConfig::default();
|
||
let result = extract_file_sync("document.pdf", None, &config)?;
|
||
println!("{}", result.content);
|
||
Ok(())
|
||
}
|
||
```
|
||
|
||
### Async Extraction
|
||
|
||
```rust
|
||
use kreuzberg::{extract_file, ExtractionConfig};
|
||
|
||
#[tokio::main]
|
||
async fn main() -> kreuzberg::Result<()> {
|
||
let config = ExtractionConfig::default();
|
||
let result = extract_file("document.pdf", None, &config).await?;
|
||
println!("{}", result.content);
|
||
Ok(())
|
||
}
|
||
```
|
||
|
||
### Batch Processing
|
||
|
||
```rust
|
||
use kreuzberg::{batch_extract_file, ExtractionConfig};
|
||
|
||
#[tokio::main]
|
||
async fn main() -> kreuzberg::Result<()> {
|
||
let config = ExtractionConfig::default();
|
||
let files = vec!["doc1.pdf", "doc2.pdf", "doc3.pdf"];
|
||
let results = batch_extract_file(&files, None, &config).await?;
|
||
|
||
for result in results {
|
||
println!("{}", result.content);
|
||
}
|
||
Ok(())
|
||
}
|
||
```
|
||
|
||
## OCR with Table Extraction
|
||
|
||
```rust
|
||
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig};
|
||
|
||
fn main() -> kreuzberg::Result<()> {
|
||
let config = ExtractionConfig {
|
||
ocr: Some(OcrConfig {
|
||
backend: "tesseract".to_string(),
|
||
language: "eng".to_string(),
|
||
tesseract_config: Some(TesseractConfig {
|
||
enable_table_detection: true,
|
||
..Default::default()
|
||
}),
|
||
}),
|
||
..Default::default()
|
||
};
|
||
|
||
let result = extract_file_sync("invoice.pdf", None, &config)?;
|
||
|
||
for table in &result.tables {
|
||
println!("{}", table.markdown);
|
||
}
|
||
Ok(())
|
||
}
|
||
```
|
||
|
||
## Password-Protected PDFs
|
||
|
||
```rust
|
||
use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig};
|
||
|
||
fn main() -> kreuzberg::Result<()> {
|
||
let config = ExtractionConfig {
|
||
pdf_options: Some(PdfConfig {
|
||
passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
|
||
..Default::default()
|
||
}),
|
||
..Default::default()
|
||
};
|
||
|
||
let result = extract_file_sync("protected.pdf", None, &config)?;
|
||
Ok(())
|
||
}
|
||
```
|
||
|
||
## Extract from Bytes
|
||
|
||
```rust
|
||
use kreuzberg::{extract_bytes_sync, ExtractionConfig};
|
||
use std::fs;
|
||
|
||
fn main() -> kreuzberg::Result<()> {
|
||
let data = fs::read("document.pdf")?;
|
||
let config = ExtractionConfig::default();
|
||
let result = extract_bytes_sync(&data, "application/pdf", &config)?;
|
||
println!("{}", result.content);
|
||
Ok(())
|
||
}
|
||
```
|
||
|
||
## Code Intelligence
|
||
|
||
Kreuzberg integrates [tree-sitter-language-pack](https://docs.tree-sitter-language-pack.kreuzberg.dev) to parse and analyze source code files across **300+ programming languages**. When you extract a source code file, Kreuzberg automatically detects the language and produces structured analysis including functions, classes, imports, exports, symbols, diagnostics, and semantic code chunks.
|
||
|
||
Code intelligence data is available via the `metadata.format` field as a `FormatMetadata::Code` variant containing a `ProcessResult`.
|
||
|
||
```rust
|
||
use kreuzberg::{extract_file_sync, ExtractionConfig, TreeSitterConfig, TreeSitterProcessConfig};
|
||
|
||
fn main() -> kreuzberg::Result<()> {
|
||
let config = ExtractionConfig {
|
||
tree_sitter: Some(TreeSitterConfig {
|
||
process: TreeSitterProcessConfig {
|
||
structure: true,
|
||
imports: true,
|
||
exports: true,
|
||
comments: true,
|
||
docstrings: true,
|
||
..Default::default()
|
||
},
|
||
..Default::default()
|
||
}),
|
||
..Default::default()
|
||
};
|
||
|
||
let result = extract_file_sync("app.py", None, &config)?;
|
||
|
||
// Access code intelligence from format metadata
|
||
if let Some(kreuzberg::types::FormatMetadata::Code(ref code)) = result.metadata.format {
|
||
println!("Language: {}", code.language);
|
||
println!("Functions/classes: {}", code.structure.len());
|
||
println!("Imports: {}", code.imports.len());
|
||
|
||
for item in &code.structure {
|
||
println!(" {:?}: {:?} at line {}", item.kind, item.name, item.span.start_line);
|
||
}
|
||
|
||
for chunk in &code.chunks {
|
||
println!("Chunk ({} bytes): {}...", chunk.content.len(), &chunk.content[..50.min(chunk.content.len())]);
|
||
}
|
||
}
|
||
|
||
Ok(())
|
||
}
|
||
```
|
||
|
||
Requires the `tree-sitter` feature flag (included in `full`). See the [Kreuzberg docs](https://docs.kreuzberg.dev/) for configuration details and examples in all languages.
|
||
|
||
## Features
|
||
|
||
The crate uses feature flags for optional functionality:
|
||
|
||
```toml
|
||
[dependencies]
|
||
kreuzberg = { version = "4.0", features = ["pdf", "excel", "ocr"] }
|
||
```
|
||
|
||
### Available Features
|
||
|
||
| Feature | Description | Binary Size |
|
||
| -------------------- | -------------------------- | ----------- |
|
||
| `pdf` | PDF extraction (pure Rust) | +2MB |
|
||
| `excel` | Excel/spreadsheet parsing | +3MB |
|
||
| `office` | DOCX, PPTX extraction | +1MB |
|
||
| `email` | EML, MSG extraction | +500KB |
|
||
| `html` | HTML to markdown | +1MB |
|
||
| `xml` | XML streaming parser | +500KB |
|
||
| `archives` | ZIP, TAR, 7Z extraction | +2MB |
|
||
| `ocr` | OCR with Tesseract | +5MB |
|
||
| `language-detection` | Language detection | +100KB |
|
||
| `chunking` | Text chunking | +200KB |
|
||
| `quality` | Text quality processing | +500KB |
|
||
|
||
### Feature Bundles
|
||
|
||
```toml
|
||
kreuzberg = { version = "4.0", features = ["full"] }
|
||
kreuzberg = { version = "4.0", features = ["server"] }
|
||
kreuzberg = { version = "4.0", features = ["cli"] }
|
||
```
|
||
|
||
## PDF Support
|
||
|
||
Kreuzberg uses **pdf_oxide** — a pure-Rust PDF library with no system dependencies.
|
||
Enable PDF extraction with the `pdf` feature:
|
||
|
||
```toml
|
||
[dependencies]
|
||
kreuzberg = { version = "5.0", features = ["pdf"] }
|
||
```
|
||
|
||
No native libraries required. Works on all platforms including musl, Docker, and WASM.
|
||
|
||
## Documentation
|
||
|
||
**[API Documentation](https://docs.rs/kreuzberg)** – Complete API reference with examples
|
||
|
||
**[https://docs.kreuzberg.dev](https://docs.kreuzberg.dev)** – User guide and tutorials
|
||
|
||
## License
|
||
|
||
Elastic License 2.0 (ELv2) - see [LICENSE](../../LICENSE) for details.
|