This commit is contained in:
296
crates/kreuzberg/README.md
Normal file
296
crates/kreuzberg/README.md
Normal file
@@ -0,0 +1,296 @@
|
||||
# Kreuzberg
|
||||
|
||||
[](https://github.com/kreuzberg-dev/alef)
|
||||
|
||||
[](https://crates.io/crates/kreuzberg)
|
||||
[](https://pypi.org/project/kreuzberg/)
|
||||
[](https://www.npmjs.com/package/@kreuzberg/node)
|
||||
[](https://www.npmjs.com/package/@kreuzberg/wasm)
|
||||
[](https://rubygems.org/gems/kreuzberg)
|
||||
[](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg)
|
||||
[](https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go/v5)
|
||||
[](https://www.nuget.org/packages/Kreuzberg/)
|
||||
|
||||
[](https://www.elastic.co/licensing/elastic-license)
|
||||
[](https://kreuzberg.dev/)
|
||||
[](https://huggingface.co/Kreuzberg)
|
||||
[](https://discord.gg/xt9WY3GnKR)
|
||||
|
||||
High-performance document intelligence library for Rust. Extract text, metadata, and structured information from PDFs, Office documents, images, and 75 formats.
|
||||
|
||||
This is the core Rust library that powers the Python, TypeScript, and Ruby bindings.
|
||||
|
||||
> **🚀 Version 4.9.5 Release**
|
||||
> This is a pre-release version. We invite you to test the library and [report any issues](https://github.com/kreuzberg-dev/kreuzberg/issues) you encounter.
|
||||
>
|
||||
> **Note**: The Rust crate is not currently published to crates.io for this RC. Use git dependencies or language bindings (Python, TypeScript, Ruby) instead.
|
||||
|
||||
## Installation
|
||||
|
||||
```toml
|
||||
[dependencies]
|
||||
kreuzberg = "4.0"
|
||||
tokio = { version = "1", features = ["rt", "macros"] }
|
||||
```
|
||||
|
||||
## PDFium Linking Options
|
||||
|
||||
Kreuzberg offers flexible PDFium linking strategies for different deployment scenarios. **Note:** Language bindings (Python, TypeScript, Ruby, Java, Go, C#, PHP, Elixir) automatically bundle PDFium—no configuration needed. This section applies only to the Rust crate.
|
||||
|
||||
| Strategy | Feature Flag | Description | Use Case |
|
||||
| --------------------- | ------------- | ------------------------------------- | --------------------------------------------------- |
|
||||
| **Default (Dynamic)** | None | Links to system PDFium at runtime | Development, system package users |
|
||||
| **Static** | `pdf-static` | Statically links PDFium into binary | Single binary distribution, no runtime dependencies |
|
||||
| **Bundled** | `pdf-bundled` | Downloads and embeds PDFium in binary | CI/CD, hermetic builds, largest binary size |
|
||||
| **System** | `pdf-system` | Uses system PDFium via pkg-config | Linux distributions with PDFium package |
|
||||
|
||||
**Example Cargo.toml configurations:**
|
||||
|
||||
```toml
|
||||
# Default (dynamic linking)
|
||||
[dependencies]
|
||||
kreuzberg = "4.0"
|
||||
|
||||
# Static linking
|
||||
[dependencies]
|
||||
kreuzberg = { version = "4.0", features = ["pdf-static"] }
|
||||
|
||||
# Bundled in binary
|
||||
[dependencies]
|
||||
kreuzberg = { version = "4.0", features = ["pdf-bundled"] }
|
||||
|
||||
# System library (requires PDFium installed)
|
||||
[dependencies]
|
||||
kreuzberg = { version = "4.0", features = ["pdf-system"] }
|
||||
```
|
||||
|
||||
For more details on feature flags and configuration options, see the [Kreuzberg documentation](https://docs.kreuzberg.dev).
|
||||
|
||||
## System Requirements
|
||||
|
||||
### ONNX Runtime (for embeddings)
|
||||
|
||||
If using embeddings functionality, ONNX Runtime must be installed:
|
||||
|
||||
```bash
|
||||
# macOS
|
||||
brew install onnxruntime
|
||||
|
||||
# Ubuntu/Debian
|
||||
sudo apt install libonnxruntime libonnxruntime-dev
|
||||
|
||||
# Windows (MSVC)
|
||||
scoop install onnxruntime
|
||||
# OR download from https://github.com/microsoft/onnxruntime/releases
|
||||
```
|
||||
|
||||
Without ONNX Runtime, embeddings will raise `MissingDependencyError` with installation instructions.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```rust
|
||||
use kreuzberg::{extract_file_sync, ExtractionConfig};
|
||||
|
||||
fn main() -> kreuzberg::Result<()> {
|
||||
let config = ExtractionConfig::default();
|
||||
let result = extract_file_sync("document.pdf", None, &config)?;
|
||||
println!("{}", result.content);
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
### Async Extraction
|
||||
|
||||
```rust
|
||||
use kreuzberg::{extract_file, ExtractionConfig};
|
||||
|
||||
#[tokio::main]
|
||||
async fn main() -> kreuzberg::Result<()> {
|
||||
let config = ExtractionConfig::default();
|
||||
let result = extract_file("document.pdf", None, &config).await?;
|
||||
println!("{}", result.content);
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
### Batch Processing
|
||||
|
||||
```rust
|
||||
use kreuzberg::{batch_extract_file, ExtractionConfig};
|
||||
|
||||
#[tokio::main]
|
||||
async fn main() -> kreuzberg::Result<()> {
|
||||
let config = ExtractionConfig::default();
|
||||
let files = vec!["doc1.pdf", "doc2.pdf", "doc3.pdf"];
|
||||
let results = batch_extract_file(&files, None, &config).await?;
|
||||
|
||||
for result in results {
|
||||
println!("{}", result.content);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
## OCR with Table Extraction
|
||||
|
||||
```rust
|
||||
use kreuzberg::{extract_file_sync, ExtractionConfig, OcrConfig, TesseractConfig};
|
||||
|
||||
fn main() -> kreuzberg::Result<()> {
|
||||
let config = ExtractionConfig {
|
||||
ocr: Some(OcrConfig {
|
||||
backend: "tesseract".to_string(),
|
||||
language: "eng".to_string(),
|
||||
tesseract_config: Some(TesseractConfig {
|
||||
enable_table_detection: true,
|
||||
..Default::default()
|
||||
}),
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
let result = extract_file_sync("invoice.pdf", None, &config)?;
|
||||
|
||||
for table in &result.tables {
|
||||
println!("{}", table.markdown);
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
## Password-Protected PDFs
|
||||
|
||||
```rust
|
||||
use kreuzberg::{extract_file_sync, ExtractionConfig, PdfConfig};
|
||||
|
||||
fn main() -> kreuzberg::Result<()> {
|
||||
let config = ExtractionConfig {
|
||||
pdf_options: Some(PdfConfig {
|
||||
passwords: Some(vec!["password1".to_string(), "password2".to_string()]),
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
let result = extract_file_sync("protected.pdf", None, &config)?;
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
## Extract from Bytes
|
||||
|
||||
```rust
|
||||
use kreuzberg::{extract_bytes_sync, ExtractionConfig};
|
||||
use std::fs;
|
||||
|
||||
fn main() -> kreuzberg::Result<()> {
|
||||
let data = fs::read("document.pdf")?;
|
||||
let config = ExtractionConfig::default();
|
||||
let result = extract_bytes_sync(&data, "application/pdf", &config)?;
|
||||
println!("{}", result.content);
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
## Code Intelligence
|
||||
|
||||
Kreuzberg integrates [tree-sitter-language-pack](https://docs.tree-sitter-language-pack.kreuzberg.dev) to parse and analyze source code files across **300+ programming languages**. When you extract a source code file, Kreuzberg automatically detects the language and produces structured analysis including functions, classes, imports, exports, symbols, diagnostics, and semantic code chunks.
|
||||
|
||||
Code intelligence data is available via the `metadata.format` field as a `FormatMetadata::Code` variant containing a `ProcessResult`.
|
||||
|
||||
```rust
|
||||
use kreuzberg::{extract_file_sync, ExtractionConfig, TreeSitterConfig, TreeSitterProcessConfig};
|
||||
|
||||
fn main() -> kreuzberg::Result<()> {
|
||||
let config = ExtractionConfig {
|
||||
tree_sitter: Some(TreeSitterConfig {
|
||||
process: TreeSitterProcessConfig {
|
||||
structure: true,
|
||||
imports: true,
|
||||
exports: true,
|
||||
comments: true,
|
||||
docstrings: true,
|
||||
..Default::default()
|
||||
},
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
let result = extract_file_sync("app.py", None, &config)?;
|
||||
|
||||
// Access code intelligence from format metadata
|
||||
if let Some(kreuzberg::types::FormatMetadata::Code(ref code)) = result.metadata.format {
|
||||
println!("Language: {}", code.language);
|
||||
println!("Functions/classes: {}", code.structure.len());
|
||||
println!("Imports: {}", code.imports.len());
|
||||
|
||||
for item in &code.structure {
|
||||
println!(" {:?}: {:?} at line {}", item.kind, item.name, item.span.start_line);
|
||||
}
|
||||
|
||||
for chunk in &code.chunks {
|
||||
println!("Chunk ({} bytes): {}...", chunk.content.len(), &chunk.content[..50.min(chunk.content.len())]);
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
Requires the `tree-sitter` feature flag (included in `full`). See the [Kreuzberg docs](https://docs.kreuzberg.dev/) for configuration details and examples in all languages.
|
||||
|
||||
## Features
|
||||
|
||||
The crate uses feature flags for optional functionality:
|
||||
|
||||
```toml
|
||||
[dependencies]
|
||||
kreuzberg = { version = "4.0", features = ["pdf", "excel", "ocr"] }
|
||||
```
|
||||
|
||||
### Available Features
|
||||
|
||||
| Feature | Description | Binary Size |
|
||||
| -------------------- | -------------------------- | ----------- |
|
||||
| `pdf` | PDF extraction (pure Rust) | +2MB |
|
||||
| `excel` | Excel/spreadsheet parsing | +3MB |
|
||||
| `office` | DOCX, PPTX extraction | +1MB |
|
||||
| `email` | EML, MSG extraction | +500KB |
|
||||
| `html` | HTML to markdown | +1MB |
|
||||
| `xml` | XML streaming parser | +500KB |
|
||||
| `archives` | ZIP, TAR, 7Z extraction | +2MB |
|
||||
| `ocr` | OCR with Tesseract | +5MB |
|
||||
| `language-detection` | Language detection | +100KB |
|
||||
| `chunking` | Text chunking | +200KB |
|
||||
| `quality` | Text quality processing | +500KB |
|
||||
|
||||
### Feature Bundles
|
||||
|
||||
```toml
|
||||
kreuzberg = { version = "4.0", features = ["full"] }
|
||||
kreuzberg = { version = "4.0", features = ["server"] }
|
||||
kreuzberg = { version = "4.0", features = ["cli"] }
|
||||
```
|
||||
|
||||
## PDF Support
|
||||
|
||||
Kreuzberg uses **pdf_oxide** — a pure-Rust PDF library with no system dependencies.
|
||||
Enable PDF extraction with the `pdf` feature:
|
||||
|
||||
```toml
|
||||
[dependencies]
|
||||
kreuzberg = { version = "5.0", features = ["pdf"] }
|
||||
```
|
||||
|
||||
No native libraries required. Works on all platforms including musl, Docker, and WASM.
|
||||
|
||||
## Documentation
|
||||
|
||||
**[API Documentation](https://docs.rs/kreuzberg)** – Complete API reference with examples
|
||||
|
||||
**[https://docs.kreuzberg.dev](https://docs.kreuzberg.dev)** – User guide and tutorials
|
||||
|
||||
## License
|
||||
|
||||
Elastic License 2.0 (ELv2) - see [LICENSE](../../LICENSE) for details.
|
||||
Reference in New Issue
Block a user