This commit is contained in:
543
docs/guides/advanced.md
Normal file
543
docs/guides/advanced.md
Normal file
@@ -0,0 +1,543 @@
|
||||
# Advanced Features
|
||||
|
||||
## Text Chunking
|
||||
|
||||
Split extracted text into chunks for RAG, vector databases, or LLM context windows. Four strategies:
|
||||
|
||||
- **Text** — splits on whitespace/punctuation boundaries
|
||||
- **Markdown** — structure-aware; preserves headings, lists, and code blocks
|
||||
- **YAML** — section-aware; preserves YAML document structure
|
||||
- **Semantic** — topic-aware; splits at natural document boundaries
|
||||
|
||||
### Semantic
|
||||
|
||||
Set `chunker_type` to `"semantic"`. Uses an embedding model for topic detection when one is configured; otherwise falls back to structural heuristics.
|
||||
|
||||
```python
|
||||
config = ExtractionConfig(
|
||||
chunking=ChunkingConfig(chunker_type="semantic")
|
||||
)
|
||||
```
|
||||
|
||||
**Behavior:**
|
||||
|
||||
- **Without embeddings** — Uses structural heuristics: detects headers (ALL CAPS, numbered sections) and paragraph boundaries
|
||||
- **With embeddings** — Compares consecutive paragraphs via embeddings to detect topic shifts, merging paragraphs below the `topic_threshold` (default: 0.5)
|
||||
|
||||
Use `topic_threshold` to control sensitivity: higher values (0.7–0.9) preserve more fine-grained topics, lower values (0.1–0.3) merge aggressive. Only applies when an embedding model is configured.
|
||||
|
||||
### Configuration
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/chunking_config.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/chunking_config.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/chunking_config.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/chunking_config.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/config/chunking_config.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/chunking_config.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/chunking_config.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/chunking_config.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/config/chunking_config.md"
|
||||
|
||||
### Chunk Output
|
||||
|
||||
Each chunk in `result.chunks` contains:
|
||||
|
||||
| Field | Description |
|
||||
| --------------------------------------- | ------------------------------------------------ |
|
||||
| `content` | Chunk text |
|
||||
| `metadata.byte_start` / `byte_end` | Byte offsets in the original text |
|
||||
| `metadata.chunk_index` / `total_chunks` | Position in sequence |
|
||||
| `metadata.token_count` | Token count (when embeddings enabled) |
|
||||
| `metadata.heading_context` | Active heading hierarchy (Markdown chunker only) |
|
||||
| `embedding` | Embedding vector (when configured) |
|
||||
|
||||
Chunks can be sized by token count instead of characters — enable the `chunking-tokenizers` feature and set `sizing` to `token`.
|
||||
|
||||
### RAG Pipeline Example
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/chunking_rag.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/utils/chunking_rag.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/chunking_rag.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/advanced/chunking_rag.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/advanced/chunking_rag.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/chunking_rag.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/advanced/chunking_rag.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/advanced/chunking_rag.md"
|
||||
|
||||
## Language Detection
|
||||
|
||||
Detect languages in extracted text using [`whatlang`](https://crates.io/crates/whatlang) — 60+ languages with ISO 639-3 codes. Set `detect_multiple: true` to chunk the text into 200-char segments and return all detected languages sorted by prevalence.
|
||||
|
||||
### Configuration
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/language_detection_config.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/language_detection_config.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/language_detection_config.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/language_detection_config.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/config/language_detection_config.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/language_detection_config.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/language_detection_config.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/language_detection_config.md"
|
||||
|
||||
### Multilingual Example
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/language_detection_multilingual.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/metadata/language_detection_multilingual.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/language_detection_multilingual.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/advanced/language_detection_multilingual.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/advanced/language_detection_multilingual.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/language_detection_multilingual.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/advanced/language_detection_multilingual.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/advanced/language_detection_multilingual.md"
|
||||
|
||||
## Embedding Generation
|
||||
|
||||
Local in-process embeddings via ONNX for semantic search and RAG — no external API calls. Requires the `embeddings` feature.
|
||||
|
||||
| Preset | Model | Dimensions | Max Tokens | Use Case |
|
||||
| -------------- | ---------------------------- | ---------- | ---------- | ------------------------------------------------------- |
|
||||
| `fast` | all-MiniLM-L6-v2 (quantized) | 384 | 512 | Quick prototyping, development, resource-constrained |
|
||||
| `balanced` | BGE-base-en-v1.5 | 768 | 1024 | General-purpose RAG, production deployments, English |
|
||||
| `quality` | BGE-large-en-v1.5 | 1024 | 2000 | Complex documents, maximum accuracy, sufficient compute |
|
||||
| `multilingual` | multilingual-e5-base | 768 | 1024 | International documents, mixed-language content |
|
||||
|
||||
### In-Process Embedding Backends (Plugin Variant)
|
||||
|
||||
Plug a caller-managed embedder (e.g. `llama-cpp-python`, `sentence-transformers`) into Kreuzberg via the `Plugin` variant of `EmbeddingModelType` — Kreuzberg calls back into the registered backend instead of running its own ONNX model.
|
||||
|
||||
1. Register the backend once at startup via `kreuzberg::plugins::register_embedding_backend(Arc::new(MyEmbedder))`. The backend implements `EmbeddingBackend` (a `Plugin`-inheriting async trait with `dimensions()` and `embed(texts) -> Vec<Vec<f32>>`).
|
||||
2. Reference it by name in `EmbeddingConfig`: `{ "model": { "type": "plugin", "name": "my-embedder" } }`.
|
||||
3. Optional: set `EmbeddingConfig.max_embed_duration_secs` (default 60) to bound the wait on a hung backend; `None` disables the timeout.
|
||||
|
||||
The CLI (`kreuzberg embed --provider plugin --plugin my-embedder`), MCP server (`embed_text` tool, `embedding_plugin` parameter), REST API, and env var `KREUZBERG_EMBEDDING_PLUGIN_NAME` all accept the Plugin variant once a backend is registered.
|
||||
|
||||
**Fork-safety**: Python callers running under `multiprocessing`, `gunicorn`'s prefork worker, or Celery prefork must re-register the backend in each child process — native-backed embedders (including `llama-cpp-python`) aren't fork-safe. Use `os.register_at_fork(after_in_child=reregister_fn)` to automate the re-registration.
|
||||
|
||||
### Configuration
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/embedding_with_chunking.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/utils/embedding_with_chunking.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/embedding_with_chunking.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/advanced/embedding_with_chunking.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/advanced/embedding_with_chunking.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/embedding_with_chunking.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/advanced/embedding_with_chunking.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/advanced/embedding_with_chunking.md"
|
||||
|
||||
### Vector Database Integration
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/vector_database_integration.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/utils/vector_database_integration.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/vector_database_integration.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/advanced/vector_database_integration.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/advanced/vector_database_integration.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/vector_database_integration.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/advanced/vector_database_integration.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/advanced/vector_database_integration.md"
|
||||
|
||||
## Token Reduction
|
||||
|
||||
Reduce token count while preserving meaning for LLM pipelines.
|
||||
|
||||
| Level | Reduction | Effect |
|
||||
| ------------ | --------- | ---------------------------------------- |
|
||||
| `off` | 0% | Pass-through |
|
||||
| `moderate` | 15–25% | Stopwords + redundancy removal |
|
||||
| `aggressive` | 30–50% | Semantic clustering + importance scoring |
|
||||
|
||||
### Configuration
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/token_reduction_config.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/token_reduction_config.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/token_reduction_config.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/token_reduction_config.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/config/token_reduction_config.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/token_reduction_config.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/token_reduction_config.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/token_reduction_config.md"
|
||||
|
||||
### Example
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/token_reduction_example.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/utils/token_reduction_example.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/token_reduction_example.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/advanced/token_reduction_example.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/advanced/token_reduction_example.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/token_reduction_example.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/advanced/token_reduction_example.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/advanced/token_reduction_example.md"
|
||||
|
||||
## Keyword Extraction
|
||||
|
||||
Extract keywords using YAKE or RAKE algorithms. Requires the `keywords` feature flag. See [Keyword Extraction](keywords.md) for algorithm details and parameter reference.
|
||||
|
||||
### Configuration
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/keyword_extraction_config.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/keyword_extraction_config.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/keyword_extraction_config.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/keyword_extraction_config.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/config/keyword_extraction_config.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/keyword_extraction_config.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/keyword_extraction_config.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/keyword_extraction_config.md"
|
||||
|
||||
### Example
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/keyword_extraction_example.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/utils/keyword_extraction_example.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/keyword_extraction_example.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/advanced/keyword_extraction_example.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/advanced/keyword_extraction_example.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/keyword_extraction_example.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/advanced/keyword_extraction_example.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/advanced/keyword_extraction_example.md"
|
||||
|
||||
## Quality Processing
|
||||
|
||||
Score extracted text for quality issues (0.0–1.0, where 1.0 is highest quality). Detects OCR artifacts, script content, navigation elements, and structural issues.
|
||||
|
||||
| Factor | Weight | Detects |
|
||||
| ------------------- | ------ | ------------------------------------------------------ |
|
||||
| OCR Artifacts | 30% | Scattered chars, repeated punctuation, malformed words |
|
||||
| Script Content | 20% | JavaScript, CSS, HTML tags |
|
||||
| Navigation Elements | 10% | Breadcrumbs, pagination, skip links |
|
||||
| Document Structure | 20% | Sentence/paragraph length, punctuation distribution |
|
||||
| Metadata Quality | 10% | Presence of title, author, subject |
|
||||
|
||||
Score ranges: `0.0–0.3` very low, `0.3–0.6` low, `0.6–0.8` moderate, `0.8–1.0` high.
|
||||
|
||||
### Configuration
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/quality_processing_config.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/quality_processing_config.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/quality_processing_config.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/quality_processing_config.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/config/quality_processing_config.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/quality_processing_config.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/quality_processing_config.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/quality_processing_config.md"
|
||||
|
||||
### Example
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/quality_processing_example.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/utils/quality_processing_example.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/quality_processing_example.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/advanced/quality_processing_example.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/advanced/quality_processing_example.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/quality_processing_example.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/advanced/quality_processing_example.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/advanced/quality_processing_example.md"
|
||||
|
||||
## Combining Features
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/advanced/combining_all_features.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/combining_all_features.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/combining_all_features.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/combining_all_features.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/combining_all_features.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/combining_all_features.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/combining_all_features.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/combining_all_features.md"
|
||||
101
docs/guides/agent-skills.md
Normal file
101
docs/guides/agent-skills.md
Normal file
@@ -0,0 +1,101 @@
|
||||
# AI Coding Assistants <span class="version-badge new">v4.2.15</span>
|
||||
|
||||
Kreuzberg ships with an [Agent Skill](https://agentskills.io) that teaches AI coding assistants how to use the library correctly — covering extraction, configuration, OCR, chunking, embeddings, batch processing, error handling, and plugins across Python, Node.js/TypeScript, Rust, and CLI.
|
||||
|
||||
## Supported Assistants
|
||||
|
||||
Works with any tool supporting the [Agent Skills](https://agentskills.io) standard: Claude Code, Codex, Gemini CLI, Cursor, Visual Studio Code (with AI extensions), Amp, Goose, and Roo Code.
|
||||
|
||||
## Installing
|
||||
|
||||
```bash title="Terminal"
|
||||
# Install into current project (recommended)
|
||||
npx skills add kreuzberg-dev/kreuzberg
|
||||
|
||||
# Install globally
|
||||
npx skills add kreuzberg-dev/kreuzberg -g
|
||||
```
|
||||
|
||||
Or copy manually:
|
||||
|
||||
```bash title="Terminal"
|
||||
cp -r path/to/kreuzberg/skills/kreuzberg .claude/skills/kreuzberg
|
||||
```
|
||||
|
||||
## What the Skill Provides
|
||||
|
||||
When your AI coding assistant discovers the skill, it knows:
|
||||
|
||||
- All extraction functions and their correct signatures across languages
|
||||
- Configuration field names (for example, `max_chars` not `max_characters` in Python)
|
||||
- Rust feature gates (for example, `tokio-runtime` for sync functions)
|
||||
- Language-specific conventions (snake_case in Python/Rust, camelCase in Node.js)
|
||||
- Error handling patterns for each language
|
||||
|
||||
### Skill Structure
|
||||
|
||||
```text
|
||||
skills/kreuzberg/
|
||||
├── SKILL.md # Main skill (~400 lines)
|
||||
└── references/
|
||||
├── python-api.md # Complete Python API
|
||||
├── nodejs-api.md # Complete Node.js API
|
||||
├── rust-api.md # Complete Rust API
|
||||
├── cli-reference.md # All CLI commands and flags
|
||||
├── configuration.md # Config file formats and schema
|
||||
├── supported-formats.md # All 90+ supported formats
|
||||
├── advanced-features.md # Plugins, embeddings, MCP, security
|
||||
└── other-bindings.md # Go, Ruby, Java, C#, PHP, Elixir
|
||||
```
|
||||
|
||||
The main file stays under 500 lines for efficient AI consumption. Reference files load on demand.
|
||||
|
||||
## Quick Examples
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file, extract_file_sync, ExtractionConfig, OcrConfig
|
||||
|
||||
result = await extract_file("document.pdf")
|
||||
print(result.content)
|
||||
|
||||
config = ExtractionConfig(
|
||||
ocr=OcrConfig(backend="tesseract", language="eng"),
|
||||
output_format="markdown",
|
||||
)
|
||||
result = await extract_file("document.pdf", config=config)
|
||||
```
|
||||
|
||||
=== "Node.js"
|
||||
|
||||
```typescript
|
||||
import { extractFile, extractFileSync } from '@kreuzberg/node';
|
||||
|
||||
const result = await extractFile('document.pdf');
|
||||
console.log(result.content);
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust
|
||||
use kreuzberg::{extract_file, ExtractionConfig};
|
||||
|
||||
let config = ExtractionConfig::default();
|
||||
let result = extract_file("document.pdf", None, &config).await?;
|
||||
```
|
||||
|
||||
=== "CLI"
|
||||
|
||||
```bash
|
||||
kreuzberg extract document.pdf
|
||||
kreuzberg extract document.pdf --format json --output-format markdown
|
||||
```
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [Agent Skills Standard](https://agentskills.io) — the open standard
|
||||
- [Extraction Basics](extraction.md) — core extraction API
|
||||
- [Configuration](configuration.md) — all configuration options
|
||||
- [Advanced Features](advanced.md) — chunking, embeddings, language detection
|
||||
- [Plugin System](plugins.md) — custom plugins
|
||||
391
docs/guides/api-server.md
Normal file
391
docs/guides/api-server.md
Normal file
@@ -0,0 +1,391 @@
|
||||
# API Server <span class="version-badge">v4.0.0</span>
|
||||
|
||||
Kreuzberg runs as an HTTP REST API server (`kreuzberg serve`) or as an MCP server (`kreuzberg mcp`) for AI agent integration.
|
||||
|
||||
## HTTP REST API
|
||||
|
||||
### Start
|
||||
|
||||
=== "CLI"
|
||||
|
||||
--8<-- "snippets/api_server/cli.md"
|
||||
|
||||
=== "Docker"
|
||||
|
||||
--8<-- "snippets/api_server/docker.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/api_server/python.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/api_server/rust.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/api_server/go.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/api_server/java.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/api_server/csharp.md"
|
||||
|
||||
### Endpoints
|
||||
|
||||
#### POST /extract
|
||||
|
||||
Extract text from uploaded files via multipart form data.
|
||||
|
||||
| Field | Required | Description |
|
||||
| --------------- | ---------------- | ------------------------------------------------ |
|
||||
| `files` | Yes (repeatable) | Files to extract |
|
||||
| `config` | No | JSON config overrides |
|
||||
| `output_format` | No | `plain` (default), `markdown`, `djot`, or `html` |
|
||||
|
||||
```bash title="Terminal"
|
||||
# Single file
|
||||
curl -F "files=@document.pdf" http://localhost:8000/extract
|
||||
|
||||
# Multiple files
|
||||
curl -F "files=@doc1.pdf" -F "files=@doc2.docx" http://localhost:8000/extract
|
||||
|
||||
# With config overrides
|
||||
curl -F "files=@scanned.pdf" \
|
||||
-F 'config={"ocr":{"language":"eng"},"force_ocr":true}' \
|
||||
http://localhost:8000/extract
|
||||
```
|
||||
|
||||
```json title="Response"
|
||||
[
|
||||
{
|
||||
"content": "Extracted text...",
|
||||
"mime_type": "application/pdf",
|
||||
"metadata": { "page_count": 10, "author": "John Doe" },
|
||||
"tables": [],
|
||||
"detected_languages": ["eng"],
|
||||
"chunks": null,
|
||||
"images": null
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
#### POST /embed
|
||||
|
||||
Generate vector embeddings. Requires the `embeddings` feature.
|
||||
|
||||
| Field | Required | Description |
|
||||
| -------- | -------- | -------------------------- |
|
||||
| `texts` | Yes | Array of strings |
|
||||
| `config` | No | Embedding config overrides |
|
||||
|
||||
```bash title="Terminal"
|
||||
curl -X POST http://localhost:8000/embed \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"texts":["Hello world","Second text"]}'
|
||||
```
|
||||
|
||||
| Preset | Dimensions | Model |
|
||||
| -------------------- | ---------- | ------------------ |
|
||||
| `fast` | 384 | AllMiniLML6V2Q |
|
||||
| `balanced` (default) | 768 | BGEBaseENV15 |
|
||||
| `quality` | 1024 | BGELargeENV15 |
|
||||
| `multilingual` | 768 | MultilingualE5Base |
|
||||
|
||||
#### POST /chunk
|
||||
|
||||
Chunk text for RAG pipelines.
|
||||
|
||||
| Field | Required | Description |
|
||||
| ----------------------- | -------- | ----------------------------------------------------------- |
|
||||
| `text` | Yes | Text to chunk |
|
||||
| `chunker_type` | No | `"text"` (default), `"markdown"`, `"yaml"`, or `"semantic"` |
|
||||
| `config.max_characters` | No | Max chars per chunk (default: 2000) |
|
||||
| `config.overlap` | No | Overlap between chunks (default: 100) |
|
||||
|
||||
```bash title="Terminal"
|
||||
curl -X POST http://localhost:8000/chunk \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"text":"Long text...","chunker_type":"text","config":{"max_characters":1000,"overlap":50}}'
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/client_chunk_text.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/api/client_chunk_text.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/client_chunk_text.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/client_chunk_text.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/client_chunk_text.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/client_chunk_text.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/client_chunk_text.md"
|
||||
|
||||
#### POST /extract-structured <span class="version-badge">v4.8.0</span>
|
||||
|
||||
Extract typed JSON from a document by running an LLM against the extracted text with a JSON schema (requires `liter-llm` feature; `multipart/form-data` request).
|
||||
|
||||
| Field | Required | Description |
|
||||
| ------------------- | -------- | -------------------------------------------------------------------------------------------------- |
|
||||
| `file` (or `files`) | Yes | The document to extract from |
|
||||
| `schema` | Yes | JSON Schema string describing the structured output |
|
||||
| `model` | Yes | LLM model identifier, for example `openai/gpt-4o` or `anthropic/claude-sonnet-4-20250514` |
|
||||
| `api_key` | No | LLM provider API key. Falls back to provider env vars (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, ...) |
|
||||
| `prompt` | No | Custom Jinja2 prompt template overriding the default |
|
||||
| `schema_name` | No | Schema identifier (default: `extraction`) |
|
||||
| `strict` | No | `"true"` / `"false"` — enable OpenAI strict mode for exact schema matching |
|
||||
| `config` | No | Extraction config overrides as a JSON string |
|
||||
|
||||
```bash title="Terminal"
|
||||
curl -X POST http://localhost:8000/extract-structured \
|
||||
-F "file=@invoice.pdf" \
|
||||
-F 'schema={"type":"object","properties":{"invoice_number":{"type":"string"},"total":{"type":"number"}},"required":["invoice_number","total"]}' \
|
||||
-F "model=openai/gpt-4o" \
|
||||
-F "api_key=$OPENAI_API_KEY" \
|
||||
-F "strict=true"
|
||||
```
|
||||
|
||||
```json title="Response"
|
||||
{
|
||||
"structured_output": {
|
||||
"invoice_number": "INV-2026-0142",
|
||||
"total": 1284.5
|
||||
},
|
||||
"content": "Invoice INV-2026-0142...",
|
||||
"mime_type": "application/pdf"
|
||||
}
|
||||
```
|
||||
|
||||
Errors follow the same shape as `/extract`. A `501` response indicates the server was built without `liter-llm`.
|
||||
|
||||
#### Other Endpoints
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
| ----------------- | ------ | ------------------------------------------------------------------------- |
|
||||
| `/health` | GET | `{"status":"healthy","version":"4.6.3"}` |
|
||||
| `/version` | GET | `{"version":"4.6.3"}` <span class="version-badge">v4.5.2</span> |
|
||||
| `/detect` | POST | MIME type detection (multipart) <span class="version-badge">v4.5.2</span> |
|
||||
| `/cache/stats` | GET | Cache statistics |
|
||||
| `/cache/warm` | POST | Pre-download models <span class="version-badge">v4.5.2</span> |
|
||||
| `/cache/manifest` | GET | Model manifest with checksums <span class="version-badge">v4.5.2</span> |
|
||||
| `/cache/clear` | DELETE | Clear all cached files |
|
||||
| `/info` | GET | `{"version":"...","rust_backend":true}` |
|
||||
| `/openapi.json` | GET | OpenAPI 3.0 schema |
|
||||
|
||||
### Client Examples
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/client_extract_single_file.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/client_extract_single_file.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/client_extract_single_file.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/client_extract_single_file.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/client_extract_single_file.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/client_extract_single_file.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/client_extract_single_file.md"
|
||||
|
||||
### Error Handling
|
||||
|
||||
```json title="Error response"
|
||||
{
|
||||
"error_type": "ValidationError",
|
||||
"message": "Invalid file format",
|
||||
"status_code": 400
|
||||
}
|
||||
```
|
||||
|
||||
| Status | Error type | Meaning |
|
||||
| ------ | -------------------------- | ----------------- |
|
||||
| 400 | `ValidationError` | Invalid input |
|
||||
| 422 | `ParsingError`, `OcrError` | Processing failed |
|
||||
| 500 | Internal errors | Server errors |
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/error_handling_extract.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/api/error_handling_extract.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/error_handling_extract.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/error_handling_extract.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/error_handling_extract.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/error_handling_extract.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/error_handling_extract.md"
|
||||
|
||||
### Configuration
|
||||
|
||||
The server discovers `kreuzberg.toml` in the current and parent directories. Pass `--config path/to/file` to use a different file.
|
||||
|
||||
| Variable | Default | Description |
|
||||
| ------------------------------ | ------- | ------------------------------- |
|
||||
| `KREUZBERG_MAX_UPLOAD_SIZE_MB` | `100` | Max upload size in MB |
|
||||
| `KREUZBERG_CORS_ORIGINS` | `*` | Comma-separated allowed origins |
|
||||
|
||||
!!! Warning Default CORS allows all origins. Set `KREUZBERG_CORS_ORIGINS` explicitly in production.
|
||||
|
||||
See [Configuration Guide](configuration.md) for all options.
|
||||
|
||||
---
|
||||
|
||||
## MCP Server
|
||||
|
||||
### Start
|
||||
|
||||
```bash title="Terminal"
|
||||
kreuzberg mcp
|
||||
kreuzberg mcp --config kreuzberg.toml
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/mcp/mcp_server_start.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/mcp/mcp_server_start.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/mcp/mcp_server_start.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/mcp/mcp_server_start.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/mcp/mcp_server_start.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/mcp_server_start.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/mcp/mcp_server_start.md"
|
||||
|
||||
### Tools
|
||||
|
||||
| Tool | Key parameters | Description |
|
||||
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
|
||||
| `extract_file` | `path` | Extract from file path |
|
||||
| `extract_bytes` | `data` (base64) | Extract from encoded bytes |
|
||||
| `batch_extract_files` | `paths` | Extract multiple files |
|
||||
| `detect_mime_type` | `path` | Detect file format |
|
||||
| `list_formats` | — | List supported formats <span class="version-badge">v4.5.2</span> |
|
||||
| `get_version` | — | Library version <span class="version-badge">v4.5.2</span> |
|
||||
| `cache_stats` | — | Cache usage |
|
||||
| `cache_clear` | — | Remove cached files |
|
||||
| `cache_manifest` | — | Model checksums <span class="version-badge">v4.5.2</span> |
|
||||
| `cache_warm` | — | Pre-download models <span class="version-badge">v4.5.2</span> |
|
||||
| `embed_text` | `texts` | Generate embeddings <span class="version-badge">v4.5.2</span> |
|
||||
| `chunk_text` | `text` | Split text <span class="version-badge">v4.5.2</span> |
|
||||
| `extract_structured` | `path`, `schema`, `model`; optional `schema_name` (default `"extraction"`), `schema_description`, `prompt`, `api_key`, `strict` (default `false`) | Extract structured JSON via LLM <span class="version-badge">v4.8.0</span> |
|
||||
|
||||
All tools accept an optional `config` object. `extract_file` and `extract_bytes` also accept `pdf_password`. `extract_structured` requires the server to be built with the `liter-llm` feature; see the row above for optional fields and defaults.
|
||||
|
||||
### AI Agent Integration
|
||||
|
||||
=== "Claude Desktop"
|
||||
|
||||
Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kreuzberg": {
|
||||
"command": "kreuzberg",
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/mcp/mcp_custom_client.md"
|
||||
|
||||
=== "LangChain"
|
||||
|
||||
--8<-- "snippets/python/mcp/mcp_langchain_integration.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/mcp/mcp_custom_client.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/mcp/mcp_custom_client.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/mcp/mcp_custom_client.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/mcp/mcp_client.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/mcp_custom_client.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/mcp/mcp_custom_client.md"
|
||||
|
||||
---
|
||||
|
||||
For Docker and Kubernetes deployment, see [Docker Guide](docker.md) and [Kubernetes Guide](kubernetes.md).
|
||||
249
docs/guides/code-intelligence.md
Normal file
249
docs/guides/code-intelligence.md
Normal file
@@ -0,0 +1,249 @@
|
||||
# Code Intelligence
|
||||
|
||||
Kreuzberg integrates [tree-sitter-language-pack](https://docs.tree-sitter-language-pack.kreuzberg.dev) (TSLP) to parse and analyze source code files. When you extract a source code file, Kreuzberg automatically detects the programming language and produces structured analysis alongside the raw text content.
|
||||
|
||||
## What You Get
|
||||
|
||||
When extracting source code, the `metadata.format` field contains a `ProcessResult` (format type `"code"`) with:
|
||||
|
||||
- **Structure** -- functions, classes, structs, methods, modules, and their nesting hierarchy
|
||||
- **Imports** -- import/include/require statements with source paths and imported items
|
||||
- **Exports** -- exported symbols with their kinds (function, class, variable, type, default)
|
||||
- **Comments** -- inline and block comments with their positions
|
||||
- **Docstrings** -- documentation comments with parsed sections (params, returns, etc.)
|
||||
- **Symbols** -- variable, constant, and type alias definitions
|
||||
- **Diagnostics** -- parse errors and warnings from tree-sitter
|
||||
- **Chunks** -- semantically meaningful code chunks for RAG and embedding pipelines
|
||||
- **Metrics** -- file-level statistics (lines of code, comment lines, empty lines, node count)
|
||||
|
||||
Language support covers **300+ programming languages** via tree-sitter grammars. See the [TSLP documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) for the full language list.
|
||||
|
||||
## Getting Started
|
||||
|
||||
Code intelligence is enabled by default when the `tree-sitter` feature flag is active. Simply extract a source code file:
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust title="basic.rs"
|
||||
use kreuzberg::{extract_file_sync, ExtractionConfig};
|
||||
|
||||
let config = ExtractionConfig::default();
|
||||
let result = extract_file_sync("app.py", None, &config)?;
|
||||
|
||||
// The content field has the raw source text
|
||||
println!("{}", result.content);
|
||||
|
||||
// Code intelligence is in metadata.format
|
||||
if let Some(kreuzberg::types::FormatMetadata::Code(ref code)) = result.metadata.format {
|
||||
println!("Language: {}", code.language);
|
||||
println!("Structures: {}", code.structure.len());
|
||||
println!("Imports: {}", code.imports.len());
|
||||
}
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python title="basic.py"
|
||||
import kreuzberg
|
||||
|
||||
config = kreuzberg.ExtractionConfig()
|
||||
result = kreuzberg.extract_file_sync("app.py", config=config)
|
||||
|
||||
# The content field has the raw source text
|
||||
print(result.content)
|
||||
|
||||
# Code intelligence is in metadata["format"]
|
||||
fmt = result.metadata.get("format")
|
||||
if fmt and fmt.get("format_type") == "code":
|
||||
print(f"Language: {fmt['language']}")
|
||||
print(f"Structures: {len(fmt['structure'])}")
|
||||
print(f"Imports: {len(fmt['imports'])}")
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript title="basic.ts"
|
||||
import { extractFileSync } from "@kreuzberg/node";
|
||||
|
||||
const result = extractFileSync("app.ts");
|
||||
|
||||
console.log(result.content);
|
||||
|
||||
const fmt = result.metadata?.format;
|
||||
if (fmt?.formatType === "code") {
|
||||
console.log(`Language: ${fmt.language}`);
|
||||
console.log(`Structures: ${fmt.structure.length}`);
|
||||
console.log(`Imports: ${fmt.imports.length}`);
|
||||
}
|
||||
```
|
||||
|
||||
=== "Go"
|
||||
|
||||
```go title="basic.go"
|
||||
result, err := kreuzberg.ExtractFileSync("app.py", nil)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
fmt.Println(result.Content)
|
||||
// Code intelligence is available in result.Metadata.Format
|
||||
// when Format.Type == "code"
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Use `TreeSitterConfig` to control which analysis features are enabled. Set `enabled: false` to disable code intelligence entirely. By default, `structure`, `imports`, and `exports` are enabled; `comments`, `docstrings`, `symbols`, and `diagnostics` are disabled.
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust title="config.rs"
|
||||
use kreuzberg::{ExtractionConfig, TreeSitterConfig, TreeSitterProcessConfig};
|
||||
|
||||
let config = ExtractionConfig {
|
||||
tree_sitter: Some(TreeSitterConfig {
|
||||
process: TreeSitterProcessConfig {
|
||||
structure: true, // functions, classes, etc. (default: true)
|
||||
imports: true, // import statements (default: true)
|
||||
exports: true, // export statements (default: true)
|
||||
comments: true, // comments (default: false)
|
||||
docstrings: true, // docstrings (default: false)
|
||||
symbols: true, // variables, constants (default: false)
|
||||
diagnostics: true, // parse errors/warnings (default: false)
|
||||
chunk_max_size: Some(4096), // max chunk size in bytes
|
||||
..Default::default()
|
||||
},
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python title="config.py"
|
||||
import kreuzberg
|
||||
|
||||
config = kreuzberg.ExtractionConfig(
|
||||
tree_sitter={
|
||||
"process": {
|
||||
"structure": True,
|
||||
"imports": True,
|
||||
"exports": True,
|
||||
"comments": True,
|
||||
"docstrings": True,
|
||||
"symbols": True,
|
||||
"diagnostics": True,
|
||||
"chunk_max_size": 4096,
|
||||
}
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript title="config.ts"
|
||||
import { ExtractionConfig } from "@kreuzberg/node";
|
||||
|
||||
const config: ExtractionConfig = {
|
||||
treeSitter: {
|
||||
process: {
|
||||
structure: true,
|
||||
imports: true,
|
||||
exports: true,
|
||||
comments: true,
|
||||
docstrings: true,
|
||||
symbols: true,
|
||||
diagnostics: true,
|
||||
chunkMaxSize: 4096,
|
||||
},
|
||||
},
|
||||
};
|
||||
```
|
||||
|
||||
=== "TOML"
|
||||
|
||||
```toml title="kreuzberg.toml"
|
||||
[tree_sitter.process]
|
||||
structure = true
|
||||
imports = true
|
||||
exports = true
|
||||
comments = true
|
||||
docstrings = true
|
||||
symbols = true
|
||||
diagnostics = true
|
||||
chunk_max_size = 4096
|
||||
```
|
||||
|
||||
### Configuration Fields
|
||||
|
||||
See [`TreeSitterConfig`](../reference/configuration.md#treesitterconfig) and [`TreeSitterProcessConfig`](../reference/configuration.md#treesitterprocessconfig) for all fields.
|
||||
|
||||
## ProcessResult Fields
|
||||
|
||||
Code intelligence results are returned as a `ProcessResult` from the upstream [`tree-sitter-language-pack`](https://docs.rs/tree-sitter-language-pack) crate. Top-level fields: `language`, `metrics`, `structure`, `imports`, `exports`, `chunks`, plus `comments` / `docstrings` / `symbols` / `diagnostics` (populated only when their `TreeSitterProcessConfig` flag is on). See the upstream crate docs for full field shapes.
|
||||
|
||||
## Semantic Chunking for RAG
|
||||
|
||||
Code chunks produced by tree-sitter are semantically aware -- they split at function, class, and module boundaries rather than fixed line counts. This makes them ideal for retrieval-augmented generation (RAG) pipelines:
|
||||
|
||||
```python title="rag_chunking.py"
|
||||
import kreuzberg
|
||||
|
||||
config = kreuzberg.ExtractionConfig(
|
||||
tree_sitter={"process": {"chunk_max_size": 2048}}
|
||||
)
|
||||
|
||||
result = kreuzberg.extract_file_sync("large_module.py", config=config)
|
||||
|
||||
fmt = result.metadata.get("format")
|
||||
if fmt and fmt.get("format_type") == "code":
|
||||
for chunk in fmt.get("chunks", []):
|
||||
# Each chunk is a semantically coherent piece of code
|
||||
embedding = your_embedding_model(chunk["content"])
|
||||
store_in_vector_db(
|
||||
text=chunk["content"],
|
||||
embedding=embedding,
|
||||
metadata={
|
||||
"language": chunk["language"],
|
||||
"start_line": chunk["span"]["start_line"],
|
||||
"parent": chunk.get("context", {}).get("parent_name"),
|
||||
},
|
||||
)
|
||||
```
|
||||
|
||||
## Language Detection
|
||||
|
||||
Kreuzberg detects the programming language in two ways:
|
||||
|
||||
1. **File extension** (fast path) -- when using `extract_file`, the extension is matched against 248 known language extensions
|
||||
2. **Shebang line** (fallback) -- when using `extract_bytes` or when the extension is ambiguous, the first line is checked for `#!/usr/bin/env python`, `#!/bin/bash`, and so on.
|
||||
|
||||
If neither method identifies the language, extraction returns an `UnsupportedFormat` error.
|
||||
|
||||
## Language Support
|
||||
|
||||
Tree-sitter-language-pack supports 300+ programming languages. For the full list, see the [TSLP language reference](https://docs.tree-sitter-language-pack.kreuzberg.dev).
|
||||
|
||||
Common languages with full structural analysis:
|
||||
|
||||
| Language | Structure | Imports | Exports | Docstrings |
|
||||
| ---------- | --------- | ------- | ------- | ---------- |
|
||||
| Python | Yes | Yes | Yes | Yes |
|
||||
| Rust | Yes | Yes | Yes | Yes |
|
||||
| TypeScript | Yes | Yes | Yes | Yes |
|
||||
| JavaScript | Yes | Yes | Yes | Yes |
|
||||
| Go | Yes | Yes | Yes | Yes |
|
||||
| Java | Yes | Yes | Yes | Yes |
|
||||
| C/C++ | Yes | Yes | Yes | Yes |
|
||||
| Ruby | Yes | Yes | Yes | Yes |
|
||||
| PHP | Yes | Yes | Yes | Yes |
|
||||
| C# | Yes | Yes | Yes | Yes |
|
||||
| Swift | Yes | Yes | Yes | Yes |
|
||||
| Kotlin | Yes | Yes | Yes | Yes |
|
||||
| Elixir | Yes | Yes | Yes | Yes |
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Configuration Reference](../reference/configuration.md#treesitterconfig) -- TreeSitterConfig and TreeSitterProcessConfig fields
|
||||
- [Types Reference](../reference/types.md) -- ProcessResult, StructureItem, CodeChunk, and related type definitions
|
||||
- [tree-sitter-language-pack documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) -- Full language support reference
|
||||
214
docs/guides/configuration.md
Normal file
214
docs/guides/configuration.md
Normal file
@@ -0,0 +1,214 @@
|
||||
# Configuration Guide <span class="version-badge">v4.0.0</span>
|
||||
|
||||
All extraction behavior is controlled through `ExtractionConfig`. Pass it directly in code or load it from a TOML/YAML/JSON file. Every field is optional. For per-field documentation, see the [Configuration Reference](../reference/configuration.md).
|
||||
|
||||
## Quick Start
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/config_basic.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/config_basic.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/config/config_basic.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/config_basic.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/config/config_basic.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/config_basic.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/config_basic.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/config_basic.md"
|
||||
|
||||
## Configuration Files
|
||||
|
||||
Three formats are supported. TOML is recommended.
|
||||
|
||||
=== "TOML (Recommended)"
|
||||
|
||||
```toml title="kreuzberg.toml"
|
||||
use_cache = true
|
||||
enable_quality_processing = true
|
||||
|
||||
[ocr]
|
||||
backend = "tesseract"
|
||||
language = "eng"
|
||||
|
||||
[ocr.tesseract_config]
|
||||
psm = 3
|
||||
```
|
||||
|
||||
=== "YAML"
|
||||
|
||||
```yaml title="kreuzberg.yaml"
|
||||
use_cache: true
|
||||
enable_quality_processing: true
|
||||
|
||||
ocr:
|
||||
backend: tesseract
|
||||
language: eng
|
||||
tesseract_config:
|
||||
psm: 3
|
||||
```
|
||||
|
||||
=== "JSON"
|
||||
|
||||
```json title="kreuzberg.json"
|
||||
{
|
||||
"use_cache": true,
|
||||
"enable_quality_processing": true,
|
||||
"ocr": {
|
||||
"backend": "tesseract",
|
||||
"language": "eng",
|
||||
"tesseract_config": {
|
||||
"psm": 3
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Automatic Discovery
|
||||
|
||||
When no `--config` path is supplied, Kreuzberg walks up from the current working directory looking for `kreuzberg.toml` and uses the first match. YAML and JSON files are supported only when passed explicitly via `--config`. If nothing is found, defaults are used.
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/config_discover.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/config_discover.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/config/config_discover.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/config_discover.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/config/config_discover.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/config_discover.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/config_discover.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/config_discover.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/config/config_discover.md"
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### Setting Up OCR
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/config_ocr.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/config_ocr.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/ocr/config_ocr.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/config_ocr.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/config/config_ocr.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/config_ocr.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/config_ocr.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/config_ocr.md"
|
||||
|
||||
For backend selection and language packs, see [OCR Guide](ocr.md). For fine-grained Tesseract tuning, see [TesseractConfig Reference](../reference/configuration.md#tesseractconfig).
|
||||
|
||||
### Chunking for RAG
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/chunking.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/utils/chunking.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/chunking.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/utils/chunking.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/utils/chunking.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/embedding_with_chunking.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/utils/chunking.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/utils/chunking.md"
|
||||
|
||||
## All Configuration Categories
|
||||
|
||||
- [ExtractionConfig](../reference/configuration.md#extractionconfig) — top-level options
|
||||
- [OcrConfig](../reference/configuration.md#ocrconfig) — OCR backend, language, acceleration
|
||||
- [TesseractConfig](../reference/configuration.md#tesseractconfig) — Tesseract PSM, confidence, table detection
|
||||
- [ChunkingConfig](../reference/configuration.md#chunkingconfig) — chunk size, overlap
|
||||
- [TokenReductionConfig](../reference/configuration.md#tokenreductionconfig) — LLM prompt token reduction
|
||||
- [ContentFilterConfig](../reference/configuration.md#contentfilterconfig) — header/footer/watermark filtering
|
||||
- [PageConfig](../reference/configuration.md#pageconfig) — page tracking and markers
|
||||
- [AccelerationConfig](../reference/configuration.md#accelerationconfig) — ONNX Runtime execution provider
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [Extraction Basics](extraction.md) — core extraction API and supported formats
|
||||
- [OCR Guide](ocr.md) — backend installation and language setup
|
||||
- [Advanced Features](advanced.md) — embeddings, language detection, page tracking
|
||||
- [Plugins Guide](plugins.md) — custom post-processors and validators
|
||||
354
docs/guides/development.md
Normal file
354
docs/guides/development.md
Normal file
@@ -0,0 +1,354 @@
|
||||
# Development Workflow
|
||||
|
||||
Everything you need to build, test, and debug Kreuzberg locally. This guide assumes you've already followed the [Contributing Guide](../contributing.md) to fork and clone the repository.
|
||||
|
||||
---
|
||||
|
||||
## The Task Runner
|
||||
|
||||
Kreuzberg uses [Task](https://taskfile.dev/) for all build and test workflows. One command to bootstrap everything:
|
||||
|
||||
```bash title="Terminal"
|
||||
task setup
|
||||
```
|
||||
|
||||
That installs all toolchains and dependencies. Safe to re-run anytime — it's idempotent.
|
||||
|
||||
### The Pattern
|
||||
|
||||
Tasks follow `<language>:<action>`. Once you learn this pattern, the command for any task is predictable:
|
||||
|
||||
```bash title="Terminal"
|
||||
task rust:build # Build the Rust core
|
||||
task rust:build:dev # Debug build (faster compile, no optimizations)
|
||||
task rust:build:release # Release build (slow compile, fast binary)
|
||||
task rust:test # Run Rust tests
|
||||
task rust:test:ci # Same tests, with CI-level diagnostics
|
||||
|
||||
task python:build # Build Python bindings via maturin
|
||||
task python:test # Run Python test suite
|
||||
task node:build # Build Node.js bindings via napi
|
||||
task node:test # Jest tests
|
||||
```
|
||||
|
||||
The same pattern works for every language: `go:build`, `java:test`, `ruby:build`, `csharp:test`, and so on.
|
||||
|
||||
### Bulk Operations
|
||||
|
||||
```bash title="Terminal"
|
||||
task build:all # Build every binding
|
||||
task test:all # Test every binding (sequential)
|
||||
task test:all:parallel # Test every binding (parallel — faster, noisier output)
|
||||
task check # Lint + format check across the whole repo
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Locally
|
||||
|
||||
### Rust
|
||||
|
||||
The core lives in `crates/kreuzberg/`. Most changes start here.
|
||||
|
||||
```bash title="Terminal"
|
||||
task rust:test
|
||||
|
||||
cargo test -p kreuzberg test_pdf_extraction -- --nocapture
|
||||
|
||||
RUST_LOG=debug cargo test -p kreuzberg test_name -- --nocapture
|
||||
```
|
||||
|
||||
### Python
|
||||
|
||||
Python bindings are in `packages/python/`. Build first, then test:
|
||||
|
||||
```bash title="Terminal"
|
||||
task python:build:dev
|
||||
task python:test
|
||||
|
||||
cd packages/python
|
||||
uv run pytest tests/ -k "test_extract" -v
|
||||
```
|
||||
|
||||
The `RUST_LOG` env var works here too — the Rust core logs through Python's stderr:
|
||||
|
||||
```bash title="Terminal"
|
||||
RUST_LOG=debug uv run pytest tests/ -v
|
||||
```
|
||||
|
||||
### Node.js
|
||||
|
||||
TypeScript bindings are in `packages/typescript/`:
|
||||
|
||||
```bash title="Terminal"
|
||||
task node:build:dev
|
||||
task node:test
|
||||
|
||||
cd packages/typescript
|
||||
pnpm test -- --testPathPattern="extract"
|
||||
```
|
||||
|
||||
### Everything Else
|
||||
|
||||
Same pattern. Build, then test:
|
||||
|
||||
```bash title="Terminal"
|
||||
task go:build && task go:test
|
||||
task java:build && task java:test
|
||||
task csharp:build && task csharp:test
|
||||
task ruby:build && task ruby:test
|
||||
task php:build && task php:test
|
||||
task elixir:build && task elixir:test
|
||||
task r:build && task r:test
|
||||
task c:build && task c:test
|
||||
task wasm:build && task wasm:test
|
||||
```
|
||||
|
||||
### Testing the live browser demo
|
||||
|
||||
The demo at `docs/demo.html` loads `@kreuzberg/wasm` from a CDN. To test local changes against it, use:
|
||||
|
||||
```bash title="Terminal"
|
||||
task demo:dev
|
||||
```
|
||||
|
||||
This builds the Wasm binary and TypeScript dist, patches the demo with local URLs, and starts two servers:
|
||||
|
||||
| Server | URL | Role |
|
||||
| ------ | ----------------------- | ---------------------------------- |
|
||||
| Docs | `http://localhost:8001` | Serves the patched `demo-dev.html` |
|
||||
| Assets | `http://localhost:9000` | Serves the local Wasm package |
|
||||
|
||||
Open **`http://localhost:8001/demo-dev.html`** — no manual edits needed. The patched file (`docs/demo-dev.html`) is gitignored and regenerated on every run. The two different ports reproduce the cross-origin setup the CDN creates in production.
|
||||
|
||||
To skip the slow Rust build when you've only changed TypeScript:
|
||||
|
||||
```bash title="Terminal"
|
||||
SKIP_WASM_BUILD=1 task demo:dev
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## End-to-end Test Suites
|
||||
|
||||
End-to-end tests guarantee that every language binding produces identical results for the same document. They live in `e2e/` as shared fixtures — test inputs paired with expected outputs.
|
||||
|
||||
### Run end-to-end Tests
|
||||
|
||||
| Language | Directory | Run with |
|
||||
| -------------------- | ----------------- | ---------------------- |
|
||||
| Python | `e2e/python/` | `task python:e2e:test` |
|
||||
| TypeScript / Node.js | `e2e/typescript/` | `task node:e2e:test` |
|
||||
| Rust | `e2e/rust/` | `task rust:e2e:test` |
|
||||
| Go | `e2e/go/` | `task go:e2e:test` |
|
||||
| Java | `e2e/java/` | `task java:e2e:test` |
|
||||
| .NET | `e2e/csharp/` | `task csharp:e2e:test` |
|
||||
| Ruby | `e2e/ruby/` | `task ruby:e2e:test` |
|
||||
| PHP | `e2e/php/` | `task php:e2e:test` |
|
||||
| R | `e2e/r/` | `task r:e2e:test` |
|
||||
|
||||
### Regenerate end-to-end Tests
|
||||
|
||||
When you add a feature that changes extraction behavior, regenerate the affected end-to-end suites:
|
||||
|
||||
```bash title="Terminal"
|
||||
task python:e2e:generate
|
||||
task node:e2e:generate
|
||||
task <lang>:e2e:generate
|
||||
```
|
||||
|
||||
To regenerate and test all suites at once:
|
||||
|
||||
```bash title="Terminal"
|
||||
task e2e:generate:all
|
||||
task e2e:test:all
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Benchmarking
|
||||
|
||||
Measure extraction performance with the benchmark harness in `tools/benchmark-harness/`. Use it to track regressions, compare against alternatives, and identify bottlenecks with flamegraphs.
|
||||
|
||||
### Quick Start
|
||||
|
||||
```bash title="Terminal"
|
||||
task benchmark:run FRAMEWORK=kreuzberg MODE=single-file
|
||||
task benchmark:run FRAMEWORK=kreuzberg MODE=batch
|
||||
```
|
||||
|
||||
### Common Modes
|
||||
|
||||
| Mode | What it measures |
|
||||
| ------------- | --------------------------------------- |
|
||||
| `single-file` | Latency — one file at a time |
|
||||
| `batch` | Throughput — multiple files in parallel |
|
||||
|
||||
### With Profiling
|
||||
|
||||
Generate flamegraphs to see where time is spent:
|
||||
|
||||
```bash title="Terminal"
|
||||
task benchmark:profile FRAMEWORK=kreuzberg MODE=single-file
|
||||
```
|
||||
|
||||
Results appear in the `flamegraphs/` directory as interactive SVGs.
|
||||
|
||||
View live benchmark results at <https://kreuzberg.dev/benchmarks>.
|
||||
|
||||
---
|
||||
|
||||
## Linting and Pre-commit
|
||||
|
||||
```bash title="Terminal"
|
||||
task check # Full lint + format check (same as CI validate stage)
|
||||
```
|
||||
|
||||
Language-specific:
|
||||
|
||||
```bash title="Terminal"
|
||||
task rust:lint # clippy + rustfmt
|
||||
task python:lint # ruff + mypy
|
||||
task node:lint # eslint + typecheck
|
||||
```
|
||||
|
||||
The repository uses pre-commit hooks that enforce conventional commit messages, code formatting, and linter rules. If a commit is rejected, the hook output tells you exactly what to fix.
|
||||
|
||||
---
|
||||
|
||||
## Working with Documentation
|
||||
|
||||
### Building Locally
|
||||
|
||||
```bash title="Terminal"
|
||||
uv sync --group doc
|
||||
zensical build --clean
|
||||
zensical serve
|
||||
```
|
||||
|
||||
### How Snippets Work
|
||||
|
||||
Code examples in the docs aren't inline — they're pulled from `docs/snippets/` via the `--8<--` include directive. This keeps examples testable and reusable across pages.
|
||||
|
||||
```text
|
||||
docs/snippets/
|
||||
├── python/ # Python examples
|
||||
│ ├── api/ # extract_file, batch_extract, etc.
|
||||
│ ├── config/ # ExtractionConfig, OcrConfig, etc.
|
||||
│ ├── ocr/ # OCR backends
|
||||
│ ├── plugins/ # Plugin implementations
|
||||
│ ├── mcp/ # MCP server and client
|
||||
│ └── utils/ # Embeddings, chunking, errors
|
||||
├── rust/ # Rust examples (same layout)
|
||||
├── typescript/ # TypeScript examples
|
||||
├── go/, java/, csharp/, ruby/, r/
|
||||
├── docker/ # Docker commands
|
||||
├── api_server/ # Server startup examples
|
||||
└── cli/ # CLI usage
|
||||
```
|
||||
|
||||
When you change a user-facing API, update the matching snippet. When you add a new feature, create a snippet and include it from the relevant doc page.
|
||||
|
||||
### Theme tokens (light mode)
|
||||
|
||||
Inline `code` and command-style monospace in light mode use the text token **`#26203A`**, defined in `docs/css/extra.css` as `--kb-text` (referenced as `var(--kb-text)`; brand backgrounds use the same value via `--kb-brand-ink`).
|
||||
|
||||
---
|
||||
|
||||
## Debugging
|
||||
|
||||
### Rust Panics
|
||||
|
||||
```bash title="Terminal"
|
||||
RUST_BACKTRACE=1 cargo test -p kreuzberg test_name
|
||||
RUST_BACKTRACE=full cargo test -p kreuzberg test_name
|
||||
```
|
||||
|
||||
### Python FFI Problems
|
||||
|
||||
When something goes wrong in the Rust core during a Python call, the error introspection API gives you the details:
|
||||
|
||||
```python title="debug_ffi.py"
|
||||
from kreuzberg import get_last_error_code, get_error_details, get_last_panic_context
|
||||
|
||||
details = get_error_details()
|
||||
print(f"Error: {details['message']}")
|
||||
print(f"Code: {details['error_code']}")
|
||||
|
||||
context = get_last_panic_context()
|
||||
if context:
|
||||
print(f"Panic context: {context}")
|
||||
```
|
||||
|
||||
### Verbose Logging
|
||||
|
||||
Crank up the log level to see what the Rust core is doing:
|
||||
|
||||
```bash title="Terminal"
|
||||
RUST_LOG=debug task python:test
|
||||
RUST_LOG=trace task rust:test
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CI/CD
|
||||
|
||||
CI runs on every push and PR to `main` via `.github/workflows/ci.yaml`. The pipeline has four stages:
|
||||
|
||||
1. **Validate** — conventional commits, formatting, clippy
|
||||
2. **Build** — FFI libraries, Python wheels, Node packages, all bindings
|
||||
3. **Test** — per-language test suites on Linux, macOS, and Windows
|
||||
4. **Integration** — Docker build, Docker smoke tests, CLI tests
|
||||
|
||||
### Smart Change Detection
|
||||
|
||||
CI doesn't rebuild everything on every PR. A `changes` job detects which paths were touched and only runs the relevant build/test jobs. Edit a Python file? Only Python builds and tests run. Touch the Rust core? Everything downstream rebuilds.
|
||||
|
||||
### Running CI Checks Locally
|
||||
|
||||
Before pushing, you can run the same checks CI runs:
|
||||
|
||||
```bash title="Terminal"
|
||||
task check # Matches the validate stage
|
||||
task rust:test:ci # Rust tests with CI diagnostics
|
||||
task python:test:ci # Python tests with CI diagnostics
|
||||
task test:all:ci # Everything
|
||||
```
|
||||
|
||||
### Other Workflows
|
||||
|
||||
| Workflow | When it runs | What it does |
|
||||
| --------------------- | ------------------------------------- | ---------------------------------- |
|
||||
| `ci.yaml` | Every push/PR to `main` | The main pipeline |
|
||||
| `docs.yaml` | Changes to `docs/` or `zensical.toml` | Builds and validates documentation |
|
||||
| `benchmarks.yaml` | Manual trigger | Runs the full benchmark suite |
|
||||
| `profiling.yaml` | Manual trigger | Generates flamegraphs |
|
||||
| `publish.yaml` | Release events | Publishes packages to registries |
|
||||
| `publish-docker.yaml` | Tags and releases | Builds and pushes Docker images |
|
||||
|
||||
---
|
||||
|
||||
## Performance
|
||||
|
||||
Kreuzberg's core is written in Rust, which enables zero-copy memory handling, SIMD acceleration, and true multi-core parallelism — all at compile time with no garbage collection.
|
||||
|
||||
### Why Rust Matters
|
||||
|
||||
- **Native compilation:** LLVM optimizes code ahead of time (inlining, vectorization, dead code elimination)
|
||||
- **Zero-copy strings:** Slicing uses borrowed references, not heap allocations
|
||||
- **SIMD acceleration:** Whitespace detection and character classification run 15-37x faster than scalar operations
|
||||
- **No GIL:** True multi-core parallelism across all CPU cores
|
||||
- **Deterministic memory:** Drop semantics free memory instantly, no GC pauses
|
||||
|
||||
### Key Optimizations
|
||||
|
||||
- **Batch processing:** 6-10x faster than sequential extraction through work-stealing scheduler
|
||||
- **Caching:** 85%+ hit rates for repeated files (SQLite-backed, automatic invalidation)
|
||||
- **Streaming:** Large files processed in 4KB chunks, constant memory regardless of file size
|
||||
- **Lazy initialization:** Expensive subsystems (Tokio, plugins) initialized on first use only
|
||||
|
||||
### Benchmarking Your Workload
|
||||
|
||||
Measure with your actual files using the benchmark harness (see [Benchmarking](#benchmarking) section for full instructions). For detailed analysis and live benchmark results, visit <https://kreuzberg.dev/benchmarks>.
|
||||
|
||||
---
|
||||
270
docs/guides/docker.md
Normal file
270
docs/guides/docker.md
Normal file
@@ -0,0 +1,270 @@
|
||||
# Docker Deployment <span class="version-badge">v4.0.0</span>
|
||||
|
||||
Official Docker images built on the Rust core with Debian 13 (Trixie). Each image supports three execution modes: API server (default), command-line tool, and MCP server.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Pull and Run
|
||||
|
||||
=== "API Server"
|
||||
|
||||
--8<-- "snippets/docker/api_server_basic.md"
|
||||
|
||||
=== "CLI Mode"
|
||||
|
||||
--8<-- "snippets/docker/cli_mode_basic.md"
|
||||
|
||||
=== "MCP Server"
|
||||
|
||||
--8<-- "snippets/docker/mcp_basic.md"
|
||||
|
||||
### Pull Image
|
||||
|
||||
=== "Core"
|
||||
|
||||
--8<-- "snippets/docker/core_pull.md"
|
||||
|
||||
=== "Full"
|
||||
|
||||
--8<-- "snippets/docker/full_pull.md"
|
||||
|
||||
## Image Variants
|
||||
|
||||
| | **Core** | **Full** |
|
||||
| ----------------- | -------------------------------------- | ---------------------------------------- |
|
||||
| **Image** | `ghcr.io/kreuzberg-dev/kreuzberg:core` | `ghcr.io/kreuzberg-dev/kreuzberg:latest` |
|
||||
| **Size** | ~1.0–1.3 GB | ~1.5–2.1 GB |
|
||||
| **Tesseract OCR** | 12 languages | 12 languages |
|
||||
| **Modern Office** | DOCX, PPTX, XLSX | DOCX, PPTX, XLSX |
|
||||
| **Legacy Office** | DOC, PPT, XLS (native OLE/CFB) | DOC, PPT, XLS (native OLE/CFB) |
|
||||
| **Startup** | ~1s | ~1s |
|
||||
|
||||
**Core** is optimized for production deployments where image size matters. Both images support all major formats — choose based on deployment constraints.
|
||||
|
||||
All images include: Tesseract OCR (eng, spa, fra, deu, ita, por, chi-sim, chi-tra, jpn, ara, rus, hin), PDF (pdf_oxide), images, HTML, email, and archives.
|
||||
|
||||
## Execution Modes
|
||||
|
||||
### API Server (Default)
|
||||
|
||||
```bash title="Terminal"
|
||||
docker run -p 8000:8000 ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
|
||||
# Custom port and CORS
|
||||
docker run -p 9000:9000 \
|
||||
-e KREUZBERG_CORS_ORIGINS="https://myapp.com" \
|
||||
ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
||||
serve --host 0.0.0.0 --port 9000
|
||||
|
||||
# With config file
|
||||
docker run -p 8000:8000 \
|
||||
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
|
||||
ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
||||
serve --config /config/kreuzberg.toml
|
||||
```
|
||||
|
||||
See [API Server Guide](api-server.md) for endpoint documentation.
|
||||
|
||||
### CLI Mode
|
||||
|
||||
```bash title="Terminal"
|
||||
# Extract a file
|
||||
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
||||
extract /data/document.pdf
|
||||
|
||||
# Extract with OCR
|
||||
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
||||
extract /data/scanned.pdf --ocr true
|
||||
|
||||
# Batch processing
|
||||
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
||||
batch /data/*.pdf --format json
|
||||
|
||||
# MIME detection
|
||||
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
||||
detect /data/unknown-file.bin
|
||||
```
|
||||
|
||||
### MCP Server
|
||||
|
||||
```bash title="Terminal"
|
||||
docker run ghcr.io/kreuzberg-dev/kreuzberg:latest mcp
|
||||
|
||||
# With config
|
||||
docker run \
|
||||
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
|
||||
ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
||||
mcp --config /config/kreuzberg.toml
|
||||
```
|
||||
|
||||
See [API Server Guide - MCP Section](api-server.md#mcp-server) for integration details.
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
| ------------------------------ | ----------------------------- | ------------------------------------------------------------------------------------------------ |
|
||||
| `KREUZBERG_MAX_UPLOAD_SIZE_MB` | `100` | Max upload size in MB |
|
||||
| `KREUZBERG_CORS_ORIGINS` | `*` | Comma-separated allowed origins |
|
||||
| `RUST_LOG` | `info` | Log level: `error`, `warn`, `info`, `debug`, `trace` |
|
||||
| `KREUZBERG_CACHE_DIR` | `/app/.kreuzberg` | Cache directory (set explicitly in Docker; outside containers defaults to platform global cache) |
|
||||
| `HF_HOME` | `/app/.kreuzberg/huggingface` | HuggingFace model cache |
|
||||
|
||||
Host and port are set via CLI args: `serve --host 0.0.0.0 --port 8000`.
|
||||
|
||||
## Volume Mounts
|
||||
|
||||
```bash title="Terminal"
|
||||
# Cache persistence (embedding models, OCR cache)
|
||||
docker run -p 8000:8000 \
|
||||
-v kreuzberg-cache:/app/.kreuzberg \
|
||||
ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
|
||||
# Config file
|
||||
docker run -p 8000:8000 \
|
||||
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
|
||||
ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
||||
serve --config /config/kreuzberg.toml
|
||||
|
||||
# Documents (read-only)
|
||||
docker run -v $(pwd)/documents:/data:ro \
|
||||
ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
||||
extract /data/document.pdf
|
||||
```
|
||||
|
||||
!!! Note "Model Downloads" Embedding models download on first use (~90 MB – 1.2 GB depending on preset). Use a persistent volume for `/app/.kreuzberg` in production to avoid re-downloading on container restart. Outside Docker, models are cached in the platform-specific global cache directory (for example, `~/.cache/kreuzberg/` on Linux, `~/Library/Caches/kreuzberg/` on macOS).
|
||||
|
||||
## Docker Compose
|
||||
|
||||
```yaml title="docker-compose.yaml"
|
||||
services:
|
||||
kreuzberg-api:
|
||||
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
ports:
|
||||
- "8000:8000"
|
||||
environment:
|
||||
- KREUZBERG_CORS_ORIGINS=https://myapp.com
|
||||
- KREUZBERG_MAX_UPLOAD_SIZE_MB=500
|
||||
- RUST_LOG=info
|
||||
volumes:
|
||||
- ./config:/config
|
||||
- cache-data:/app/.kreuzberg
|
||||
command: serve --host 0.0.0.0 --port 8000 --config /config/kreuzberg.toml
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "kreuzberg", "--version"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
start_period: 5s
|
||||
|
||||
volumes:
|
||||
cache-data:
|
||||
```
|
||||
|
||||
## Security
|
||||
|
||||
Images run as non-root user `kreuzberg` (UID 1000). For hardened deployments:
|
||||
|
||||
```bash title="Terminal"
|
||||
docker run --security-opt no-new-privileges \
|
||||
--read-only \
|
||||
--tmpfs /tmp \
|
||||
-p 8000:8000 \
|
||||
ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
```
|
||||
|
||||
Ensure mounted volumes have correct permissions:
|
||||
|
||||
```bash title="Terminal"
|
||||
chown -R 1000:1000 /path/to/mounted/directory
|
||||
```
|
||||
|
||||
## Resource Allocation
|
||||
|
||||
| Workload | Memory | CPU | Notes |
|
||||
| -------- | ------ | --------- | --------------------------------------- |
|
||||
| Light | 512 MB | 0.5 cores | Small documents, low concurrency |
|
||||
| Medium | 1 GB | 1 core | Typical documents, moderate concurrency |
|
||||
| Heavy | 2 GB+ | 2+ cores | Large documents, OCR, high concurrency |
|
||||
|
||||
```bash title="Terminal"
|
||||
docker run -p 8000:8000 --memory=1g --cpus=1 \
|
||||
ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
```
|
||||
|
||||
## Building Custom Images
|
||||
|
||||
=== "Core Image"
|
||||
|
||||
--8<-- "snippets/docker/build_core.md"
|
||||
|
||||
=== "Full Image"
|
||||
|
||||
--8<-- "snippets/docker/build_full.md"
|
||||
|
||||
```dockerfile title="Custom Dockerfile"
|
||||
FROM ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
|
||||
USER root
|
||||
RUN apt-get update && \
|
||||
apt-get install -y --no-install-recommends your-package-here && \
|
||||
apt-get clean && rm -rf /var/lib/apt/lists/*
|
||||
|
||||
USER kreuzberg
|
||||
COPY kreuzberg.toml /app/kreuzberg.toml
|
||||
CMD ["serve", "--config", "/app/kreuzberg.toml"]
|
||||
```
|
||||
|
||||
## Other Image Variants
|
||||
|
||||
The published Core and Full images cover most use cases. For specialized needs, the `docker/` directory has additional Dockerfiles:
|
||||
|
||||
| Dockerfile | What it builds |
|
||||
| ------------------------- | ------------------------------------------------------------------------------------- |
|
||||
| `Dockerfile.cli` | Minimal image with just the `kreuzberg` binary — good for CI pipelines and batch jobs |
|
||||
| `Dockerfile.musl-build` | Fully static Linux binaries via MUSL — runs on any distro, no dynamic libs |
|
||||
| `Dockerfile.musl-ffi` | Static C FFI library for language bindings (Go, Ruby, R, PHP, Elixir) |
|
||||
| `Dockerfile.musl-rustler` | MUSL-based Rustler NIF for Elixir |
|
||||
|
||||
### CLI Image
|
||||
|
||||
A stripped-down image with only the CLI binary. No server, no API — just extraction:
|
||||
|
||||
```bash title="Terminal"
|
||||
docker build -f docker/Dockerfile.cli -t kreuzberg-cli .
|
||||
|
||||
docker run -v $(pwd):/data kreuzberg-cli extract /data/document.pdf
|
||||
docker run -v $(pwd):/data kreuzberg-cli batch /data/*.pdf --format json
|
||||
docker run -v $(pwd):/data kreuzberg-cli detect /data/unknown-file.bin
|
||||
```
|
||||
|
||||
### MUSL Static Builds
|
||||
|
||||
These produce binaries with zero dynamic library dependencies. A single file that runs on any Linux — Alpine, scratch containers, bare EC2 instances, whatever.
|
||||
|
||||
```bash title="Terminal"
|
||||
docker build -f docker/Dockerfile.musl-build -t kreuzberg-musl-build .
|
||||
docker build -f docker/Dockerfile.musl-ffi -t kreuzberg-musl-ffi .
|
||||
```
|
||||
|
||||
The FFI variant builds a shared library used by the Go, Ruby, R, PHP, and Elixir bindings for portable cross-platform distribution.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
??? Question "Container won't start"
|
||||
|
||||
Check logs with `docker logs <container-id>`. Common causes: port conflict (change `-p` mapping), insufficient memory (increase `--memory`), volume permission errors.
|
||||
|
||||
??? Question "Permission errors on mounted volumes"
|
||||
|
||||
Images run as UID 1000. Fix with: `chown -R 1000:1000 /path/to/mounted/directory`
|
||||
|
||||
??? Question "Large file processing fails"
|
||||
|
||||
Increase memory limit (`--memory=4g`) and upload size (`-e KREUZBERG_MAX_UPLOAD_SIZE_MB=1000`).
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [Kubernetes Deployment](kubernetes.md) — production K8s with OCR config and health checks
|
||||
- [API Server Guide](api-server.md) — endpoint documentation
|
||||
- [Configuration](configuration.md) — all configuration options
|
||||
554
docs/guides/extraction.md
Normal file
554
docs/guides/extraction.md
Normal file
@@ -0,0 +1,554 @@
|
||||
# Extraction Basics
|
||||
|
||||
Eight core extraction functions are available, organized by input type (file path vs bytes), cardinality (single vs batch), and execution model (sync vs async).
|
||||
|
||||
| Input | Single sync | Single async | Batch sync | Batch async |
|
||||
| ------------- | -------------------- | --------------- | -------------------------- | --------------------- |
|
||||
| **File path** | `extract_file_sync` | `extract_file` | `batch_extract_files_sync` | `batch_extract_files` |
|
||||
| **Bytes** | `extract_bytes_sync` | `extract_bytes` | `batch_extract_bytes_sync` | `batch_extract_bytes` |
|
||||
|
||||
!!! Tip "Sync vs Async" Use async variants when you're already in an async context or processing multiple files concurrently. For scripts and simple pipelines, sync variants are simpler and just as fast for single files.
|
||||
|
||||
## Extract from Files
|
||||
|
||||
Pass a file path. Kreuzberg detects the MIME type from the extension and selects the right parser automatically.
|
||||
|
||||
### Synchronous
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/extract_file_sync.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/extract_file_sync.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/extract_file_sync.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/extract_file_sync.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/extract_file_sync.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/extract_file_sync.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/extract_file_sync.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/extract_file_sync.md"
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/extract_file_sync.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/api/extract_file_sync.md"
|
||||
|
||||
### Asynchronous
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/extract_file_async.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/extract_file_async.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/extract_file_async.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/extract_file_async.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/extract_file_async.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/extract_file_async.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/extract_file_async.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/extract_file_async.md"
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/extract_file_async.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/api/extract_file_async.md"
|
||||
|
||||
## Extract from Bytes
|
||||
|
||||
When the file is already loaded in memory (for example, from an upload or network response), pass the byte array with its MIME type. Unlike file extraction, the MIME type is required since there's no file extension to infer it from.
|
||||
|
||||
### Synchronous
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/extract_bytes_sync.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/extract_bytes_sync.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/extract_bytes_sync.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/extract_bytes_sync.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/extract_bytes_sync.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/extract_bytes_sync.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/extract_bytes_sync.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/extract_bytes_sync.md"
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/extract_bytes_sync.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/api/extract_bytes_sync.md"
|
||||
|
||||
### Asynchronous
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/extract_bytes_async.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/extract_bytes_async.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/extract_bytes_async.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/extract_bytes_async.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/extract_bytes_async.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/extract_bytes_async.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/extract_bytes_async.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/extract_bytes_async.md"
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/extract_bytes_async.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/api/extract_bytes_async.md"
|
||||
|
||||
## Batch Processing
|
||||
|
||||
Batch functions accept an array of file paths (or byte arrays) and process them concurrently. This is typically 2-5x faster than looping over single-file functions because Kreuzberg parallelizes internally.
|
||||
|
||||
### Batch Extract Files
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/batch_extract_files_sync.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/batch_extract_files_sync.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/api/batch_extract_files_sync.md"
|
||||
|
||||
### Batch Extract Bytes
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/batch_extract_bytes_sync.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/batch_extract_bytes_sync.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/batch_extract_bytes_sync.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/batch_extract_bytes_sync.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/batch_extract_bytes_sync.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/batch_extract_bytes_sync.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/batch_extract_bytes_sync.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/batch_extract_bytes_sync.md"
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/batch_extract_bytes_sync.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/api/batch_extract_bytes_sync.md"
|
||||
|
||||
### Per-File Configuration <span class="version-badge">v4.5.0</span>
|
||||
|
||||
When a batch contains a mix of document types that need different settings (for example, scanned images needing OCR alongside text-based PDFs), use `FileExtractionConfig` to override options per file while sharing a common batch config.
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python title="mixed_batch.py"
|
||||
from kreuzberg import (
|
||||
batch_extract_files_sync,
|
||||
ExtractionConfig,
|
||||
FileExtractionConfig,
|
||||
OcrConfig,
|
||||
)
|
||||
|
||||
config = ExtractionConfig(output_format="markdown")
|
||||
|
||||
paths = ["report.pdf", "scan.tiff", "notes.html"]
|
||||
file_configs = [
|
||||
None,
|
||||
FileExtractionConfig(
|
||||
force_ocr=True,
|
||||
ocr=OcrConfig(backend="tesseract", language="deu"),
|
||||
),
|
||||
FileExtractionConfig(output_format="plain"),
|
||||
]
|
||||
|
||||
results = batch_extract_files_sync(paths, config, file_configs=file_configs)
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript title="mixed_batch.ts"
|
||||
import { batchExtractFilesSync } from '@kreuzberg/node';
|
||||
|
||||
const results = batchExtractFilesSync(
|
||||
['report.pdf', 'scan.tiff', 'notes.html'],
|
||||
{ outputFormat: 'markdown' },
|
||||
[
|
||||
null,
|
||||
{ forceOcr: true, ocr: { backend: 'tesseract', language: 'deu' } },
|
||||
{ outputFormat: 'plain' },
|
||||
],
|
||||
);
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust title="mixed_batch.rs"
|
||||
use kreuzberg::{
|
||||
batch_extract_files, ExtractionConfig, FileExtractionConfig,
|
||||
OcrConfig, OutputFormat,
|
||||
};
|
||||
use std::path::PathBuf;
|
||||
|
||||
let config = ExtractionConfig {
|
||||
output_format: OutputFormat::Markdown,
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
let paths = vec![
|
||||
PathBuf::from("report.pdf"),
|
||||
PathBuf::from("scan.tiff"),
|
||||
PathBuf::from("notes.html"),
|
||||
];
|
||||
|
||||
let file_configs = vec![
|
||||
None,
|
||||
Some(FileExtractionConfig {
|
||||
force_ocr: Some(true),
|
||||
ocr: Some(OcrConfig {
|
||||
backend: "tesseract".to_string(),
|
||||
language: "deu".to_string(),
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
}),
|
||||
Some(FileExtractionConfig {
|
||||
output_format: Some(OutputFormat::Plain),
|
||||
..Default::default()
|
||||
}),
|
||||
];
|
||||
|
||||
let results = batch_extract_files(paths, &config, Some(&file_configs)).await?;
|
||||
```
|
||||
|
||||
Fields set to `None` in `FileExtractionConfig` inherit the batch default. Batch-level concerns like `max_concurrent_extractions`, `use_cache`, and `security_limits` cannot be overridden per file. See the [Configuration Reference](../reference/configuration.md#fileextractionconfig) for the full list of overridable fields.
|
||||
|
||||
## Content Filtering <span class="version-badge">v4.8.0</span>
|
||||
|
||||
Kreuzberg strips running headers, footers, watermarks, and cross-page repeating text by default so that downstream RAG and LLM pipelines see clean body content. `ContentFilterConfig` lets you opt back in to any of these when you need them, for example when extracting legal forms where the header carries the case number, or when running text analysis on a PDF whose brand name was being incorrectly removed by the repeating-text heuristic.
|
||||
|
||||
By default headers, footers, and watermarks are stripped and cross-page repeating text is deduplicated; see [ContentFilterConfig](../reference/configuration.md#contentfilterconfig) for field-level defaults and per-format behavior.
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python title="keep_headers_footers.py"
|
||||
from kreuzberg import (
|
||||
extract_file_sync,
|
||||
ContentFilterConfig,
|
||||
ExtractionConfig,
|
||||
)
|
||||
|
||||
# Legal/forms work: keep header and footer text
|
||||
config = ExtractionConfig(
|
||||
content_filter=ContentFilterConfig(
|
||||
include_headers=True,
|
||||
include_footers=True,
|
||||
),
|
||||
)
|
||||
|
||||
result = extract_file_sync("contract.pdf", config=config)
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript title="disable_repeating_text.ts"
|
||||
import { extract } from "@kreuzberg/node";
|
||||
|
||||
// Disable cross-page deduplication so brand names aren't stripped
|
||||
const result = await extract("brochure.pdf", {
|
||||
contentFilter: {
|
||||
stripRepeatingText: false,
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust title="content_filter.rs"
|
||||
use kreuzberg::{extract_file_sync, ContentFilterConfig, ExtractionConfig};
|
||||
|
||||
let config = ExtractionConfig {
|
||||
content_filter: Some(ContentFilterConfig {
|
||||
include_headers: true,
|
||||
include_footers: true,
|
||||
strip_repeating_text: true,
|
||||
include_watermarks: false,
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
let result = extract_file_sync("contract.pdf", None, &config)?;
|
||||
```
|
||||
|
||||
When a layout-detection model is active, it can independently classify regions as page headers or footers and strip them per page. Setting `include_headers=True` / `include_footers=True` also disables that per-page stripping. See the [reference page](../reference/configuration.md#contentfilterconfig) for the full field semantics and per-format behavior.
|
||||
|
||||
## Supported Formats
|
||||
|
||||
Kreuzberg supports 90+ file formats across 8 categories:
|
||||
|
||||
| Category | Extensions | Notes |
|
||||
| ----------------- | -------------------------------------------------------- | ----------------------------------- |
|
||||
| **PDF** | `.pdf` | Native text + OCR for scanned pages |
|
||||
| **Images** | `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp` | Requires OCR backend |
|
||||
| **Office** | `.docx`, `.pptx`, `.xlsx` | Modern formats via native parsers |
|
||||
| **Legacy Office** | `.doc`, `.ppt` | Native OLE/CFB parsing |
|
||||
| **Email** | `.eml`, `.msg` | Full support including attachments |
|
||||
| **Web** | `.html`, `.htm` | Converted to Markdown with metadata |
|
||||
| **Text** | `.md`, `.txt`, `.xml`, `.json`, `.yaml`, `.toml`, `.csv` | Direct extraction |
|
||||
| **Archives** | `.zip`, `.tar`, `.tar.gz`, `.tar.bz2` | Recursive extraction |
|
||||
|
||||
## Page Tracking
|
||||
|
||||
Kreuzberg can track page boundaries and extract per-page content. Page tracking availability depends on the format:
|
||||
|
||||
- **PDF** — Full byte-accurate page tracking with O(1) lookup
|
||||
- **PPTX** — Slide boundary tracking (each slide = one page)
|
||||
- **DOCX** — Best-effort detection using explicit `<w:br type="page"/>` tags
|
||||
- **Other formats** — No page tracking
|
||||
|
||||
Enable page extraction with `PageConfig`:
|
||||
|
||||
```python title="page_tracking.py"
|
||||
config = ExtractionConfig(
|
||||
pages=PageConfig(
|
||||
insert_page_markers=True,
|
||||
marker_format="\n\n<!-- PAGE {page_num} -->\n\n"
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
Page markers like `<!-- PAGE 1 -->` are inserted at boundaries in the `content` field — useful for LLMs that need to understand document layout. When both page tracking and chunking are enabled, chunks automatically include `first_page` and `last_page` metadata.
|
||||
|
||||
See [PageConfig Reference](../reference/configuration.md#pageconfig) for all options and [Advanced Page Tracking](./advanced.md) for chunk-to-page mapping examples.
|
||||
|
||||
## Code File Extraction
|
||||
|
||||
Source code files (`.py`, `.rs`, `.ts`, `.go`, etc.) go through tree-sitter and produce a `ProcessResult` on `ExtractionResult.code_intelligence` (structure, imports/exports, symbols, docstrings, diagnostics, semantic chunks). Code files bypass text chunking — TSLP's function/class-aware `CodeChunks` map directly to Kreuzberg `Chunk`s with semantic `chunk_type` and heading context.
|
||||
|
||||
See [Code Intelligence](code-intelligence.md) for usage and [`TreeSitterProcessConfig`](../reference/configuration.md#treesitterprocessconfig) for fields.
|
||||
|
||||
## PDF Page Rendering
|
||||
|
||||
Render individual PDF pages as PNG images. Unlike the extraction pipeline (which parses text, tables, metadata), this API produces raw pixel data for thumbnails, vision model input, or custom OCR pipelines.
|
||||
|
||||
### Two Approaches
|
||||
|
||||
| API | When to use |
|
||||
| ----------------- | ---------------------------------------------------------------------- |
|
||||
| `render_pdf_page` | You know which page you need, or only need a few pages |
|
||||
| `PdfPageIterator` | Process every page sequentially without loading all images into memory |
|
||||
|
||||
### DPI Configuration
|
||||
|
||||
| DPI | Pixel size (US Letter) | Use case |
|
||||
| ------------- | ---------------------- | ------------------------------- |
|
||||
| 72 | 612 x 792 | Thumbnails, quick previews |
|
||||
| 150 (default) | 1275 x 1650 | General-purpose, screen display |
|
||||
| 300 | 2550 x 3300 | OCR input, print quality |
|
||||
|
||||
**Tip:** Use 300 DPI when rendering pages for OCR or vision models. The default 150 DPI may reduce recognition accuracy on small text.
|
||||
|
||||
## MIME Type Detection
|
||||
|
||||
When extracting from bytes, Kreuzberg requires an explicit MIME type since there's no file extension to infer it from. For file paths, auto-detection from the extension is automatic.
|
||||
|
||||
### Example: Override MIME Type
|
||||
|
||||
```python title="Python"
|
||||
from kreuzberg import extract_file
|
||||
|
||||
# File without extension — provide MIME type explicitly
|
||||
result = extract_file("document_copy", mime_type="application/pdf", config=config)
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
All extraction functions raise typed exceptions on failure. Catch specific exceptions to handle different failure modes:
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/error_handling.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/api/error_handling.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/error_handling.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/error_handling.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/error_handling.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/error_handling.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/error_handling.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/error_handling.md"
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/error_handling.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/api/error_handling_wasm.md"
|
||||
|
||||
!!! Warning "System Errors"
|
||||
`OSError` (Python), `IOException` (Rust), and system-level errors always propagate through. These indicate real system problems (permissions, disk space, etc.) that your application should handle.
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [Configuration](configuration.md) — all configuration options and file formats
|
||||
- [OCR Guide](ocr.md) — set up optical character recognition
|
||||
- [Advanced Features](advanced.md) — chunking, language detection, embeddings
|
||||
- [Element-Based Output](output-formats.md#element-based-output-v410) — structured element arrays for RAG
|
||||
- [Document Structure](output-formats.md#document-structure) — hierarchical tree output
|
||||
244
docs/guides/html-output.md
Normal file
244
docs/guides/html-output.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# HTML Output
|
||||
|
||||
!!! Info "Added in v4.8.1"
|
||||
|
||||
Render extracted document content as styled HTML with semantic `kb-*` CSS classes, configurable themes, and full CSS customization.
|
||||
|
||||
## Quick Start
|
||||
|
||||
=== "CLI"
|
||||
|
||||
```bash title="Terminal"
|
||||
kreuzberg extract doc.pdf --html-theme github
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python title="html_output.py"
|
||||
from kreuzberg import ExtractionConfig, HtmlOutputConfig, HtmlTheme, extract_file
|
||||
|
||||
config = ExtractionConfig(
|
||||
output_format="html",
|
||||
html_output=HtmlOutputConfig(theme=HtmlTheme.GitHub),
|
||||
)
|
||||
result = await extract_file("doc.pdf", config=config)
|
||||
print(result.content) # styled HTML string
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript title="html_output.ts"
|
||||
import { extractFile, HtmlTheme } from '@kreuzberg/node';
|
||||
|
||||
const result = await extractFile('doc.pdf', {
|
||||
outputFormat: 'html',
|
||||
htmlOutput: { theme: HtmlTheme.GitHub },
|
||||
});
|
||||
console.log(result.content);
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust title="html_output.rs"
|
||||
use kreuzberg::{extract_file, ExtractionConfig, HtmlOutputConfig, HtmlTheme};
|
||||
|
||||
let config = ExtractionConfig {
|
||||
output_format: "html".to_string(),
|
||||
html_output: Some(HtmlOutputConfig {
|
||||
theme: HtmlTheme::GitHub,
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
let result = extract_file("doc.pdf", None, &config).await?;
|
||||
println!("{}", result.content);
|
||||
```
|
||||
|
||||
## Built-in Themes
|
||||
|
||||
| Theme | Description |
|
||||
| -------------------- | -------------------------------------------------------------------------------------- |
|
||||
| `unstyled` (default) | No built-in CSS. Only structural markup with `kb-*` classes. Use your own style sheet. |
|
||||
| `default` | System font stack, neutral colours, 72ch max width. All CSS custom properties defined. |
|
||||
| `github` | GitHub Markdown-inspired palette, border-bottom headings, 80ch max width. |
|
||||
| `dark` | Dark background (#0d1117), light text. Good for terminal/IDE integrations. |
|
||||
| `light` | Minimal light theme with generous spacing. |
|
||||
|
||||
## Configuration
|
||||
|
||||
See [HtmlOutputConfig](../reference/configuration.md#htmloutputconfig) for detailed field documentation.
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python title="html_config.py"
|
||||
from kreuzberg import ExtractionConfig, HtmlOutputConfig, HtmlTheme
|
||||
|
||||
config = ExtractionConfig(
|
||||
output_format="html",
|
||||
html_output=HtmlOutputConfig(
|
||||
theme=HtmlTheme.Dark,
|
||||
css="body { padding: 2rem; }",
|
||||
class_prefix="kb-",
|
||||
embed_css=True,
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript title="html_config.ts"
|
||||
import { HtmlTheme } from '@kreuzberg/node';
|
||||
|
||||
const config = {
|
||||
outputFormat: 'html',
|
||||
htmlOutput: {
|
||||
theme: HtmlTheme.Dark,
|
||||
css: 'body { padding: 2rem; }',
|
||||
classPrefix: 'kb-',
|
||||
embedCss: true,
|
||||
},
|
||||
};
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust title="html_config.rs"
|
||||
use kreuzberg::{ExtractionConfig, HtmlOutputConfig, HtmlTheme};
|
||||
|
||||
let config = ExtractionConfig {
|
||||
output_format: "html".to_string(),
|
||||
html_output: Some(HtmlOutputConfig {
|
||||
theme: HtmlTheme::Dark,
|
||||
css: Some("body { padding: 2rem; }".to_string()),
|
||||
class_prefix: "kb-".to_string(),
|
||||
embed_css: true,
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
```
|
||||
|
||||
## CLI Flags
|
||||
|
||||
| Flag | Description |
|
||||
| ------------------------------ | -------------------------------------------------------------------------------------------------- |
|
||||
| `--html-theme <THEME>` | Built-in theme: `default`, `github`, `dark`, `light`, `unstyled`. Implies `--content-format html`. |
|
||||
| `--html-css <CSS>` | Inline CSS string appended after the theme stylesheet. |
|
||||
| `--html-css-file <PATH>` | Path to CSS file loaded at render time (max 1 MiB). |
|
||||
| `--html-class-prefix <PREFIX>` | CSS class prefix; default: `"kb-"`. Alphanumeric, hyphens, underscores only. |
|
||||
| `--html-no-embed-css` | Suppress the `<style>` block; use external stylesheet instead. |
|
||||
|
||||
## CSS Customization
|
||||
|
||||
All built-in themes (except `unstyled`) define CSS custom properties on `:root`. Override them to adjust the theme without replacing it entirely:
|
||||
|
||||
```css title="custom.css"
|
||||
:root {
|
||||
--kb-font-family: "Inter", sans-serif;
|
||||
--kb-text-color: #333;
|
||||
--kb-max-width: 60ch;
|
||||
}
|
||||
```
|
||||
|
||||
Pass custom CSS inline or from a file:
|
||||
|
||||
=== "CLI"
|
||||
|
||||
```bash title="Terminal"
|
||||
# Inline override
|
||||
kreuzberg extract doc.pdf --html-theme github \
|
||||
--html-css ':root { --kb-max-width: 60ch; }'
|
||||
|
||||
# From a file
|
||||
kreuzberg extract doc.pdf --html-theme github \
|
||||
--html-css-file custom.css
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python title="custom_css.py"
|
||||
from kreuzberg import ExtractionConfig, HtmlOutputConfig, HtmlTheme
|
||||
|
||||
config = ExtractionConfig(
|
||||
output_format="html",
|
||||
html_output=HtmlOutputConfig(
|
||||
theme=HtmlTheme.GitHub,
|
||||
css_file="custom.css",
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust title="custom_css.rs"
|
||||
use kreuzberg::{ExtractionConfig, HtmlOutputConfig, HtmlTheme};
|
||||
use std::path::PathBuf;
|
||||
|
||||
let config = ExtractionConfig {
|
||||
output_format: "html".to_string(),
|
||||
html_output: Some(HtmlOutputConfig {
|
||||
theme: HtmlTheme::GitHub,
|
||||
css_file: Some(PathBuf::from("custom.css")),
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
```
|
||||
|
||||
To use your own style sheet, set the theme to `unstyled` and disable the embedded `<style>` block:
|
||||
|
||||
```python title="external_stylesheet.py"
|
||||
config = ExtractionConfig(
|
||||
output_format="html",
|
||||
html_output=HtmlOutputConfig(
|
||||
theme=HtmlTheme.Unstyled,
|
||||
embed_css=False,
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
## Class Reference
|
||||
|
||||
All generated HTML elements include semantic `kb-*` classes for targeted styling.
|
||||
|
||||
| Class | Element | Description |
|
||||
| --------------------------- | ---------------------- | ----------------------------- |
|
||||
| `kb-doc` | `<div>` | Root wrapper |
|
||||
| `kb-content` | `<main>` | Content area |
|
||||
| `kb-doc-title` | `<h1>` | Document title |
|
||||
| `kb-h`, `kb-h1`..`kb-h6` | `<h1>`..`<h6>` | Headings |
|
||||
| `kb-p` | `<p>` | Paragraphs |
|
||||
| `kb-list`, `kb-ul`, `kb-ol` | `<ul>`, `<ol>` | Lists |
|
||||
| `kb-li` | `<li>` | List items |
|
||||
| `kb-blockquote` | `<blockquote>` | Block quotes |
|
||||
| `kb-pre` | `<pre>` | Code blocks |
|
||||
| `kb-code` | `<code>` | Inline/block code |
|
||||
| `kb-table` | `<table>` | Tables |
|
||||
| `kb-thead`, `kb-tbody` | `<thead>`, `<tbody>` | Table sections |
|
||||
| `kb-th`, `kb-td`, `kb-tr` | `<th>`, `<td>`, `<tr>` | Table cells/rows |
|
||||
| `kb-figure` | `<figure>` | Image wrapper |
|
||||
| `kb-img` | `<img>` | Images |
|
||||
| `kb-page-break` | `<hr>` | Page breaks |
|
||||
| `kb-footnote` | `<aside>` | Footnote definitions |
|
||||
| `kb-footnote-ref` | `<sup>` | Footnote references |
|
||||
| `kb-citation` | `<cite>` | Citations |
|
||||
| `kb-link` | `<a>` | Hyperlinks |
|
||||
| `kb-metadata` | `<dl>` | Metadata blocks |
|
||||
| `kb-formula` | `<pre>` | Math formulas |
|
||||
| `kb-slide` | `<section>` | Slide sections |
|
||||
| `kb-dt`, `kb-dd` | `<dt>`, `<dd>` | Definition terms/descriptions |
|
||||
| `kb-admonition` | `<aside>` | Admonitions |
|
||||
| `kb-group` | `<div>` | Grouped content |
|
||||
|
||||
!!! Tip "Custom prefix" If you set `class_prefix` to `"my-"`, all classes become `my-doc`, `my-content`, `my-h1`, and so on.
|
||||
|
||||
## Security
|
||||
|
||||
!!! Warning "Security considerations" - `class_prefix` is validated to prevent HTML injection - `</style>` sequences are stripped from user CSS - `css_file` is limited to 1 MiB - When serving HTML to untrusted users, sanitize CSS at the application layer
|
||||
|
||||
## See Also
|
||||
|
||||
- [Configuration](configuration.md) -- all configuration options
|
||||
- [Extraction Basics](extraction.md) -- core extraction API and supported formats
|
||||
- [Element-Based Output](output-formats.md#element-based-output-v410) -- structured element output as an alternative to HTML
|
||||
- [Document Structure](output-formats.md#document-structure) -- how Kreuzberg models document structure
|
||||
105
docs/guides/keywords.md
Normal file
105
docs/guides/keywords.md
Normal file
@@ -0,0 +1,105 @@
|
||||
# Keyword Extraction
|
||||
|
||||
Extract ranked keywords from document text using YAKE or RAKE algorithms.
|
||||
|
||||
| Algorithm | Scoring | Best for |
|
||||
| --------- | ---------------------------------------- | --------------------------------------------- |
|
||||
| **YAKE** | Lower score = more relevant (0.0–1.0) | General documents, single terms, multilingual |
|
||||
| **RAKE** | Higher score = more relevant (unbounded) | Multi-word phrases, technical docs |
|
||||
|
||||
## Quick Start
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/keyword_extraction_example.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/utils/keyword_extraction_example.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/advanced/keyword_extraction_example.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/utils/keyword_extraction_example.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/utils/keyword_extraction_example.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/keyword_extraction_example.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/utils/keyword_extraction_example.md"
|
||||
|
||||
Keywords are returned in `result.extracted_keywords` as objects with `text` and `score` fields.
|
||||
|
||||
## Configuration
|
||||
|
||||
See [KeywordConfig reference](../reference/configuration.md#keywordconfig) for all configuration options.
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/keyword_extraction_config.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/keyword_extraction_config.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/config/keyword_extraction_config.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/keyword_extraction_config.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/keyword_extraction_config.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/keyword_extraction_config.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/advanced/keyword_extraction_config.md"
|
||||
|
||||
## YAKE Score Tuning
|
||||
|
||||
Use `min_score` as upper bound. Lower YAKE scores = higher relevance:
|
||||
|
||||
| `min_score` | Effect |
|
||||
| ----------- | ------------------- |
|
||||
| `0.5` | Keeps most keywords |
|
||||
| `0.3` | Main topics only |
|
||||
| `0.1` | Core concepts only |
|
||||
|
||||
`yake_params.window_size` controls co-occurrence context: `1–2` for narrow domains, `2–3` for general (default: `2`), `3–4` for discussion-heavy content.
|
||||
|
||||
## RAKE Score Tuning
|
||||
|
||||
Use `min_score` as lower bound. Higher RAKE scores = higher relevance:
|
||||
|
||||
| `min_score` | Effect |
|
||||
| ----------- | ---------------------------- |
|
||||
| `0.1` | Keeps most keywords |
|
||||
| `5.0` | Main phrases only |
|
||||
| `20.0` | Only highly specific phrases |
|
||||
|
||||
`rake_params.min_word_length` (default: `1`) and `rake_params.max_words_per_phrase` (default: `3`) control phrase boundaries.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- **Too few keywords** — Lower `min_score`, check `result.content` is non-empty, set `language` to match the document or `None` to disable stopword filtering
|
||||
- **Too many irrelevant keywords** — Raise `min_score`, set `language` for stopword filtering, reduce `ngram_range` upper bound
|
||||
- **Multi-word phrases missing (YAKE)** — Switch to RAKE or confirm `ngram_range` upper bound is >= 2
|
||||
- **Keywords don't match content** — Verify text was extracted (`result.content`) and `language` matches the document
|
||||
|
||||
See the [KeywordConfig reference](../reference/configuration.md#keywordconfig) for the full parameter list.
|
||||
530
docs/guides/kubernetes.md
Normal file
530
docs/guides/kubernetes.md
Normal file
@@ -0,0 +1,530 @@
|
||||
# Kubernetes Deployment <span class="version-badge new">v4.2.2</span>
|
||||
|
||||
Deploy Kreuzberg to Kubernetes with proper OCR configuration, permissions, and health checks.
|
||||
|
||||
## Helm Chart <span class="version-badge new">v4.8.4</span>
|
||||
|
||||
Deploy via the official Helm chart (OCI artifact on GHCR).
|
||||
|
||||
### Install
|
||||
|
||||
```bash title="Terminal"
|
||||
helm install kreuzberg oci://ghcr.io/kreuzberg-dev/charts/kreuzberg --version 4.8.4
|
||||
```
|
||||
|
||||
### Configure
|
||||
|
||||
Override defaults with a `values.yaml` file:
|
||||
|
||||
```yaml title="values.yaml"
|
||||
# NOTE: cache.enabled=true uses ReadWriteOnce by default; keep replicaCount: 1
|
||||
# with RWO storage or switch to ReadWriteMany before increasing replicas.
|
||||
replicaCount: 1
|
||||
|
||||
image:
|
||||
tag: "4.8.4"
|
||||
|
||||
kreuzberg:
|
||||
logLevel: "info"
|
||||
ocrLanguage: "eng"
|
||||
|
||||
resources:
|
||||
requests:
|
||||
memory: "1Gi"
|
||||
cpu: "1000m"
|
||||
limits:
|
||||
memory: "4Gi"
|
||||
cpu: "2000m"
|
||||
|
||||
cache:
|
||||
enabled: true
|
||||
size: 5Gi
|
||||
|
||||
ingress:
|
||||
enabled: true
|
||||
className: "nginx"
|
||||
hosts:
|
||||
- host: kreuzberg.example.com
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
tls:
|
||||
- secretName: kreuzberg-tls
|
||||
hosts:
|
||||
- kreuzberg.example.com
|
||||
|
||||
autoscaling:
|
||||
enabled: true
|
||||
minReplicas: 2
|
||||
maxReplicas: 10
|
||||
targetCPUUtilizationPercentage: 80
|
||||
|
||||
podDisruptionBudget:
|
||||
enabled: true
|
||||
minAvailable: 1
|
||||
```
|
||||
|
||||
```bash title="Terminal"
|
||||
helm install kreuzberg oci://ghcr.io/kreuzberg-dev/charts/kreuzberg \
|
||||
--version 4.8.4 \
|
||||
-f values.yaml
|
||||
```
|
||||
|
||||
### Upgrade
|
||||
|
||||
```bash title="Terminal"
|
||||
helm upgrade kreuzberg oci://ghcr.io/kreuzberg-dev/charts/kreuzberg --version 4.8.4
|
||||
```
|
||||
|
||||
### What's Included
|
||||
|
||||
The chart creates the following resources:
|
||||
|
||||
| Resource | Description | Conditional |
|
||||
| ----------------------- | ---------------------------------------------------------- | ----------------------------- |
|
||||
| Deployment | Main application with health probes and security hardening | Always |
|
||||
| Service | ClusterIP service on port 80 → 8000 | Always |
|
||||
| ServiceAccount | Dedicated service account | Always |
|
||||
| PersistentVolumeClaim | Cache for embedding models and assets | `cache.enabled` |
|
||||
| Ingress | HTTP(S) ingress with TLS | `ingress.enabled` |
|
||||
| HorizontalPodAutoscaler | CPU/memory-based autoscaling | `autoscaling.enabled` |
|
||||
| PodDisruptionBudget | Availability during disruptions | `podDisruptionBudget.enabled` |
|
||||
|
||||
All values are documented in the chart's [`values.yaml`](https://github.com/kreuzberg-dev/kreuzberg/blob/main/charts/kreuzberg/values.yaml).
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
```yaml title="minimal-deployment.yaml"
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: kreuzberg-api
|
||||
spec:
|
||||
replicas: 2
|
||||
selector:
|
||||
matchLabels:
|
||||
app: kreuzberg
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: kreuzberg
|
||||
spec:
|
||||
containers:
|
||||
- name: kreuzberg
|
||||
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
ports:
|
||||
- containerPort: 8000
|
||||
name: http
|
||||
env:
|
||||
- name: RUST_LOG
|
||||
value: "info"
|
||||
- name: TESSDATA_PREFIX
|
||||
value: "/usr/share/tesseract-ocr/5/tessdata"
|
||||
resources:
|
||||
requests:
|
||||
memory: "512Mi"
|
||||
cpu: "500m"
|
||||
limits:
|
||||
memory: "2Gi"
|
||||
cpu: "2000m"
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 30
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: kreuzberg-api
|
||||
spec:
|
||||
selector:
|
||||
app: kreuzberg
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 80
|
||||
targetPort: 8000
|
||||
type: LoadBalancer
|
||||
```
|
||||
|
||||
```bash title="Terminal"
|
||||
kubectl apply -f minimal-deployment.yaml
|
||||
```
|
||||
|
||||
## Tesseract Configuration
|
||||
|
||||
### TESSDATA_PREFIX (Critical)
|
||||
|
||||
Without `TESSDATA_PREFIX`, OCR silently falls back to non-OCR extraction. Official images ship Tesseract 5.x with tessdata at `/usr/share/tesseract-ocr/5/tessdata/`.
|
||||
|
||||
```yaml
|
||||
env:
|
||||
- name: TESSDATA_PREFIX
|
||||
value: "/usr/share/tesseract-ocr/5/tessdata"
|
||||
- name: KREUZBERG_OCR_LANGUAGE
|
||||
value: "eng"
|
||||
- name: KREUZBERG_CACHE_DIR
|
||||
value: "/app/.kreuzberg"
|
||||
- name: HF_HOME
|
||||
value: "/app/.kreuzberg/huggingface"
|
||||
```
|
||||
|
||||
**Pre-installed languages:** `eng`, `spa`, `fra`, `deu`, `ita`, `por`, `chi_sim`, `chi_tra`, `jpn`, `ara`, `rus`, `hin`
|
||||
|
||||
!!! Note "Tesseract Version" The path varies by version. Verify yours with `tesseract --version` inside the container if using a custom base image.
|
||||
|
||||
### Custom Languages via ConfigMap
|
||||
|
||||
```bash title="Terminal"
|
||||
kubectl create configmap tessdata \
|
||||
--from-file=/path/to/eng.traineddata \
|
||||
--from-file=/path/to/deu.traineddata
|
||||
```
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
containers:
|
||||
- name: kreuzberg
|
||||
env:
|
||||
- name: TESSDATA_PREFIX
|
||||
value: "/etc/tessdata"
|
||||
volumeMounts:
|
||||
- name: tessdata
|
||||
mountPath: /etc/tessdata
|
||||
volumes:
|
||||
- name: tessdata
|
||||
configMap:
|
||||
name: tessdata
|
||||
```
|
||||
|
||||
For large custom language sets, use a PVC instead of a ConfigMap.
|
||||
|
||||
### Verify Tesseract
|
||||
|
||||
```bash title="Terminal"
|
||||
kubectl exec -it deployment/kreuzberg-api -- tesseract --version
|
||||
kubectl exec -it deployment/kreuzberg-api -- tesseract --list-langs
|
||||
kubectl exec -it deployment/kreuzberg-api -- printenv TESSDATA_PREFIX
|
||||
```
|
||||
|
||||
## Permissions
|
||||
|
||||
Kreuzberg runs as non-root (UID 1000, GID 1000). Fix PVC permissions with either approach:
|
||||
|
||||
=== "Init Container"
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
initContainers:
|
||||
- name: init-permissions
|
||||
image: busybox:1.37-glibc
|
||||
command: ['sh', '-c', 'chown -R 1000:1000 /app/.kreuzberg']
|
||||
securityContext:
|
||||
runAsUser: 0
|
||||
allowPrivilegeEscalation: false
|
||||
capabilities:
|
||||
add: ["CHOWN"]
|
||||
drop: ["ALL"]
|
||||
volumeMounts:
|
||||
- name: cache
|
||||
mountPath: /app/.kreuzberg
|
||||
containers:
|
||||
- name: kreuzberg
|
||||
volumeMounts:
|
||||
- name: cache
|
||||
mountPath: /app/.kreuzberg
|
||||
```
|
||||
|
||||
=== "fsGroup"
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
securityContext:
|
||||
fsGroup: 1000
|
||||
containers:
|
||||
- name: kreuzberg
|
||||
securityContext:
|
||||
runAsUser: 1000
|
||||
runAsGroup: 1000
|
||||
allowPrivilegeEscalation: false
|
||||
readOnlyRootFilesystem: true
|
||||
capabilities:
|
||||
drop: ["ALL"]
|
||||
```
|
||||
|
||||
## Health Checks
|
||||
|
||||
```yaml
|
||||
containers:
|
||||
- name: kreuzberg
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 30
|
||||
timeoutSeconds: 5
|
||||
failureThreshold: 3
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 3
|
||||
failureThreshold: 2
|
||||
startupProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
periodSeconds: 10
|
||||
failureThreshold: 30
|
||||
```
|
||||
|
||||
## Logging
|
||||
|
||||
```yaml
|
||||
env:
|
||||
- name: RUST_LOG
|
||||
value: "kreuzberg=debug,warn"
|
||||
```
|
||||
|
||||
Levels: `trace`, `debug`, `info`, `warn`, `error`
|
||||
|
||||
```bash title="Terminal"
|
||||
kubectl logs deployment/kreuzberg-api --tail=50
|
||||
kubectl logs deployment/kreuzberg-api -f
|
||||
kubectl logs deployment/kreuzberg-api --previous
|
||||
```
|
||||
|
||||
## Production Deployment
|
||||
|
||||
Full production manifest with namespace, PVC, security context, init container, PDB, and all probes:
|
||||
|
||||
```yaml title="production-deployment.yaml"
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: kreuzberg
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: kreuzberg-cache
|
||||
namespace: kreuzberg
|
||||
spec:
|
||||
accessModes: [ReadWriteOnce]
|
||||
resources:
|
||||
requests:
|
||||
storage: 2Gi
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: kreuzberg-api
|
||||
namespace: kreuzberg
|
||||
# NOTE: PVC uses ReadWriteOnce; keep replicas: 1 with RWO storage.
|
||||
# Increase replicas only when using ReadWriteMany storage.
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: kreuzberg
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: kreuzberg
|
||||
spec:
|
||||
securityContext:
|
||||
runAsNonRoot: true
|
||||
runAsUser: 1000
|
||||
runAsGroup: 1000
|
||||
fsGroup: 1000
|
||||
seccompProfile:
|
||||
type: RuntimeDefault
|
||||
initContainers:
|
||||
- name: init-cache
|
||||
image: busybox:1.37-glibc
|
||||
command: ["sh", "-c", "mkdir -p /app/.kreuzberg && chown -R 1000:1000 /app/.kreuzberg"]
|
||||
securityContext:
|
||||
runAsUser: 0
|
||||
allowPrivilegeEscalation: false
|
||||
capabilities:
|
||||
add: ["CHOWN"]
|
||||
drop: ["ALL"]
|
||||
volumeMounts:
|
||||
- name: cache
|
||||
mountPath: /app/.kreuzberg
|
||||
containers:
|
||||
- name: kreuzberg
|
||||
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
ports:
|
||||
- containerPort: 8000
|
||||
name: http
|
||||
env:
|
||||
- name: RUST_LOG
|
||||
value: "info"
|
||||
- name: TESSDATA_PREFIX
|
||||
value: "/usr/share/tesseract-ocr/5/tessdata"
|
||||
- name: KREUZBERG_CACHE_DIR
|
||||
value: "/app/.kreuzberg"
|
||||
- name: HF_HOME
|
||||
value: "/app/.kreuzberg/huggingface"
|
||||
- name: KREUZBERG_CORS_ORIGINS
|
||||
value: "https://app.example.com"
|
||||
- name: KREUZBERG_MAX_UPLOAD_SIZE_MB
|
||||
value: "500"
|
||||
args: ["serve", "--host", "0.0.0.0", "--port", "8000"]
|
||||
resources:
|
||||
requests:
|
||||
memory: "1Gi"
|
||||
cpu: "1000m"
|
||||
limits:
|
||||
memory: "4Gi"
|
||||
cpu: "2000m"
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
initialDelaySeconds: 15
|
||||
periodSeconds: 30
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
initialDelaySeconds: 10
|
||||
periodSeconds: 10
|
||||
startupProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 8000
|
||||
periodSeconds: 10
|
||||
failureThreshold: 30
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: false
|
||||
readOnlyRootFilesystem: true
|
||||
capabilities:
|
||||
drop: ["ALL"]
|
||||
volumeMounts:
|
||||
- name: cache
|
||||
mountPath: /app/.kreuzberg
|
||||
- name: tmp
|
||||
mountPath: /tmp
|
||||
volumes:
|
||||
- name: cache
|
||||
persistentVolumeClaim:
|
||||
claimName: kreuzberg-cache
|
||||
- name: tmp
|
||||
emptyDir: {}
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: kreuzberg-api
|
||||
namespace: kreuzberg
|
||||
spec:
|
||||
type: LoadBalancer
|
||||
selector:
|
||||
app: kreuzberg
|
||||
ports:
|
||||
- protocol: TCP
|
||||
port: 80
|
||||
targetPort: 8000
|
||||
---
|
||||
apiVersion: policy/v1
|
||||
kind: PodDisruptionBudget
|
||||
metadata:
|
||||
name: kreuzberg-pdb
|
||||
namespace: kreuzberg
|
||||
spec:
|
||||
minAvailable: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: kreuzberg
|
||||
```
|
||||
|
||||
```bash title="Terminal"
|
||||
kubectl apply -f production-deployment.yaml
|
||||
```
|
||||
|
||||
!!! Note "Model Persistence" Embedding models download on first use (~90 MB – 1.2 GB). Use a PVC for `/app/.kreuzberg` to avoid re-downloading on pod restart. Outside containers, models are cached in the platform-specific global cache directory (for example, `~/.cache/kreuzberg/` on Linux, `~/Library/Caches/kreuzberg/` on macOS).
|
||||
|
||||
## High Availability
|
||||
|
||||
Add pod anti-affinity and rolling update strategy:
|
||||
|
||||
```yaml title="ha-additions.yaml"
|
||||
spec:
|
||||
replicas: 5
|
||||
strategy:
|
||||
type: RollingUpdate
|
||||
rollingUpdate:
|
||||
maxSurge: 1
|
||||
maxUnavailable: 0
|
||||
template:
|
||||
spec:
|
||||
affinity:
|
||||
podAntiAffinity:
|
||||
preferredDuringSchedulingIgnoredDuringExecution:
|
||||
- weight: 100
|
||||
podAffinityTerm:
|
||||
labelSelector:
|
||||
matchExpressions:
|
||||
- key: app
|
||||
operator: In
|
||||
values: [kreuzberg]
|
||||
topologyKey: kubernetes.io/hostname
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
??? Question "OCR silently failing"
|
||||
|
||||
Verify `TESSDATA_PREFIX` is set and tessdata files exist:
|
||||
|
||||
```bash title="Terminal"
|
||||
kubectl exec -it deployment/kreuzberg-api -- printenv TESSDATA_PREFIX
|
||||
kubectl exec -it deployment/kreuzberg-api -- ls /usr/share/tesseract-ocr/5/tessdata/
|
||||
```
|
||||
|
||||
??? Question "Permission denied on cache directory"
|
||||
|
||||
Use an init container or `fsGroup` (see [Permissions](#permissions)).
|
||||
|
||||
??? Question "OOMKilled"
|
||||
|
||||
Increase memory limits. Reduce OCR resource usage with `KREUZBERG_PDF_DPI=150` and single-language OCR.
|
||||
|
||||
??? Question "Startup probe timeout"
|
||||
|
||||
Increase `failureThreshold` on the startup probe (e.g., `60` for 10-minute timeout).
|
||||
|
||||
??? Question "Language not found"
|
||||
|
||||
Check installed languages with `kubectl exec -it deployment/kreuzberg-api -- tesseract --list-langs`. Mount custom tessdata via ConfigMap or PVC.
|
||||
|
||||
### Diagnostic Commands
|
||||
|
||||
```bash title="Terminal"
|
||||
kubectl logs deployment/kreuzberg-api --tail=200
|
||||
kubectl describe deployment kreuzberg-api
|
||||
kubectl get events -n kreuzberg
|
||||
kubectl exec -it deployment/kreuzberg-api -- env | sort
|
||||
kubectl port-forward service/kreuzberg-api 8000:8000 && curl http://localhost:8000/health
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [Docker Deployment](docker.md) — container configuration and image variants
|
||||
- [API Server Guide](api-server.md) — endpoint documentation
|
||||
- [OCR Guide](ocr.md) — backend installation and language setup
|
||||
- [Configuration](configuration.md) — all configuration options
|
||||
242
docs/guides/layout-detection.md
Normal file
242
docs/guides/layout-detection.md
Normal file
@@ -0,0 +1,242 @@
|
||||
# Layout Detection <span class="version-badge">v4.5.0</span>
|
||||
|
||||
Detect document layout regions (tables, figures, headers, text blocks, etc.) in PDFs using ONNX-based deep learning models. Enables table extraction, figure isolation, reading-order reconstruction, and selective OCR.
|
||||
|
||||
!!! Note "Feature gate" Requires the `layout-detection` Cargo feature. Not included in the default feature set.
|
||||
|
||||
## Model
|
||||
|
||||
Layout detection uses the **RT-DETR v2** model, an ONNX-based deep learning model that detects 17 layout element classes: text blocks, tables, figures, headers, footers, captions, code, lists, sections, formulas, footnotes, page headers/footers, titles, checkboxes, key-value regions, and document indices.
|
||||
|
||||
### When to Enable
|
||||
|
||||
**Recommended for:** complex multi-column PDFs, scanned documents, academic papers, business forms, and any document where layout understanding improves extraction accuracy.
|
||||
|
||||
**Less beneficial for:** simple single-column text documents, high-throughput pipelines where latency is critical (consider GPU acceleration), or documents already well-handled by PDF structure trees.
|
||||
|
||||
### Performance Impact
|
||||
|
||||
| Pipeline | Structure F1 | Text F1 | Avg time/doc |
|
||||
| -------- | ------------ | ------- | ------------ |
|
||||
| Baseline | 33.9% | 87.4% | 447 ms |
|
||||
| Layout | 41.1% | 90.1% | 1500 ms |
|
||||
|
||||
_171-document PDF corpus, CPU only. GPU acceleration significantly reduces the time penalty._
|
||||
|
||||
!!! Note "Layout Detection Model" Kreuzberg uses only the RT-DETR v2 model for layout detection. The `preset` field is not available in `LayoutDetectionConfig`. Configure table structure recognition separately via `table_model` — see "Table Structure Models" below.
|
||||
|
||||
## Configuration
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python
|
||||
from kreuzberg import ExtractionConfig, LayoutDetectionConfig, extract_file
|
||||
|
||||
config = ExtractionConfig(
|
||||
layout=LayoutDetectionConfig(
|
||||
confidence_threshold=0.5,
|
||||
apply_heuristics=True,
|
||||
table_model="tatr",
|
||||
)
|
||||
)
|
||||
result = await extract_file("document.pdf", config=config)
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript
|
||||
const result = await extract("document.pdf", {
|
||||
layout: {
|
||||
confidenceThreshold: 0.5,
|
||||
applyHeuristics: true,
|
||||
tableModel: "tatr",
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust
|
||||
use kreuzberg::core::{ExtractionConfig, LayoutDetectionConfig};
|
||||
|
||||
let config = ExtractionConfig {
|
||||
layout: Some(LayoutDetectionConfig {
|
||||
confidence_threshold: Some(0.5),
|
||||
apply_heuristics: true,
|
||||
table_model: Some("tatr".to_string()),
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
```
|
||||
|
||||
=== "TOML"
|
||||
|
||||
```toml title="kreuzberg.toml"
|
||||
[layout]
|
||||
apply_heuristics = true
|
||||
# table_model = "tatr"
|
||||
```
|
||||
|
||||
=== "CLI"
|
||||
|
||||
```bash title="Terminal"
|
||||
# Enable layout detection with default settings
|
||||
kreuzberg extract document.pdf --layout --content-format markdown
|
||||
|
||||
# Custom confidence threshold
|
||||
kreuzberg extract document.pdf --layout-confidence 0.5 --content-format markdown
|
||||
|
||||
# Specific table model
|
||||
kreuzberg extract document.pdf --layout --layout-table-model slanet_wired
|
||||
|
||||
# Combined with GPU acceleration
|
||||
kreuzberg extract document.pdf --layout --acceleration coreml
|
||||
```
|
||||
|
||||
See [LayoutDetectionConfig](../reference/configuration.md#layoutdetectionconfig) for all fields.
|
||||
|
||||
## Table Structure Models <span class="version-badge">v4.5.3</span>
|
||||
|
||||
When layout detection identifies a table region, a table structure model analyzes rows, columns, headers, and spanning cells. Set `LayoutDetectionConfig.table_model` to one of:
|
||||
|
||||
| Value | Notes |
|
||||
| ----------------- | ----------------------------------------------------------- |
|
||||
| `tatr` | Default. Fast (~30 MB). General-purpose. |
|
||||
| `slanet_wired` | Higher accuracy for bordered/gridlined tables (~365 MB). |
|
||||
| `slanet_wireless` | Higher accuracy for borderless tables (~365 MB). |
|
||||
| `slanet_auto` | Auto-classifies per page (~737 MB). Slowest. |
|
||||
| `slanet_plus` | Smallest (~7.78 MB). For resource-constrained environments. |
|
||||
| `disabled` | Skip table structure recognition. |
|
||||
|
||||
!!! Note "Model Download" SLANeXT models are not downloaded by default. Use `cache warm --all-table-models` to pre-download, or they download automatically on first use.
|
||||
|
||||
## GPU Acceleration
|
||||
|
||||
Layout detection uses ONNX Runtime with automatic provider selection:
|
||||
|
||||
| Provider | Platform | Notes |
|
||||
| -------- | -------------- | ----------------------------- |
|
||||
| CPU | All | Default, no setup needed |
|
||||
| CUDA | Linux, Windows | Requires CUDA toolkit + cuDNN |
|
||||
| CoreML | macOS | Automatic on Apple Silicon |
|
||||
| TensorRT | Linux | Requires TensorRT |
|
||||
|
||||
To override:
|
||||
|
||||
```python
|
||||
config = ExtractionConfig(
|
||||
layout=LayoutDetectionConfig(),
|
||||
acceleration=AccelerationConfig(provider="cuda", device_id=0)
|
||||
)
|
||||
```
|
||||
|
||||
See [AccelerationConfig reference](../reference/configuration.md#accelerationconfig) for details.
|
||||
|
||||
## Layout Classes
|
||||
|
||||
The RT-DETR v2 model detects 17 classes. Each `LayoutRegion.class_name` is one of:
|
||||
|
||||
`caption`, `footnote`, `formula`, `list_item`, `page_footer`, `page_header`, `picture`, `section_header`, `table`, `text`, `title`, `document_index`, `code`, `checkbox_selected`, `checkbox_unselected`, `form`, `key_value_region`.
|
||||
|
||||
See [`LayoutRegion`](../reference/types.md#layoutregion) in the types reference for the full field shape.
|
||||
|
||||
## Accessing Layout Regions
|
||||
|
||||
When layout detection is enabled AND page extraction is enabled, each page in the result includes `layout_regions` — a list of detected regions with class, confidence score, bounding box, and area fraction. This enables programmatic filtering and analysis of specific layout elements.
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file, ExtractionConfig, LayoutDetectionConfig, PagesConfig
|
||||
|
||||
result = await extract_file(
|
||||
"document.pdf",
|
||||
config=ExtractionConfig(
|
||||
layout=LayoutDetectionConfig(),
|
||||
pages=PagesConfig(extract_pages=True),
|
||||
),
|
||||
)
|
||||
|
||||
for page in result.pages:
|
||||
if page.layout_regions:
|
||||
for region in page.layout_regions:
|
||||
if region.class_name == "picture" and region.confidence > 0.9:
|
||||
print(f"Page {page.page_number}: diagram detected "
|
||||
f"(confidence={region.confidence:.2f}, "
|
||||
f"area={region.area_fraction:.0%})")
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript
|
||||
const result = await extract("document.pdf", {
|
||||
layout: {},
|
||||
pages: { extractPages: true },
|
||||
});
|
||||
|
||||
for (const page of result.pages ?? []) {
|
||||
if (page.layoutRegions) {
|
||||
for (const region of page.layoutRegions) {
|
||||
if (region.className === "picture" && region.confidence > 0.9) {
|
||||
console.log(
|
||||
`Page ${page.pageNumber}: diagram detected ` +
|
||||
`(confidence=${region.confidence.toFixed(2)}, ` +
|
||||
`area=${(region.areaFraction * 100).toFixed(0)}%)`
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust
|
||||
use kreuzberg::core::{ExtractionConfig, LayoutDetectionConfig, PagesConfig};
|
||||
|
||||
let result = extract_file(
|
||||
"document.pdf",
|
||||
ExtractionConfig {
|
||||
layout: Some(LayoutDetectionConfig::default()),
|
||||
pages: Some(PagesConfig {
|
||||
extract_pages: true,
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
},
|
||||
).await?;
|
||||
|
||||
for page in &result.pages {
|
||||
if let Some(regions) = &page.layout_regions {
|
||||
for region in regions {
|
||||
if region.class_name == "picture" && region.confidence > 0.9 {
|
||||
println!(
|
||||
"Page {}: diagram detected (confidence={:.2}, area={:.0}%)",
|
||||
page.page_number,
|
||||
region.confidence,
|
||||
region.area_fraction * 100.0
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Tips
|
||||
|
||||
- Use `confidence` to filter low-confidence detections — typically ≥ 0.8–0.9 for downstream operations
|
||||
- Use `area_fraction` to distinguish between inline images and full-page diagrams (e.g., `area_fraction > 0.1` for significant figures)
|
||||
- Regions are independent of page extraction — enable both to access both content and layout structure
|
||||
- Available across all bindings (Python, TypeScript, Rust, Ruby, Java, Go, Elixir, C#, PHP)
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
- **[Docling](https://github.com/DS4SD/docling)** — RT-DETR v2 model and layout classification approach
|
||||
- **[TATR](https://github.com/microsoft/table-transformer)** — Table structure recognition with ONNX
|
||||
- **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)** — SLANeXT table structure and PP-LCNet classifier models
|
||||
|
||||
## Related
|
||||
|
||||
- [Configuration Reference](../reference/configuration.md#layoutdetectionconfig) — full field reference
|
||||
- [Element-Based Output](output-formats.md#element-based-output-v410) — using layout-aware results
|
||||
411
docs/guides/llm-integration.md
Normal file
411
docs/guides/llm-integration.md
Normal file
@@ -0,0 +1,411 @@
|
||||
# LLM Integration <span class="version-badge">v4.8.0</span>
|
||||
|
||||
Kreuzberg integrates with 143 LLM providers (including local inference engines) via [liter-llm](https://github.com/kreuzberg-dev/liter-llm) for three capabilities: VLM OCR, structured extraction, and provider-hosted embeddings.
|
||||
|
||||
!!! Note "Feature gate" Requires the `liter-llm` Cargo feature. Not included in the default feature set.
|
||||
|
||||
## VLM OCR
|
||||
|
||||
Use vision-language models as an OCR backend by rendering document pages as images and sending them to the VLM for text extraction.
|
||||
|
||||
### When to Use
|
||||
|
||||
- Low-quality scanned documents where traditional OCR struggles
|
||||
- Handwritten text recognition
|
||||
- Arabic, Farsi, and other scripts with poor Tesseract/PaddleOCR support
|
||||
- Complex layouts where traditional OCR fails (mixed tables, forms, diagrams)
|
||||
- When you need higher accuracy and can accept higher latency and API costs
|
||||
|
||||
### Configuration
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/llm/vlm_ocr.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/llm/vlm_ocr.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust title="Rust"
|
||||
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig, LlmConfig};
|
||||
|
||||
let config = ExtractionConfig {
|
||||
force_ocr: true,
|
||||
ocr: Some(OcrConfig {
|
||||
backend: "vlm".to_string(),
|
||||
vlm_config: Some(LlmConfig {
|
||||
model: "openai/gpt-4o-mini".to_string(),
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
let result = extract_file("scan.pdf", None, &config).await?;
|
||||
```
|
||||
|
||||
=== "CLI"
|
||||
|
||||
```bash title="Terminal"
|
||||
kreuzberg extract scan.pdf --force-ocr true \
|
||||
--vlm-model openai/gpt-4o-mini
|
||||
```
|
||||
|
||||
=== "TOML"
|
||||
|
||||
```toml title="kreuzberg.toml"
|
||||
force_ocr = true
|
||||
|
||||
[ocr]
|
||||
backend = "vlm"
|
||||
|
||||
[ocr.vlm_config]
|
||||
model = "openai/gpt-4o-mini"
|
||||
```
|
||||
|
||||
=== "Environment Variables"
|
||||
|
||||
```bash title="Terminal"
|
||||
export KREUZBERG_VLM_OCR_MODEL=openai/gpt-4o-mini
|
||||
export OPENAI_API_KEY=sk-...
|
||||
```
|
||||
|
||||
### Custom VLM Prompt
|
||||
|
||||
Override the default prompt template for VLM OCR:
|
||||
|
||||
```python title="Python"
|
||||
from kreuzberg import ExtractionConfig, OcrConfig, LlmConfig
|
||||
|
||||
config = ExtractionConfig(
|
||||
force_ocr=True,
|
||||
ocr=OcrConfig(
|
||||
backend="vlm",
|
||||
vlm_config=LlmConfig(model="openai/gpt-4o-mini"),
|
||||
vlm_prompt="Extract all text from this document image. Preserve formatting.",
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
### Supported Providers
|
||||
|
||||
Any liter-llm vision-capable provider works as a VLM OCR backend:
|
||||
|
||||
| Provider | Example Model |
|
||||
| ----------------- | -------------------------------------- |
|
||||
| OpenAI | `openai/gpt-4o`, `openai/gpt-4o-mini` |
|
||||
| Anthropic | `anthropic/claude-3-5-sonnet-20241022` |
|
||||
| Google | `google/gemini-2.0-flash` |
|
||||
| Groq | `groq/llama-3.2-90b-vision-preview` |
|
||||
| Ollama (local) | `ollama/llama3.2-vision` |
|
||||
| LM Studio (local) | `lmstudio/llava-1.5` |
|
||||
| vLLM (local) | `vllm/llava-next` |
|
||||
|
||||
## Structured Extraction
|
||||
|
||||
Extract structured JSON data from documents by providing a schema; the document text is sent to an LLM for conforming extraction.
|
||||
|
||||
### Basic Usage
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/llm/structured_extraction.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/llm/structured_extraction.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/llm/structured_extraction.md"
|
||||
|
||||
=== "CLI"
|
||||
|
||||
```bash title="Terminal"
|
||||
kreuzberg extract-structured paper.pdf \
|
||||
--schema schema.json \
|
||||
--model openai/gpt-4o-mini \
|
||||
--strict
|
||||
```
|
||||
|
||||
=== "TOML"
|
||||
|
||||
```toml title="kreuzberg.toml"
|
||||
[structured_extraction]
|
||||
schema_name = "paper_metadata"
|
||||
strict = true
|
||||
|
||||
[structured_extraction.schema]
|
||||
type = "object"
|
||||
|
||||
[structured_extraction.schema.properties.title]
|
||||
type = "string"
|
||||
|
||||
[structured_extraction.schema.properties.date]
|
||||
type = "string"
|
||||
|
||||
[structured_extraction.llm]
|
||||
model = "openai/gpt-4o-mini"
|
||||
```
|
||||
|
||||
### Custom Prompts (Jinja2)
|
||||
|
||||
Override the default extraction prompt with a Jinja2 template:
|
||||
|
||||
```python title="Python"
|
||||
from kreuzberg import ExtractionConfig, StructuredExtractionConfig, LlmConfig
|
||||
|
||||
config = ExtractionConfig(
|
||||
structured_extraction=StructuredExtractionConfig(
|
||||
schema={"type": "object", "properties": {"title": {"type": "string"}}},
|
||||
llm=LlmConfig(model="openai/gpt-4o-mini"),
|
||||
prompt=(
|
||||
"Analyze this document and extract key metadata.\n\n"
|
||||
"Document:\n{{ content }}\n\n"
|
||||
"Schema: {{ schema }}"
|
||||
),
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
Available template variables:
|
||||
|
||||
| Variable | Description |
|
||||
| -------------------------- | ----------------------------------------- |
|
||||
| `{{ content }}` | The extracted document text |
|
||||
| `{{ schema }}` | The JSON schema as a formatted string |
|
||||
| `{{ schema_name }}` | The schema name (default: `"extraction"`) |
|
||||
| `{{ schema_description }}` | The schema description (may be empty) |
|
||||
|
||||
### Cross-Provider Compatibility
|
||||
|
||||
Structured extraction handles provider differences automatically:
|
||||
|
||||
- **OpenAI**: Full strict mode with `additionalProperties` enforcement
|
||||
- **Anthropic/Gemini**: `additionalProperties` automatically stripped (not supported by these providers)
|
||||
- **All providers**: Markdown code fence wrapping in responses is automatically handled
|
||||
|
||||
### Strict Mode
|
||||
|
||||
When `strict=True`, the LLM is instructed to produce output that exactly matches the schema. This enables OpenAI's structured output mode and adds validation on the response.
|
||||
|
||||
## VLM Embeddings
|
||||
|
||||
Use provider-hosted embedding models when you need to match your vector database model or local ONNX models are unavailable.
|
||||
|
||||
### Configuration
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/llm/vlm_embeddings.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript title="TypeScript"
|
||||
import { embedSync } from '@kreuzberg/node';
|
||||
|
||||
const embeddings = embedSync(['Hello world'], {
|
||||
model: {
|
||||
modelType: 'llm',
|
||||
value: 'openai/text-embedding-3-small',
|
||||
},
|
||||
normalize: true,
|
||||
});
|
||||
console.log(embeddings[0].length); // 1536
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust title="Rust"
|
||||
use kreuzberg::{embed_texts, EmbeddingConfig, EmbeddingModelType, LlmConfig};
|
||||
|
||||
let config = EmbeddingConfig {
|
||||
model: EmbeddingModelType::Llm {
|
||||
llm: LlmConfig {
|
||||
model: "openai/text-embedding-3-small".to_string(),
|
||||
..Default::default()
|
||||
},
|
||||
},
|
||||
normalize: true,
|
||||
..Default::default()
|
||||
};
|
||||
let embeddings = embed_texts(&["Hello world"], &config)?;
|
||||
```
|
||||
|
||||
=== "CLI"
|
||||
|
||||
```bash title="Terminal"
|
||||
kreuzberg embed \
|
||||
--provider llm \
|
||||
--model openai/text-embedding-3-small \
|
||||
--text "Hello world"
|
||||
```
|
||||
|
||||
### Available Models
|
||||
|
||||
| Model | Dimensions | Provider |
|
||||
| ---------------------------------------- | ---------- | -------- |
|
||||
| `openai/text-embedding-3-small` | 1536 | OpenAI |
|
||||
| `openai/text-embedding-3-large` | 3072 | OpenAI |
|
||||
| `mistral/mistral-embed` | 1024 | Mistral |
|
||||
| Any liter-llm embedding-capable provider | Varies | Various |
|
||||
|
||||
## Local LLM Support
|
||||
|
||||
<span class="version-badge">v4.8.0</span>
|
||||
|
||||
Run local LLM inference engines via [liter-llm](https://github.com/kreuzberg-dev/liter-llm)'s provider routing; point to your local server without needing an API key.
|
||||
|
||||
### Supported Local Engines
|
||||
|
||||
| Engine | Prefix | Default URL | Install |
|
||||
| ------------------------------------------------------ | ------------ | --------------------------- | --------------------- |
|
||||
| [Ollama](https://ollama.com) | `ollama/` | `http://localhost:11434/v1` | `brew install ollama` |
|
||||
| [LM Studio](https://lmstudio.ai) | `lmstudio/` | `http://localhost:1234/v1` | Desktop app |
|
||||
| [vLLM](https://vllm.ai) | `vllm/` | `http://localhost:8000/v1` | `pip install vllm` |
|
||||
| [llama.cpp](https://github.com/ggerganov/llama.cpp) | `llamacpp/` | `http://localhost:8080/v1` | Build from source |
|
||||
| [LocalAI](https://localai.io) | `localai/` | `http://localhost:8080/v1` | Docker |
|
||||
| [llamafile](https://github.com/Mozilla-Ocho/llamafile) | `llamafile/` | `http://localhost:8080/v1` | Single binary |
|
||||
|
||||
### Example: Ollama
|
||||
|
||||
=== "CLI" ```Bash
|
||||
|
||||
# Start Ollama and pull a model
|
||||
|
||||
ollama pull llama3.2-vision
|
||||
|
||||
# Use it for VLM OCR (no API key needed)
|
||||
kreuzberg extract scan.pdf --force-ocr true \
|
||||
--vlm-model ollama/llama3.2-vision
|
||||
|
||||
# Use it for structured extraction
|
||||
kreuzberg extract-structured doc.pdf \
|
||||
--schema schema.json \
|
||||
--model ollama/llama3.2
|
||||
|
||||
# Use it for embeddings
|
||||
kreuzberg embed --provider llm \
|
||||
--model ollama/all-minilm \
|
||||
--text "Hello world"
|
||||
```
|
||||
|
||||
=== "Python" ```python from Kreuzberg import extract_file, ExtractionConfig, StructuredExtractionConfig, LlmConfig
|
||||
|
||||
config = ExtractionConfig(
|
||||
structured_extraction=StructuredExtractionConfig(
|
||||
schema={"type": "object", "properties": {"title": {"type": "string"}}},
|
||||
llm=LlmConfig(model="ollama/llama3.2"), # No api_key needed
|
||||
),
|
||||
)
|
||||
result = await extract_file("doc.pdf", config=config)
|
||||
```
|
||||
|
||||
=== "TOML Config" ```toml [structured_extraction.llm] model = "ollama/llama3.2"
|
||||
|
||||
# No api_key needed for local providers
|
||||
```
|
||||
|
||||
!!! Tip "Custom Base URL" If your local server runs on a non-default port, use `base_url`:
|
||||
`python
|
||||
LlmConfig(model="ollama/llama3.2", base_url="http://localhost:11435/v1")`
|
||||
|
||||
## LLM Usage Tracking
|
||||
|
||||
Every LLM call made during extraction is tracked in the `llm_usage` field of `ExtractionResult`. Each entry records the model used, token counts, estimated cost, and why the model stopped generating.
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python
|
||||
result = await extract_file("document.pdf", config)
|
||||
if result.get("llm_usage"):
|
||||
for usage in result["llm_usage"]:
|
||||
print(f"{usage['source']}: {usage['input_tokens']} in, {usage['output_tokens']} out, ${usage['estimated_cost']:.4f}")
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript
|
||||
const result = await extractFile("document.pdf", config);
|
||||
for (const usage of result.llmUsage ?? []) {
|
||||
console.log(`${usage.source}: ${usage.inputTokens} in, ${usage.outputTokens} out, $${usage.estimatedCost?.toFixed(4)}`);
|
||||
}
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust
|
||||
let result = extract_file("document.pdf", &config).await?;
|
||||
if let Some(usages) = &result.llm_usage {
|
||||
for usage in usages {
|
||||
println!("{}: {} in, {} out", usage.source, usage.input_tokens.unwrap_or(0), usage.output_tokens.unwrap_or(0));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `source` field indicates which pipeline stage triggered the call: `"vlm_ocr"`, `"structured_extraction"`, or `"embeddings"`.
|
||||
|
||||
## API Key Configuration
|
||||
|
||||
API keys can be set via (in order of precedence):
|
||||
|
||||
1. `api_key` field in `LlmConfig` — highest priority, per-request
|
||||
2. Provider standard env vars (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, etc.)
|
||||
3. Kreuzberg-specific env var (`KREUZBERG_LLM_API_KEY`) — used as fallback for any provider
|
||||
|
||||
!!! Note "Local providers skip API key lookup" Local inference engines (Ollama, LM Studio, vLLM, llama.cpp, LocalAI, llamafile) do not require an API key. If you use a local provider prefix (for example, `ollama/`), the API key fields are ignored.
|
||||
|
||||
```python title="Python"
|
||||
from kreuzberg import LlmConfig
|
||||
|
||||
# Explicit API key
|
||||
config = LlmConfig(model="openai/gpt-4o", api_key="sk-...")
|
||||
|
||||
# Custom base URL (e.g., Azure OpenAI, local proxy)
|
||||
config = LlmConfig(
|
||||
model="openai/gpt-4o",
|
||||
base_url="https://my-proxy.example.com/v1",
|
||||
)
|
||||
```
|
||||
|
||||
## LlmConfig Reference
|
||||
|
||||
| Field | Type | Default | Description |
|
||||
| -------------- | --------------- | ---------- | ------------------------------------------------------------------- |
|
||||
| `model` | `str` | _required_ | Provider/model in liter-llm format (for example, `"openai/gpt-4o"`) |
|
||||
| `api_key` | `str \| None` | `None` | API key (falls back to env vars) |
|
||||
| `base_url` | `str \| None` | `None` | Custom endpoint URL |
|
||||
| `timeout_secs` | `int \| None` | `60` | Request timeout in seconds |
|
||||
| `max_retries` | `int \| None` | `3` | Maximum retry attempts |
|
||||
| `temperature` | `float \| None` | `None` | Sampling temperature |
|
||||
| `max_tokens` | `int \| None` | `None` | Maximum tokens to generate |
|
||||
|
||||
## REST API
|
||||
|
||||
### Structured Extraction
|
||||
|
||||
`POST /extract-structured` — multipart form with file + schema + model configuration.
|
||||
|
||||
```bash title="Terminal"
|
||||
curl -X POST http://localhost:4000/extract-structured \
|
||||
-F "file=@invoice.pdf" \
|
||||
-F 'schema={"type":"object","properties":{"vendor":{"type":"string"},"total":{"type":"number"}}}' \
|
||||
-F "model=openai/gpt-4o-mini" \
|
||||
-F "strict=true"
|
||||
```
|
||||
|
||||
## MCP Tools
|
||||
|
||||
When running Kreuzberg as an MCP server, LLM features are available as tools:
|
||||
|
||||
- `extract_structured` — extract structured data from a document using a JSON schema
|
||||
- `embed_text` — extended with `model` parameter for LLM-hosted embeddings
|
||||
|
||||
## Related
|
||||
|
||||
- [OCR](ocr.md) — OCR backends including VLM OCR
|
||||
- [Configuration Reference](configuration.md) — full field reference for all config types
|
||||
- [Advanced Features](advanced.md) — chunking, language detection, local embeddings
|
||||
- [API Server](api-server.md) — REST API endpoints
|
||||
191
docs/guides/mcp-integration.md
Normal file
191
docs/guides/mcp-integration.md
Normal file
@@ -0,0 +1,191 @@
|
||||
# MCP Integration <span class="version-badge">v5.0.0</span>
|
||||
|
||||
Kreuzberg speaks [Model Context Protocol](https://modelcontextprotocol.io/). That means any AI agent — Claude, Cursor, a custom LangChain pipeline — can extract documents, generate embeddings, and manage caches through a standard tool interface without writing extraction code.
|
||||
|
||||
Two commands to get started:
|
||||
|
||||
```bash title="Terminal"
|
||||
pip install "kreuzberg[all]"
|
||||
kreuzberg mcp
|
||||
```
|
||||
|
||||
That's it. You now have an MCP server running over stdio, ready for any compatible client.
|
||||
|
||||
---
|
||||
|
||||
## How It Works
|
||||
|
||||
The MCP server wraps Kreuzberg's extraction engine behind standard tools, running as a child process over stdin/stdout with JSON-RPC messages — no HTTP ports or configuration needed.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
A["AI Agent\n(Claude, Cursor, etc.)"] -->|"JSON-RPC\nover stdio"| B["kreuzberg mcp"]
|
||||
B --> C["Extraction Engine"]
|
||||
B --> D["Embedding Engine"]
|
||||
B --> E["Cache Layer"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Server Modes
|
||||
|
||||
### Stdio (Default)
|
||||
|
||||
The standard mode for local AI tools. The agent spawns `kreuzberg mcp` as a subprocess and communicates over pipes.
|
||||
|
||||
```bash title="Terminal"
|
||||
kreuzberg mcp
|
||||
kreuzberg mcp --config kreuzberg.toml
|
||||
```
|
||||
|
||||
This is what Claude Desktop, Cursor, and most MCP clients expect.
|
||||
|
||||
### HTTP Transport
|
||||
|
||||
!!! Info "Feature flag: `mcp-http`" HTTP transport requires the `mcp-http` feature flag at build time.
|
||||
|
||||
For remote deployments or multi-client setups where stdio doesn't work — shared servers, team environments, cloud-hosted agents — HTTP transport exposes the same tool interface over the network.
|
||||
|
||||
---
|
||||
|
||||
## Tools
|
||||
|
||||
Kreuzberg exposes 13 tools via MCP. All extraction tools accept an optional `config` object to override defaults:
|
||||
|
||||
**Extraction:** `extract_file`, `extract_bytes`, `batch_extract_files`, `detect_mime_type`, `extract_structured`
|
||||
**Embeddings:** `embed_text`
|
||||
**Chunking:** `chunk_text`
|
||||
**Cache:** `cache_stats`, `cache_clear`, `cache_manifest`, `cache_warm`
|
||||
**Metadata:** `list_formats`, `get_version`
|
||||
|
||||
`extract_structured` requires the server to be built with the `liter-llm` feature. Full parameter schemas are discoverable at runtime via the MCP client's `list_tools` call.
|
||||
|
||||
---
|
||||
|
||||
## Connecting AI Tools
|
||||
|
||||
### Claude Desktop
|
||||
|
||||
Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
|
||||
|
||||
```json title="claude_desktop_config.json"
|
||||
{
|
||||
"mcpServers": {
|
||||
"kreuzberg": {
|
||||
"command": "kreuzberg",
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Restart Claude. Kreuzberg's tools appear automatically — ask Claude to "extract text from invoice.pdf" and it will call `extract_file` behind the scenes.
|
||||
|
||||
### Cursor
|
||||
|
||||
Add to `.cursor/mcp.json` in your project root:
|
||||
|
||||
```json title=".cursor/mcp.json"
|
||||
{
|
||||
"mcpServers": {
|
||||
"kreuzberg": {
|
||||
"command": "kreuzberg",
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Python MCP Client
|
||||
|
||||
For building custom agent pipelines, use the official `mcp` Python SDK:
|
||||
|
||||
```python title="mcp_client.py"
|
||||
import asyncio
|
||||
from mcp import ClientSession, StdioServerParameters
|
||||
from mcp.client.stdio import stdio_client
|
||||
|
||||
async def main() -> None:
|
||||
server_params = StdioServerParameters(
|
||||
command="kreuzberg", args=["mcp"]
|
||||
)
|
||||
|
||||
async with stdio_client(server_params) as (read, write):
|
||||
async with ClientSession(read, write) as session:
|
||||
await session.initialize()
|
||||
|
||||
tools = await session.list_tools()
|
||||
print(f"Available: {[t.name for t in tools.tools]}")
|
||||
|
||||
result = await session.call_tool(
|
||||
"extract_file",
|
||||
arguments={"path": "document.pdf"},
|
||||
)
|
||||
print(result)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Spawning from Python
|
||||
|
||||
If your application manages the server lifecycle directly:
|
||||
|
||||
```python title="spawn_server.py"
|
||||
import subprocess
|
||||
|
||||
process = subprocess.Popen(
|
||||
["python", "-m", "kreuzberg", "mcp"],
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
)
|
||||
print(f"MCP server running (PID {process.pid})")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
Pass a TOML config file to set extraction defaults for all tools:
|
||||
|
||||
```bash title="Terminal"
|
||||
kreuzberg mcp --config kreuzberg.toml
|
||||
```
|
||||
|
||||
Individual tool calls override file defaults via a `config` parameter. See [ExtractionConfig Reference](../reference/configuration.md) for all available fields.
|
||||
|
||||
---
|
||||
|
||||
## Running in Docker
|
||||
|
||||
```bash title="Terminal"
|
||||
docker run ghcr.io/kreuzberg-dev/kreuzberg:latest mcp
|
||||
|
||||
docker run \
|
||||
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
|
||||
ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
||||
mcp --config /config/kreuzberg.toml
|
||||
```
|
||||
|
||||
For production, use Compose with a persistent cache volume so embedding models don't re-download on restart:
|
||||
|
||||
```yaml title="docker-compose.yaml"
|
||||
services:
|
||||
kreuzberg-mcp:
|
||||
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
command: mcp --config /config/kreuzberg.toml
|
||||
volumes:
|
||||
- ./kreuzberg.toml:/config/kreuzberg.toml:ro
|
||||
- cache-data:/app/.kreuzberg
|
||||
restart: unless-stopped
|
||||
|
||||
volumes:
|
||||
cache-data:
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What to Read Next
|
||||
|
||||
- [API Server Guide](api-server.md) — the HTTP REST API and detailed MCP tool reference
|
||||
- [Docker Deployment](docker.md) — container setup for all server modes
|
||||
- [Configuration Reference](../reference/configuration.md) — every config option explained
|
||||
492
docs/guides/ocr.md
Normal file
492
docs/guides/ocr.md
Normal file
@@ -0,0 +1,492 @@
|
||||
# OCR (Optical Character Recognition)
|
||||
|
||||
Extract text from images and scanned PDFs. Kreuzberg automatically determines when OCR is needed — images always require it, scanned PDFs trigger it per-page, and hybrid PDFs only OCR the pages that lack a text layer. Set `force_ocr=True` to OCR all pages regardless.
|
||||
|
||||
## Backend Comparison
|
||||
|
||||
Four OCR backends — pick based on platform, accuracy needs, and language coverage.
|
||||
|
||||
| | **Tesseract** | **PaddleOCR** | **EasyOCR** | **VLM** |
|
||||
| ---------------- | -------------------- | ----------------------------------- | ------------------- | ------------------------ |
|
||||
| **Speed** | Fast | Very fast | Moderate | Slow (API latency) |
|
||||
| **Accuracy** | Good | Excellent | Excellent | Highest |
|
||||
| **Languages** | 100+ | 80+ (11 script families) | 80+ | All (provider-dependent) |
|
||||
| **Installation** | System package | Built-in (native) or Python package | Python package only | API key only |
|
||||
| **Model size** | ~10 MB | Mobile ~8 MB, Server ~120 MB | ~100 MB | None (cloud-hosted) |
|
||||
| **GPU support** | No | Yes | Yes | N/A (server-side) |
|
||||
| **Platform** | All (including Wasm) | All except Wasm | Python only | All |
|
||||
| **Cost** | Free | Free | Free | Per-token API cost |
|
||||
|
||||
**When to use which:**
|
||||
|
||||
- **Tesseract** — Default choice. Works everywhere, low overhead, broadest platform support.
|
||||
- **PaddleOCR** — Best speed-to-accuracy ratio. Preferred for CJK languages. Mobile tier is fast; server tier maximizes accuracy with GPU.
|
||||
- **EasyOCR** — Highest accuracy with deep learning models. Python-only, heavier dependency.
|
||||
- **VLM** — Best for handwritten text, poor scans, Arabic/Farsi, and complex layouts. Requires an API key and incurs per-token costs. See [LLM Integration](llm-integration.md) for full details.
|
||||
|
||||
## Installation
|
||||
|
||||
### Tesseract
|
||||
|
||||
=== "macOS"
|
||||
|
||||
```bash title="Terminal"
|
||||
brew install tesseract
|
||||
```
|
||||
|
||||
=== "Ubuntu / Debian"
|
||||
|
||||
```bash title="Terminal"
|
||||
sudo apt-get install tesseract-ocr
|
||||
```
|
||||
|
||||
=== "RHEL / Fedora"
|
||||
|
||||
```bash title="Terminal"
|
||||
sudo dnf install tesseract
|
||||
```
|
||||
|
||||
=== "Windows"
|
||||
|
||||
Download from [GitHub releases](https://github.com/UB-Mannheim/tesseract/wiki).
|
||||
|
||||
**Additional language packs:**
|
||||
|
||||
```bash title="Terminal"
|
||||
# macOS — all languages
|
||||
brew install tesseract-lang
|
||||
|
||||
# Ubuntu/Debian — individual languages
|
||||
sudo apt-get install tesseract-ocr-deu # German
|
||||
sudo apt-get install tesseract-ocr-fra # French
|
||||
|
||||
# Verify installed languages
|
||||
tesseract --list-langs
|
||||
```
|
||||
|
||||
### PaddleOCR
|
||||
|
||||
=== "Native bindings (Rust, Go, TypeScript, Java, C#, Ruby, PHP, Elixir)"
|
||||
|
||||
Built in via the `paddle-ocr` feature flag. Models download automatically on first use — no extra installation needed.
|
||||
|
||||
```toml title="Cargo.toml (Rust example)"
|
||||
[dependencies]
|
||||
kreuzberg = { version = "4.0", features = ["paddle-ocr"] }
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
PaddleOCR is bundled via the native Rust bindings and works out of the box since 4.8.5 — no extra installation is needed. Models are downloaded automatically on first use.
|
||||
|
||||
### EasyOCR (Python only)
|
||||
|
||||
```bash title="Terminal"
|
||||
pip install "kreuzberg[easyocr]"
|
||||
```
|
||||
|
||||
!!! Info "Python 3.14" EasyOCR 1.7.3+ and PyTorch 2.9.1+ support Python 3.14. Install `kreuzberg[easyocr]` on any supported Python version (3.10–3.14).
|
||||
|
||||
!!! Tip "Tesseract marker extra"
|
||||
`pip install "kreuzberg[tesseract]"` is available as a metadata-only marker to document a dependency on the Tesseract system package. It installs no Python packages — Tesseract itself must still be installed via your OS package manager (see above).
|
||||
|
||||
## Configuration
|
||||
|
||||
### Basic OCR
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/ocr/ocr_extraction.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/ocr/ocr_extraction.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/ocr/ocr_extraction.md"
|
||||
|
||||
### Multiple Languages
|
||||
|
||||
Specify multiple language codes separated by `+` (Tesseract) or as a list (EasyOCR/PaddleOCR):
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/ocr/ocr_multi_language.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/ocr/ocr_multi_language.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/ocr/ocr_multi_language.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/ocr/ocr_multi_language.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/ocr/ocr_multi_language.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/ocr/ocr_multi_language.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/ocr/ocr_multi_language.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
```typescript
|
||||
import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';
|
||||
|
||||
await initWasm();
|
||||
await enableOcr();
|
||||
|
||||
const file = fileInput.files?.[0];
|
||||
if (file) {
|
||||
const result = await extractFromFile(file, file.type, {
|
||||
ocr: { backend: 'tesseract-wasm', language: 'eng+deu' },
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
### Force OCR
|
||||
|
||||
Process PDFs with OCR even when they have a text layer:
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/ocr/ocr_force_all_pages.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/ocr/ocr_force_all_pages.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/ocr/ocr_force_all_pages.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/ocr/ocr_force_all_pages.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/ocr/ocr_force_all_pages.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/ocr/ocr_force_all_pages.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/ocr/ocr_force_all_pages.md"
|
||||
|
||||
### Using EasyOCR
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/ocr/ocr_easyocr.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/ocr/ocr_easyocr.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/ocr/ocr_easyocr.md"
|
||||
|
||||
### Disable OCR
|
||||
|
||||
!!! Info "Added in v4.7.0"
|
||||
|
||||
When `disable_ocr` is set, image files return empty content instead of raising `MissingDependencyError`:
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python title="disable_ocr.py"
|
||||
from kreuzberg import ExtractionConfig, extract_file_sync
|
||||
|
||||
config = ExtractionConfig(disable_ocr=True)
|
||||
result = extract_file_sync("scanned.png", config=config)
|
||||
# result.content will be empty — OCR was skipped
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript title="disable_ocr.ts"
|
||||
import { extractFileSync } from '@kreuzberg/node';
|
||||
|
||||
const result = extractFileSync('scanned.png', {
|
||||
disableOcr: true,
|
||||
});
|
||||
// result.content will be empty — OCR was skipped
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust title="disable_ocr.rs"
|
||||
use kreuzberg::{ExtractionConfig, extract_file};
|
||||
|
||||
let config = ExtractionConfig {
|
||||
disable_ocr: true,
|
||||
..Default::default()
|
||||
};
|
||||
let result = extract_file("scanned.png", &config).await?;
|
||||
// result.content will be empty — OCR was skipped
|
||||
```
|
||||
|
||||
### Using PaddleOCR
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/ocr/ocr_paddleocr.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/ocr/ocr_paddleocr.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/ocr/ocr_paddleocr.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/ocr/ocr_paddleocr.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/ocr/ocr_paddleocr.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/ocr/ocr_paddleocr.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/ocr/ocr_paddleocr.md"
|
||||
|
||||
### Using VLM OCR <span class="version-badge">v4.8.0</span>
|
||||
|
||||
Use a vision-language model (e.g. GPT-4o, Claude) as the OCR backend — each page is rendered and sent to the VLM. Cloud providers need an API key; local engines (Ollama, etc.) use the `ollama/` prefix — see [Local LLM Support](llm-integration.md#local-llm-support).
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/llm/vlm_ocr.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/llm/vlm_ocr.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust title="Rust"
|
||||
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig, LlmConfig};
|
||||
|
||||
let config = ExtractionConfig {
|
||||
force_ocr: true,
|
||||
ocr: Some(OcrConfig {
|
||||
backend: "vlm".to_string(),
|
||||
vlm_config: Some(LlmConfig {
|
||||
model: "openai/gpt-4o-mini".to_string(),
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
let result = extract_file("scan.pdf", None, &config).await?;
|
||||
```
|
||||
|
||||
=== "CLI"
|
||||
|
||||
```bash title="Terminal"
|
||||
kreuzberg extract scan.pdf --force-ocr true --vlm-model openai/gpt-4o-mini
|
||||
```
|
||||
|
||||
=== "TOML"
|
||||
|
||||
```toml title="kreuzberg.toml"
|
||||
force_ocr = true
|
||||
|
||||
[ocr]
|
||||
backend = "vlm"
|
||||
|
||||
[ocr.vlm_config]
|
||||
model = "openai/gpt-4o-mini"
|
||||
```
|
||||
|
||||
For more on VLM OCR, including custom prompts, supported providers, and API key configuration, see [LLM Integration](llm-integration.md#vlm-ocr).
|
||||
|
||||
!!! Tip "GPU Acceleration" EasyOCR and PaddleOCR support GPU acceleration. Set `use_gpu=True` in your OCR config. PaddleOCR's `model_tier="server"` gives the best accuracy with GPU.
|
||||
|
||||
## DPI Configuration
|
||||
|
||||
Higher DPI improves accuracy but increases processing time and memory.
|
||||
|
||||
| DPI | Trade-off |
|
||||
| ----------------- | ------------------------------------------ |
|
||||
| **150** | Fastest — lower accuracy, less memory |
|
||||
| **300** (default) | Balanced — good accuracy, reasonable speed |
|
||||
| **600** | Best accuracy — slower, more memory |
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/ocr_dpi_config.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/ocr/ocr_dpi_config.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/ocr/ocr_dpi_config.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/ocr_dpi_config.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/config/ocr_dpi_config.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/ocr_dpi_config.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/ocr_dpi_config.md"
|
||||
|
||||
## PaddleOCR Script Families
|
||||
|
||||
80+ languages across 11 script families (PP-OCRv5). Recognition models are downloaded on demand from HuggingFace:
|
||||
|
||||
| Family | Languages |
|
||||
| -------------- | -------------------------------------------------------------------------------------------- |
|
||||
| **English** | English, numbers, punctuation |
|
||||
| **Chinese** | Simplified/Traditional Chinese, Japanese |
|
||||
| **Latin** | French, German, Spanish, Portuguese, Italian, Polish, Dutch, Turkish, Vietnamese, and so on. |
|
||||
| **Korean** | Korean (Hangul) |
|
||||
| **Slavic** | Russian, Ukrainian, Belarusian, Bulgarian, Serbian, and so on. |
|
||||
| **Thai** | Thai script |
|
||||
| **Greek** | Greek script |
|
||||
| **Arabic** | Arabic, Persian, Urdu |
|
||||
| **Devanagari** | Hindi, Marathi, Sanskrit, Nepali |
|
||||
| **Tamil** | Tamil script |
|
||||
| **Telugu** | Telugu script |
|
||||
|
||||
Models are cached locally after first download, so subsequent runs start immediately.
|
||||
|
||||
## CLI Usage
|
||||
|
||||
```bash title="Terminal"
|
||||
# Basic OCR extraction
|
||||
kreuzberg extract scanned.pdf --ocr true
|
||||
|
||||
# Specific language
|
||||
kreuzberg extract french_doc.pdf --ocr true --ocr-language fra
|
||||
|
||||
# Specific backend
|
||||
kreuzberg extract chinese_doc.pdf --ocr true --ocr-backend paddle-ocr --ocr-language ch
|
||||
|
||||
# Force OCR on all pages
|
||||
kreuzberg extract document.pdf --force-ocr true
|
||||
|
||||
# VLM OCR backend
|
||||
kreuzberg extract handwritten.pdf --force-ocr true --vlm-model openai/gpt-4o-mini
|
||||
|
||||
# Use a config file
|
||||
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true
|
||||
```
|
||||
|
||||
| Flag | Description |
|
||||
| ------------------------- | ---------------------------------------------------------------------------------- |
|
||||
| `--ocr true` | Enable OCR processing |
|
||||
| `--ocr-language <code>` | Language code (`eng`, `deu`, `fra`, `ch`, `ja`, `ru`, etc.) |
|
||||
| `--ocr-backend <backend>` | Engine: `tesseract`, `paddle-ocr`, `easyocr`, or `vlm` |
|
||||
| `--force-ocr true` | OCR all pages regardless of text layer |
|
||||
| `--vlm-model <model>` | VLM model for OCR (for example, `openai/gpt-4o-mini`). Implies `--ocr-backend vlm` |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
??? Question "Tesseract not found"
|
||||
|
||||
Install Tesseract and verify it's on your PATH:
|
||||
|
||||
```bash title="Terminal"
|
||||
# macOS
|
||||
brew install tesseract
|
||||
|
||||
# Ubuntu/Debian
|
||||
sudo apt-get install tesseract-ocr
|
||||
|
||||
# Verify
|
||||
tesseract --version
|
||||
```
|
||||
|
||||
??? Question "Language not found"
|
||||
|
||||
Install the language data pack:
|
||||
|
||||
```bash title="Terminal"
|
||||
# macOS — all languages
|
||||
brew install tesseract-lang
|
||||
|
||||
# Ubuntu/Debian — individual language
|
||||
sudo apt-get install tesseract-ocr-deu
|
||||
|
||||
# Verify
|
||||
tesseract --list-langs
|
||||
```
|
||||
|
||||
??? Question "Poor accuracy"
|
||||
|
||||
- Increase DPI to 600 for better quality
|
||||
- Try a different backend — PaddleOCR and EasyOCR often outperform Tesseract on complex layouts
|
||||
- Specify the correct language code for your document
|
||||
- Use `force_ocr=True` if a PDF's embedded text layer is low quality
|
||||
- For handwritten text or very poor scans, try the VLM backend with a vision-capable model (see [LLM Integration](llm-integration.md#vlm-ocr))
|
||||
|
||||
??? Question "Slow processing"
|
||||
|
||||
- Reduce DPI to 150 for faster throughput
|
||||
- Enable GPU acceleration with EasyOCR or PaddleOCR (`use_gpu=True`)
|
||||
- Use batch extraction to process multiple files concurrently
|
||||
|
||||
??? Question "Out of memory on large PDFs"
|
||||
|
||||
- Reduce DPI — lower resolution uses significantly less memory
|
||||
- Process pages in smaller batches
|
||||
- Use PaddleOCR's mobile tier (`model_tier="mobile"`) for a smaller memory footprint
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [LLM Integration](llm-integration.md) — VLM OCR, structured extraction, and LLM embeddings
|
||||
- [Configuration](configuration.md) — all configuration options
|
||||
- [Extraction Basics](extraction.md) — core extraction API and supported formats
|
||||
- [Advanced Features](advanced.md) — chunking, language detection, embeddings
|
||||
398
docs/guides/output-formats.md
Normal file
398
docs/guides/output-formats.md
Normal file
@@ -0,0 +1,398 @@
|
||||
# Output Formats <span class="version-badge">v4.1.0</span>
|
||||
|
||||
Choose the format that matches your downstream processing:
|
||||
|
||||
- **Unified (default)** — Plain text/Markdown, for LLM prompts and full-text search
|
||||
- **Element-Based** — Flat array of typed elements with metadata, for RAG chunking and semantic search
|
||||
- **Document Structure** — Hierarchical tree with explicit parent-child references, for knowledge graphs and structured apps
|
||||
- **PDF Hierarchy** — Font-size classification into heading levels (H1–H6) for PDFs
|
||||
|
||||
## Unified Output (Default)
|
||||
|
||||
No configuration required. The result contains:
|
||||
|
||||
- `content` — Full document text with minimal formatting
|
||||
- `pages` — Per-page breakdown for PDFs, DOCX, and PPTX
|
||||
- `tables` — Extracted tables in structured format
|
||||
- `images` — Image metadata and paths
|
||||
|
||||
---
|
||||
|
||||
## Element-Based Output <span class="version-badge">v4.1.0</span>
|
||||
|
||||
A flat array of typed elements (titles, paragraphs, tables, list items, code blocks, images, etc.). Each carries a page number; PDF text elements also carry bounding boxes when hierarchy extraction is enabled.
|
||||
|
||||
Use for RAG chunking, semantic search, or Unstructured.io-compatible pipelines.
|
||||
|
||||
### Enable
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/element_based_output.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/element_based_output.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/config/element_based_output.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/element_based_output.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/element_based_output.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/element_based_output.md"
|
||||
|
||||
=== "PHP"
|
||||
|
||||
--8<-- "snippets/php/config/element_based_output.md"
|
||||
|
||||
Elements are in `result.elements`. Each element has `element_id`, `element_type`, `text`, and `metadata`.
|
||||
|
||||
### Element Types
|
||||
|
||||
| `element_type` | Description | Key `additional` fields |
|
||||
| ---------------- | ---------------------------------- | ------------------------------------------ |
|
||||
| `title` | Main title or top-level heading | `level` (h1–h6), `font_size`, `font_name` |
|
||||
| `heading` | Section/subsection heading | `level` (h1–h6) |
|
||||
| `narrative_text` | Body paragraph | — |
|
||||
| `list_item` | Bullet, numbered, or indented item | `list_type`, `list_marker`, `indent_level` |
|
||||
| `table` | Tabular data | `row_count`, `column_count`, `format` |
|
||||
| `image` | Embedded image | `format`, `width`, `height`, `alt_text` |
|
||||
| `code_block` | Code snippet | `language`, `line_count` |
|
||||
| `block_quote` | Quoted text | — |
|
||||
| `header` | Recurring page header | `position` |
|
||||
| `footer` | Recurring page footer | `position` |
|
||||
| `page_break` | Page boundary marker | `next_page` |
|
||||
|
||||
### Metadata
|
||||
|
||||
Every element's `metadata` contains:
|
||||
|
||||
| Field | Type | Description |
|
||||
| --------------- | --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `page_number` | `int \| None` | 1-indexed page number (PDF, DOCX, PPTX) |
|
||||
| `filename` | `str \| None` | Source filename |
|
||||
| `coordinates` | `BoundingBox \| None` | `x0`, `y0`, `x1`, `y1` in PDF points. Only populated for **text elements** when `pdf_options.hierarchy` is enabled with `include_bbox=True`. Table and image elements do not carry coordinates. |
|
||||
| `element_index` | `int` | Zero-based position in the elements array |
|
||||
| `additional` | `dict[str, str]` | Element-type-specific fields (see table above) |
|
||||
|
||||
PDF coordinates use bottom-left origin in points (1/72 inch).
|
||||
|
||||
### Example Output
|
||||
|
||||
```json
|
||||
{
|
||||
"element_id": "elem-a3f2b1c4",
|
||||
"element_type": "title",
|
||||
"text": "Introduction to Machine Learning",
|
||||
"metadata": {
|
||||
"page_number": 1,
|
||||
"element_index": 0,
|
||||
"coordinates": { "x0": 72.0, "y0": 700.0, "x1": 540.0, "y1": 730.0 },
|
||||
"additional": { "level": "h1", "font_size": "24" }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Filtering Elements
|
||||
|
||||
```python
|
||||
config = ExtractionConfig(result_format="element_based")
|
||||
result = extract_file_sync("document.pdf", config=config)
|
||||
|
||||
titles = [e for e in result.elements if e.element_type == "title"]
|
||||
tables = [e for e in result.elements if e.element_type == "table"]
|
||||
|
||||
for title in titles:
|
||||
level = title.metadata.additional.get("level", "h1")
|
||||
print(f"[{level}] {title.text}")
|
||||
```
|
||||
|
||||
### Migrating from Unstructured.io
|
||||
|
||||
If you're migrating from Unstructured.io, element-based output follows a similar structure with these key differences:
|
||||
|
||||
| Aspect | Unstructured.io | Kreuzberg |
|
||||
| ----------- | ------------------------------------- | ------------------------------------------- |
|
||||
| Type names | PascalCase (`Title`, `NarrativeText`) | snake_case (`title`, `narrative_text`) |
|
||||
| Element IDs | Not always present | Always present (deterministic hash) |
|
||||
| Metadata | Basic (`page_number`, `filename`) | Extended (coordinates, `additional` fields) |
|
||||
| Config key | — | `result_format="element_based"` |
|
||||
|
||||
---
|
||||
|
||||
## Document Structure
|
||||
|
||||
A flat list of nodes with explicit parent-child index references — a traversable tree with heading levels, content layers, inline annotations, and structured table grids.
|
||||
|
||||
Use when you need hierarchical relationships between sections.
|
||||
|
||||
### Comparison
|
||||
|
||||
| Aspect | Unified (default) | Element-based | Document structure |
|
||||
| ------------------ | ---------------------- | -------------------- | --------------------------------- |
|
||||
| Output shape | `content: string` | `elements: array` | `nodes: array` with index refs |
|
||||
| Hierarchy | None | Inferred from levels | Explicit parent/child indices |
|
||||
| Inline annotations | No | No | Bold, italic, links per node |
|
||||
| Tables | `result.tables` | Table elements | `TableGrid` with cell coords |
|
||||
| Content layers | Not classified | Not classified | body, header, footer, footnote |
|
||||
| Best for | LLM prompts, full-text | RAG chunking | Knowledge graphs, structured apps |
|
||||
|
||||
### Enable
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/document_structure_config.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/document_structure_config.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/config/document_structure_config.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/document_structure_config.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/config/document_structure_config.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/config/document_structure_config.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/document_structure_config.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/document_structure_config.md"
|
||||
|
||||
### Node Shape
|
||||
|
||||
Each node in `result.document.nodes`:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "node-a3f2b1c4",
|
||||
"content": { "node_type": "heading", "level": 2, "text": "Supervised Learning" },
|
||||
"parent": 0,
|
||||
"children": [4, 5, 6],
|
||||
"content_layer": "body",
|
||||
"page": 5,
|
||||
"page_end": null,
|
||||
"bbox": { "x0": 72.0, "y0": 600.0, "x1": 400.0, "y1": 620.0 },
|
||||
"annotations": []
|
||||
}
|
||||
```
|
||||
|
||||
- `parent` and `children` are integer indices into the `nodes` array (`null` if absent)
|
||||
- `bbox` is present when bounding box data is available
|
||||
- `annotations` contains inline formatting spans
|
||||
|
||||
### Node Types
|
||||
|
||||
| `node_type` | Key fields | Notes |
|
||||
| ------------ | ---------------------------------------- | ------------------------------------------- |
|
||||
| `title` | `text` | Document title |
|
||||
| `heading` | `level` (1–6), `text` | Section heading |
|
||||
| `paragraph` | `text` | Body paragraph; may have `annotations` |
|
||||
| `list` | `ordered` (bool) | Container; children are `list_item` nodes |
|
||||
| `list_item` | `text` | Child of `list` |
|
||||
| `table` | `grid` ([TableGrid](#table-grid)) | Grid with cell-level data |
|
||||
| `image` | `description`, `image_index` | `image_index` references `result.images` |
|
||||
| `code` | `text`, `language` | Code block |
|
||||
| `quote` | _(container)_ | Children are typically paragraphs |
|
||||
| `formula` | `text` | Math formula (plain text, LaTeX, or MathML) |
|
||||
| `footnote` | `text` | Usually `content_layer: "footnote"` |
|
||||
| `group` | `label`, `heading_level`, `heading_text` | Section grouping container |
|
||||
| `page_break` | _(marker)_ | Page boundary |
|
||||
|
||||
### Content Layers
|
||||
|
||||
| Layer | Description |
|
||||
| ---------- | ------------------------------------------ |
|
||||
| `body` | Main document content |
|
||||
| `header` | Page header area (repeated chapter titles) |
|
||||
| `footer` | Page footer area (page numbers, copyright) |
|
||||
| `footnote` | Footnotes and endnotes |
|
||||
|
||||
```python
|
||||
for node in result.document["nodes"]:
|
||||
if node["content_layer"] == "body":
|
||||
process_main_content(node)
|
||||
```
|
||||
|
||||
### Text Annotations
|
||||
|
||||
Paragraphs carry a list of `annotations` marking character spans:
|
||||
|
||||
```json
|
||||
{ "start": 0, "end": 16, "kind": { "annotation_type": "bold" } }
|
||||
```
|
||||
|
||||
| `annotation_type` | Extra fields |
|
||||
| ---------------------------------------------- | ------------------------- |
|
||||
| `bold`, `italic`, `underline`, `strikethrough` | — |
|
||||
| `code`, `subscript`, `superscript` | — |
|
||||
| `link` | `url`, `title` (optional) |
|
||||
|
||||
```python
|
||||
for node in result.document["nodes"]:
|
||||
for ann in node.get("annotations", []):
|
||||
text = node["content"].get("text", "")
|
||||
span = text[ann["start"]:ann["end"]]
|
||||
kind = ann["kind"]["annotation_type"]
|
||||
if kind == "link":
|
||||
print(f"Link: {span} -> {ann['kind']['url']}")
|
||||
else:
|
||||
print(f"{kind}: {span}")
|
||||
```
|
||||
|
||||
### Table Grid
|
||||
|
||||
Table nodes contain a `grid` with cell-level data:
|
||||
|
||||
```json
|
||||
{
|
||||
"rows": 3,
|
||||
"cols": 3,
|
||||
"cells": [
|
||||
{ "content": "Algorithm", "row": 0, "col": 0, "row_span": 1, "col_span": 1, "is_header": true },
|
||||
{
|
||||
"content": "Decision Tree",
|
||||
"row": 1,
|
||||
"col": 0,
|
||||
"row_span": 1,
|
||||
"col_span": 1,
|
||||
"is_header": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Each cell has `row`, `col`, `row_span`, `col_span`, `is_header`, and optionally `bbox`.
|
||||
|
||||
```python
|
||||
for node in result.document["nodes"]:
|
||||
if node["content"]["node_type"] == "table":
|
||||
grid = node["content"]["grid"]
|
||||
rows, cols = grid["rows"], grid["cols"]
|
||||
table = [[None] * cols for _ in range(rows)]
|
||||
for cell in grid["cells"]:
|
||||
table[cell["row"]][cell["col"]] = cell["content"]
|
||||
for row in table:
|
||||
print(" | ".join(str(c or "") for c in row))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## PDF Hierarchy Detection
|
||||
|
||||
Classifies PDF text blocks into heading levels (H1–H6) and body text via K-means clustering on font sizes — largest cluster is H1, second-largest H2, and so on.
|
||||
|
||||
### Quick Start
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/pdf_hierarchy_config.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/pdf_hierarchy_config.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/config/pdf_hierarchy_config.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/pdf_hierarchy_config.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/config/pdf_hierarchy_config.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/config/pdf_hierarchy_config.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/pdf_hierarchy_config.md"
|
||||
|
||||
### Output
|
||||
|
||||
Hierarchy data is in `result.pages[n].hierarchy`. Each page has a `blocks` list:
|
||||
|
||||
```json
|
||||
{
|
||||
"block_count": 4,
|
||||
"blocks": [
|
||||
{
|
||||
"text": "Chapter 1: Introduction",
|
||||
"level": "h1",
|
||||
"font_size": 24.0,
|
||||
"bbox": [50.0, 100.0, 400.0, 125.0]
|
||||
},
|
||||
{ "text": "Background", "level": "h2", "font_size": 18.0, "bbox": [50.0, 150.0, 300.0, 168.0] },
|
||||
{
|
||||
"text": "This chapter provides...",
|
||||
"level": "body",
|
||||
"font_size": 12.0,
|
||||
"bbox": [50.0, 200.0, 550.0, 450.0]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
- `bbox`: `[left, top, right, bottom]` in PDF points (present when `include_bbox=True`). This is the only way to obtain bounding box coordinates for text elements — they are not included by default.
|
||||
- `level`: `"h1"` – `"h6"` or `"body"`
|
||||
|
||||
### Configuration
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
| ------------------------ | --------------- | ------- | --------------------------------------------------- |
|
||||
| `enabled` | `bool` | `true` | Enable hierarchy extraction |
|
||||
| `k_clusters` | `int` | `6` | Font size clusters (2–10), maps to heading levels |
|
||||
| `include_bbox` | `bool` | `true` | Include bounding box coordinates |
|
||||
| `ocr_coverage_threshold` | `float \| None` | `None` | Trigger OCR if text coverage is below this fraction |
|
||||
|
||||
#### Choosing k_clusters
|
||||
|
||||
| `k_clusters` | Heading levels | Use when |
|
||||
| ------------ | -------------- | --------------------------------------- |
|
||||
| 2–3 | H1–H2 | Simple documents with 1–2 heading sizes |
|
||||
| 4–5 | H1–H4 | Standard documents |
|
||||
| 6 (default) | H1–H6 | Most documents |
|
||||
| 7–8 | H1–H6+ | Books, specs with deep nesting |
|
||||
|
||||
#### Ocr_coverage_threshold
|
||||
|
||||
| Threshold | Behavior |
|
||||
| --------- | ------------------------------- |
|
||||
| `None` | OCR never triggered by coverage |
|
||||
| `0.3` | OCR if < 30% of page has text |
|
||||
| `0.5` | OCR if < 50% of page has text |
|
||||
|
||||
Requires an OCR backend to be configured separately.
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
- **`hierarchy` is `None`** — Check `hierarchy.enabled` is `True`. If the PDF is image-only, enable OCR. If fewer text blocks than `k_clusters`, reduce `k_clusters`.
|
||||
- **Most blocks classified as `body`** — Document may use uniform font sizes. Reduce `k_clusters` (try 3–4).
|
||||
- **Heading levels don't match visual inspection** — Levels are assigned by font size rank, not absolute size. Filter on `block.font_size` directly for absolute thresholds.
|
||||
|
||||
See the [HierarchyConfig reference](../reference/configuration.md#hierarchyconfig) for the full parameter list.
|
||||
348
docs/guides/plugins.md
Normal file
348
docs/guides/plugins.md
Normal file
@@ -0,0 +1,348 @@
|
||||
# Creating Plugins <span class="version-badge">v4.0.0</span>
|
||||
|
||||
Extend Kreuzberg with custom extractors, post-processors, OCR backends, and validators registered globally for use across all extraction calls.
|
||||
|
||||
!!! Note "Wasm" Custom plugins are not supported in Wasm environments. Use Python, Rust, or other native bindings.
|
||||
|
||||
## Plugin Types
|
||||
|
||||
| Type | Purpose | Use case |
|
||||
| --------------------- | --------------------------------- | ---------------------------------------------------------- |
|
||||
| **DocumentExtractor** | Extract content from file formats | New format support, override built-in extractors |
|
||||
| **PostProcessor** | Transform extraction results | Metadata enrichment, content filtering, text normalization |
|
||||
| **OcrBackend** | Perform OCR on images | Cloud OCR services, custom OCR engines |
|
||||
| **Validator** | Validate extraction quality | Minimum content length, quality score thresholds |
|
||||
|
||||
All plugins must be thread-safe (`Send + Sync` in Rust, thread-safe in Python) and implement `initialize()` / `shutdown()` lifecycle methods.
|
||||
|
||||
## Document Extractors
|
||||
|
||||
### Implementation
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/plugins/plugin_extractor.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/plugins/plugin_extractor.md"
|
||||
|
||||
### Registration
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/plugins/extractor_registration.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/plugins/custom_extractor_plugin.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/plugins/extractor_registration.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/plugins/extractor_registration.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/plugins/extractor_registration.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/extractor_registration.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/plugins/extractor_registration.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/plugins/extractor_registration.md"
|
||||
|
||||
### Priority System
|
||||
|
||||
When multiple extractors support the same MIME type, the highest priority wins:
|
||||
|
||||
| Range | Level |
|
||||
| ------ | --------------------------- |
|
||||
| 0–25 | Fallback / low-quality |
|
||||
| 26–49 | Alternative |
|
||||
| **50** | **Default (built-in)** |
|
||||
| 51–75 | Enhanced / premium |
|
||||
| 76–100 | Specialized / high-priority |
|
||||
|
||||
## Post-Processors
|
||||
|
||||
Processors execute in three stages:
|
||||
|
||||
- **Early** — Foundational: language detection, quality scoring, text normalization
|
||||
- **Middle** — Transformation: keyword extraction, token reduction, summarization
|
||||
- **Late** — Final: custom metadata, analytics, output formatting
|
||||
|
||||
### Implementation
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/plugins/word_count_processor.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/plugins/word_count_processor.md"
|
||||
|
||||
### Conditional Processing
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/plugins/pdf_only_processor.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/metadata/pdf_only_processor.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/plugins/pdf_only_processor.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/plugins/pdf_only_processor.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/pdf_only_processor.md"
|
||||
|
||||
## OCR Backends
|
||||
|
||||
### Implementation
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/ocr/cloud_ocr_backend.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/ocr/cloud_ocr_backend.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/ocr/cloud_ocr_backend.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/cloud_ocr_backend.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/ocr/cloud_ocr_backend.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/ocr/cloud_ocr_backend.md"
|
||||
|
||||
### Registration
|
||||
|
||||
Register the backend and set its name in `OcrConfig`:
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python title="Python"
|
||||
from kreuzberg import register_ocr_backend, unregister_ocr_backend
|
||||
|
||||
backend = CloudOcrBackend(api_key="your-api-key")
|
||||
register_ocr_backend(backend)
|
||||
|
||||
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
|
||||
|
||||
config = ExtractionConfig(ocr=OcrConfig(backend="cloud-ocr", language="eng"))
|
||||
result = extract_file_sync("scanned.pdf", config=config)
|
||||
|
||||
unregister_ocr_backend("cloud-ocr")
|
||||
```
|
||||
|
||||
### Using EasyOCR (Built-in)
|
||||
|
||||
The built-in EasyOCR backend supports 80+ languages and GPU acceleration — just point `OcrConfig` at it:
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/ocr/ocr_easyocr.md"
|
||||
|
||||
## Validators
|
||||
|
||||
!!! Warning Validation errors cause extraction to fail. Use validators for critical quality checks only.
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/plugins/min_length_validator.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/plugins/min_length_validator.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/plugins/min_length_validator.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/min_length_validator.md"
|
||||
|
||||
### Quality Score Validator
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/plugins/quality_score_validator.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/plugins/quality_score_validator.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/plugins/quality_score_validator.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/quality_score_validator.md"
|
||||
|
||||
## Plugin Management
|
||||
|
||||
### Listing
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/plugins/list_plugins.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/plugins/list_plugins.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/plugins/list_plugins.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/list_plugins.md"
|
||||
|
||||
### Unregistering
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/plugins/unregister_plugins.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/plugins/unregister_plugins.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/plugins/unregister_plugins.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/unregister_plugins.md"
|
||||
|
||||
### Clearing All
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/plugins/clear_plugins.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/plugins/clear_plugins.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/plugins/clear_plugins.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/clear_plugins.md"
|
||||
|
||||
## Thread Safety
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/plugins/stateful_plugin.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/plugins/stateful_plugin.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/plugins/stateful_plugin.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/stateful_plugin.md"
|
||||
|
||||
## Best Practices
|
||||
|
||||
**Naming:** Use kebab-case (`my-custom-plugin`), lowercase only, no spaces or special characters.
|
||||
|
||||
### Logging
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/plugins/plugin_logging.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/plugins/plugin_logging.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/plugins/plugin_logging.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/plugin_logging.md"
|
||||
|
||||
### Testing
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/plugins/plugin_testing.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/plugins/plugin_testing.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/plugins/plugin_testing.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/plugin_testing.md"
|
||||
|
||||
## Complete Example: PDF Metadata Extractor
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/metadata/pdf_metadata_extractor.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/plugins/pdf_metadata_extractor.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/plugins/pdf_metadata_extractor.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/pdf_metadata_extractor.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/plugins/pdf_metadata_extractor.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/plugins/pdf_metadata_extractor.md"
|
||||
Reference in New Issue
Block a user