Nomad changes
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s

This commit is contained in:
Henrik Jess Nielsen
2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions

543
docs/guides/advanced.md Normal file
View File

@@ -0,0 +1,543 @@
# Advanced Features
## Text Chunking
Split extracted text into chunks for RAG, vector databases, or LLM context windows. Four strategies:
- **Text** — splits on whitespace/punctuation boundaries
- **Markdown** — structure-aware; preserves headings, lists, and code blocks
- **YAML** — section-aware; preserves YAML document structure
- **Semantic** — topic-aware; splits at natural document boundaries
### Semantic
Set `chunker_type` to `"semantic"`. Uses an embedding model for topic detection when one is configured; otherwise falls back to structural heuristics.
```python
config = ExtractionConfig(
chunking=ChunkingConfig(chunker_type="semantic")
)
```
**Behavior:**
- **Without embeddings** — Uses structural heuristics: detects headers (ALL CAPS, numbered sections) and paragraph boundaries
- **With embeddings** — Compares consecutive paragraphs via embeddings to detect topic shifts, merging paragraphs below the `topic_threshold` (default: 0.5)
Use `topic_threshold` to control sensitivity: higher values (0.70.9) preserve more fine-grained topics, lower values (0.10.3) merge aggressive. Only applies when an embedding model is configured.
### Configuration
=== "Python"
--8<-- "snippets/python/config/chunking_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/chunking_config.md"
=== "Rust"
--8<-- "snippets/rust/advanced/chunking_config.md"
=== "Go"
--8<-- "snippets/go/config/chunking_config.md"
=== "Java"
--8<-- "snippets/java/config/chunking_config.md"
=== "C#"
--8<-- "snippets/csharp/advanced/chunking_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/chunking_config.md"
=== "R"
--8<-- "snippets/r/config/chunking_config.md"
=== "Wasm"
--8<-- "snippets/wasm/config/chunking_config.md"
### Chunk Output
Each chunk in `result.chunks` contains:
| Field | Description |
| --------------------------------------- | ------------------------------------------------ |
| `content` | Chunk text |
| `metadata.byte_start` / `byte_end` | Byte offsets in the original text |
| `metadata.chunk_index` / `total_chunks` | Position in sequence |
| `metadata.token_count` | Token count (when embeddings enabled) |
| `metadata.heading_context` | Active heading hierarchy (Markdown chunker only) |
| `embedding` | Embedding vector (when configured) |
Chunks can be sized by token count instead of characters — enable the `chunking-tokenizers` feature and set `sizing` to `token`.
### RAG Pipeline Example
=== "Python"
--8<-- "snippets/python/utils/chunking_rag.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/chunking_rag.md"
=== "Rust"
--8<-- "snippets/rust/advanced/chunking_rag.md"
=== "Go"
--8<-- "snippets/go/advanced/chunking_rag.md"
=== "Java"
--8<-- "snippets/java/advanced/chunking_rag.md"
=== "C#"
--8<-- "snippets/csharp/advanced/chunking_rag.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/chunking_rag.md"
=== "R"
--8<-- "snippets/r/advanced/chunking_rag.md"
## Language Detection
Detect languages in extracted text using [`whatlang`](https://crates.io/crates/whatlang) — 60+ languages with ISO 639-3 codes. Set `detect_multiple: true` to chunk the text into 200-char segments and return all detected languages sorted by prevalence.
### Configuration
=== "Python"
--8<-- "snippets/python/config/language_detection_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/language_detection_config.md"
=== "Rust"
--8<-- "snippets/rust/advanced/language_detection_config.md"
=== "Go"
--8<-- "snippets/go/config/language_detection_config.md"
=== "Java"
--8<-- "snippets/java/config/language_detection_config.md"
=== "C#"
--8<-- "snippets/csharp/advanced/language_detection_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/language_detection_config.md"
=== "R"
--8<-- "snippets/r/config/language_detection_config.md"
### Multilingual Example
=== "Python"
--8<-- "snippets/python/utils/language_detection_multilingual.md"
=== "TypeScript"
--8<-- "snippets/typescript/metadata/language_detection_multilingual.md"
=== "Rust"
--8<-- "snippets/rust/advanced/language_detection_multilingual.md"
=== "Go"
--8<-- "snippets/go/advanced/language_detection_multilingual.md"
=== "Java"
--8<-- "snippets/java/advanced/language_detection_multilingual.md"
=== "C#"
--8<-- "snippets/csharp/advanced/language_detection_multilingual.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/language_detection_multilingual.md"
=== "R"
--8<-- "snippets/r/advanced/language_detection_multilingual.md"
## Embedding Generation
Local in-process embeddings via ONNX for semantic search and RAG — no external API calls. Requires the `embeddings` feature.
| Preset | Model | Dimensions | Max Tokens | Use Case |
| -------------- | ---------------------------- | ---------- | ---------- | ------------------------------------------------------- |
| `fast` | all-MiniLM-L6-v2 (quantized) | 384 | 512 | Quick prototyping, development, resource-constrained |
| `balanced` | BGE-base-en-v1.5 | 768 | 1024 | General-purpose RAG, production deployments, English |
| `quality` | BGE-large-en-v1.5 | 1024 | 2000 | Complex documents, maximum accuracy, sufficient compute |
| `multilingual` | multilingual-e5-base | 768 | 1024 | International documents, mixed-language content |
### In-Process Embedding Backends (Plugin Variant)
Plug a caller-managed embedder (e.g. `llama-cpp-python`, `sentence-transformers`) into Kreuzberg via the `Plugin` variant of `EmbeddingModelType` — Kreuzberg calls back into the registered backend instead of running its own ONNX model.
1. Register the backend once at startup via `kreuzberg::plugins::register_embedding_backend(Arc::new(MyEmbedder))`. The backend implements `EmbeddingBackend` (a `Plugin`-inheriting async trait with `dimensions()` and `embed(texts) -> Vec<Vec<f32>>`).
2. Reference it by name in `EmbeddingConfig`: `{ "model": { "type": "plugin", "name": "my-embedder" } }`.
3. Optional: set `EmbeddingConfig.max_embed_duration_secs` (default 60) to bound the wait on a hung backend; `None` disables the timeout.
The CLI (`kreuzberg embed --provider plugin --plugin my-embedder`), MCP server (`embed_text` tool, `embedding_plugin` parameter), REST API, and env var `KREUZBERG_EMBEDDING_PLUGIN_NAME` all accept the Plugin variant once a backend is registered.
**Fork-safety**: Python callers running under `multiprocessing`, `gunicorn`'s prefork worker, or Celery prefork must re-register the backend in each child process — native-backed embedders (including `llama-cpp-python`) aren't fork-safe. Use `os.register_at_fork(after_in_child=reregister_fn)` to automate the re-registration.
### Configuration
=== "Python"
--8<-- "snippets/python/utils/embedding_with_chunking.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/embedding_with_chunking.md"
=== "Rust"
--8<-- "snippets/rust/advanced/embedding_with_chunking.md"
=== "Go"
--8<-- "snippets/go/advanced/embedding_with_chunking.md"
=== "Java"
--8<-- "snippets/java/advanced/embedding_with_chunking.md"
=== "C#"
--8<-- "snippets/csharp/advanced/embedding_with_chunking.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/embedding_with_chunking.md"
=== "R"
--8<-- "snippets/r/advanced/embedding_with_chunking.md"
### Vector Database Integration
=== "Python"
--8<-- "snippets/python/utils/vector_database_integration.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/vector_database_integration.md"
=== "Rust"
--8<-- "snippets/rust/advanced/vector_database_integration.md"
=== "Go"
--8<-- "snippets/go/advanced/vector_database_integration.md"
=== "Java"
--8<-- "snippets/java/advanced/vector_database_integration.md"
=== "C#"
--8<-- "snippets/csharp/advanced/vector_database_integration.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/vector_database_integration.md"
=== "R"
--8<-- "snippets/r/advanced/vector_database_integration.md"
## Token Reduction
Reduce token count while preserving meaning for LLM pipelines.
| Level | Reduction | Effect |
| ------------ | --------- | ---------------------------------------- |
| `off` | 0% | Pass-through |
| `moderate` | 1525% | Stopwords + redundancy removal |
| `aggressive` | 3050% | Semantic clustering + importance scoring |
### Configuration
=== "Python"
--8<-- "snippets/python/config/token_reduction_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/token_reduction_config.md"
=== "Rust"
--8<-- "snippets/rust/advanced/token_reduction_config.md"
=== "Go"
--8<-- "snippets/go/config/token_reduction_config.md"
=== "Java"
--8<-- "snippets/java/config/token_reduction_config.md"
=== "C#"
--8<-- "snippets/csharp/advanced/token_reduction_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/token_reduction_config.md"
=== "R"
--8<-- "snippets/r/config/token_reduction_config.md"
### Example
=== "Python"
--8<-- "snippets/python/utils/token_reduction_example.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/token_reduction_example.md"
=== "Rust"
--8<-- "snippets/rust/advanced/token_reduction_example.md"
=== "Go"
--8<-- "snippets/go/advanced/token_reduction_example.md"
=== "Java"
--8<-- "snippets/java/advanced/token_reduction_example.md"
=== "C#"
--8<-- "snippets/csharp/advanced/token_reduction_example.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/token_reduction_example.md"
=== "R"
--8<-- "snippets/r/advanced/token_reduction_example.md"
## Keyword Extraction
Extract keywords using YAKE or RAKE algorithms. Requires the `keywords` feature flag. See [Keyword Extraction](keywords.md) for algorithm details and parameter reference.
### Configuration
=== "Python"
--8<-- "snippets/python/config/keyword_extraction_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/keyword_extraction_config.md"
=== "Rust"
--8<-- "snippets/rust/advanced/keyword_extraction_config.md"
=== "Go"
--8<-- "snippets/go/config/keyword_extraction_config.md"
=== "Java"
--8<-- "snippets/java/config/keyword_extraction_config.md"
=== "C#"
--8<-- "snippets/csharp/advanced/keyword_extraction_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/keyword_extraction_config.md"
=== "R"
--8<-- "snippets/r/config/keyword_extraction_config.md"
### Example
=== "Python"
--8<-- "snippets/python/utils/keyword_extraction_example.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/keyword_extraction_example.md"
=== "Rust"
--8<-- "snippets/rust/advanced/keyword_extraction_example.md"
=== "Go"
--8<-- "snippets/go/advanced/keyword_extraction_example.md"
=== "Java"
--8<-- "snippets/java/advanced/keyword_extraction_example.md"
=== "C#"
--8<-- "snippets/csharp/advanced/keyword_extraction_example.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/keyword_extraction_example.md"
=== "R"
--8<-- "snippets/r/advanced/keyword_extraction_example.md"
## Quality Processing
Score extracted text for quality issues (0.01.0, where 1.0 is highest quality). Detects OCR artifacts, script content, navigation elements, and structural issues.
| Factor | Weight | Detects |
| ------------------- | ------ | ------------------------------------------------------ |
| OCR Artifacts | 30% | Scattered chars, repeated punctuation, malformed words |
| Script Content | 20% | JavaScript, CSS, HTML tags |
| Navigation Elements | 10% | Breadcrumbs, pagination, skip links |
| Document Structure | 20% | Sentence/paragraph length, punctuation distribution |
| Metadata Quality | 10% | Presence of title, author, subject |
Score ranges: `0.00.3` very low, `0.30.6` low, `0.60.8` moderate, `0.81.0` high.
### Configuration
=== "Python"
--8<-- "snippets/python/config/quality_processing_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/quality_processing_config.md"
=== "Rust"
--8<-- "snippets/rust/advanced/quality_processing_config.md"
=== "Go"
--8<-- "snippets/go/config/quality_processing_config.md"
=== "Java"
--8<-- "snippets/java/config/quality_processing_config.md"
=== "C#"
--8<-- "snippets/csharp/advanced/quality_processing_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/quality_processing_config.md"
=== "R"
--8<-- "snippets/r/config/quality_processing_config.md"
### Example
=== "Python"
--8<-- "snippets/python/utils/quality_processing_example.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/quality_processing_example.md"
=== "Rust"
--8<-- "snippets/rust/advanced/quality_processing_example.md"
=== "Go"
--8<-- "snippets/go/advanced/quality_processing_example.md"
=== "Java"
--8<-- "snippets/java/advanced/quality_processing_example.md"
=== "C#"
--8<-- "snippets/csharp/advanced/quality_processing_example.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/quality_processing_example.md"
=== "R"
--8<-- "snippets/r/advanced/quality_processing_example.md"
## Combining Features
=== "Python"
--8<-- "snippets/python/advanced/combining_all_features.md"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/combining_all_features.md"
=== "Rust"
--8<-- "snippets/rust/api/combining_all_features.md"
=== "Go"
--8<-- "snippets/go/api/combining_all_features.md"
=== "Java"
--8<-- "snippets/java/api/combining_all_features.md"
=== "C#"
--8<-- "snippets/csharp/advanced/combining_all_features.md"
=== "Ruby"
--8<-- "snippets/ruby/api/combining_all_features.md"
=== "R"
--8<-- "snippets/r/api/combining_all_features.md"

101
docs/guides/agent-skills.md Normal file
View File

@@ -0,0 +1,101 @@
# AI Coding Assistants <span class="version-badge new">v4.2.15</span>
Kreuzberg ships with an [Agent Skill](https://agentskills.io) that teaches AI coding assistants how to use the library correctly — covering extraction, configuration, OCR, chunking, embeddings, batch processing, error handling, and plugins across Python, Node.js/TypeScript, Rust, and CLI.
## Supported Assistants
Works with any tool supporting the [Agent Skills](https://agentskills.io) standard: Claude Code, Codex, Gemini CLI, Cursor, Visual Studio Code (with AI extensions), Amp, Goose, and Roo Code.
## Installing
```bash title="Terminal"
# Install into current project (recommended)
npx skills add kreuzberg-dev/kreuzberg
# Install globally
npx skills add kreuzberg-dev/kreuzberg -g
```
Or copy manually:
```bash title="Terminal"
cp -r path/to/kreuzberg/skills/kreuzberg .claude/skills/kreuzberg
```
## What the Skill Provides
When your AI coding assistant discovers the skill, it knows:
- All extraction functions and their correct signatures across languages
- Configuration field names (for example, `max_chars` not `max_characters` in Python)
- Rust feature gates (for example, `tokio-runtime` for sync functions)
- Language-specific conventions (snake_case in Python/Rust, camelCase in Node.js)
- Error handling patterns for each language
### Skill Structure
```text
skills/kreuzberg/
├── SKILL.md # Main skill (~400 lines)
└── references/
├── python-api.md # Complete Python API
├── nodejs-api.md # Complete Node.js API
├── rust-api.md # Complete Rust API
├── cli-reference.md # All CLI commands and flags
├── configuration.md # Config file formats and schema
├── supported-formats.md # All 90+ supported formats
├── advanced-features.md # Plugins, embeddings, MCP, security
└── other-bindings.md # Go, Ruby, Java, C#, PHP, Elixir
```
The main file stays under 500 lines for efficient AI consumption. Reference files load on demand.
## Quick Examples
=== "Python"
```python
from kreuzberg import extract_file, extract_file_sync, ExtractionConfig, OcrConfig
result = await extract_file("document.pdf")
print(result.content)
config = ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng"),
output_format="markdown",
)
result = await extract_file("document.pdf", config=config)
```
=== "Node.js"
```typescript
import { extractFile, extractFileSync } from '@kreuzberg/node';
const result = await extractFile('document.pdf');
console.log(result.content);
```
=== "Rust"
```rust
use kreuzberg::{extract_file, ExtractionConfig};
let config = ExtractionConfig::default();
let result = extract_file("document.pdf", None, &config).await?;
```
=== "CLI"
```bash
kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json --output-format markdown
```
## Further Reading
- [Agent Skills Standard](https://agentskills.io) — the open standard
- [Extraction Basics](extraction.md) — core extraction API
- [Configuration](configuration.md) — all configuration options
- [Advanced Features](advanced.md) — chunking, embeddings, language detection
- [Plugin System](plugins.md) — custom plugins

391
docs/guides/api-server.md Normal file
View File

@@ -0,0 +1,391 @@
# API Server <span class="version-badge">v4.0.0</span>
Kreuzberg runs as an HTTP REST API server (`kreuzberg serve`) or as an MCP server (`kreuzberg mcp`) for AI agent integration.
## HTTP REST API
### Start
=== "CLI"
--8<-- "snippets/api_server/cli.md"
=== "Docker"
--8<-- "snippets/api_server/docker.md"
=== "Python"
--8<-- "snippets/api_server/python.md"
=== "Rust"
--8<-- "snippets/api_server/rust.md"
=== "Go"
--8<-- "snippets/api_server/go.md"
=== "Java"
--8<-- "snippets/api_server/java.md"
=== "C#"
--8<-- "snippets/api_server/csharp.md"
### Endpoints
#### POST /extract
Extract text from uploaded files via multipart form data.
| Field | Required | Description |
| --------------- | ---------------- | ------------------------------------------------ |
| `files` | Yes (repeatable) | Files to extract |
| `config` | No | JSON config overrides |
| `output_format` | No | `plain` (default), `markdown`, `djot`, or `html` |
```bash title="Terminal"
# Single file
curl -F "files=@document.pdf" http://localhost:8000/extract
# Multiple files
curl -F "files=@doc1.pdf" -F "files=@doc2.docx" http://localhost:8000/extract
# With config overrides
curl -F "files=@scanned.pdf" \
-F 'config={"ocr":{"language":"eng"},"force_ocr":true}' \
http://localhost:8000/extract
```
```json title="Response"
[
{
"content": "Extracted text...",
"mime_type": "application/pdf",
"metadata": { "page_count": 10, "author": "John Doe" },
"tables": [],
"detected_languages": ["eng"],
"chunks": null,
"images": null
}
]
```
#### POST /embed
Generate vector embeddings. Requires the `embeddings` feature.
| Field | Required | Description |
| -------- | -------- | -------------------------- |
| `texts` | Yes | Array of strings |
| `config` | No | Embedding config overrides |
```bash title="Terminal"
curl -X POST http://localhost:8000/embed \
-H "Content-Type: application/json" \
-d '{"texts":["Hello world","Second text"]}'
```
| Preset | Dimensions | Model |
| -------------------- | ---------- | ------------------ |
| `fast` | 384 | AllMiniLML6V2Q |
| `balanced` (default) | 768 | BGEBaseENV15 |
| `quality` | 1024 | BGELargeENV15 |
| `multilingual` | 768 | MultilingualE5Base |
#### POST /chunk
Chunk text for RAG pipelines.
| Field | Required | Description |
| ----------------------- | -------- | ----------------------------------------------------------- |
| `text` | Yes | Text to chunk |
| `chunker_type` | No | `"text"` (default), `"markdown"`, `"yaml"`, or `"semantic"` |
| `config.max_characters` | No | Max chars per chunk (default: 2000) |
| `config.overlap` | No | Overlap between chunks (default: 100) |
```bash title="Terminal"
curl -X POST http://localhost:8000/chunk \
-H "Content-Type: application/json" \
-d '{"text":"Long text...","chunker_type":"text","config":{"max_characters":1000,"overlap":50}}'
```
=== "Python"
--8<-- "snippets/python/api/client_chunk_text.md"
=== "TypeScript"
--8<-- "snippets/typescript/api/client_chunk_text.md"
=== "Rust"
--8<-- "snippets/rust/api/client_chunk_text.md"
=== "Go"
--8<-- "snippets/go/api/client_chunk_text.md"
=== "Java"
--8<-- "snippets/java/api/client_chunk_text.md"
=== "C#"
--8<-- "snippets/csharp/client_chunk_text.md"
=== "Ruby"
--8<-- "snippets/ruby/api/client_chunk_text.md"
#### POST /extract-structured <span class="version-badge">v4.8.0</span>
Extract typed JSON from a document by running an LLM against the extracted text with a JSON schema (requires `liter-llm` feature; `multipart/form-data` request).
| Field | Required | Description |
| ------------------- | -------- | -------------------------------------------------------------------------------------------------- |
| `file` (or `files`) | Yes | The document to extract from |
| `schema` | Yes | JSON Schema string describing the structured output |
| `model` | Yes | LLM model identifier, for example `openai/gpt-4o` or `anthropic/claude-sonnet-4-20250514` |
| `api_key` | No | LLM provider API key. Falls back to provider env vars (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, ...) |
| `prompt` | No | Custom Jinja2 prompt template overriding the default |
| `schema_name` | No | Schema identifier (default: `extraction`) |
| `strict` | No | `"true"` / `"false"` — enable OpenAI strict mode for exact schema matching |
| `config` | No | Extraction config overrides as a JSON string |
```bash title="Terminal"
curl -X POST http://localhost:8000/extract-structured \
-F "file=@invoice.pdf" \
-F 'schema={"type":"object","properties":{"invoice_number":{"type":"string"},"total":{"type":"number"}},"required":["invoice_number","total"]}' \
-F "model=openai/gpt-4o" \
-F "api_key=$OPENAI_API_KEY" \
-F "strict=true"
```
```json title="Response"
{
"structured_output": {
"invoice_number": "INV-2026-0142",
"total": 1284.5
},
"content": "Invoice INV-2026-0142...",
"mime_type": "application/pdf"
}
```
Errors follow the same shape as `/extract`. A `501` response indicates the server was built without `liter-llm`.
#### Other Endpoints
| Endpoint | Method | Description |
| ----------------- | ------ | ------------------------------------------------------------------------- |
| `/health` | GET | `{"status":"healthy","version":"4.6.3"}` |
| `/version` | GET | `{"version":"4.6.3"}` <span class="version-badge">v4.5.2</span> |
| `/detect` | POST | MIME type detection (multipart) <span class="version-badge">v4.5.2</span> |
| `/cache/stats` | GET | Cache statistics |
| `/cache/warm` | POST | Pre-download models <span class="version-badge">v4.5.2</span> |
| `/cache/manifest` | GET | Model manifest with checksums <span class="version-badge">v4.5.2</span> |
| `/cache/clear` | DELETE | Clear all cached files |
| `/info` | GET | `{"version":"...","rust_backend":true}` |
| `/openapi.json` | GET | OpenAPI 3.0 schema |
### Client Examples
=== "Python"
--8<-- "snippets/python/api/client_extract_single_file.md"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/client_extract_single_file.md"
=== "Rust"
--8<-- "snippets/rust/api/client_extract_single_file.md"
=== "Go"
--8<-- "snippets/go/api/client_extract_single_file.md"
=== "Java"
--8<-- "snippets/java/api/client_extract_single_file.md"
=== "C#"
--8<-- "snippets/csharp/client_extract_single_file.md"
=== "Ruby"
--8<-- "snippets/ruby/api/client_extract_single_file.md"
### Error Handling
```json title="Error response"
{
"error_type": "ValidationError",
"message": "Invalid file format",
"status_code": 400
}
```
| Status | Error type | Meaning |
| ------ | -------------------------- | ----------------- |
| 400 | `ValidationError` | Invalid input |
| 422 | `ParsingError`, `OcrError` | Processing failed |
| 500 | Internal errors | Server errors |
=== "Python"
--8<-- "snippets/python/utils/error_handling_extract.md"
=== "TypeScript"
--8<-- "snippets/typescript/api/error_handling_extract.md"
=== "Rust"
--8<-- "snippets/rust/api/error_handling_extract.md"
=== "Go"
--8<-- "snippets/go/api/error_handling_extract.md"
=== "Java"
--8<-- "snippets/java/api/error_handling_extract.md"
=== "C#"
--8<-- "snippets/csharp/error_handling_extract.md"
=== "Ruby"
--8<-- "snippets/ruby/api/error_handling_extract.md"
### Configuration
The server discovers `kreuzberg.toml` in the current and parent directories. Pass `--config path/to/file` to use a different file.
| Variable | Default | Description |
| ------------------------------ | ------- | ------------------------------- |
| `KREUZBERG_MAX_UPLOAD_SIZE_MB` | `100` | Max upload size in MB |
| `KREUZBERG_CORS_ORIGINS` | `*` | Comma-separated allowed origins |
!!! Warning Default CORS allows all origins. Set `KREUZBERG_CORS_ORIGINS` explicitly in production.
See [Configuration Guide](configuration.md) for all options.
---
## MCP Server
### Start
```bash title="Terminal"
kreuzberg mcp
kreuzberg mcp --config kreuzberg.toml
```
=== "Python"
--8<-- "snippets/python/mcp/mcp_server_start.md"
=== "TypeScript"
--8<-- "snippets/typescript/mcp/mcp_server_start.md"
=== "Rust"
--8<-- "snippets/rust/mcp/mcp_server_start.md"
=== "Go"
--8<-- "snippets/go/mcp/mcp_server_start.md"
=== "Java"
--8<-- "snippets/java/mcp/mcp_server_start.md"
=== "C#"
--8<-- "snippets/csharp/mcp_server_start.md"
=== "Ruby"
--8<-- "snippets/ruby/mcp/mcp_server_start.md"
### Tools
| Tool | Key parameters | Description |
| --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| `extract_file` | `path` | Extract from file path |
| `extract_bytes` | `data` (base64) | Extract from encoded bytes |
| `batch_extract_files` | `paths` | Extract multiple files |
| `detect_mime_type` | `path` | Detect file format |
| `list_formats` | — | List supported formats <span class="version-badge">v4.5.2</span> |
| `get_version` | — | Library version <span class="version-badge">v4.5.2</span> |
| `cache_stats` | — | Cache usage |
| `cache_clear` | — | Remove cached files |
| `cache_manifest` | — | Model checksums <span class="version-badge">v4.5.2</span> |
| `cache_warm` | — | Pre-download models <span class="version-badge">v4.5.2</span> |
| `embed_text` | `texts` | Generate embeddings <span class="version-badge">v4.5.2</span> |
| `chunk_text` | `text` | Split text <span class="version-badge">v4.5.2</span> |
| `extract_structured` | `path`, `schema`, `model`; optional `schema_name` (default `"extraction"`), `schema_description`, `prompt`, `api_key`, `strict` (default `false`) | Extract structured JSON via LLM <span class="version-badge">v4.8.0</span> |
All tools accept an optional `config` object. `extract_file` and `extract_bytes` also accept `pdf_password`. `extract_structured` requires the server to be built with the `liter-llm` feature; see the row above for optional fields and defaults.
### AI Agent Integration
=== "Claude Desktop"
Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
```json
{
"mcpServers": {
"kreuzberg": {
"command": "kreuzberg",
"args": ["mcp"]
}
}
}
```
=== "Python"
--8<-- "snippets/python/mcp/mcp_custom_client.md"
=== "LangChain"
--8<-- "snippets/python/mcp/mcp_langchain_integration.md"
=== "TypeScript"
--8<-- "snippets/typescript/mcp/mcp_custom_client.md"
=== "Rust"
--8<-- "snippets/rust/mcp/mcp_custom_client.md"
=== "Go"
--8<-- "snippets/go/mcp/mcp_custom_client.md"
=== "Java"
--8<-- "snippets/java/mcp/mcp_client.md"
=== "C#"
--8<-- "snippets/csharp/mcp_custom_client.md"
=== "Ruby"
--8<-- "snippets/ruby/mcp/mcp_custom_client.md"
---
For Docker and Kubernetes deployment, see [Docker Guide](docker.md) and [Kubernetes Guide](kubernetes.md).

View File

@@ -0,0 +1,249 @@
# Code Intelligence
Kreuzberg integrates [tree-sitter-language-pack](https://docs.tree-sitter-language-pack.kreuzberg.dev) (TSLP) to parse and analyze source code files. When you extract a source code file, Kreuzberg automatically detects the programming language and produces structured analysis alongside the raw text content.
## What You Get
When extracting source code, the `metadata.format` field contains a `ProcessResult` (format type `"code"`) with:
- **Structure** -- functions, classes, structs, methods, modules, and their nesting hierarchy
- **Imports** -- import/include/require statements with source paths and imported items
- **Exports** -- exported symbols with their kinds (function, class, variable, type, default)
- **Comments** -- inline and block comments with their positions
- **Docstrings** -- documentation comments with parsed sections (params, returns, etc.)
- **Symbols** -- variable, constant, and type alias definitions
- **Diagnostics** -- parse errors and warnings from tree-sitter
- **Chunks** -- semantically meaningful code chunks for RAG and embedding pipelines
- **Metrics** -- file-level statistics (lines of code, comment lines, empty lines, node count)
Language support covers **300+ programming languages** via tree-sitter grammars. See the [TSLP documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) for the full language list.
## Getting Started
Code intelligence is enabled by default when the `tree-sitter` feature flag is active. Simply extract a source code file:
=== "Rust"
```rust title="basic.rs"
use kreuzberg::{extract_file_sync, ExtractionConfig};
let config = ExtractionConfig::default();
let result = extract_file_sync("app.py", None, &config)?;
// The content field has the raw source text
println!("{}", result.content);
// Code intelligence is in metadata.format
if let Some(kreuzberg::types::FormatMetadata::Code(ref code)) = result.metadata.format {
println!("Language: {}", code.language);
println!("Structures: {}", code.structure.len());
println!("Imports: {}", code.imports.len());
}
```
=== "Python"
```python title="basic.py"
import kreuzberg
config = kreuzberg.ExtractionConfig()
result = kreuzberg.extract_file_sync("app.py", config=config)
# The content field has the raw source text
print(result.content)
# Code intelligence is in metadata["format"]
fmt = result.metadata.get("format")
if fmt and fmt.get("format_type") == "code":
print(f"Language: {fmt['language']}")
print(f"Structures: {len(fmt['structure'])}")
print(f"Imports: {len(fmt['imports'])}")
```
=== "TypeScript"
```typescript title="basic.ts"
import { extractFileSync } from "@kreuzberg/node";
const result = extractFileSync("app.ts");
console.log(result.content);
const fmt = result.metadata?.format;
if (fmt?.formatType === "code") {
console.log(`Language: ${fmt.language}`);
console.log(`Structures: ${fmt.structure.length}`);
console.log(`Imports: ${fmt.imports.length}`);
}
```
=== "Go"
```go title="basic.go"
result, err := kreuzberg.ExtractFileSync("app.py", nil)
if err != nil {
log.Fatal(err)
}
fmt.Println(result.Content)
// Code intelligence is available in result.Metadata.Format
// when Format.Type == "code"
```
## Configuration
Use `TreeSitterConfig` to control which analysis features are enabled. Set `enabled: false` to disable code intelligence entirely. By default, `structure`, `imports`, and `exports` are enabled; `comments`, `docstrings`, `symbols`, and `diagnostics` are disabled.
=== "Rust"
```rust title="config.rs"
use kreuzberg::{ExtractionConfig, TreeSitterConfig, TreeSitterProcessConfig};
let config = ExtractionConfig {
tree_sitter: Some(TreeSitterConfig {
process: TreeSitterProcessConfig {
structure: true, // functions, classes, etc. (default: true)
imports: true, // import statements (default: true)
exports: true, // export statements (default: true)
comments: true, // comments (default: false)
docstrings: true, // docstrings (default: false)
symbols: true, // variables, constants (default: false)
diagnostics: true, // parse errors/warnings (default: false)
chunk_max_size: Some(4096), // max chunk size in bytes
..Default::default()
},
..Default::default()
}),
..Default::default()
};
```
=== "Python"
```python title="config.py"
import kreuzberg
config = kreuzberg.ExtractionConfig(
tree_sitter={
"process": {
"structure": True,
"imports": True,
"exports": True,
"comments": True,
"docstrings": True,
"symbols": True,
"diagnostics": True,
"chunk_max_size": 4096,
}
}
)
```
=== "TypeScript"
```typescript title="config.ts"
import { ExtractionConfig } from "@kreuzberg/node";
const config: ExtractionConfig = {
treeSitter: {
process: {
structure: true,
imports: true,
exports: true,
comments: true,
docstrings: true,
symbols: true,
diagnostics: true,
chunkMaxSize: 4096,
},
},
};
```
=== "TOML"
```toml title="kreuzberg.toml"
[tree_sitter.process]
structure = true
imports = true
exports = true
comments = true
docstrings = true
symbols = true
diagnostics = true
chunk_max_size = 4096
```
### Configuration Fields
See [`TreeSitterConfig`](../reference/configuration.md#treesitterconfig) and [`TreeSitterProcessConfig`](../reference/configuration.md#treesitterprocessconfig) for all fields.
## ProcessResult Fields
Code intelligence results are returned as a `ProcessResult` from the upstream [`tree-sitter-language-pack`](https://docs.rs/tree-sitter-language-pack) crate. Top-level fields: `language`, `metrics`, `structure`, `imports`, `exports`, `chunks`, plus `comments` / `docstrings` / `symbols` / `diagnostics` (populated only when their `TreeSitterProcessConfig` flag is on). See the upstream crate docs for full field shapes.
## Semantic Chunking for RAG
Code chunks produced by tree-sitter are semantically aware -- they split at function, class, and module boundaries rather than fixed line counts. This makes them ideal for retrieval-augmented generation (RAG) pipelines:
```python title="rag_chunking.py"
import kreuzberg
config = kreuzberg.ExtractionConfig(
tree_sitter={"process": {"chunk_max_size": 2048}}
)
result = kreuzberg.extract_file_sync("large_module.py", config=config)
fmt = result.metadata.get("format")
if fmt and fmt.get("format_type") == "code":
for chunk in fmt.get("chunks", []):
# Each chunk is a semantically coherent piece of code
embedding = your_embedding_model(chunk["content"])
store_in_vector_db(
text=chunk["content"],
embedding=embedding,
metadata={
"language": chunk["language"],
"start_line": chunk["span"]["start_line"],
"parent": chunk.get("context", {}).get("parent_name"),
},
)
```
## Language Detection
Kreuzberg detects the programming language in two ways:
1. **File extension** (fast path) -- when using `extract_file`, the extension is matched against 248 known language extensions
2. **Shebang line** (fallback) -- when using `extract_bytes` or when the extension is ambiguous, the first line is checked for `#!/usr/bin/env python`, `#!/bin/bash`, and so on.
If neither method identifies the language, extraction returns an `UnsupportedFormat` error.
## Language Support
Tree-sitter-language-pack supports 300+ programming languages. For the full list, see the [TSLP language reference](https://docs.tree-sitter-language-pack.kreuzberg.dev).
Common languages with full structural analysis:
| Language | Structure | Imports | Exports | Docstrings |
| ---------- | --------- | ------- | ------- | ---------- |
| Python | Yes | Yes | Yes | Yes |
| Rust | Yes | Yes | Yes | Yes |
| TypeScript | Yes | Yes | Yes | Yes |
| JavaScript | Yes | Yes | Yes | Yes |
| Go | Yes | Yes | Yes | Yes |
| Java | Yes | Yes | Yes | Yes |
| C/C++ | Yes | Yes | Yes | Yes |
| Ruby | Yes | Yes | Yes | Yes |
| PHP | Yes | Yes | Yes | Yes |
| C# | Yes | Yes | Yes | Yes |
| Swift | Yes | Yes | Yes | Yes |
| Kotlin | Yes | Yes | Yes | Yes |
| Elixir | Yes | Yes | Yes | Yes |
## Related Documentation
- [Configuration Reference](../reference/configuration.md#treesitterconfig) -- TreeSitterConfig and TreeSitterProcessConfig fields
- [Types Reference](../reference/types.md) -- ProcessResult, StructureItem, CodeChunk, and related type definitions
- [tree-sitter-language-pack documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) -- Full language support reference

View File

@@ -0,0 +1,214 @@
# Configuration Guide <span class="version-badge">v4.0.0</span>
All extraction behavior is controlled through `ExtractionConfig`. Pass it directly in code or load it from a TOML/YAML/JSON file. Every field is optional. For per-field documentation, see the [Configuration Reference](../reference/configuration.md).
## Quick Start
=== "Python"
--8<-- "snippets/python/config/config_basic.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/config_basic.md"
=== "Rust"
--8<-- "snippets/rust/config/config_basic.md"
=== "Go"
--8<-- "snippets/go/config/config_basic.md"
=== "Java"
--8<-- "snippets/java/config/config_basic.md"
=== "C#"
--8<-- "snippets/csharp/config_basic.md"
=== "Ruby"
--8<-- "snippets/ruby/config/config_basic.md"
=== "R"
--8<-- "snippets/r/config/config_basic.md"
## Configuration Files
Three formats are supported. TOML is recommended.
=== "TOML (Recommended)"
```toml title="kreuzberg.toml"
use_cache = true
enable_quality_processing = true
[ocr]
backend = "tesseract"
language = "eng"
[ocr.tesseract_config]
psm = 3
```
=== "YAML"
```yaml title="kreuzberg.yaml"
use_cache: true
enable_quality_processing: true
ocr:
backend: tesseract
language: eng
tesseract_config:
psm: 3
```
=== "JSON"
```json title="kreuzberg.json"
{
"use_cache": true,
"enable_quality_processing": true,
"ocr": {
"backend": "tesseract",
"language": "eng",
"tesseract_config": {
"psm": 3
}
}
}
```
### Automatic Discovery
When no `--config` path is supplied, Kreuzberg walks up from the current working directory looking for `kreuzberg.toml` and uses the first match. YAML and JSON files are supported only when passed explicitly via `--config`. If nothing is found, defaults are used.
=== "Python"
--8<-- "snippets/python/config/config_discover.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/config_discover.md"
=== "Rust"
--8<-- "snippets/rust/config/config_discover.md"
=== "Go"
--8<-- "snippets/go/config/config_discover.md"
=== "Java"
--8<-- "snippets/java/config/config_discover.md"
=== "C#"
--8<-- "snippets/csharp/config_discover.md"
=== "Ruby"
--8<-- "snippets/ruby/config/config_discover.md"
=== "R"
--8<-- "snippets/r/config/config_discover.md"
=== "Wasm"
--8<-- "snippets/wasm/config/config_discover.md"
## Common Use Cases
### Setting Up OCR
=== "Python"
--8<-- "snippets/python/config/config_ocr.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/config_ocr.md"
=== "Rust"
--8<-- "snippets/rust/ocr/config_ocr.md"
=== "Go"
--8<-- "snippets/go/config/config_ocr.md"
=== "Java"
--8<-- "snippets/java/config/config_ocr.md"
=== "C#"
--8<-- "snippets/csharp/config_ocr.md"
=== "Ruby"
--8<-- "snippets/ruby/config/config_ocr.md"
=== "R"
--8<-- "snippets/r/config/config_ocr.md"
For backend selection and language packs, see [OCR Guide](ocr.md). For fine-grained Tesseract tuning, see [TesseractConfig Reference](../reference/configuration.md#tesseractconfig).
### Chunking for RAG
=== "Python"
--8<-- "snippets/python/utils/chunking.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/chunking.md"
=== "Rust"
--8<-- "snippets/rust/advanced/chunking.md"
=== "Go"
--8<-- "snippets/go/utils/chunking.md"
=== "Java"
--8<-- "snippets/java/utils/chunking.md"
=== "C#"
--8<-- "snippets/csharp/advanced/embedding_with_chunking.md"
=== "Ruby"
--8<-- "snippets/ruby/utils/chunking.md"
=== "R"
--8<-- "snippets/r/utils/chunking.md"
## All Configuration Categories
- [ExtractionConfig](../reference/configuration.md#extractionconfig) — top-level options
- [OcrConfig](../reference/configuration.md#ocrconfig) — OCR backend, language, acceleration
- [TesseractConfig](../reference/configuration.md#tesseractconfig) — Tesseract PSM, confidence, table detection
- [ChunkingConfig](../reference/configuration.md#chunkingconfig) — chunk size, overlap
- [TokenReductionConfig](../reference/configuration.md#tokenreductionconfig) — LLM prompt token reduction
- [ContentFilterConfig](../reference/configuration.md#contentfilterconfig) — header/footer/watermark filtering
- [PageConfig](../reference/configuration.md#pageconfig) — page tracking and markers
- [AccelerationConfig](../reference/configuration.md#accelerationconfig) — ONNX Runtime execution provider
## Next Steps
- [Extraction Basics](extraction.md) — core extraction API and supported formats
- [OCR Guide](ocr.md) — backend installation and language setup
- [Advanced Features](advanced.md) — embeddings, language detection, page tracking
- [Plugins Guide](plugins.md) — custom post-processors and validators

354
docs/guides/development.md Normal file
View File

@@ -0,0 +1,354 @@
# Development Workflow
Everything you need to build, test, and debug Kreuzberg locally. This guide assumes you've already followed the [Contributing Guide](../contributing.md) to fork and clone the repository.
---
## The Task Runner
Kreuzberg uses [Task](https://taskfile.dev/) for all build and test workflows. One command to bootstrap everything:
```bash title="Terminal"
task setup
```
That installs all toolchains and dependencies. Safe to re-run anytime — it's idempotent.
### The Pattern
Tasks follow `<language>:<action>`. Once you learn this pattern, the command for any task is predictable:
```bash title="Terminal"
task rust:build # Build the Rust core
task rust:build:dev # Debug build (faster compile, no optimizations)
task rust:build:release # Release build (slow compile, fast binary)
task rust:test # Run Rust tests
task rust:test:ci # Same tests, with CI-level diagnostics
task python:build # Build Python bindings via maturin
task python:test # Run Python test suite
task node:build # Build Node.js bindings via napi
task node:test # Jest tests
```
The same pattern works for every language: `go:build`, `java:test`, `ruby:build`, `csharp:test`, and so on.
### Bulk Operations
```bash title="Terminal"
task build:all # Build every binding
task test:all # Test every binding (sequential)
task test:all:parallel # Test every binding (parallel — faster, noisier output)
task check # Lint + format check across the whole repo
```
---
## Testing Locally
### Rust
The core lives in `crates/kreuzberg/`. Most changes start here.
```bash title="Terminal"
task rust:test
cargo test -p kreuzberg test_pdf_extraction -- --nocapture
RUST_LOG=debug cargo test -p kreuzberg test_name -- --nocapture
```
### Python
Python bindings are in `packages/python/`. Build first, then test:
```bash title="Terminal"
task python:build:dev
task python:test
cd packages/python
uv run pytest tests/ -k "test_extract" -v
```
The `RUST_LOG` env var works here too — the Rust core logs through Python's stderr:
```bash title="Terminal"
RUST_LOG=debug uv run pytest tests/ -v
```
### Node.js
TypeScript bindings are in `packages/typescript/`:
```bash title="Terminal"
task node:build:dev
task node:test
cd packages/typescript
pnpm test -- --testPathPattern="extract"
```
### Everything Else
Same pattern. Build, then test:
```bash title="Terminal"
task go:build && task go:test
task java:build && task java:test
task csharp:build && task csharp:test
task ruby:build && task ruby:test
task php:build && task php:test
task elixir:build && task elixir:test
task r:build && task r:test
task c:build && task c:test
task wasm:build && task wasm:test
```
### Testing the live browser demo
The demo at `docs/demo.html` loads `@kreuzberg/wasm` from a CDN. To test local changes against it, use:
```bash title="Terminal"
task demo:dev
```
This builds the Wasm binary and TypeScript dist, patches the demo with local URLs, and starts two servers:
| Server | URL | Role |
| ------ | ----------------------- | ---------------------------------- |
| Docs | `http://localhost:8001` | Serves the patched `demo-dev.html` |
| Assets | `http://localhost:9000` | Serves the local Wasm package |
Open **`http://localhost:8001/demo-dev.html`** — no manual edits needed. The patched file (`docs/demo-dev.html`) is gitignored and regenerated on every run. The two different ports reproduce the cross-origin setup the CDN creates in production.
To skip the slow Rust build when you've only changed TypeScript:
```bash title="Terminal"
SKIP_WASM_BUILD=1 task demo:dev
```
---
## End-to-end Test Suites
End-to-end tests guarantee that every language binding produces identical results for the same document. They live in `e2e/` as shared fixtures — test inputs paired with expected outputs.
### Run end-to-end Tests
| Language | Directory | Run with |
| -------------------- | ----------------- | ---------------------- |
| Python | `e2e/python/` | `task python:e2e:test` |
| TypeScript / Node.js | `e2e/typescript/` | `task node:e2e:test` |
| Rust | `e2e/rust/` | `task rust:e2e:test` |
| Go | `e2e/go/` | `task go:e2e:test` |
| Java | `e2e/java/` | `task java:e2e:test` |
| .NET | `e2e/csharp/` | `task csharp:e2e:test` |
| Ruby | `e2e/ruby/` | `task ruby:e2e:test` |
| PHP | `e2e/php/` | `task php:e2e:test` |
| R | `e2e/r/` | `task r:e2e:test` |
### Regenerate end-to-end Tests
When you add a feature that changes extraction behavior, regenerate the affected end-to-end suites:
```bash title="Terminal"
task python:e2e:generate
task node:e2e:generate
task <lang>:e2e:generate
```
To regenerate and test all suites at once:
```bash title="Terminal"
task e2e:generate:all
task e2e:test:all
```
---
## Benchmarking
Measure extraction performance with the benchmark harness in `tools/benchmark-harness/`. Use it to track regressions, compare against alternatives, and identify bottlenecks with flamegraphs.
### Quick Start
```bash title="Terminal"
task benchmark:run FRAMEWORK=kreuzberg MODE=single-file
task benchmark:run FRAMEWORK=kreuzberg MODE=batch
```
### Common Modes
| Mode | What it measures |
| ------------- | --------------------------------------- |
| `single-file` | Latency — one file at a time |
| `batch` | Throughput — multiple files in parallel |
### With Profiling
Generate flamegraphs to see where time is spent:
```bash title="Terminal"
task benchmark:profile FRAMEWORK=kreuzberg MODE=single-file
```
Results appear in the `flamegraphs/` directory as interactive SVGs.
View live benchmark results at <https://kreuzberg.dev/benchmarks>.
---
## Linting and Pre-commit
```bash title="Terminal"
task check # Full lint + format check (same as CI validate stage)
```
Language-specific:
```bash title="Terminal"
task rust:lint # clippy + rustfmt
task python:lint # ruff + mypy
task node:lint # eslint + typecheck
```
The repository uses pre-commit hooks that enforce conventional commit messages, code formatting, and linter rules. If a commit is rejected, the hook output tells you exactly what to fix.
---
## Working with Documentation
### Building Locally
```bash title="Terminal"
uv sync --group doc
zensical build --clean
zensical serve
```
### How Snippets Work
Code examples in the docs aren't inline — they're pulled from `docs/snippets/` via the `--8<--` include directive. This keeps examples testable and reusable across pages.
```text
docs/snippets/
├── python/ # Python examples
│ ├── api/ # extract_file, batch_extract, etc.
│ ├── config/ # ExtractionConfig, OcrConfig, etc.
│ ├── ocr/ # OCR backends
│ ├── plugins/ # Plugin implementations
│ ├── mcp/ # MCP server and client
│ └── utils/ # Embeddings, chunking, errors
├── rust/ # Rust examples (same layout)
├── typescript/ # TypeScript examples
├── go/, java/, csharp/, ruby/, r/
├── docker/ # Docker commands
├── api_server/ # Server startup examples
└── cli/ # CLI usage
```
When you change a user-facing API, update the matching snippet. When you add a new feature, create a snippet and include it from the relevant doc page.
### Theme tokens (light mode)
Inline `code` and command-style monospace in light mode use the text token **`#26203A`**, defined in `docs/css/extra.css` as `--kb-text` (referenced as `var(--kb-text)`; brand backgrounds use the same value via `--kb-brand-ink`).
---
## Debugging
### Rust Panics
```bash title="Terminal"
RUST_BACKTRACE=1 cargo test -p kreuzberg test_name
RUST_BACKTRACE=full cargo test -p kreuzberg test_name
```
### Python FFI Problems
When something goes wrong in the Rust core during a Python call, the error introspection API gives you the details:
```python title="debug_ffi.py"
from kreuzberg import get_last_error_code, get_error_details, get_last_panic_context
details = get_error_details()
print(f"Error: {details['message']}")
print(f"Code: {details['error_code']}")
context = get_last_panic_context()
if context:
print(f"Panic context: {context}")
```
### Verbose Logging
Crank up the log level to see what the Rust core is doing:
```bash title="Terminal"
RUST_LOG=debug task python:test
RUST_LOG=trace task rust:test
```
---
## CI/CD
CI runs on every push and PR to `main` via `.github/workflows/ci.yaml`. The pipeline has four stages:
1. **Validate** — conventional commits, formatting, clippy
2. **Build** — FFI libraries, Python wheels, Node packages, all bindings
3. **Test** — per-language test suites on Linux, macOS, and Windows
4. **Integration** — Docker build, Docker smoke tests, CLI tests
### Smart Change Detection
CI doesn't rebuild everything on every PR. A `changes` job detects which paths were touched and only runs the relevant build/test jobs. Edit a Python file? Only Python builds and tests run. Touch the Rust core? Everything downstream rebuilds.
### Running CI Checks Locally
Before pushing, you can run the same checks CI runs:
```bash title="Terminal"
task check # Matches the validate stage
task rust:test:ci # Rust tests with CI diagnostics
task python:test:ci # Python tests with CI diagnostics
task test:all:ci # Everything
```
### Other Workflows
| Workflow | When it runs | What it does |
| --------------------- | ------------------------------------- | ---------------------------------- |
| `ci.yaml` | Every push/PR to `main` | The main pipeline |
| `docs.yaml` | Changes to `docs/` or `zensical.toml` | Builds and validates documentation |
| `benchmarks.yaml` | Manual trigger | Runs the full benchmark suite |
| `profiling.yaml` | Manual trigger | Generates flamegraphs |
| `publish.yaml` | Release events | Publishes packages to registries |
| `publish-docker.yaml` | Tags and releases | Builds and pushes Docker images |
---
## Performance
Kreuzberg's core is written in Rust, which enables zero-copy memory handling, SIMD acceleration, and true multi-core parallelism — all at compile time with no garbage collection.
### Why Rust Matters
- **Native compilation:** LLVM optimizes code ahead of time (inlining, vectorization, dead code elimination)
- **Zero-copy strings:** Slicing uses borrowed references, not heap allocations
- **SIMD acceleration:** Whitespace detection and character classification run 15-37x faster than scalar operations
- **No GIL:** True multi-core parallelism across all CPU cores
- **Deterministic memory:** Drop semantics free memory instantly, no GC pauses
### Key Optimizations
- **Batch processing:** 6-10x faster than sequential extraction through work-stealing scheduler
- **Caching:** 85%+ hit rates for repeated files (SQLite-backed, automatic invalidation)
- **Streaming:** Large files processed in 4KB chunks, constant memory regardless of file size
- **Lazy initialization:** Expensive subsystems (Tokio, plugins) initialized on first use only
### Benchmarking Your Workload
Measure with your actual files using the benchmark harness (see [Benchmarking](#benchmarking) section for full instructions). For detailed analysis and live benchmark results, visit <https://kreuzberg.dev/benchmarks>.
---

270
docs/guides/docker.md Normal file
View File

@@ -0,0 +1,270 @@
# Docker Deployment <span class="version-badge">v4.0.0</span>
Official Docker images built on the Rust core with Debian 13 (Trixie). Each image supports three execution modes: API server (default), command-line tool, and MCP server.
## Quick Start
### Pull and Run
=== "API Server"
--8<-- "snippets/docker/api_server_basic.md"
=== "CLI Mode"
--8<-- "snippets/docker/cli_mode_basic.md"
=== "MCP Server"
--8<-- "snippets/docker/mcp_basic.md"
### Pull Image
=== "Core"
--8<-- "snippets/docker/core_pull.md"
=== "Full"
--8<-- "snippets/docker/full_pull.md"
## Image Variants
| | **Core** | **Full** |
| ----------------- | -------------------------------------- | ---------------------------------------- |
| **Image** | `ghcr.io/kreuzberg-dev/kreuzberg:core` | `ghcr.io/kreuzberg-dev/kreuzberg:latest` |
| **Size** | ~1.01.3 GB | ~1.52.1 GB |
| **Tesseract OCR** | 12 languages | 12 languages |
| **Modern Office** | DOCX, PPTX, XLSX | DOCX, PPTX, XLSX |
| **Legacy Office** | DOC, PPT, XLS (native OLE/CFB) | DOC, PPT, XLS (native OLE/CFB) |
| **Startup** | ~1s | ~1s |
**Core** is optimized for production deployments where image size matters. Both images support all major formats — choose based on deployment constraints.
All images include: Tesseract OCR (eng, spa, fra, deu, ita, por, chi-sim, chi-tra, jpn, ara, rus, hin), PDF (pdf_oxide), images, HTML, email, and archives.
## Execution Modes
### API Server (Default)
```bash title="Terminal"
docker run -p 8000:8000 ghcr.io/kreuzberg-dev/kreuzberg:latest
# Custom port and CORS
docker run -p 9000:9000 \
-e KREUZBERG_CORS_ORIGINS="https://myapp.com" \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
serve --host 0.0.0.0 --port 9000
# With config file
docker run -p 8000:8000 \
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
serve --config /config/kreuzberg.toml
```
See [API Server Guide](api-server.md) for endpoint documentation.
### CLI Mode
```bash title="Terminal"
# Extract a file
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/document.pdf
# Extract with OCR
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/scanned.pdf --ocr true
# Batch processing
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
batch /data/*.pdf --format json
# MIME detection
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg:latest \
detect /data/unknown-file.bin
```
### MCP Server
```bash title="Terminal"
docker run ghcr.io/kreuzberg-dev/kreuzberg:latest mcp
# With config
docker run \
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
mcp --config /config/kreuzberg.toml
```
See [API Server Guide - MCP Section](api-server.md#mcp-server) for integration details.
## Environment Variables
| Variable | Default | Description |
| ------------------------------ | ----------------------------- | ------------------------------------------------------------------------------------------------ |
| `KREUZBERG_MAX_UPLOAD_SIZE_MB` | `100` | Max upload size in MB |
| `KREUZBERG_CORS_ORIGINS` | `*` | Comma-separated allowed origins |
| `RUST_LOG` | `info` | Log level: `error`, `warn`, `info`, `debug`, `trace` |
| `KREUZBERG_CACHE_DIR` | `/app/.kreuzberg` | Cache directory (set explicitly in Docker; outside containers defaults to platform global cache) |
| `HF_HOME` | `/app/.kreuzberg/huggingface` | HuggingFace model cache |
Host and port are set via CLI args: `serve --host 0.0.0.0 --port 8000`.
## Volume Mounts
```bash title="Terminal"
# Cache persistence (embedding models, OCR cache)
docker run -p 8000:8000 \
-v kreuzberg-cache:/app/.kreuzberg \
ghcr.io/kreuzberg-dev/kreuzberg:latest
# Config file
docker run -p 8000:8000 \
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
serve --config /config/kreuzberg.toml
# Documents (read-only)
docker run -v $(pwd)/documents:/data:ro \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
extract /data/document.pdf
```
!!! Note "Model Downloads" Embedding models download on first use (~90 MB 1.2 GB depending on preset). Use a persistent volume for `/app/.kreuzberg` in production to avoid re-downloading on container restart. Outside Docker, models are cached in the platform-specific global cache directory (for example, `~/.cache/kreuzberg/` on Linux, `~/Library/Caches/kreuzberg/` on macOS).
## Docker Compose
```yaml title="docker-compose.yaml"
services:
kreuzberg-api:
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
ports:
- "8000:8000"
environment:
- KREUZBERG_CORS_ORIGINS=https://myapp.com
- KREUZBERG_MAX_UPLOAD_SIZE_MB=500
- RUST_LOG=info
volumes:
- ./config:/config
- cache-data:/app/.kreuzberg
command: serve --host 0.0.0.0 --port 8000 --config /config/kreuzberg.toml
restart: unless-stopped
healthcheck:
test: ["CMD", "kreuzberg", "--version"]
interval: 30s
timeout: 10s
retries: 3
start_period: 5s
volumes:
cache-data:
```
## Security
Images run as non-root user `kreuzberg` (UID 1000). For hardened deployments:
```bash title="Terminal"
docker run --security-opt no-new-privileges \
--read-only \
--tmpfs /tmp \
-p 8000:8000 \
ghcr.io/kreuzberg-dev/kreuzberg:latest
```
Ensure mounted volumes have correct permissions:
```bash title="Terminal"
chown -R 1000:1000 /path/to/mounted/directory
```
## Resource Allocation
| Workload | Memory | CPU | Notes |
| -------- | ------ | --------- | --------------------------------------- |
| Light | 512 MB | 0.5 cores | Small documents, low concurrency |
| Medium | 1 GB | 1 core | Typical documents, moderate concurrency |
| Heavy | 2 GB+ | 2+ cores | Large documents, OCR, high concurrency |
```bash title="Terminal"
docker run -p 8000:8000 --memory=1g --cpus=1 \
ghcr.io/kreuzberg-dev/kreuzberg:latest
```
## Building Custom Images
=== "Core Image"
--8<-- "snippets/docker/build_core.md"
=== "Full Image"
--8<-- "snippets/docker/build_full.md"
```dockerfile title="Custom Dockerfile"
FROM ghcr.io/kreuzberg-dev/kreuzberg:latest
USER root
RUN apt-get update && \
apt-get install -y --no-install-recommends your-package-here && \
apt-get clean && rm -rf /var/lib/apt/lists/*
USER kreuzberg
COPY kreuzberg.toml /app/kreuzberg.toml
CMD ["serve", "--config", "/app/kreuzberg.toml"]
```
## Other Image Variants
The published Core and Full images cover most use cases. For specialized needs, the `docker/` directory has additional Dockerfiles:
| Dockerfile | What it builds |
| ------------------------- | ------------------------------------------------------------------------------------- |
| `Dockerfile.cli` | Minimal image with just the `kreuzberg` binary — good for CI pipelines and batch jobs |
| `Dockerfile.musl-build` | Fully static Linux binaries via MUSL — runs on any distro, no dynamic libs |
| `Dockerfile.musl-ffi` | Static C FFI library for language bindings (Go, Ruby, R, PHP, Elixir) |
| `Dockerfile.musl-rustler` | MUSL-based Rustler NIF for Elixir |
### CLI Image
A stripped-down image with only the CLI binary. No server, no API — just extraction:
```bash title="Terminal"
docker build -f docker/Dockerfile.cli -t kreuzberg-cli .
docker run -v $(pwd):/data kreuzberg-cli extract /data/document.pdf
docker run -v $(pwd):/data kreuzberg-cli batch /data/*.pdf --format json
docker run -v $(pwd):/data kreuzberg-cli detect /data/unknown-file.bin
```
### MUSL Static Builds
These produce binaries with zero dynamic library dependencies. A single file that runs on any Linux — Alpine, scratch containers, bare EC2 instances, whatever.
```bash title="Terminal"
docker build -f docker/Dockerfile.musl-build -t kreuzberg-musl-build .
docker build -f docker/Dockerfile.musl-ffi -t kreuzberg-musl-ffi .
```
The FFI variant builds a shared library used by the Go, Ruby, R, PHP, and Elixir bindings for portable cross-platform distribution.
## Troubleshooting
??? Question "Container won't start"
Check logs with `docker logs <container-id>`. Common causes: port conflict (change `-p` mapping), insufficient memory (increase `--memory`), volume permission errors.
??? Question "Permission errors on mounted volumes"
Images run as UID 1000. Fix with: `chown -R 1000:1000 /path/to/mounted/directory`
??? Question "Large file processing fails"
Increase memory limit (`--memory=4g`) and upload size (`-e KREUZBERG_MAX_UPLOAD_SIZE_MB=1000`).
## Next Steps
- [Kubernetes Deployment](kubernetes.md) — production K8s with OCR config and health checks
- [API Server Guide](api-server.md) — endpoint documentation
- [Configuration](configuration.md) — all configuration options

554
docs/guides/extraction.md Normal file
View File

@@ -0,0 +1,554 @@
# Extraction Basics
Eight core extraction functions are available, organized by input type (file path vs bytes), cardinality (single vs batch), and execution model (sync vs async).
| Input | Single sync | Single async | Batch sync | Batch async |
| ------------- | -------------------- | --------------- | -------------------------- | --------------------- |
| **File path** | `extract_file_sync` | `extract_file` | `batch_extract_files_sync` | `batch_extract_files` |
| **Bytes** | `extract_bytes_sync` | `extract_bytes` | `batch_extract_bytes_sync` | `batch_extract_bytes` |
!!! Tip "Sync vs Async" Use async variants when you're already in an async context or processing multiple files concurrently. For scripts and simple pipelines, sync variants are simpler and just as fast for single files.
## Extract from Files
Pass a file path. Kreuzberg detects the MIME type from the extension and selects the right parser automatically.
### Synchronous
=== "Python"
--8<-- "snippets/python/api/extract_file_sync.md"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/extract_file_sync.md"
=== "Rust"
--8<-- "snippets/rust/api/extract_file_sync.md"
=== "Go"
--8<-- "snippets/go/api/extract_file_sync.md"
=== "Java"
--8<-- "snippets/java/api/extract_file_sync.md"
=== "C#"
--8<-- "snippets/csharp/extract_file_sync.md"
=== "Ruby"
--8<-- "snippets/ruby/api/extract_file_sync.md"
=== "R"
--8<-- "snippets/r/api/extract_file_sync.md"
=== "C"
--8<-- "snippets/c/api/extract_file_sync.md"
=== "Wasm"
--8<-- "snippets/wasm/api/extract_file_sync.md"
### Asynchronous
=== "Python"
--8<-- "snippets/python/api/extract_file_async.md"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/extract_file_async.md"
=== "Rust"
--8<-- "snippets/rust/api/extract_file_async.md"
=== "Go"
--8<-- "snippets/go/api/extract_file_async.md"
=== "Java"
--8<-- "snippets/java/api/extract_file_async.md"
=== "C#"
--8<-- "snippets/csharp/extract_file_async.md"
=== "Ruby"
--8<-- "snippets/ruby/api/extract_file_async.md"
=== "R"
--8<-- "snippets/r/api/extract_file_async.md"
=== "C"
--8<-- "snippets/c/api/extract_file_async.md"
=== "Wasm"
--8<-- "snippets/wasm/api/extract_file_async.md"
## Extract from Bytes
When the file is already loaded in memory (for example, from an upload or network response), pass the byte array with its MIME type. Unlike file extraction, the MIME type is required since there's no file extension to infer it from.
### Synchronous
=== "Python"
--8<-- "snippets/python/api/extract_bytes_sync.md"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/extract_bytes_sync.md"
=== "Rust"
--8<-- "snippets/rust/api/extract_bytes_sync.md"
=== "Go"
--8<-- "snippets/go/api/extract_bytes_sync.md"
=== "Java"
--8<-- "snippets/java/api/extract_bytes_sync.md"
=== "C#"
--8<-- "snippets/csharp/extract_bytes_sync.md"
=== "Ruby"
--8<-- "snippets/ruby/api/extract_bytes_sync.md"
=== "R"
--8<-- "snippets/r/api/extract_bytes_sync.md"
=== "C"
--8<-- "snippets/c/api/extract_bytes_sync.md"
=== "Wasm"
--8<-- "snippets/wasm/api/extract_bytes_sync.md"
### Asynchronous
=== "Python"
--8<-- "snippets/python/api/extract_bytes_async.md"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/extract_bytes_async.md"
=== "Rust"
--8<-- "snippets/rust/api/extract_bytes_async.md"
=== "Go"
--8<-- "snippets/go/api/extract_bytes_async.md"
=== "Java"
--8<-- "snippets/java/api/extract_bytes_async.md"
=== "C#"
--8<-- "snippets/csharp/extract_bytes_async.md"
=== "Ruby"
--8<-- "snippets/ruby/api/extract_bytes_async.md"
=== "R"
--8<-- "snippets/r/api/extract_bytes_async.md"
=== "C"
--8<-- "snippets/c/api/extract_bytes_async.md"
=== "Wasm"
--8<-- "snippets/wasm/api/extract_bytes_async.md"
## Batch Processing
Batch functions accept an array of file paths (or byte arrays) and process them concurrently. This is typically 2-5x faster than looping over single-file functions because Kreuzberg parallelizes internally.
### Batch Extract Files
=== "Python"
--8<-- "snippets/python/api/batch_extract_files_sync.md"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/batch_extract_files_sync.md"
=== "Rust"
--8<-- "snippets/rust/api/batch_extract_files_sync.md"
=== "Go"
--8<-- "snippets/go/api/batch_extract_files_sync.md"
=== "Java"
--8<-- "snippets/java/api/batch_extract_files_sync.md"
=== "C#"
--8<-- "snippets/csharp/batch_extract_files_sync.md"
=== "Ruby"
--8<-- "snippets/ruby/api/batch_extract_files_sync.md"
=== "R"
--8<-- "snippets/r/api/batch_extract_files_sync.md"
=== "C"
--8<-- "snippets/c/api/batch_extract_files_sync.md"
=== "Wasm"
--8<-- "snippets/wasm/api/batch_extract_files_sync.md"
### Batch Extract Bytes
=== "Python"
--8<-- "snippets/python/api/batch_extract_bytes_sync.md"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/batch_extract_bytes_sync.md"
=== "Rust"
--8<-- "snippets/rust/api/batch_extract_bytes_sync.md"
=== "Go"
--8<-- "snippets/go/api/batch_extract_bytes_sync.md"
=== "Java"
--8<-- "snippets/java/api/batch_extract_bytes_sync.md"
=== "C#"
--8<-- "snippets/csharp/batch_extract_bytes_sync.md"
=== "Ruby"
--8<-- "snippets/ruby/api/batch_extract_bytes_sync.md"
=== "R"
--8<-- "snippets/r/api/batch_extract_bytes_sync.md"
=== "C"
--8<-- "snippets/c/api/batch_extract_bytes_sync.md"
=== "Wasm"
--8<-- "snippets/wasm/api/batch_extract_bytes_sync.md"
### Per-File Configuration <span class="version-badge">v4.5.0</span>
When a batch contains a mix of document types that need different settings (for example, scanned images needing OCR alongside text-based PDFs), use `FileExtractionConfig` to override options per file while sharing a common batch config.
=== "Python"
```python title="mixed_batch.py"
from kreuzberg import (
batch_extract_files_sync,
ExtractionConfig,
FileExtractionConfig,
OcrConfig,
)
config = ExtractionConfig(output_format="markdown")
paths = ["report.pdf", "scan.tiff", "notes.html"]
file_configs = [
None,
FileExtractionConfig(
force_ocr=True,
ocr=OcrConfig(backend="tesseract", language="deu"),
),
FileExtractionConfig(output_format="plain"),
]
results = batch_extract_files_sync(paths, config, file_configs=file_configs)
```
=== "TypeScript"
```typescript title="mixed_batch.ts"
import { batchExtractFilesSync } from '@kreuzberg/node';
const results = batchExtractFilesSync(
['report.pdf', 'scan.tiff', 'notes.html'],
{ outputFormat: 'markdown' },
[
null,
{ forceOcr: true, ocr: { backend: 'tesseract', language: 'deu' } },
{ outputFormat: 'plain' },
],
);
```
=== "Rust"
```rust title="mixed_batch.rs"
use kreuzberg::{
batch_extract_files, ExtractionConfig, FileExtractionConfig,
OcrConfig, OutputFormat,
};
use std::path::PathBuf;
let config = ExtractionConfig {
output_format: OutputFormat::Markdown,
..Default::default()
};
let paths = vec![
PathBuf::from("report.pdf"),
PathBuf::from("scan.tiff"),
PathBuf::from("notes.html"),
];
let file_configs = vec![
None,
Some(FileExtractionConfig {
force_ocr: Some(true),
ocr: Some(OcrConfig {
backend: "tesseract".to_string(),
language: "deu".to_string(),
..Default::default()
}),
..Default::default()
}),
Some(FileExtractionConfig {
output_format: Some(OutputFormat::Plain),
..Default::default()
}),
];
let results = batch_extract_files(paths, &config, Some(&file_configs)).await?;
```
Fields set to `None` in `FileExtractionConfig` inherit the batch default. Batch-level concerns like `max_concurrent_extractions`, `use_cache`, and `security_limits` cannot be overridden per file. See the [Configuration Reference](../reference/configuration.md#fileextractionconfig) for the full list of overridable fields.
## Content Filtering <span class="version-badge">v4.8.0</span>
Kreuzberg strips running headers, footers, watermarks, and cross-page repeating text by default so that downstream RAG and LLM pipelines see clean body content. `ContentFilterConfig` lets you opt back in to any of these when you need them, for example when extracting legal forms where the header carries the case number, or when running text analysis on a PDF whose brand name was being incorrectly removed by the repeating-text heuristic.
By default headers, footers, and watermarks are stripped and cross-page repeating text is deduplicated; see [ContentFilterConfig](../reference/configuration.md#contentfilterconfig) for field-level defaults and per-format behavior.
=== "Python"
```python title="keep_headers_footers.py"
from kreuzberg import (
extract_file_sync,
ContentFilterConfig,
ExtractionConfig,
)
# Legal/forms work: keep header and footer text
config = ExtractionConfig(
content_filter=ContentFilterConfig(
include_headers=True,
include_footers=True,
),
)
result = extract_file_sync("contract.pdf", config=config)
```
=== "TypeScript"
```typescript title="disable_repeating_text.ts"
import { extract } from "@kreuzberg/node";
// Disable cross-page deduplication so brand names aren't stripped
const result = await extract("brochure.pdf", {
contentFilter: {
stripRepeatingText: false,
},
});
```
=== "Rust"
```rust title="content_filter.rs"
use kreuzberg::{extract_file_sync, ContentFilterConfig, ExtractionConfig};
let config = ExtractionConfig {
content_filter: Some(ContentFilterConfig {
include_headers: true,
include_footers: true,
strip_repeating_text: true,
include_watermarks: false,
}),
..Default::default()
};
let result = extract_file_sync("contract.pdf", None, &config)?;
```
When a layout-detection model is active, it can independently classify regions as page headers or footers and strip them per page. Setting `include_headers=True` / `include_footers=True` also disables that per-page stripping. See the [reference page](../reference/configuration.md#contentfilterconfig) for the full field semantics and per-format behavior.
## Supported Formats
Kreuzberg supports 90+ file formats across 8 categories:
| Category | Extensions | Notes |
| ----------------- | -------------------------------------------------------- | ----------------------------------- |
| **PDF** | `.pdf` | Native text + OCR for scanned pages |
| **Images** | `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp` | Requires OCR backend |
| **Office** | `.docx`, `.pptx`, `.xlsx` | Modern formats via native parsers |
| **Legacy Office** | `.doc`, `.ppt` | Native OLE/CFB parsing |
| **Email** | `.eml`, `.msg` | Full support including attachments |
| **Web** | `.html`, `.htm` | Converted to Markdown with metadata |
| **Text** | `.md`, `.txt`, `.xml`, `.json`, `.yaml`, `.toml`, `.csv` | Direct extraction |
| **Archives** | `.zip`, `.tar`, `.tar.gz`, `.tar.bz2` | Recursive extraction |
## Page Tracking
Kreuzberg can track page boundaries and extract per-page content. Page tracking availability depends on the format:
- **PDF** — Full byte-accurate page tracking with O(1) lookup
- **PPTX** — Slide boundary tracking (each slide = one page)
- **DOCX** — Best-effort detection using explicit `<w:br type="page"/>` tags
- **Other formats** — No page tracking
Enable page extraction with `PageConfig`:
```python title="page_tracking.py"
config = ExtractionConfig(
pages=PageConfig(
insert_page_markers=True,
marker_format="\n\n<!-- PAGE {page_num} -->\n\n"
)
)
```
Page markers like `<!-- PAGE 1 -->` are inserted at boundaries in the `content` field — useful for LLMs that need to understand document layout. When both page tracking and chunking are enabled, chunks automatically include `first_page` and `last_page` metadata.
See [PageConfig Reference](../reference/configuration.md#pageconfig) for all options and [Advanced Page Tracking](./advanced.md) for chunk-to-page mapping examples.
## Code File Extraction
Source code files (`.py`, `.rs`, `.ts`, `.go`, etc.) go through tree-sitter and produce a `ProcessResult` on `ExtractionResult.code_intelligence` (structure, imports/exports, symbols, docstrings, diagnostics, semantic chunks). Code files bypass text chunking — TSLP's function/class-aware `CodeChunks` map directly to Kreuzberg `Chunk`s with semantic `chunk_type` and heading context.
See [Code Intelligence](code-intelligence.md) for usage and [`TreeSitterProcessConfig`](../reference/configuration.md#treesitterprocessconfig) for fields.
## PDF Page Rendering
Render individual PDF pages as PNG images. Unlike the extraction pipeline (which parses text, tables, metadata), this API produces raw pixel data for thumbnails, vision model input, or custom OCR pipelines.
### Two Approaches
| API | When to use |
| ----------------- | ---------------------------------------------------------------------- |
| `render_pdf_page` | You know which page you need, or only need a few pages |
| `PdfPageIterator` | Process every page sequentially without loading all images into memory |
### DPI Configuration
| DPI | Pixel size (US Letter) | Use case |
| ------------- | ---------------------- | ------------------------------- |
| 72 | 612 x 792 | Thumbnails, quick previews |
| 150 (default) | 1275 x 1650 | General-purpose, screen display |
| 300 | 2550 x 3300 | OCR input, print quality |
**Tip:** Use 300 DPI when rendering pages for OCR or vision models. The default 150 DPI may reduce recognition accuracy on small text.
## MIME Type Detection
When extracting from bytes, Kreuzberg requires an explicit MIME type since there's no file extension to infer it from. For file paths, auto-detection from the extension is automatic.
### Example: Override MIME Type
```python title="Python"
from kreuzberg import extract_file
# File without extension — provide MIME type explicitly
result = extract_file("document_copy", mime_type="application/pdf", config=config)
```
## Error Handling
All extraction functions raise typed exceptions on failure. Catch specific exceptions to handle different failure modes:
=== "Python"
--8<-- "snippets/python/utils/error_handling.md"
=== "TypeScript"
--8<-- "snippets/typescript/api/error_handling.md"
=== "Rust"
--8<-- "snippets/rust/api/error_handling.md"
=== "Go"
--8<-- "snippets/go/api/error_handling.md"
=== "Java"
--8<-- "snippets/java/api/error_handling.md"
=== "C#"
--8<-- "snippets/csharp/error_handling.md"
=== "Ruby"
--8<-- "snippets/ruby/api/error_handling.md"
=== "R"
--8<-- "snippets/r/api/error_handling.md"
=== "C"
--8<-- "snippets/c/api/error_handling.md"
=== "Wasm"
--8<-- "snippets/wasm/api/error_handling_wasm.md"
!!! Warning "System Errors"
`OSError` (Python), `IOException` (Rust), and system-level errors always propagate through. These indicate real system problems (permissions, disk space, etc.) that your application should handle.
## Next Steps
- [Configuration](configuration.md) — all configuration options and file formats
- [OCR Guide](ocr.md) — set up optical character recognition
- [Advanced Features](advanced.md) — chunking, language detection, embeddings
- [Element-Based Output](output-formats.md#element-based-output-v410) — structured element arrays for RAG
- [Document Structure](output-formats.md#document-structure) — hierarchical tree output

244
docs/guides/html-output.md Normal file
View File

@@ -0,0 +1,244 @@
# HTML Output
!!! Info "Added in v4.8.1"
Render extracted document content as styled HTML with semantic `kb-*` CSS classes, configurable themes, and full CSS customization.
## Quick Start
=== "CLI"
```bash title="Terminal"
kreuzberg extract doc.pdf --html-theme github
```
=== "Python"
```python title="html_output.py"
from kreuzberg import ExtractionConfig, HtmlOutputConfig, HtmlTheme, extract_file
config = ExtractionConfig(
output_format="html",
html_output=HtmlOutputConfig(theme=HtmlTheme.GitHub),
)
result = await extract_file("doc.pdf", config=config)
print(result.content) # styled HTML string
```
=== "TypeScript"
```typescript title="html_output.ts"
import { extractFile, HtmlTheme } from '@kreuzberg/node';
const result = await extractFile('doc.pdf', {
outputFormat: 'html',
htmlOutput: { theme: HtmlTheme.GitHub },
});
console.log(result.content);
```
=== "Rust"
```rust title="html_output.rs"
use kreuzberg::{extract_file, ExtractionConfig, HtmlOutputConfig, HtmlTheme};
let config = ExtractionConfig {
output_format: "html".to_string(),
html_output: Some(HtmlOutputConfig {
theme: HtmlTheme::GitHub,
..Default::default()
}),
..Default::default()
};
let result = extract_file("doc.pdf", None, &config).await?;
println!("{}", result.content);
```
## Built-in Themes
| Theme | Description |
| -------------------- | -------------------------------------------------------------------------------------- |
| `unstyled` (default) | No built-in CSS. Only structural markup with `kb-*` classes. Use your own style sheet. |
| `default` | System font stack, neutral colours, 72ch max width. All CSS custom properties defined. |
| `github` | GitHub Markdown-inspired palette, border-bottom headings, 80ch max width. |
| `dark` | Dark background (#0d1117), light text. Good for terminal/IDE integrations. |
| `light` | Minimal light theme with generous spacing. |
## Configuration
See [HtmlOutputConfig](../reference/configuration.md#htmloutputconfig) for detailed field documentation.
=== "Python"
```python title="html_config.py"
from kreuzberg import ExtractionConfig, HtmlOutputConfig, HtmlTheme
config = ExtractionConfig(
output_format="html",
html_output=HtmlOutputConfig(
theme=HtmlTheme.Dark,
css="body { padding: 2rem; }",
class_prefix="kb-",
embed_css=True,
),
)
```
=== "TypeScript"
```typescript title="html_config.ts"
import { HtmlTheme } from '@kreuzberg/node';
const config = {
outputFormat: 'html',
htmlOutput: {
theme: HtmlTheme.Dark,
css: 'body { padding: 2rem; }',
classPrefix: 'kb-',
embedCss: true,
},
};
```
=== "Rust"
```rust title="html_config.rs"
use kreuzberg::{ExtractionConfig, HtmlOutputConfig, HtmlTheme};
let config = ExtractionConfig {
output_format: "html".to_string(),
html_output: Some(HtmlOutputConfig {
theme: HtmlTheme::Dark,
css: Some("body { padding: 2rem; }".to_string()),
class_prefix: "kb-".to_string(),
embed_css: true,
..Default::default()
}),
..Default::default()
};
```
## CLI Flags
| Flag | Description |
| ------------------------------ | -------------------------------------------------------------------------------------------------- |
| `--html-theme <THEME>` | Built-in theme: `default`, `github`, `dark`, `light`, `unstyled`. Implies `--content-format html`. |
| `--html-css <CSS>` | Inline CSS string appended after the theme stylesheet. |
| `--html-css-file <PATH>` | Path to CSS file loaded at render time (max 1 MiB). |
| `--html-class-prefix <PREFIX>` | CSS class prefix; default: `"kb-"`. Alphanumeric, hyphens, underscores only. |
| `--html-no-embed-css` | Suppress the `<style>` block; use external stylesheet instead. |
## CSS Customization
All built-in themes (except `unstyled`) define CSS custom properties on `:root`. Override them to adjust the theme without replacing it entirely:
```css title="custom.css"
:root {
--kb-font-family: "Inter", sans-serif;
--kb-text-color: #333;
--kb-max-width: 60ch;
}
```
Pass custom CSS inline or from a file:
=== "CLI"
```bash title="Terminal"
# Inline override
kreuzberg extract doc.pdf --html-theme github \
--html-css ':root { --kb-max-width: 60ch; }'
# From a file
kreuzberg extract doc.pdf --html-theme github \
--html-css-file custom.css
```
=== "Python"
```python title="custom_css.py"
from kreuzberg import ExtractionConfig, HtmlOutputConfig, HtmlTheme
config = ExtractionConfig(
output_format="html",
html_output=HtmlOutputConfig(
theme=HtmlTheme.GitHub,
css_file="custom.css",
),
)
```
=== "Rust"
```rust title="custom_css.rs"
use kreuzberg::{ExtractionConfig, HtmlOutputConfig, HtmlTheme};
use std::path::PathBuf;
let config = ExtractionConfig {
output_format: "html".to_string(),
html_output: Some(HtmlOutputConfig {
theme: HtmlTheme::GitHub,
css_file: Some(PathBuf::from("custom.css")),
..Default::default()
}),
..Default::default()
};
```
To use your own style sheet, set the theme to `unstyled` and disable the embedded `<style>` block:
```python title="external_stylesheet.py"
config = ExtractionConfig(
output_format="html",
html_output=HtmlOutputConfig(
theme=HtmlTheme.Unstyled,
embed_css=False,
),
)
```
## Class Reference
All generated HTML elements include semantic `kb-*` classes for targeted styling.
| Class | Element | Description |
| --------------------------- | ---------------------- | ----------------------------- |
| `kb-doc` | `<div>` | Root wrapper |
| `kb-content` | `<main>` | Content area |
| `kb-doc-title` | `<h1>` | Document title |
| `kb-h`, `kb-h1`..`kb-h6` | `<h1>`..`<h6>` | Headings |
| `kb-p` | `<p>` | Paragraphs |
| `kb-list`, `kb-ul`, `kb-ol` | `<ul>`, `<ol>` | Lists |
| `kb-li` | `<li>` | List items |
| `kb-blockquote` | `<blockquote>` | Block quotes |
| `kb-pre` | `<pre>` | Code blocks |
| `kb-code` | `<code>` | Inline/block code |
| `kb-table` | `<table>` | Tables |
| `kb-thead`, `kb-tbody` | `<thead>`, `<tbody>` | Table sections |
| `kb-th`, `kb-td`, `kb-tr` | `<th>`, `<td>`, `<tr>` | Table cells/rows |
| `kb-figure` | `<figure>` | Image wrapper |
| `kb-img` | `<img>` | Images |
| `kb-page-break` | `<hr>` | Page breaks |
| `kb-footnote` | `<aside>` | Footnote definitions |
| `kb-footnote-ref` | `<sup>` | Footnote references |
| `kb-citation` | `<cite>` | Citations |
| `kb-link` | `<a>` | Hyperlinks |
| `kb-metadata` | `<dl>` | Metadata blocks |
| `kb-formula` | `<pre>` | Math formulas |
| `kb-slide` | `<section>` | Slide sections |
| `kb-dt`, `kb-dd` | `<dt>`, `<dd>` | Definition terms/descriptions |
| `kb-admonition` | `<aside>` | Admonitions |
| `kb-group` | `<div>` | Grouped content |
!!! Tip "Custom prefix" If you set `class_prefix` to `"my-"`, all classes become `my-doc`, `my-content`, `my-h1`, and so on.
## Security
!!! Warning "Security considerations" - `class_prefix` is validated to prevent HTML injection - `</style>` sequences are stripped from user CSS - `css_file` is limited to 1 MiB - When serving HTML to untrusted users, sanitize CSS at the application layer
## See Also
- [Configuration](configuration.md) -- all configuration options
- [Extraction Basics](extraction.md) -- core extraction API and supported formats
- [Element-Based Output](output-formats.md#element-based-output-v410) -- structured element output as an alternative to HTML
- [Document Structure](output-formats.md#document-structure) -- how Kreuzberg models document structure

105
docs/guides/keywords.md Normal file
View File

@@ -0,0 +1,105 @@
# Keyword Extraction
Extract ranked keywords from document text using YAKE or RAKE algorithms.
| Algorithm | Scoring | Best for |
| --------- | ---------------------------------------- | --------------------------------------------- |
| **YAKE** | Lower score = more relevant (0.01.0) | General documents, single terms, multilingual |
| **RAKE** | Higher score = more relevant (unbounded) | Multi-word phrases, technical docs |
## Quick Start
=== "Python"
--8<-- "snippets/python/utils/keyword_extraction_example.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/keyword_extraction_example.md"
=== "Rust"
--8<-- "snippets/rust/advanced/keyword_extraction_example.md"
=== "Go"
--8<-- "snippets/go/utils/keyword_extraction_example.md"
=== "Java"
--8<-- "snippets/java/utils/keyword_extraction_example.md"
=== "C#"
--8<-- "snippets/csharp/advanced/keyword_extraction_example.md"
=== "Ruby"
--8<-- "snippets/ruby/utils/keyword_extraction_example.md"
Keywords are returned in `result.extracted_keywords` as objects with `text` and `score` fields.
## Configuration
See [KeywordConfig reference](../reference/configuration.md#keywordconfig) for all configuration options.
=== "Python"
--8<-- "snippets/python/config/keyword_extraction_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/keyword_extraction_config.md"
=== "Rust"
--8<-- "snippets/rust/config/keyword_extraction_config.md"
=== "Go"
--8<-- "snippets/go/config/keyword_extraction_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/keyword_extraction_config.md"
=== "R"
--8<-- "snippets/r/config/keyword_extraction_config.md"
=== "C#"
--8<-- "snippets/csharp/advanced/keyword_extraction_config.md"
## YAKE Score Tuning
Use `min_score` as upper bound. Lower YAKE scores = higher relevance:
| `min_score` | Effect |
| ----------- | ------------------- |
| `0.5` | Keeps most keywords |
| `0.3` | Main topics only |
| `0.1` | Core concepts only |
`yake_params.window_size` controls co-occurrence context: `12` for narrow domains, `23` for general (default: `2`), `34` for discussion-heavy content.
## RAKE Score Tuning
Use `min_score` as lower bound. Higher RAKE scores = higher relevance:
| `min_score` | Effect |
| ----------- | ---------------------------- |
| `0.1` | Keeps most keywords |
| `5.0` | Main phrases only |
| `20.0` | Only highly specific phrases |
`rake_params.min_word_length` (default: `1`) and `rake_params.max_words_per_phrase` (default: `3`) control phrase boundaries.
## Troubleshooting
- **Too few keywords** — Lower `min_score`, check `result.content` is non-empty, set `language` to match the document or `None` to disable stopword filtering
- **Too many irrelevant keywords** — Raise `min_score`, set `language` for stopword filtering, reduce `ngram_range` upper bound
- **Multi-word phrases missing (YAKE)** — Switch to RAKE or confirm `ngram_range` upper bound is >= 2
- **Keywords don't match content** — Verify text was extracted (`result.content`) and `language` matches the document
See the [KeywordConfig reference](../reference/configuration.md#keywordconfig) for the full parameter list.

530
docs/guides/kubernetes.md Normal file
View File

@@ -0,0 +1,530 @@
# Kubernetes Deployment <span class="version-badge new">v4.2.2</span>
Deploy Kreuzberg to Kubernetes with proper OCR configuration, permissions, and health checks.
## Helm Chart <span class="version-badge new">v4.8.4</span>
Deploy via the official Helm chart (OCI artifact on GHCR).
### Install
```bash title="Terminal"
helm install kreuzberg oci://ghcr.io/kreuzberg-dev/charts/kreuzberg --version 4.8.4
```
### Configure
Override defaults with a `values.yaml` file:
```yaml title="values.yaml"
# NOTE: cache.enabled=true uses ReadWriteOnce by default; keep replicaCount: 1
# with RWO storage or switch to ReadWriteMany before increasing replicas.
replicaCount: 1
image:
tag: "4.8.4"
kreuzberg:
logLevel: "info"
ocrLanguage: "eng"
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
cache:
enabled: true
size: 5Gi
ingress:
enabled: true
className: "nginx"
hosts:
- host: kreuzberg.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: kreuzberg-tls
hosts:
- kreuzberg.example.com
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 80
podDisruptionBudget:
enabled: true
minAvailable: 1
```
```bash title="Terminal"
helm install kreuzberg oci://ghcr.io/kreuzberg-dev/charts/kreuzberg \
--version 4.8.4 \
-f values.yaml
```
### Upgrade
```bash title="Terminal"
helm upgrade kreuzberg oci://ghcr.io/kreuzberg-dev/charts/kreuzberg --version 4.8.4
```
### What's Included
The chart creates the following resources:
| Resource | Description | Conditional |
| ----------------------- | ---------------------------------------------------------- | ----------------------------- |
| Deployment | Main application with health probes and security hardening | Always |
| Service | ClusterIP service on port 80 → 8000 | Always |
| ServiceAccount | Dedicated service account | Always |
| PersistentVolumeClaim | Cache for embedding models and assets | `cache.enabled` |
| Ingress | HTTP(S) ingress with TLS | `ingress.enabled` |
| HorizontalPodAutoscaler | CPU/memory-based autoscaling | `autoscaling.enabled` |
| PodDisruptionBudget | Availability during disruptions | `podDisruptionBudget.enabled` |
All values are documented in the chart's [`values.yaml`](https://github.com/kreuzberg-dev/kreuzberg/blob/main/charts/kreuzberg/values.yaml).
---
## Quick Start
```yaml title="minimal-deployment.yaml"
apiVersion: apps/v1
kind: Deployment
metadata:
name: kreuzberg-api
spec:
replicas: 2
selector:
matchLabels:
app: kreuzberg
template:
metadata:
labels:
app: kreuzberg
spec:
containers:
- name: kreuzberg
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
ports:
- containerPort: 8000
name: http
env:
- name: RUST_LOG
value: "info"
- name: TESSDATA_PREFIX
value: "/usr/share/tesseract-ocr/5/tessdata"
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: kreuzberg-api
spec:
selector:
app: kreuzberg
ports:
- protocol: TCP
port: 80
targetPort: 8000
type: LoadBalancer
```
```bash title="Terminal"
kubectl apply -f minimal-deployment.yaml
```
## Tesseract Configuration
### TESSDATA_PREFIX (Critical)
Without `TESSDATA_PREFIX`, OCR silently falls back to non-OCR extraction. Official images ship Tesseract 5.x with tessdata at `/usr/share/tesseract-ocr/5/tessdata/`.
```yaml
env:
- name: TESSDATA_PREFIX
value: "/usr/share/tesseract-ocr/5/tessdata"
- name: KREUZBERG_OCR_LANGUAGE
value: "eng"
- name: KREUZBERG_CACHE_DIR
value: "/app/.kreuzberg"
- name: HF_HOME
value: "/app/.kreuzberg/huggingface"
```
**Pre-installed languages:** `eng`, `spa`, `fra`, `deu`, `ita`, `por`, `chi_sim`, `chi_tra`, `jpn`, `ara`, `rus`, `hin`
!!! Note "Tesseract Version" The path varies by version. Verify yours with `tesseract --version` inside the container if using a custom base image.
### Custom Languages via ConfigMap
```bash title="Terminal"
kubectl create configmap tessdata \
--from-file=/path/to/eng.traineddata \
--from-file=/path/to/deu.traineddata
```
```yaml
spec:
containers:
- name: kreuzberg
env:
- name: TESSDATA_PREFIX
value: "/etc/tessdata"
volumeMounts:
- name: tessdata
mountPath: /etc/tessdata
volumes:
- name: tessdata
configMap:
name: tessdata
```
For large custom language sets, use a PVC instead of a ConfigMap.
### Verify Tesseract
```bash title="Terminal"
kubectl exec -it deployment/kreuzberg-api -- tesseract --version
kubectl exec -it deployment/kreuzberg-api -- tesseract --list-langs
kubectl exec -it deployment/kreuzberg-api -- printenv TESSDATA_PREFIX
```
## Permissions
Kreuzberg runs as non-root (UID 1000, GID 1000). Fix PVC permissions with either approach:
=== "Init Container"
```yaml
spec:
initContainers:
- name: init-permissions
image: busybox:1.37-glibc
command: ['sh', '-c', 'chown -R 1000:1000 /app/.kreuzberg']
securityContext:
runAsUser: 0
allowPrivilegeEscalation: false
capabilities:
add: ["CHOWN"]
drop: ["ALL"]
volumeMounts:
- name: cache
mountPath: /app/.kreuzberg
containers:
- name: kreuzberg
volumeMounts:
- name: cache
mountPath: /app/.kreuzberg
```
=== "fsGroup"
```yaml
spec:
securityContext:
fsGroup: 1000
containers:
- name: kreuzberg
securityContext:
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
```
## Health Checks
```yaml
containers:
- name: kreuzberg
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 2
startupProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 10
failureThreshold: 30
```
## Logging
```yaml
env:
- name: RUST_LOG
value: "kreuzberg=debug,warn"
```
Levels: `trace`, `debug`, `info`, `warn`, `error`
```bash title="Terminal"
kubectl logs deployment/kreuzberg-api --tail=50
kubectl logs deployment/kreuzberg-api -f
kubectl logs deployment/kreuzberg-api --previous
```
## Production Deployment
Full production manifest with namespace, PVC, security context, init container, PDB, and all probes:
```yaml title="production-deployment.yaml"
apiVersion: v1
kind: Namespace
metadata:
name: kreuzberg
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: kreuzberg-cache
namespace: kreuzberg
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 2Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kreuzberg-api
namespace: kreuzberg
# NOTE: PVC uses ReadWriteOnce; keep replicas: 1 with RWO storage.
# Increase replicas only when using ReadWriteMany storage.
spec:
replicas: 1
selector:
matchLabels:
app: kreuzberg
template:
metadata:
labels:
app: kreuzberg
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
seccompProfile:
type: RuntimeDefault
initContainers:
- name: init-cache
image: busybox:1.37-glibc
command: ["sh", "-c", "mkdir -p /app/.kreuzberg && chown -R 1000:1000 /app/.kreuzberg"]
securityContext:
runAsUser: 0
allowPrivilegeEscalation: false
capabilities:
add: ["CHOWN"]
drop: ["ALL"]
volumeMounts:
- name: cache
mountPath: /app/.kreuzberg
containers:
- name: kreuzberg
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
ports:
- containerPort: 8000
name: http
env:
- name: RUST_LOG
value: "info"
- name: TESSDATA_PREFIX
value: "/usr/share/tesseract-ocr/5/tessdata"
- name: KREUZBERG_CACHE_DIR
value: "/app/.kreuzberg"
- name: HF_HOME
value: "/app/.kreuzberg/huggingface"
- name: KREUZBERG_CORS_ORIGINS
value: "https://app.example.com"
- name: KREUZBERG_MAX_UPLOAD_SIZE_MB
value: "500"
args: ["serve", "--host", "0.0.0.0", "--port", "8000"]
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 15
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 10
startupProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 10
failureThreshold: 30
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
volumeMounts:
- name: cache
mountPath: /app/.kreuzberg
- name: tmp
mountPath: /tmp
volumes:
- name: cache
persistentVolumeClaim:
claimName: kreuzberg-cache
- name: tmp
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: kreuzberg-api
namespace: kreuzberg
spec:
type: LoadBalancer
selector:
app: kreuzberg
ports:
- protocol: TCP
port: 80
targetPort: 8000
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: kreuzberg-pdb
namespace: kreuzberg
spec:
minAvailable: 1
selector:
matchLabels:
app: kreuzberg
```
```bash title="Terminal"
kubectl apply -f production-deployment.yaml
```
!!! Note "Model Persistence" Embedding models download on first use (~90 MB 1.2 GB). Use a PVC for `/app/.kreuzberg` to avoid re-downloading on pod restart. Outside containers, models are cached in the platform-specific global cache directory (for example, `~/.cache/kreuzberg/` on Linux, `~/Library/Caches/kreuzberg/` on macOS).
## High Availability
Add pod anti-affinity and rolling update strategy:
```yaml title="ha-additions.yaml"
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: [kreuzberg]
topologyKey: kubernetes.io/hostname
```
## Troubleshooting
??? Question "OCR silently failing"
Verify `TESSDATA_PREFIX` is set and tessdata files exist:
```bash title="Terminal"
kubectl exec -it deployment/kreuzberg-api -- printenv TESSDATA_PREFIX
kubectl exec -it deployment/kreuzberg-api -- ls /usr/share/tesseract-ocr/5/tessdata/
```
??? Question "Permission denied on cache directory"
Use an init container or `fsGroup` (see [Permissions](#permissions)).
??? Question "OOMKilled"
Increase memory limits. Reduce OCR resource usage with `KREUZBERG_PDF_DPI=150` and single-language OCR.
??? Question "Startup probe timeout"
Increase `failureThreshold` on the startup probe (e.g., `60` for 10-minute timeout).
??? Question "Language not found"
Check installed languages with `kubectl exec -it deployment/kreuzberg-api -- tesseract --list-langs`. Mount custom tessdata via ConfigMap or PVC.
### Diagnostic Commands
```bash title="Terminal"
kubectl logs deployment/kreuzberg-api --tail=200
kubectl describe deployment kreuzberg-api
kubectl get events -n kreuzberg
kubectl exec -it deployment/kreuzberg-api -- env | sort
kubectl port-forward service/kreuzberg-api 8000:8000 && curl http://localhost:8000/health
```
## Next Steps
- [Docker Deployment](docker.md) — container configuration and image variants
- [API Server Guide](api-server.md) — endpoint documentation
- [OCR Guide](ocr.md) — backend installation and language setup
- [Configuration](configuration.md) — all configuration options

View File

@@ -0,0 +1,242 @@
# Layout Detection <span class="version-badge">v4.5.0</span>
Detect document layout regions (tables, figures, headers, text blocks, etc.) in PDFs using ONNX-based deep learning models. Enables table extraction, figure isolation, reading-order reconstruction, and selective OCR.
!!! Note "Feature gate" Requires the `layout-detection` Cargo feature. Not included in the default feature set.
## Model
Layout detection uses the **RT-DETR v2** model, an ONNX-based deep learning model that detects 17 layout element classes: text blocks, tables, figures, headers, footers, captions, code, lists, sections, formulas, footnotes, page headers/footers, titles, checkboxes, key-value regions, and document indices.
### When to Enable
**Recommended for:** complex multi-column PDFs, scanned documents, academic papers, business forms, and any document where layout understanding improves extraction accuracy.
**Less beneficial for:** simple single-column text documents, high-throughput pipelines where latency is critical (consider GPU acceleration), or documents already well-handled by PDF structure trees.
### Performance Impact
| Pipeline | Structure F1 | Text F1 | Avg time/doc |
| -------- | ------------ | ------- | ------------ |
| Baseline | 33.9% | 87.4% | 447 ms |
| Layout | 41.1% | 90.1% | 1500 ms |
_171-document PDF corpus, CPU only. GPU acceleration significantly reduces the time penalty._
!!! Note "Layout Detection Model" Kreuzberg uses only the RT-DETR v2 model for layout detection. The `preset` field is not available in `LayoutDetectionConfig`. Configure table structure recognition separately via `table_model` — see "Table Structure Models" below.
## Configuration
=== "Python"
```python
from kreuzberg import ExtractionConfig, LayoutDetectionConfig, extract_file
config = ExtractionConfig(
layout=LayoutDetectionConfig(
confidence_threshold=0.5,
apply_heuristics=True,
table_model="tatr",
)
)
result = await extract_file("document.pdf", config=config)
```
=== "TypeScript"
```typescript
const result = await extract("document.pdf", {
layout: {
confidenceThreshold: 0.5,
applyHeuristics: true,
tableModel: "tatr",
},
});
```
=== "Rust"
```rust
use kreuzberg::core::{ExtractionConfig, LayoutDetectionConfig};
let config = ExtractionConfig {
layout: Some(LayoutDetectionConfig {
confidence_threshold: Some(0.5),
apply_heuristics: true,
table_model: Some("tatr".to_string()),
..Default::default()
}),
..Default::default()
};
```
=== "TOML"
```toml title="kreuzberg.toml"
[layout]
apply_heuristics = true
# table_model = "tatr"
```
=== "CLI"
```bash title="Terminal"
# Enable layout detection with default settings
kreuzberg extract document.pdf --layout --content-format markdown
# Custom confidence threshold
kreuzberg extract document.pdf --layout-confidence 0.5 --content-format markdown
# Specific table model
kreuzberg extract document.pdf --layout --layout-table-model slanet_wired
# Combined with GPU acceleration
kreuzberg extract document.pdf --layout --acceleration coreml
```
See [LayoutDetectionConfig](../reference/configuration.md#layoutdetectionconfig) for all fields.
## Table Structure Models <span class="version-badge">v4.5.3</span>
When layout detection identifies a table region, a table structure model analyzes rows, columns, headers, and spanning cells. Set `LayoutDetectionConfig.table_model` to one of:
| Value | Notes |
| ----------------- | ----------------------------------------------------------- |
| `tatr` | Default. Fast (~30 MB). General-purpose. |
| `slanet_wired` | Higher accuracy for bordered/gridlined tables (~365 MB). |
| `slanet_wireless` | Higher accuracy for borderless tables (~365 MB). |
| `slanet_auto` | Auto-classifies per page (~737 MB). Slowest. |
| `slanet_plus` | Smallest (~7.78 MB). For resource-constrained environments. |
| `disabled` | Skip table structure recognition. |
!!! Note "Model Download" SLANeXT models are not downloaded by default. Use `cache warm --all-table-models` to pre-download, or they download automatically on first use.
## GPU Acceleration
Layout detection uses ONNX Runtime with automatic provider selection:
| Provider | Platform | Notes |
| -------- | -------------- | ----------------------------- |
| CPU | All | Default, no setup needed |
| CUDA | Linux, Windows | Requires CUDA toolkit + cuDNN |
| CoreML | macOS | Automatic on Apple Silicon |
| TensorRT | Linux | Requires TensorRT |
To override:
```python
config = ExtractionConfig(
layout=LayoutDetectionConfig(),
acceleration=AccelerationConfig(provider="cuda", device_id=0)
)
```
See [AccelerationConfig reference](../reference/configuration.md#accelerationconfig) for details.
## Layout Classes
The RT-DETR v2 model detects 17 classes. Each `LayoutRegion.class_name` is one of:
`caption`, `footnote`, `formula`, `list_item`, `page_footer`, `page_header`, `picture`, `section_header`, `table`, `text`, `title`, `document_index`, `code`, `checkbox_selected`, `checkbox_unselected`, `form`, `key_value_region`.
See [`LayoutRegion`](../reference/types.md#layoutregion) in the types reference for the full field shape.
## Accessing Layout Regions
When layout detection is enabled AND page extraction is enabled, each page in the result includes `layout_regions` — a list of detected regions with class, confidence score, bounding box, and area fraction. This enables programmatic filtering and analysis of specific layout elements.
=== "Python"
```python
from kreuzberg import extract_file, ExtractionConfig, LayoutDetectionConfig, PagesConfig
result = await extract_file(
"document.pdf",
config=ExtractionConfig(
layout=LayoutDetectionConfig(),
pages=PagesConfig(extract_pages=True),
),
)
for page in result.pages:
if page.layout_regions:
for region in page.layout_regions:
if region.class_name == "picture" and region.confidence > 0.9:
print(f"Page {page.page_number}: diagram detected "
f"(confidence={region.confidence:.2f}, "
f"area={region.area_fraction:.0%})")
```
=== "TypeScript"
```typescript
const result = await extract("document.pdf", {
layout: {},
pages: { extractPages: true },
});
for (const page of result.pages ?? []) {
if (page.layoutRegions) {
for (const region of page.layoutRegions) {
if (region.className === "picture" && region.confidence > 0.9) {
console.log(
`Page ${page.pageNumber}: diagram detected ` +
`(confidence=${region.confidence.toFixed(2)}, ` +
`area=${(region.areaFraction * 100).toFixed(0)}%)`
);
}
}
}
}
```
=== "Rust"
```rust
use kreuzberg::core::{ExtractionConfig, LayoutDetectionConfig, PagesConfig};
let result = extract_file(
"document.pdf",
ExtractionConfig {
layout: Some(LayoutDetectionConfig::default()),
pages: Some(PagesConfig {
extract_pages: true,
..Default::default()
}),
..Default::default()
},
).await?;
for page in &result.pages {
if let Some(regions) = &page.layout_regions {
for region in regions {
if region.class_name == "picture" && region.confidence > 0.9 {
println!(
"Page {}: diagram detected (confidence={:.2}, area={:.0}%)",
page.page_number,
region.confidence,
region.area_fraction * 100.0
);
}
}
}
}
```
### Tips
- Use `confidence` to filter low-confidence detections — typically ≥ 0.80.9 for downstream operations
- Use `area_fraction` to distinguish between inline images and full-page diagrams (e.g., `area_fraction > 0.1` for significant figures)
- Regions are independent of page extraction — enable both to access both content and layout structure
- Available across all bindings (Python, TypeScript, Rust, Ruby, Java, Go, Elixir, C#, PHP)
## Acknowledgments
- **[Docling](https://github.com/DS4SD/docling)** — RT-DETR v2 model and layout classification approach
- **[TATR](https://github.com/microsoft/table-transformer)** — Table structure recognition with ONNX
- **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)** — SLANeXT table structure and PP-LCNet classifier models
## Related
- [Configuration Reference](../reference/configuration.md#layoutdetectionconfig) — full field reference
- [Element-Based Output](output-formats.md#element-based-output-v410) — using layout-aware results

View File

@@ -0,0 +1,411 @@
# LLM Integration <span class="version-badge">v4.8.0</span>
Kreuzberg integrates with 143 LLM providers (including local inference engines) via [liter-llm](https://github.com/kreuzberg-dev/liter-llm) for three capabilities: VLM OCR, structured extraction, and provider-hosted embeddings.
!!! Note "Feature gate" Requires the `liter-llm` Cargo feature. Not included in the default feature set.
## VLM OCR
Use vision-language models as an OCR backend by rendering document pages as images and sending them to the VLM for text extraction.
### When to Use
- Low-quality scanned documents where traditional OCR struggles
- Handwritten text recognition
- Arabic, Farsi, and other scripts with poor Tesseract/PaddleOCR support
- Complex layouts where traditional OCR fails (mixed tables, forms, diagrams)
- When you need higher accuracy and can accept higher latency and API costs
### Configuration
=== "Python"
--8<-- "snippets/python/llm/vlm_ocr.md"
=== "TypeScript"
--8<-- "snippets/typescript/llm/vlm_ocr.md"
=== "Rust"
```rust title="Rust"
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig, LlmConfig};
let config = ExtractionConfig {
force_ocr: true,
ocr: Some(OcrConfig {
backend: "vlm".to_string(),
vlm_config: Some(LlmConfig {
model: "openai/gpt-4o-mini".to_string(),
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
let result = extract_file("scan.pdf", None, &config).await?;
```
=== "CLI"
```bash title="Terminal"
kreuzberg extract scan.pdf --force-ocr true \
--vlm-model openai/gpt-4o-mini
```
=== "TOML"
```toml title="kreuzberg.toml"
force_ocr = true
[ocr]
backend = "vlm"
[ocr.vlm_config]
model = "openai/gpt-4o-mini"
```
=== "Environment Variables"
```bash title="Terminal"
export KREUZBERG_VLM_OCR_MODEL=openai/gpt-4o-mini
export OPENAI_API_KEY=sk-...
```
### Custom VLM Prompt
Override the default prompt template for VLM OCR:
```python title="Python"
from kreuzberg import ExtractionConfig, OcrConfig, LlmConfig
config = ExtractionConfig(
force_ocr=True,
ocr=OcrConfig(
backend="vlm",
vlm_config=LlmConfig(model="openai/gpt-4o-mini"),
vlm_prompt="Extract all text from this document image. Preserve formatting.",
),
)
```
### Supported Providers
Any liter-llm vision-capable provider works as a VLM OCR backend:
| Provider | Example Model |
| ----------------- | -------------------------------------- |
| OpenAI | `openai/gpt-4o`, `openai/gpt-4o-mini` |
| Anthropic | `anthropic/claude-3-5-sonnet-20241022` |
| Google | `google/gemini-2.0-flash` |
| Groq | `groq/llama-3.2-90b-vision-preview` |
| Ollama (local) | `ollama/llama3.2-vision` |
| LM Studio (local) | `lmstudio/llava-1.5` |
| vLLM (local) | `vllm/llava-next` |
## Structured Extraction
Extract structured JSON data from documents by providing a schema; the document text is sent to an LLM for conforming extraction.
### Basic Usage
=== "Python"
--8<-- "snippets/python/llm/structured_extraction.md"
=== "TypeScript"
--8<-- "snippets/typescript/llm/structured_extraction.md"
=== "Rust"
--8<-- "snippets/rust/llm/structured_extraction.md"
=== "CLI"
```bash title="Terminal"
kreuzberg extract-structured paper.pdf \
--schema schema.json \
--model openai/gpt-4o-mini \
--strict
```
=== "TOML"
```toml title="kreuzberg.toml"
[structured_extraction]
schema_name = "paper_metadata"
strict = true
[structured_extraction.schema]
type = "object"
[structured_extraction.schema.properties.title]
type = "string"
[structured_extraction.schema.properties.date]
type = "string"
[structured_extraction.llm]
model = "openai/gpt-4o-mini"
```
### Custom Prompts (Jinja2)
Override the default extraction prompt with a Jinja2 template:
```python title="Python"
from kreuzberg import ExtractionConfig, StructuredExtractionConfig, LlmConfig
config = ExtractionConfig(
structured_extraction=StructuredExtractionConfig(
schema={"type": "object", "properties": {"title": {"type": "string"}}},
llm=LlmConfig(model="openai/gpt-4o-mini"),
prompt=(
"Analyze this document and extract key metadata.\n\n"
"Document:\n{{ content }}\n\n"
"Schema: {{ schema }}"
),
),
)
```
Available template variables:
| Variable | Description |
| -------------------------- | ----------------------------------------- |
| `{{ content }}` | The extracted document text |
| `{{ schema }}` | The JSON schema as a formatted string |
| `{{ schema_name }}` | The schema name (default: `"extraction"`) |
| `{{ schema_description }}` | The schema description (may be empty) |
### Cross-Provider Compatibility
Structured extraction handles provider differences automatically:
- **OpenAI**: Full strict mode with `additionalProperties` enforcement
- **Anthropic/Gemini**: `additionalProperties` automatically stripped (not supported by these providers)
- **All providers**: Markdown code fence wrapping in responses is automatically handled
### Strict Mode
When `strict=True`, the LLM is instructed to produce output that exactly matches the schema. This enables OpenAI's structured output mode and adds validation on the response.
## VLM Embeddings
Use provider-hosted embedding models when you need to match your vector database model or local ONNX models are unavailable.
### Configuration
=== "Python"
--8<-- "snippets/python/llm/vlm_embeddings.md"
=== "TypeScript"
```typescript title="TypeScript"
import { embedSync } from '@kreuzberg/node';
const embeddings = embedSync(['Hello world'], {
model: {
modelType: 'llm',
value: 'openai/text-embedding-3-small',
},
normalize: true,
});
console.log(embeddings[0].length); // 1536
```
=== "Rust"
```rust title="Rust"
use kreuzberg::{embed_texts, EmbeddingConfig, EmbeddingModelType, LlmConfig};
let config = EmbeddingConfig {
model: EmbeddingModelType::Llm {
llm: LlmConfig {
model: "openai/text-embedding-3-small".to_string(),
..Default::default()
},
},
normalize: true,
..Default::default()
};
let embeddings = embed_texts(&["Hello world"], &config)?;
```
=== "CLI"
```bash title="Terminal"
kreuzberg embed \
--provider llm \
--model openai/text-embedding-3-small \
--text "Hello world"
```
### Available Models
| Model | Dimensions | Provider |
| ---------------------------------------- | ---------- | -------- |
| `openai/text-embedding-3-small` | 1536 | OpenAI |
| `openai/text-embedding-3-large` | 3072 | OpenAI |
| `mistral/mistral-embed` | 1024 | Mistral |
| Any liter-llm embedding-capable provider | Varies | Various |
## Local LLM Support
<span class="version-badge">v4.8.0</span>
Run local LLM inference engines via [liter-llm](https://github.com/kreuzberg-dev/liter-llm)'s provider routing; point to your local server without needing an API key.
### Supported Local Engines
| Engine | Prefix | Default URL | Install |
| ------------------------------------------------------ | ------------ | --------------------------- | --------------------- |
| [Ollama](https://ollama.com) | `ollama/` | `http://localhost:11434/v1` | `brew install ollama` |
| [LM Studio](https://lmstudio.ai) | `lmstudio/` | `http://localhost:1234/v1` | Desktop app |
| [vLLM](https://vllm.ai) | `vllm/` | `http://localhost:8000/v1` | `pip install vllm` |
| [llama.cpp](https://github.com/ggerganov/llama.cpp) | `llamacpp/` | `http://localhost:8080/v1` | Build from source |
| [LocalAI](https://localai.io) | `localai/` | `http://localhost:8080/v1` | Docker |
| [llamafile](https://github.com/Mozilla-Ocho/llamafile) | `llamafile/` | `http://localhost:8080/v1` | Single binary |
### Example: Ollama
=== "CLI" ```Bash
# Start Ollama and pull a model
ollama pull llama3.2-vision
# Use it for VLM OCR (no API key needed)
kreuzberg extract scan.pdf --force-ocr true \
--vlm-model ollama/llama3.2-vision
# Use it for structured extraction
kreuzberg extract-structured doc.pdf \
--schema schema.json \
--model ollama/llama3.2
# Use it for embeddings
kreuzberg embed --provider llm \
--model ollama/all-minilm \
--text "Hello world"
```
=== "Python" ```python from Kreuzberg import extract_file, ExtractionConfig, StructuredExtractionConfig, LlmConfig
config = ExtractionConfig(
structured_extraction=StructuredExtractionConfig(
schema={"type": "object", "properties": {"title": {"type": "string"}}},
llm=LlmConfig(model="ollama/llama3.2"), # No api_key needed
),
)
result = await extract_file("doc.pdf", config=config)
```
=== "TOML Config" ```toml [structured_extraction.llm] model = "ollama/llama3.2"
# No api_key needed for local providers
```
!!! Tip "Custom Base URL" If your local server runs on a non-default port, use `base_url`:
`python
LlmConfig(model="ollama/llama3.2", base_url="http://localhost:11435/v1")`
## LLM Usage Tracking
Every LLM call made during extraction is tracked in the `llm_usage` field of `ExtractionResult`. Each entry records the model used, token counts, estimated cost, and why the model stopped generating.
=== "Python"
```python
result = await extract_file("document.pdf", config)
if result.get("llm_usage"):
for usage in result["llm_usage"]:
print(f"{usage['source']}: {usage['input_tokens']} in, {usage['output_tokens']} out, ${usage['estimated_cost']:.4f}")
```
=== "TypeScript"
```typescript
const result = await extractFile("document.pdf", config);
for (const usage of result.llmUsage ?? []) {
console.log(`${usage.source}: ${usage.inputTokens} in, ${usage.outputTokens} out, $${usage.estimatedCost?.toFixed(4)}`);
}
```
=== "Rust"
```rust
let result = extract_file("document.pdf", &config).await?;
if let Some(usages) = &result.llm_usage {
for usage in usages {
println!("{}: {} in, {} out", usage.source, usage.input_tokens.unwrap_or(0), usage.output_tokens.unwrap_or(0));
}
}
```
The `source` field indicates which pipeline stage triggered the call: `"vlm_ocr"`, `"structured_extraction"`, or `"embeddings"`.
## API Key Configuration
API keys can be set via (in order of precedence):
1. `api_key` field in `LlmConfig` — highest priority, per-request
2. Provider standard env vars (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, etc.)
3. Kreuzberg-specific env var (`KREUZBERG_LLM_API_KEY`) — used as fallback for any provider
!!! Note "Local providers skip API key lookup" Local inference engines (Ollama, LM Studio, vLLM, llama.cpp, LocalAI, llamafile) do not require an API key. If you use a local provider prefix (for example, `ollama/`), the API key fields are ignored.
```python title="Python"
from kreuzberg import LlmConfig
# Explicit API key
config = LlmConfig(model="openai/gpt-4o", api_key="sk-...")
# Custom base URL (e.g., Azure OpenAI, local proxy)
config = LlmConfig(
model="openai/gpt-4o",
base_url="https://my-proxy.example.com/v1",
)
```
## LlmConfig Reference
| Field | Type | Default | Description |
| -------------- | --------------- | ---------- | ------------------------------------------------------------------- |
| `model` | `str` | _required_ | Provider/model in liter-llm format (for example, `"openai/gpt-4o"`) |
| `api_key` | `str \| None` | `None` | API key (falls back to env vars) |
| `base_url` | `str \| None` | `None` | Custom endpoint URL |
| `timeout_secs` | `int \| None` | `60` | Request timeout in seconds |
| `max_retries` | `int \| None` | `3` | Maximum retry attempts |
| `temperature` | `float \| None` | `None` | Sampling temperature |
| `max_tokens` | `int \| None` | `None` | Maximum tokens to generate |
## REST API
### Structured Extraction
`POST /extract-structured` — multipart form with file + schema + model configuration.
```bash title="Terminal"
curl -X POST http://localhost:4000/extract-structured \
-F "file=@invoice.pdf" \
-F 'schema={"type":"object","properties":{"vendor":{"type":"string"},"total":{"type":"number"}}}' \
-F "model=openai/gpt-4o-mini" \
-F "strict=true"
```
## MCP Tools
When running Kreuzberg as an MCP server, LLM features are available as tools:
- `extract_structured` — extract structured data from a document using a JSON schema
- `embed_text` — extended with `model` parameter for LLM-hosted embeddings
## Related
- [OCR](ocr.md) — OCR backends including VLM OCR
- [Configuration Reference](configuration.md) — full field reference for all config types
- [Advanced Features](advanced.md) — chunking, language detection, local embeddings
- [API Server](api-server.md) — REST API endpoints

View File

@@ -0,0 +1,191 @@
# MCP Integration <span class="version-badge">v5.0.0</span>
Kreuzberg speaks [Model Context Protocol](https://modelcontextprotocol.io/). That means any AI agent — Claude, Cursor, a custom LangChain pipeline — can extract documents, generate embeddings, and manage caches through a standard tool interface without writing extraction code.
Two commands to get started:
```bash title="Terminal"
pip install "kreuzberg[all]"
kreuzberg mcp
```
That's it. You now have an MCP server running over stdio, ready for any compatible client.
---
## How It Works
The MCP server wraps Kreuzberg's extraction engine behind standard tools, running as a child process over stdin/stdout with JSON-RPC messages — no HTTP ports or configuration needed.
```mermaid
flowchart LR
A["AI Agent\n(Claude, Cursor, etc.)"] -->|"JSON-RPC\nover stdio"| B["kreuzberg mcp"]
B --> C["Extraction Engine"]
B --> D["Embedding Engine"]
B --> E["Cache Layer"]
```
---
## Server Modes
### Stdio (Default)
The standard mode for local AI tools. The agent spawns `kreuzberg mcp` as a subprocess and communicates over pipes.
```bash title="Terminal"
kreuzberg mcp
kreuzberg mcp --config kreuzberg.toml
```
This is what Claude Desktop, Cursor, and most MCP clients expect.
### HTTP Transport
!!! Info "Feature flag: `mcp-http`" HTTP transport requires the `mcp-http` feature flag at build time.
For remote deployments or multi-client setups where stdio doesn't work — shared servers, team environments, cloud-hosted agents — HTTP transport exposes the same tool interface over the network.
---
## Tools
Kreuzberg exposes 13 tools via MCP. All extraction tools accept an optional `config` object to override defaults:
**Extraction:** `extract_file`, `extract_bytes`, `batch_extract_files`, `detect_mime_type`, `extract_structured`
**Embeddings:** `embed_text`
**Chunking:** `chunk_text`
**Cache:** `cache_stats`, `cache_clear`, `cache_manifest`, `cache_warm`
**Metadata:** `list_formats`, `get_version`
`extract_structured` requires the server to be built with the `liter-llm` feature. Full parameter schemas are discoverable at runtime via the MCP client's `list_tools` call.
---
## Connecting AI Tools
### Claude Desktop
Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
```json title="claude_desktop_config.json"
{
"mcpServers": {
"kreuzberg": {
"command": "kreuzberg",
"args": ["mcp"]
}
}
}
```
Restart Claude. Kreuzberg's tools appear automatically — ask Claude to "extract text from invoice.pdf" and it will call `extract_file` behind the scenes.
### Cursor
Add to `.cursor/mcp.json` in your project root:
```json title=".cursor/mcp.json"
{
"mcpServers": {
"kreuzberg": {
"command": "kreuzberg",
"args": ["mcp"]
}
}
}
```
### Python MCP Client
For building custom agent pipelines, use the official `mcp` Python SDK:
```python title="mcp_client.py"
import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
async def main() -> None:
server_params = StdioServerParameters(
command="kreuzberg", args=["mcp"]
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
tools = await session.list_tools()
print(f"Available: {[t.name for t in tools.tools]}")
result = await session.call_tool(
"extract_file",
arguments={"path": "document.pdf"},
)
print(result)
asyncio.run(main())
```
### Spawning from Python
If your application manages the server lifecycle directly:
```python title="spawn_server.py"
import subprocess
process = subprocess.Popen(
["python", "-m", "kreuzberg", "mcp"],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
print(f"MCP server running (PID {process.pid})")
```
---
## Configuration
Pass a TOML config file to set extraction defaults for all tools:
```bash title="Terminal"
kreuzberg mcp --config kreuzberg.toml
```
Individual tool calls override file defaults via a `config` parameter. See [ExtractionConfig Reference](../reference/configuration.md) for all available fields.
---
## Running in Docker
```bash title="Terminal"
docker run ghcr.io/kreuzberg-dev/kreuzberg:latest mcp
docker run \
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
ghcr.io/kreuzberg-dev/kreuzberg:latest \
mcp --config /config/kreuzberg.toml
```
For production, use Compose with a persistent cache volume so embedding models don't re-download on restart:
```yaml title="docker-compose.yaml"
services:
kreuzberg-mcp:
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
command: mcp --config /config/kreuzberg.toml
volumes:
- ./kreuzberg.toml:/config/kreuzberg.toml:ro
- cache-data:/app/.kreuzberg
restart: unless-stopped
volumes:
cache-data:
```
---
## What to Read Next
- [API Server Guide](api-server.md) — the HTTP REST API and detailed MCP tool reference
- [Docker Deployment](docker.md) — container setup for all server modes
- [Configuration Reference](../reference/configuration.md) — every config option explained

492
docs/guides/ocr.md Normal file
View File

@@ -0,0 +1,492 @@
# OCR (Optical Character Recognition)
Extract text from images and scanned PDFs. Kreuzberg automatically determines when OCR is needed — images always require it, scanned PDFs trigger it per-page, and hybrid PDFs only OCR the pages that lack a text layer. Set `force_ocr=True` to OCR all pages regardless.
## Backend Comparison
Four OCR backends — pick based on platform, accuracy needs, and language coverage.
| | **Tesseract** | **PaddleOCR** | **EasyOCR** | **VLM** |
| ---------------- | -------------------- | ----------------------------------- | ------------------- | ------------------------ |
| **Speed** | Fast | Very fast | Moderate | Slow (API latency) |
| **Accuracy** | Good | Excellent | Excellent | Highest |
| **Languages** | 100+ | 80+ (11 script families) | 80+ | All (provider-dependent) |
| **Installation** | System package | Built-in (native) or Python package | Python package only | API key only |
| **Model size** | ~10 MB | Mobile ~8 MB, Server ~120 MB | ~100 MB | None (cloud-hosted) |
| **GPU support** | No | Yes | Yes | N/A (server-side) |
| **Platform** | All (including Wasm) | All except Wasm | Python only | All |
| **Cost** | Free | Free | Free | Per-token API cost |
**When to use which:**
- **Tesseract** — Default choice. Works everywhere, low overhead, broadest platform support.
- **PaddleOCR** — Best speed-to-accuracy ratio. Preferred for CJK languages. Mobile tier is fast; server tier maximizes accuracy with GPU.
- **EasyOCR** — Highest accuracy with deep learning models. Python-only, heavier dependency.
- **VLM** — Best for handwritten text, poor scans, Arabic/Farsi, and complex layouts. Requires an API key and incurs per-token costs. See [LLM Integration](llm-integration.md) for full details.
## Installation
### Tesseract
=== "macOS"
```bash title="Terminal"
brew install tesseract
```
=== "Ubuntu / Debian"
```bash title="Terminal"
sudo apt-get install tesseract-ocr
```
=== "RHEL / Fedora"
```bash title="Terminal"
sudo dnf install tesseract
```
=== "Windows"
Download from [GitHub releases](https://github.com/UB-Mannheim/tesseract/wiki).
**Additional language packs:**
```bash title="Terminal"
# macOS — all languages
brew install tesseract-lang
# Ubuntu/Debian — individual languages
sudo apt-get install tesseract-ocr-deu # German
sudo apt-get install tesseract-ocr-fra # French
# Verify installed languages
tesseract --list-langs
```
### PaddleOCR
=== "Native bindings (Rust, Go, TypeScript, Java, C#, Ruby, PHP, Elixir)"
Built in via the `paddle-ocr` feature flag. Models download automatically on first use — no extra installation needed.
```toml title="Cargo.toml (Rust example)"
[dependencies]
kreuzberg = { version = "4.0", features = ["paddle-ocr"] }
```
=== "Python"
PaddleOCR is bundled via the native Rust bindings and works out of the box since 4.8.5 — no extra installation is needed. Models are downloaded automatically on first use.
### EasyOCR (Python only)
```bash title="Terminal"
pip install "kreuzberg[easyocr]"
```
!!! Info "Python 3.14" EasyOCR 1.7.3+ and PyTorch 2.9.1+ support Python 3.14. Install `kreuzberg[easyocr]` on any supported Python version (3.103.14).
!!! Tip "Tesseract marker extra"
`pip install "kreuzberg[tesseract]"` is available as a metadata-only marker to document a dependency on the Tesseract system package. It installs no Python packages — Tesseract itself must still be installed via your OS package manager (see above).
## Configuration
### Basic OCR
=== "Python"
--8<-- "snippets/python/ocr/ocr_extraction.md"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_extraction.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_extraction.md"
=== "Go"
--8<-- "snippets/go/ocr/ocr_extraction.md"
=== "Java"
--8<-- "snippets/java/ocr/ocr_extraction.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/ocr_extraction.md"
=== "R"
--8<-- "snippets/r/ocr/ocr_extraction.md"
=== "Wasm"
--8<-- "snippets/wasm/ocr/ocr_extraction.md"
### Multiple Languages
Specify multiple language codes separated by `+` (Tesseract) or as a list (EasyOCR/PaddleOCR):
=== "Python"
--8<-- "snippets/python/ocr/ocr_multi_language.md"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_multi_language.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_multi_language.md"
=== "Go"
--8<-- "snippets/go/ocr/ocr_multi_language.md"
=== "Java"
--8<-- "snippets/java/ocr/ocr_multi_language.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/ocr_multi_language.md"
=== "R"
--8<-- "snippets/r/ocr/ocr_multi_language.md"
=== "Wasm"
```typescript
import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';
await initWasm();
await enableOcr();
const file = fileInput.files?.[0];
if (file) {
const result = await extractFromFile(file, file.type, {
ocr: { backend: 'tesseract-wasm', language: 'eng+deu' },
});
}
```
### Force OCR
Process PDFs with OCR even when they have a text layer:
=== "Python"
--8<-- "snippets/python/ocr/ocr_force_all_pages.md"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_force_all_pages.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_force_all_pages.md"
=== "Go"
--8<-- "snippets/go/ocr/ocr_force_all_pages.md"
=== "Java"
--8<-- "snippets/java/ocr/ocr_force_all_pages.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/ocr_force_all_pages.md"
=== "R"
--8<-- "snippets/r/ocr/ocr_force_all_pages.md"
### Using EasyOCR
=== "Python"
--8<-- "snippets/python/ocr/ocr_easyocr.md"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_easyocr.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_easyocr.md"
### Disable OCR
!!! Info "Added in v4.7.0"
When `disable_ocr` is set, image files return empty content instead of raising `MissingDependencyError`:
=== "Python"
```python title="disable_ocr.py"
from kreuzberg import ExtractionConfig, extract_file_sync
config = ExtractionConfig(disable_ocr=True)
result = extract_file_sync("scanned.png", config=config)
# result.content will be empty — OCR was skipped
```
=== "TypeScript"
```typescript title="disable_ocr.ts"
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('scanned.png', {
disableOcr: true,
});
// result.content will be empty — OCR was skipped
```
=== "Rust"
```rust title="disable_ocr.rs"
use kreuzberg::{ExtractionConfig, extract_file};
let config = ExtractionConfig {
disable_ocr: true,
..Default::default()
};
let result = extract_file("scanned.png", &config).await?;
// result.content will be empty — OCR was skipped
```
### Using PaddleOCR
=== "Python"
--8<-- "snippets/python/ocr/ocr_paddleocr.md"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_paddleocr.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_paddleocr.md"
=== "Go"
--8<-- "snippets/go/ocr/ocr_paddleocr.md"
=== "Java"
--8<-- "snippets/java/ocr/ocr_paddleocr.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/ocr_paddleocr.md"
=== "R"
--8<-- "snippets/r/ocr/ocr_paddleocr.md"
### Using VLM OCR <span class="version-badge">v4.8.0</span>
Use a vision-language model (e.g. GPT-4o, Claude) as the OCR backend — each page is rendered and sent to the VLM. Cloud providers need an API key; local engines (Ollama, etc.) use the `ollama/` prefix — see [Local LLM Support](llm-integration.md#local-llm-support).
=== "Python"
--8<-- "snippets/python/llm/vlm_ocr.md"
=== "TypeScript"
--8<-- "snippets/typescript/llm/vlm_ocr.md"
=== "Rust"
```rust title="Rust"
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig, LlmConfig};
let config = ExtractionConfig {
force_ocr: true,
ocr: Some(OcrConfig {
backend: "vlm".to_string(),
vlm_config: Some(LlmConfig {
model: "openai/gpt-4o-mini".to_string(),
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
let result = extract_file("scan.pdf", None, &config).await?;
```
=== "CLI"
```bash title="Terminal"
kreuzberg extract scan.pdf --force-ocr true --vlm-model openai/gpt-4o-mini
```
=== "TOML"
```toml title="kreuzberg.toml"
force_ocr = true
[ocr]
backend = "vlm"
[ocr.vlm_config]
model = "openai/gpt-4o-mini"
```
For more on VLM OCR, including custom prompts, supported providers, and API key configuration, see [LLM Integration](llm-integration.md#vlm-ocr).
!!! Tip "GPU Acceleration" EasyOCR and PaddleOCR support GPU acceleration. Set `use_gpu=True` in your OCR config. PaddleOCR's `model_tier="server"` gives the best accuracy with GPU.
## DPI Configuration
Higher DPI improves accuracy but increases processing time and memory.
| DPI | Trade-off |
| ----------------- | ------------------------------------------ |
| **150** | Fastest — lower accuracy, less memory |
| **300** (default) | Balanced — good accuracy, reasonable speed |
| **600** | Best accuracy — slower, more memory |
=== "Python"
--8<-- "snippets/python/config/ocr_dpi_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_dpi_config.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_dpi_config.md"
=== "Go"
--8<-- "snippets/go/config/ocr_dpi_config.md"
=== "Java"
--8<-- "snippets/java/config/ocr_dpi_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/ocr_dpi_config.md"
=== "R"
--8<-- "snippets/r/config/ocr_dpi_config.md"
## PaddleOCR Script Families
80+ languages across 11 script families (PP-OCRv5). Recognition models are downloaded on demand from HuggingFace:
| Family | Languages |
| -------------- | -------------------------------------------------------------------------------------------- |
| **English** | English, numbers, punctuation |
| **Chinese** | Simplified/Traditional Chinese, Japanese |
| **Latin** | French, German, Spanish, Portuguese, Italian, Polish, Dutch, Turkish, Vietnamese, and so on. |
| **Korean** | Korean (Hangul) |
| **Slavic** | Russian, Ukrainian, Belarusian, Bulgarian, Serbian, and so on. |
| **Thai** | Thai script |
| **Greek** | Greek script |
| **Arabic** | Arabic, Persian, Urdu |
| **Devanagari** | Hindi, Marathi, Sanskrit, Nepali |
| **Tamil** | Tamil script |
| **Telugu** | Telugu script |
Models are cached locally after first download, so subsequent runs start immediately.
## CLI Usage
```bash title="Terminal"
# Basic OCR extraction
kreuzberg extract scanned.pdf --ocr true
# Specific language
kreuzberg extract french_doc.pdf --ocr true --ocr-language fra
# Specific backend
kreuzberg extract chinese_doc.pdf --ocr true --ocr-backend paddle-ocr --ocr-language ch
# Force OCR on all pages
kreuzberg extract document.pdf --force-ocr true
# VLM OCR backend
kreuzberg extract handwritten.pdf --force-ocr true --vlm-model openai/gpt-4o-mini
# Use a config file
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true
```
| Flag | Description |
| ------------------------- | ---------------------------------------------------------------------------------- |
| `--ocr true` | Enable OCR processing |
| `--ocr-language <code>` | Language code (`eng`, `deu`, `fra`, `ch`, `ja`, `ru`, etc.) |
| `--ocr-backend <backend>` | Engine: `tesseract`, `paddle-ocr`, `easyocr`, or `vlm` |
| `--force-ocr true` | OCR all pages regardless of text layer |
| `--vlm-model <model>` | VLM model for OCR (for example, `openai/gpt-4o-mini`). Implies `--ocr-backend vlm` |
## Troubleshooting
??? Question "Tesseract not found"
Install Tesseract and verify it's on your PATH:
```bash title="Terminal"
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# Verify
tesseract --version
```
??? Question "Language not found"
Install the language data pack:
```bash title="Terminal"
# macOS — all languages
brew install tesseract-lang
# Ubuntu/Debian — individual language
sudo apt-get install tesseract-ocr-deu
# Verify
tesseract --list-langs
```
??? Question "Poor accuracy"
- Increase DPI to 600 for better quality
- Try a different backend — PaddleOCR and EasyOCR often outperform Tesseract on complex layouts
- Specify the correct language code for your document
- Use `force_ocr=True` if a PDF's embedded text layer is low quality
- For handwritten text or very poor scans, try the VLM backend with a vision-capable model (see [LLM Integration](llm-integration.md#vlm-ocr))
??? Question "Slow processing"
- Reduce DPI to 150 for faster throughput
- Enable GPU acceleration with EasyOCR or PaddleOCR (`use_gpu=True`)
- Use batch extraction to process multiple files concurrently
??? Question "Out of memory on large PDFs"
- Reduce DPI — lower resolution uses significantly less memory
- Process pages in smaller batches
- Use PaddleOCR's mobile tier (`model_tier="mobile"`) for a smaller memory footprint
## Next Steps
- [LLM Integration](llm-integration.md) — VLM OCR, structured extraction, and LLM embeddings
- [Configuration](configuration.md) — all configuration options
- [Extraction Basics](extraction.md) — core extraction API and supported formats
- [Advanced Features](advanced.md) — chunking, language detection, embeddings

View File

@@ -0,0 +1,398 @@
# Output Formats <span class="version-badge">v4.1.0</span>
Choose the format that matches your downstream processing:
- **Unified (default)** — Plain text/Markdown, for LLM prompts and full-text search
- **Element-Based** — Flat array of typed elements with metadata, for RAG chunking and semantic search
- **Document Structure** — Hierarchical tree with explicit parent-child references, for knowledge graphs and structured apps
- **PDF Hierarchy** — Font-size classification into heading levels (H1H6) for PDFs
## Unified Output (Default)
No configuration required. The result contains:
- `content` — Full document text with minimal formatting
- `pages` — Per-page breakdown for PDFs, DOCX, and PPTX
- `tables` — Extracted tables in structured format
- `images` — Image metadata and paths
---
## Element-Based Output <span class="version-badge">v4.1.0</span>
A flat array of typed elements (titles, paragraphs, tables, list items, code blocks, images, etc.). Each carries a page number; PDF text elements also carry bounding boxes when hierarchy extraction is enabled.
Use for RAG chunking, semantic search, or Unstructured.io-compatible pipelines.
### Enable
=== "Python"
--8<-- "snippets/python/config/element_based_output.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/element_based_output.md"
=== "Rust"
--8<-- "snippets/rust/config/element_based_output.md"
=== "Go"
--8<-- "snippets/go/config/element_based_output.md"
=== "Ruby"
--8<-- "snippets/ruby/config/element_based_output.md"
=== "R"
--8<-- "snippets/r/config/element_based_output.md"
=== "PHP"
--8<-- "snippets/php/config/element_based_output.md"
Elements are in `result.elements`. Each element has `element_id`, `element_type`, `text`, and `metadata`.
### Element Types
| `element_type` | Description | Key `additional` fields |
| ---------------- | ---------------------------------- | ------------------------------------------ |
| `title` | Main title or top-level heading | `level` (h1h6), `font_size`, `font_name` |
| `heading` | Section/subsection heading | `level` (h1h6) |
| `narrative_text` | Body paragraph | — |
| `list_item` | Bullet, numbered, or indented item | `list_type`, `list_marker`, `indent_level` |
| `table` | Tabular data | `row_count`, `column_count`, `format` |
| `image` | Embedded image | `format`, `width`, `height`, `alt_text` |
| `code_block` | Code snippet | `language`, `line_count` |
| `block_quote` | Quoted text | — |
| `header` | Recurring page header | `position` |
| `footer` | Recurring page footer | `position` |
| `page_break` | Page boundary marker | `next_page` |
### Metadata
Every element's `metadata` contains:
| Field | Type | Description |
| --------------- | --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `page_number` | `int \| None` | 1-indexed page number (PDF, DOCX, PPTX) |
| `filename` | `str \| None` | Source filename |
| `coordinates` | `BoundingBox \| None` | `x0`, `y0`, `x1`, `y1` in PDF points. Only populated for **text elements** when `pdf_options.hierarchy` is enabled with `include_bbox=True`. Table and image elements do not carry coordinates. |
| `element_index` | `int` | Zero-based position in the elements array |
| `additional` | `dict[str, str]` | Element-type-specific fields (see table above) |
PDF coordinates use bottom-left origin in points (1/72 inch).
### Example Output
```json
{
"element_id": "elem-a3f2b1c4",
"element_type": "title",
"text": "Introduction to Machine Learning",
"metadata": {
"page_number": 1,
"element_index": 0,
"coordinates": { "x0": 72.0, "y0": 700.0, "x1": 540.0, "y1": 730.0 },
"additional": { "level": "h1", "font_size": "24" }
}
}
```
### Filtering Elements
```python
config = ExtractionConfig(result_format="element_based")
result = extract_file_sync("document.pdf", config=config)
titles = [e for e in result.elements if e.element_type == "title"]
tables = [e for e in result.elements if e.element_type == "table"]
for title in titles:
level = title.metadata.additional.get("level", "h1")
print(f"[{level}] {title.text}")
```
### Migrating from Unstructured.io
If you're migrating from Unstructured.io, element-based output follows a similar structure with these key differences:
| Aspect | Unstructured.io | Kreuzberg |
| ----------- | ------------------------------------- | ------------------------------------------- |
| Type names | PascalCase (`Title`, `NarrativeText`) | snake_case (`title`, `narrative_text`) |
| Element IDs | Not always present | Always present (deterministic hash) |
| Metadata | Basic (`page_number`, `filename`) | Extended (coordinates, `additional` fields) |
| Config key | — | `result_format="element_based"` |
---
## Document Structure
A flat list of nodes with explicit parent-child index references — a traversable tree with heading levels, content layers, inline annotations, and structured table grids.
Use when you need hierarchical relationships between sections.
### Comparison
| Aspect | Unified (default) | Element-based | Document structure |
| ------------------ | ---------------------- | -------------------- | --------------------------------- |
| Output shape | `content: string` | `elements: array` | `nodes: array` with index refs |
| Hierarchy | None | Inferred from levels | Explicit parent/child indices |
| Inline annotations | No | No | Bold, italic, links per node |
| Tables | `result.tables` | Table elements | `TableGrid` with cell coords |
| Content layers | Not classified | Not classified | body, header, footer, footnote |
| Best for | LLM prompts, full-text | RAG chunking | Knowledge graphs, structured apps |
### Enable
=== "Python"
--8<-- "snippets/python/config/document_structure_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/document_structure_config.md"
=== "Rust"
--8<-- "snippets/rust/config/document_structure_config.md"
=== "Go"
--8<-- "snippets/go/config/document_structure_config.md"
=== "Java"
--8<-- "snippets/java/config/document_structure_config.md"
=== "C#"
--8<-- "snippets/csharp/config/document_structure_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/document_structure_config.md"
=== "R"
--8<-- "snippets/r/config/document_structure_config.md"
### Node Shape
Each node in `result.document.nodes`:
```json
{
"id": "node-a3f2b1c4",
"content": { "node_type": "heading", "level": 2, "text": "Supervised Learning" },
"parent": 0,
"children": [4, 5, 6],
"content_layer": "body",
"page": 5,
"page_end": null,
"bbox": { "x0": 72.0, "y0": 600.0, "x1": 400.0, "y1": 620.0 },
"annotations": []
}
```
- `parent` and `children` are integer indices into the `nodes` array (`null` if absent)
- `bbox` is present when bounding box data is available
- `annotations` contains inline formatting spans
### Node Types
| `node_type` | Key fields | Notes |
| ------------ | ---------------------------------------- | ------------------------------------------- |
| `title` | `text` | Document title |
| `heading` | `level` (16), `text` | Section heading |
| `paragraph` | `text` | Body paragraph; may have `annotations` |
| `list` | `ordered` (bool) | Container; children are `list_item` nodes |
| `list_item` | `text` | Child of `list` |
| `table` | `grid` ([TableGrid](#table-grid)) | Grid with cell-level data |
| `image` | `description`, `image_index` | `image_index` references `result.images` |
| `code` | `text`, `language` | Code block |
| `quote` | _(container)_ | Children are typically paragraphs |
| `formula` | `text` | Math formula (plain text, LaTeX, or MathML) |
| `footnote` | `text` | Usually `content_layer: "footnote"` |
| `group` | `label`, `heading_level`, `heading_text` | Section grouping container |
| `page_break` | _(marker)_ | Page boundary |
### Content Layers
| Layer | Description |
| ---------- | ------------------------------------------ |
| `body` | Main document content |
| `header` | Page header area (repeated chapter titles) |
| `footer` | Page footer area (page numbers, copyright) |
| `footnote` | Footnotes and endnotes |
```python
for node in result.document["nodes"]:
if node["content_layer"] == "body":
process_main_content(node)
```
### Text Annotations
Paragraphs carry a list of `annotations` marking character spans:
```json
{ "start": 0, "end": 16, "kind": { "annotation_type": "bold" } }
```
| `annotation_type` | Extra fields |
| ---------------------------------------------- | ------------------------- |
| `bold`, `italic`, `underline`, `strikethrough` | — |
| `code`, `subscript`, `superscript` | — |
| `link` | `url`, `title` (optional) |
```python
for node in result.document["nodes"]:
for ann in node.get("annotations", []):
text = node["content"].get("text", "")
span = text[ann["start"]:ann["end"]]
kind = ann["kind"]["annotation_type"]
if kind == "link":
print(f"Link: {span} -> {ann['kind']['url']}")
else:
print(f"{kind}: {span}")
```
### Table Grid
Table nodes contain a `grid` with cell-level data:
```json
{
"rows": 3,
"cols": 3,
"cells": [
{ "content": "Algorithm", "row": 0, "col": 0, "row_span": 1, "col_span": 1, "is_header": true },
{
"content": "Decision Tree",
"row": 1,
"col": 0,
"row_span": 1,
"col_span": 1,
"is_header": false
}
]
}
```
Each cell has `row`, `col`, `row_span`, `col_span`, `is_header`, and optionally `bbox`.
```python
for node in result.document["nodes"]:
if node["content"]["node_type"] == "table":
grid = node["content"]["grid"]
rows, cols = grid["rows"], grid["cols"]
table = [[None] * cols for _ in range(rows)]
for cell in grid["cells"]:
table[cell["row"]][cell["col"]] = cell["content"]
for row in table:
print(" | ".join(str(c or "") for c in row))
```
---
## PDF Hierarchy Detection
Classifies PDF text blocks into heading levels (H1H6) and body text via K-means clustering on font sizes — largest cluster is H1, second-largest H2, and so on.
### Quick Start
=== "Python"
--8<-- "snippets/python/config/pdf_hierarchy_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/pdf_hierarchy_config.md"
=== "Rust"
--8<-- "snippets/rust/config/pdf_hierarchy_config.md"
=== "Go"
--8<-- "snippets/go/config/pdf_hierarchy_config.md"
=== "Java"
--8<-- "snippets/java/config/pdf_hierarchy_config.md"
=== "C#"
--8<-- "snippets/csharp/config/pdf_hierarchy_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/pdf_hierarchy_config.md"
### Output
Hierarchy data is in `result.pages[n].hierarchy`. Each page has a `blocks` list:
```json
{
"block_count": 4,
"blocks": [
{
"text": "Chapter 1: Introduction",
"level": "h1",
"font_size": 24.0,
"bbox": [50.0, 100.0, 400.0, 125.0]
},
{ "text": "Background", "level": "h2", "font_size": 18.0, "bbox": [50.0, 150.0, 300.0, 168.0] },
{
"text": "This chapter provides...",
"level": "body",
"font_size": 12.0,
"bbox": [50.0, 200.0, 550.0, 450.0]
}
]
}
```
- `bbox`: `[left, top, right, bottom]` in PDF points (present when `include_bbox=True`). This is the only way to obtain bounding box coordinates for text elements — they are not included by default.
- `level`: `"h1"` `"h6"` or `"body"`
### Configuration
| Parameter | Type | Default | Description |
| ------------------------ | --------------- | ------- | --------------------------------------------------- |
| `enabled` | `bool` | `true` | Enable hierarchy extraction |
| `k_clusters` | `int` | `6` | Font size clusters (210), maps to heading levels |
| `include_bbox` | `bool` | `true` | Include bounding box coordinates |
| `ocr_coverage_threshold` | `float \| None` | `None` | Trigger OCR if text coverage is below this fraction |
#### Choosing k_clusters
| `k_clusters` | Heading levels | Use when |
| ------------ | -------------- | --------------------------------------- |
| 23 | H1H2 | Simple documents with 12 heading sizes |
| 45 | H1H4 | Standard documents |
| 6 (default) | H1H6 | Most documents |
| 78 | H1H6+ | Books, specs with deep nesting |
#### Ocr_coverage_threshold
| Threshold | Behavior |
| --------- | ------------------------------- |
| `None` | OCR never triggered by coverage |
| `0.3` | OCR if < 30% of page has text |
| `0.5` | OCR if < 50% of page has text |
Requires an OCR backend to be configured separately.
### Troubleshooting
- **`hierarchy` is `None`** — Check `hierarchy.enabled` is `True`. If the PDF is image-only, enable OCR. If fewer text blocks than `k_clusters`, reduce `k_clusters`.
- **Most blocks classified as `body`** — Document may use uniform font sizes. Reduce `k_clusters` (try 34).
- **Heading levels don't match visual inspection** — Levels are assigned by font size rank, not absolute size. Filter on `block.font_size` directly for absolute thresholds.
See the [HierarchyConfig reference](../reference/configuration.md#hierarchyconfig) for the full parameter list.

348
docs/guides/plugins.md Normal file
View File

@@ -0,0 +1,348 @@
# Creating Plugins <span class="version-badge">v4.0.0</span>
Extend Kreuzberg with custom extractors, post-processors, OCR backends, and validators registered globally for use across all extraction calls.
!!! Note "Wasm" Custom plugins are not supported in Wasm environments. Use Python, Rust, or other native bindings.
## Plugin Types
| Type | Purpose | Use case |
| --------------------- | --------------------------------- | ---------------------------------------------------------- |
| **DocumentExtractor** | Extract content from file formats | New format support, override built-in extractors |
| **PostProcessor** | Transform extraction results | Metadata enrichment, content filtering, text normalization |
| **OcrBackend** | Perform OCR on images | Cloud OCR services, custom OCR engines |
| **Validator** | Validate extraction quality | Minimum content length, quality score thresholds |
All plugins must be thread-safe (`Send + Sync` in Rust, thread-safe in Python) and implement `initialize()` / `shutdown()` lifecycle methods.
## Document Extractors
### Implementation
=== "Rust"
--8<-- "snippets/rust/plugins/plugin_extractor.md"
=== "Python"
--8<-- "snippets/python/plugins/plugin_extractor.md"
### Registration
=== "Python"
--8<-- "snippets/python/plugins/extractor_registration.md"
=== "TypeScript"
--8<-- "snippets/typescript/plugins/custom_extractor_plugin.md"
=== "Rust"
--8<-- "snippets/rust/plugins/extractor_registration.md"
=== "Go"
--8<-- "snippets/go/plugins/extractor_registration.md"
=== "Java"
--8<-- "snippets/java/plugins/extractor_registration.md"
=== "C#"
--8<-- "snippets/csharp/extractor_registration.md"
=== "Ruby"
--8<-- "snippets/ruby/plugins/extractor_registration.md"
=== "R"
--8<-- "snippets/r/plugins/extractor_registration.md"
### Priority System
When multiple extractors support the same MIME type, the highest priority wins:
| Range | Level |
| ------ | --------------------------- |
| 025 | Fallback / low-quality |
| 2649 | Alternative |
| **50** | **Default (built-in)** |
| 5175 | Enhanced / premium |
| 76100 | Specialized / high-priority |
## Post-Processors
Processors execute in three stages:
- **Early** — Foundational: language detection, quality scoring, text normalization
- **Middle** — Transformation: keyword extraction, token reduction, summarization
- **Late** — Final: custom metadata, analytics, output formatting
### Implementation
=== "Rust"
--8<-- "snippets/rust/plugins/word_count_processor.md"
=== "Python"
--8<-- "snippets/python/plugins/word_count_processor.md"
### Conditional Processing
=== "Python"
--8<-- "snippets/python/plugins/pdf_only_processor.md"
=== "Rust"
--8<-- "snippets/rust/metadata/pdf_only_processor.md"
=== "Go"
--8<-- "snippets/go/plugins/pdf_only_processor.md"
=== "Java"
--8<-- "snippets/java/plugins/pdf_only_processor.md"
=== "C#"
--8<-- "snippets/csharp/pdf_only_processor.md"
## OCR Backends
### Implementation
=== "Rust"
--8<-- "snippets/rust/ocr/cloud_ocr_backend.md"
=== "Python"
--8<-- "snippets/python/ocr/cloud_ocr_backend.md"
=== "Java"
--8<-- "snippets/java/ocr/cloud_ocr_backend.md"
=== "C#"
--8<-- "snippets/csharp/cloud_ocr_backend.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/cloud_ocr_backend.md"
=== "R"
--8<-- "snippets/r/ocr/cloud_ocr_backend.md"
### Registration
Register the backend and set its name in `OcrConfig`:
=== "Python"
```python title="Python"
from kreuzberg import register_ocr_backend, unregister_ocr_backend
backend = CloudOcrBackend(api_key="your-api-key")
register_ocr_backend(backend)
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(ocr=OcrConfig(backend="cloud-ocr", language="eng"))
result = extract_file_sync("scanned.pdf", config=config)
unregister_ocr_backend("cloud-ocr")
```
### Using EasyOCR (Built-in)
The built-in EasyOCR backend supports 80+ languages and GPU acceleration — just point `OcrConfig` at it:
=== "Python"
--8<-- "snippets/python/ocr/ocr_easyocr.md"
## Validators
!!! Warning Validation errors cause extraction to fail. Use validators for critical quality checks only.
=== "Rust"
--8<-- "snippets/rust/plugins/min_length_validator.md"
=== "Python"
--8<-- "snippets/python/plugins/min_length_validator.md"
=== "Java"
--8<-- "snippets/java/plugins/min_length_validator.md"
=== "C#"
--8<-- "snippets/csharp/min_length_validator.md"
### Quality Score Validator
=== "Rust"
--8<-- "snippets/rust/plugins/quality_score_validator.md"
=== "Python"
--8<-- "snippets/python/plugins/quality_score_validator.md"
=== "Java"
--8<-- "snippets/java/plugins/quality_score_validator.md"
=== "C#"
--8<-- "snippets/csharp/quality_score_validator.md"
## Plugin Management
### Listing
=== "Python"
--8<-- "snippets/python/plugins/list_plugins.md"
=== "Rust"
--8<-- "snippets/rust/plugins/list_plugins.md"
=== "Java"
--8<-- "snippets/java/plugins/list_plugins.md"
=== "C#"
--8<-- "snippets/csharp/list_plugins.md"
### Unregistering
=== "Python"
--8<-- "snippets/python/plugins/unregister_plugins.md"
=== "Rust"
--8<-- "snippets/rust/plugins/unregister_plugins.md"
=== "Java"
--8<-- "snippets/java/plugins/unregister_plugins.md"
=== "C#"
--8<-- "snippets/csharp/unregister_plugins.md"
### Clearing All
=== "Python"
--8<-- "snippets/python/plugins/clear_plugins.md"
=== "Rust"
--8<-- "snippets/rust/plugins/clear_plugins.md"
=== "Java"
--8<-- "snippets/java/plugins/clear_plugins.md"
=== "C#"
--8<-- "snippets/csharp/clear_plugins.md"
## Thread Safety
=== "Rust"
--8<-- "snippets/rust/plugins/stateful_plugin.md"
=== "Python"
--8<-- "snippets/python/plugins/stateful_plugin.md"
=== "Java"
--8<-- "snippets/java/plugins/stateful_plugin.md"
=== "C#"
--8<-- "snippets/csharp/stateful_plugin.md"
## Best Practices
**Naming:** Use kebab-case (`my-custom-plugin`), lowercase only, no spaces or special characters.
### Logging
=== "Python"
--8<-- "snippets/python/plugins/plugin_logging.md"
=== "Rust"
--8<-- "snippets/rust/plugins/plugin_logging.md"
=== "Java"
--8<-- "snippets/java/plugins/plugin_logging.md"
=== "C#"
--8<-- "snippets/csharp/plugin_logging.md"
### Testing
=== "Python"
--8<-- "snippets/python/plugins/plugin_testing.md"
=== "Rust"
--8<-- "snippets/rust/plugins/plugin_testing.md"
=== "Java"
--8<-- "snippets/java/plugins/plugin_testing.md"
=== "C#"
--8<-- "snippets/csharp/plugin_testing.md"
## Complete Example: PDF Metadata Extractor
=== "Python"
--8<-- "snippets/python/metadata/pdf_metadata_extractor.md"
=== "Go"
--8<-- "snippets/go/plugins/pdf_metadata_extractor.md"
=== "Java"
--8<-- "snippets/java/plugins/pdf_metadata_extractor.md"
=== "C#"
--8<-- "snippets/csharp/pdf_metadata_extractor.md"
=== "Ruby"
--8<-- "snippets/ruby/plugins/pdf_metadata_extractor.md"
=== "R"
--8<-- "snippets/r/plugins/pdf_metadata_extractor.md"