438 lines
23 KiB
Markdown
438 lines
23 KiB
Markdown
|
|
# Features
|
|||
|
|
|
|||
|
|
A map of what Kreuzberg can do. Each section links to the guide or reference page with configuration details and code examples.
|
|||
|
|
|
|||
|
|

|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Format Support
|
|||
|
|
|
|||
|
|
90+ file formats handled by native Rust extractors — no LibreOffice or other external tools required.
|
|||
|
|
|
|||
|
|
=== "Documents"
|
|||
|
|
|
|||
|
|
<div class="format-chips">
|
|||
|
|
<span class="format-chip">PDF <code>.pdf</code></span>
|
|||
|
|
<span class="format-chip">Word <code>.docx .doc</code></span>
|
|||
|
|
<span class="format-chip">Pages <code>.pages</code></span>
|
|||
|
|
<span class="format-chip">PowerPoint <code>.pptx .ppt</code></span>
|
|||
|
|
<span class="format-chip">Keynote <code>.key</code></span>
|
|||
|
|
<span class="format-chip">OpenDocument <code>.odt</code></span>
|
|||
|
|
<span class="format-chip">Plain text <code>.txt</code></span>
|
|||
|
|
<span class="format-chip">Markdown <code>.md</code></span>
|
|||
|
|
<span class="format-chip">Djot <code>.djot</code></span>
|
|||
|
|
<span class="format-chip">MDX <code>.mdx</code></span>
|
|||
|
|
<span class="format-chip">RTF <code>.rtf</code></span>
|
|||
|
|
<span class="format-chip">reStructuredText <code>.rst</code></span>
|
|||
|
|
<span class="format-chip">Org <code>.org</code></span>
|
|||
|
|
<span class="format-chip">Hangul <code>.hwp .hwpx</code></span>
|
|||
|
|
</div>
|
|||
|
|
|
|||
|
|
=== "Spreadsheets"
|
|||
|
|
|
|||
|
|
<div class="format-chips">
|
|||
|
|
<span class="format-chip">Excel <code>.xlsx .xls .xlsm .xlsb</code></span>
|
|||
|
|
<span class="format-chip">Numbers <code>.numbers</code></span>
|
|||
|
|
<span class="format-chip">OpenDocument <code>.ods</code></span>
|
|||
|
|
<span class="format-chip">CSV <code>.csv</code></span>
|
|||
|
|
<span class="format-chip">TSV <code>.tsv</code></span>
|
|||
|
|
<span class="format-chip">dBASE <code>.dbf</code></span>
|
|||
|
|
</div>
|
|||
|
|
|
|||
|
|
=== "Images"
|
|||
|
|
|
|||
|
|
<div class="format-chips">
|
|||
|
|
<span class="format-chip">JPEG <code>.jpg .jpeg</code></span>
|
|||
|
|
<span class="format-chip">PNG <code>.png</code></span>
|
|||
|
|
<span class="format-chip">GIF <code>.gif</code></span>
|
|||
|
|
<span class="format-chip">BMP <code>.bmp</code></span>
|
|||
|
|
<span class="format-chip">TIFF <code>.tiff .tif</code></span>
|
|||
|
|
<span class="format-chip">WebP <code>.webp</code></span>
|
|||
|
|
<span class="format-chip">JPEG 2000 <code>.jp2 .jpx .jpm .mj2</code></span>
|
|||
|
|
<span class="format-chip">JBIG2 <code>.jbig2</code></span>
|
|||
|
|
<span class="format-chip">PNM <code>.pnm .pbm .pgm .ppm</code></span>
|
|||
|
|
</div>
|
|||
|
|
|
|||
|
|
=== "Email"
|
|||
|
|
|
|||
|
|
<div class="format-chips">
|
|||
|
|
<span class="format-chip">EML <code>.eml</code></span>
|
|||
|
|
<span class="format-chip">MSG <code>.msg</code></span>
|
|||
|
|
</div>
|
|||
|
|
|
|||
|
|
=== "Web and Markup"
|
|||
|
|
|
|||
|
|
<div class="format-chips">
|
|||
|
|
<span class="format-chip">HTML <code>.html .htm</code></span>
|
|||
|
|
<span class="format-chip">XHTML <code>.xhtml</code></span>
|
|||
|
|
<span class="format-chip">XML <code>.xml</code></span>
|
|||
|
|
<span class="format-chip">SVG <code>.svg</code></span>
|
|||
|
|
</div>
|
|||
|
|
|
|||
|
|
=== "Structured Data"
|
|||
|
|
|
|||
|
|
<div class="format-chips">
|
|||
|
|
<span class="format-chip">JSON <code>.json</code></span>
|
|||
|
|
<span class="format-chip">YAML <code>.yaml</code></span>
|
|||
|
|
<span class="format-chip">TOML <code>.toml</code></span>
|
|||
|
|
</div>
|
|||
|
|
|
|||
|
|
=== "Archives"
|
|||
|
|
|
|||
|
|
<div class="format-chips">
|
|||
|
|
<span class="format-chip">ZIP <code>.zip</code></span>
|
|||
|
|
<span class="format-chip">TAR <code>.tar .tgz</code></span>
|
|||
|
|
<span class="format-chip">GZIP <code>.gz</code></span>
|
|||
|
|
<span class="format-chip">7-Zip <code>.7z</code></span>
|
|||
|
|
</div>
|
|||
|
|
|
|||
|
|
=== "Academic"
|
|||
|
|
|
|||
|
|
<div class="format-chips">
|
|||
|
|
<span class="format-chip">EPUB <code>.epub</code></span>
|
|||
|
|
<span class="format-chip">BibTeX <code>.bib</code></span>
|
|||
|
|
<span class="format-chip">RIS <code>.ris</code></span>
|
|||
|
|
<span class="format-chip">CSL <code>.csl</code></span>
|
|||
|
|
<span class="format-chip">LaTeX <code>.tex</code></span>
|
|||
|
|
<span class="format-chip">Typst <code>.typ</code></span>
|
|||
|
|
<span class="format-chip">JATS <code>.jats</code></span>
|
|||
|
|
<span class="format-chip">DocBook <code>.docbook</code></span>
|
|||
|
|
<span class="format-chip">OPML <code>.opml</code></span>
|
|||
|
|
</div>
|
|||
|
|
|
|||
|
|
For the full format matrix with MIME types, extraction methods, and special capabilities, see the [Format Support Reference](reference/formats.md).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Extraction Pipeline
|
|||
|
|
|
|||
|
|
Every file flows through the same multi-stage pipeline:
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
flowchart LR
|
|||
|
|
A[Input File] --> B[MIME Detection]
|
|||
|
|
B --> C[Format Extractor]
|
|||
|
|
C --> D{OCR Needed?}
|
|||
|
|
D -->|Yes| E[OCR Engine]
|
|||
|
|
D -->|No| F[Post-Processing]
|
|||
|
|
E --> F
|
|||
|
|
F --> G[ExtractionResult]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
1. **MIME detection** -- Kreuzberg identifies the file type from magic bytes and extension, then selects the matching native extractor from the registry.
|
|||
|
|
2. **Format extraction** -- The extractor pulls text, tables, metadata, and optionally images from the file. PDF extraction uses pdf_oxide (pure Rust); Office formats use streaming XML parsers; images pass directly to OCR.
|
|||
|
|
3. **OCR** -- When the extractor finds no text layer (or `force_ocr` is set), the file is routed to the configured OCR backend. The OCR result replaces or supplements the extracted text.
|
|||
|
|
4. **Post-processing** -- Validators, quality processing, chunking, embeddings, keyword extraction, and any registered post-processor plugins run in sequence.
|
|||
|
|
5. **Caching** -- If caching is enabled, results are stored keyed by a content hash so repeated extractions skip the entire pipeline.
|
|||
|
|
|
|||
|
|
For a deep dive into each stage, see [Extraction Pipeline](concepts/extraction-pipeline.md).
|
|||
|
|
|
|||
|
|
### Output Formats
|
|||
|
|
|
|||
|
|
Kreuzberg supports five output formats: **Plain text**, **Markdown**, **Djot**, **HTML**, and **Structured (JSON)**. The HTML format includes a styled renderer with semantic `kb-*` CSS classes, five built-in themes, and CSS custom properties for full customization. See [HTML Output](guides/html-output.md) for details.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## OCR Engines
|
|||
|
|
|
|||
|
|
Three OCR backends, usable individually or chained into a quality-driven fallback pipeline.
|
|||
|
|
|
|||
|
|
### Backend Comparison
|
|||
|
|
|
|||
|
|
| | Tesseract | PaddleOCR | EasyOCR |
|
|||
|
|
| ------------------ | ---------------------------------------- | ------------------------------------------------------------------ | -------------------------------- |
|
|||
|
|
| **Languages** | 100+ | 80+ (11 script families) | 80+ |
|
|||
|
|
| **Best for** | General purpose, broad language coverage | CJK, complex scripts, high accuracy | GPU-accelerated workloads |
|
|||
|
|
| **Platform** | All bindings including WASM | All non-WASM bindings | Python only |
|
|||
|
|
| **Install** | System package (`tesseract-ocr`) | Cargo feature `paddle-ocr` (bundled in Python package since 4.8.5) | `pip install kreuzberg[easyocr]` |
|
|||
|
|
| **Runtime** | C library (Tesseract 4.0+) | ONNX Runtime (models downloaded on first use) | PyTorch (optional CUDA) |
|
|||
|
|
| **Python version** | Any | Any | Any |
|
|||
|
|
|
|||
|
|
### Multi-Backend Pipeline
|
|||
|
|
|
|||
|
|
!!! Info "Added in v4.5.0"
|
|||
|
|
|
|||
|
|
When the `paddle-ocr` feature is enabled, Kreuzberg automatically constructs a fallback pipeline: Tesseract runs first, and if the output falls below configurable quality thresholds (16 tunable parameters), PaddleOCR takes over. You can also define a custom ordering across all three backends.
|
|||
|
|
|
|||
|
|
The pipeline supports auto-rotate for page orientation detection (0/90/180/270 degrees) and per-stage language and backend-specific settings.
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
flowchart TD
|
|||
|
|
A[Image / Scanned Page] --> B[Primary Backend]
|
|||
|
|
B --> C{Quality Above Threshold?}
|
|||
|
|
C -->|Yes| D[Return Result]
|
|||
|
|
C -->|No| E[Fallback Backend]
|
|||
|
|
E --> F{Quality Above Threshold?}
|
|||
|
|
F -->|Yes| D
|
|||
|
|
F -->|No| G[Return Best Result]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Document-Level Optimization
|
|||
|
|
|
|||
|
|
!!! Info "Added in v4.5.3"
|
|||
|
|
|
|||
|
|
Some OCR backends (including EasyOCR) now support **document-level processing**. When a file path is provided, the extractor can bypass the expensive page-by-page rendering stage and delegate the entire document to the OCR engine. This significantly reduces memory overhead and improves throughput for large PDFs and multi-page images.
|
|||
|
|
|
|||
|
|
For backend configuration, language selection, and PSM/OEM modes, see the [OCR Guide](guides/ocr.md).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Processing Features
|
|||
|
|
|
|||
|
|
Optional post-extraction steps, each configured independently through `ExtractionConfig`.
|
|||
|
|
|
|||
|
|
### For RAG Pipelines
|
|||
|
|
|
|||
|
|
**Content Chunking** -- Split extracted text into sized chunks for LLM consumption. Strategies include recursive (paragraph/sentence/word splitting), semantic, and Markdown-aware chunking that preserves heading hierarchy. Chunks can be sized by character count or by token count using any HuggingFace tokenizer.
|
|||
|
|
|
|||
|
|
**Embeddings** -- Generate vector embeddings locally using FastEmbed. Choose from preset models (`"fast"`, `"balanced"`, `"quality"`) or any FastEmbed-compatible model. Embeddings are generated in-process with no external API calls.
|
|||
|
|
|
|||
|
|
**Page Tracking** -- Extract per-page content with byte-accurate offsets for O(1) page lookups. Chunks are automatically mapped to their source pages, enabling precise citations in retrieval systems. Supported for PDF (byte-accurate), PPTX (slide boundaries), and DOCX (best-effort page breaks). See [Extraction Basics](guides/extraction.md) for usage.
|
|||
|
|
|
|||
|
|
**PDF Hierarchy Detection** -- Detect document structure from PDFs using K-means clustering on block characteristics (font size, weight, indentation, position). Blocks are assigned to semantic levels (title, section, subsection, paragraph) without relying on explicit heading tags. See the [Output Formats Guide](guides/output-formats.md#pdf-hierarchy-detection).
|
|||
|
|
|
|||
|
|
**PDF Page Rendering** -- Render individual PDF pages as PNG images for thumbnails, vision model input, or custom processing pipelines. Memory-efficient iterator renders one page at a time. Configurable DPI (default 150). Available across all language bindings. See [Extraction Guide](guides/extraction.md#pdf-page-rendering).
|
|||
|
|
|
|||
|
|
!!! Info "Added in v4.6.2"
|
|||
|
|
|
|||
|
|
### LLM-Powered Intelligence
|
|||
|
|
|
|||
|
|
!!! Info "Added in v4.8.0"
|
|||
|
|
|
|||
|
|
Kreuzberg integrates with 143 LLM providers including local inference (Ollama, LM Studio, vLLM, llama.cpp) via [liter-llm](https://github.com/kreuzberg-dev/liter-llm) to unlock three new capabilities that complement the local extraction pipeline.
|
|||
|
|
|
|||
|
|
<details>
|
|||
|
|
<summary><strong>VLM OCR</strong> -- Vision language models as an OCR backend</summary>
|
|||
|
|
|
|||
|
|
Use OpenAI GPT-4o, Anthropic Claude, Google Gemini, or any vision-capable model as an OCR engine. VLM OCR delivers superior accuracy on low-quality scans, handwriting, Arabic/Farsi scripts, and complex layouts where traditional OCR struggles. Configure via `ocr.backend = "vlm"` with `ocr.vlm_config` in your extraction config or `kreuzberg.toml`.
|
|||
|
|
|
|||
|
|
</details>
|
|||
|
|
|
|||
|
|
<details>
|
|||
|
|
<summary><strong>Structured Extraction</strong> -- Extract typed JSON from documents using a schema</summary>
|
|||
|
|
|
|||
|
|
Provide a JSON schema and an optional Jinja2 prompt template; the LLM returns conforming structured data. Supports strict mode (OpenAI) with automatic `additionalProperties` sanitization for cross-provider compatibility. Available through the `kreuzberg extract-structured` CLI command, `POST /extract-structured` API endpoint, and `extract_structured` MCP tool.
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"type": "object",
|
|||
|
|
"properties": {
|
|||
|
|
"invoice_number": { "type": "string" },
|
|||
|
|
"total": { "type": "number" },
|
|||
|
|
"line_items": {
|
|||
|
|
"type": "array",
|
|||
|
|
"items": {
|
|||
|
|
"type": "object",
|
|||
|
|
"properties": {
|
|||
|
|
"description": { "type": "string" },
|
|||
|
|
"amount": { "type": "number" }
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
</details>
|
|||
|
|
|
|||
|
|
<details>
|
|||
|
|
<summary><strong>VLM Embeddings</strong> -- Provider-hosted embedding models</summary>
|
|||
|
|
|
|||
|
|
Use provider-hosted embedding models (for example, `openai/text-embedding-3-small`, `mistral/mistral-embed`) as an alternative to local ONNX models. Works through the existing `/embed` API endpoint, `embed_text` MCP tool, and `embed` CLI command with `--provider llm`.
|
|||
|
|
|
|||
|
|
</details>
|
|||
|
|
|
|||
|
|
<details>
|
|||
|
|
<summary><strong>Custom Jinja2 Prompts</strong> -- Minijinja template engine for LLM prompts</summary>
|
|||
|
|
|
|||
|
|
Customize the prompts sent to LLMs with Minijinja templates. Available variables for structured extraction: `{{ content }}`, `{{ schema }}`, `{{ schema_name }}`, `{{ schema_description }}`. For VLM OCR prompts: `{{ language }}`. Override the default prompt per-request or in configuration.
|
|||
|
|
|
|||
|
|
</details>
|
|||
|
|
|
|||
|
|
`LlmConfig` and `StructuredExtractionConfig` types are exposed in Python, Node.js, and PHP bindings. Five new environment variables (`KREUZBERG_LLM_MODEL`, `KREUZBERG_LLM_API_KEY`, `KREUZBERG_LLM_BASE_URL`, `KREUZBERG_VLM_OCR_MODEL`, `KREUZBERG_VLM_EMBEDDING_MODEL`) provide zero-code configuration.
|
|||
|
|
|
|||
|
|
### For Search and Indexing
|
|||
|
|
|
|||
|
|
**Keyword Extraction** -- Extract key phrases using YAKE (unsupervised, language-independent) or RAKE (fast statistical method). Configurable n-gram ranges and language-specific stopword filtering. See the [Keyword Extraction Guide](guides/keywords.md).
|
|||
|
|
|
|||
|
|
**Language Detection** -- Identify 60+ languages with confidence scoring using fast-langdetect. Supports multi-language detection for documents with mixed content.
|
|||
|
|
|
|||
|
|
**Metadata Extraction** -- Pull document properties (title, author, creation date), page/word/character counts, and format-specific metadata (Excel sheet names, PDF annotations).
|
|||
|
|
|
|||
|
|
### For Code
|
|||
|
|
|
|||
|
|
**Code Intelligence** -- Extract functions, classes, imports, exports, symbols, docstrings, and diagnostics from 300+ programming languages via tree-sitter. Results are available in `ExtractionResult.code_intelligence` as a `ProcessResult`. Code files produce semantic chunks (function/class-aware) that bypass the text-splitter entirely. Configure content mode with `CodeContentMode`: `chunks` (default, semantic TSLP chunks), `raw` (source as-is), or `structure` (headings + docstrings only).
|
|||
|
|
|
|||
|
|
### For Data Quality
|
|||
|
|
|
|||
|
|
**Quality Processing** -- Unicode normalization (NFC/NFD/NFKC/NFKD), whitespace and line break standardization, encoding detection, and mojibake correction.
|
|||
|
|
|
|||
|
|
**Token Reduction** -- Reduce token count while preserving meaning through TF-IDF-based extractive summarization. Three modes: light (~15% reduction), moderate (~30%), and aggressive (~50%).
|
|||
|
|
|
|||
|
|
**Table Extraction** -- Structured table data from PDFs, spreadsheets, and Word documents with cell-level row/column indexing, merged cell support, and Markdown or JSON output.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Layout Detection
|
|||
|
|
|
|||
|
|
!!! Info "Added in v4.5.0"
|
|||
|
|
|
|||
|
|
Detect and classify document regions using ONNX-based deep learning. Layout detection identifies 17 element types (text, tables, figures, headers, code, forms, captions, and more), enabling accurate region-aware extraction and structured table recovery.
|
|||
|
|
|
|||
|
|
**RT-DETR v2** -- The layout detection model that identifies document structure with high precision. Automatically selects and configures separate table structure models (TATR, SLANeXT variants, or SLANet-plus) for cell-level analysis within detected table regions.
|
|||
|
|
|
|||
|
|
**Table Structure Recognition** -- When layout detection identifies a table, a configurable table structure model analyzes rows, columns, headers, and spanning cells for HTML recovery with colspan/rowspan support. Choose from:
|
|||
|
|
|
|||
|
|
- **TATR** (30 MB) — General-purpose, fast, default
|
|||
|
|
- **SLANeXT Wired/Wireless/Auto** (365–737 MB) — Optimized for bordered/borderless tables with auto-detection
|
|||
|
|
- **SLANet-plus** (7.78 MB) — Lightweight, resource-constrained environments
|
|||
|
|
|
|||
|
|
GPU acceleration via ONNX Runtime (CUDA, CoreML, TensorRT) significantly reduces inference time. Models are automatically downloaded and cached on first use.
|
|||
|
|
|
|||
|
|
**Availability:** All language bindings **except WebAssembly** — WASM does not support layout detection because ONNX Runtime is unavailable in browser environments.
|
|||
|
|
|
|||
|
|
For configuration and usage, see the [Layout Detection Guide](guides/layout-detection.md).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Plugin System
|
|||
|
|
|
|||
|
|
The extraction pipeline is extensible through four plugin types, each hooking into a different stage:
|
|||
|
|
|
|||
|
|
```mermaid
|
|||
|
|
flowchart LR
|
|||
|
|
A[File Input] --> B[Document Extractor Plugin]
|
|||
|
|
B --> C[OCR Backend Plugin]
|
|||
|
|
C --> D[Validator Plugin]
|
|||
|
|
D --> E[Post-Processor Plugin]
|
|||
|
|
E --> F[Output]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
| Plugin Type | Purpose | Example |
|
|||
|
|
| ----------------------- | -------------------------------------------------------- | ------------------------------ |
|
|||
|
|
| **Document Extractors** | Add support for custom file formats or override defaults | Proprietary format parser |
|
|||
|
|
| **OCR Backends** | Integrate cloud OCR services or custom engines | AWS Textract, Google Vision |
|
|||
|
|
| **Validators** | Enforce quality standards on extraction results | Minimum word count check |
|
|||
|
|
| **Post-Processors** | Transform or enrich results after extraction | PII redaction, custom metadata |
|
|||
|
|
|
|||
|
|
Plugins are registered with a priority value that determines execution order. Discovery works through Python entry points, configuration files, or environment variables.
|
|||
|
|
|
|||
|
|
For the architecture overview, see [Plugin System](concepts/plugin-system.md). For implementation guidance, see [Creating Plugins](guides/plugins.md).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Deployment Modes
|
|||
|
|
|
|||
|
|
| Mode | When to Use | Details |
|
|||
|
|
| -------------- | ------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
|
|||
|
|
| **Library** | Embedding extraction into your application | Import the package in Python, TypeScript, Rust, Go, Ruby, C#, Java, PHP, Elixir, R, or C |
|
|||
|
|
| **CLI** | One-off extractions, scripting, CI pipelines | `kreuzberg extract document.pdf --format json` -- see [CLI Usage](cli/usage.md) |
|
|||
|
|
| **REST API** | Multi-service architectures, language-agnostic access | `kreuzberg serve --port 8000` -- see [API Server Guide](guides/api-server.md) |
|
|||
|
|
| **MCP Server** | AI agent integration (Claude Desktop, Continue.dev) | `kreuzberg mcp` -- stdio transport with JSON-RPC 2.0 |
|
|||
|
|
| **Docker** | Reproducible deployments with all dependencies bundled | `ghcr.io/kreuzberg-dev/kreuzberg:latest` -- see [Docker Guide](guides/docker.md) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Language Bindings
|
|||
|
|
|
|||
|
|
12 native bindings share the Rust core and produce identical results.
|
|||
|
|
|
|||
|
|
### Binding Tiers
|
|||
|
|
|
|||
|
|
**Full feature parity with async API** -- Python (PyO3), TypeScript/Node.js (NAPI-RS), Rust
|
|||
|
|
|
|||
|
|
**Full features, synchronous API** -- Go, Ruby, C#, Java
|
|||
|
|
|
|||
|
|
**Subset or constrained environments** -- PHP, Elixir, R, C (FFI)
|
|||
|
|
|
|||
|
|
**TypeScript: Two flavors**
|
|||
|
|
|
|||
|
|
- **Native** (`@kreuzberg/node`) — Full speed, complete feature parity (servers, plugins, config file discovery)
|
|||
|
|
- **WASM** (`@kreuzberg/wasm`) — Browser/edge runtime, 60–80% of native speed, no native dependencies required. Excluded features: ORT-dependent (paddle-ocr, layout detection, embeddings, auto-rotate), server modes (api/mcp), CLI binary, and filesystem-dependent paths. All formats including email (.eml/.msg), PDF, all office formats (DOCX/XLSX/PPTX/ODT/RTF/EPUB/iWork/HWP), archives, plus Tesseract OCR (via the kreuzberg-tesseract WASI build), chunking, keywords, language detection, stopwords, tree-sitter, and liter-llm are supported.
|
|||
|
|
|
|||
|
|
Choose Native for server-side Node.js; choose WASM for browser or edge deployments.
|
|||
|
|
|
|||
|
|
### Rust Feature Flags
|
|||
|
|
|
|||
|
|
Rust builds are modular through Cargo features. Nothing is enabled by default:
|
|||
|
|
|
|||
|
|
| Category | Features |
|
|||
|
|
| --------------------- | ------------------------------------------------------------------------------------------------------- |
|
|||
|
|
| **Format extractors** | `pdf`, `excel`, `office`, `email`, `html`, `xml`, `archives`, `markdown`, `djot`, `mdx` |
|
|||
|
|
| **Processing** | `ocr`, `paddle-ocr`, `language-detection`, `chunking`, `embeddings`, `quality`, `keywords`, `stopwords` |
|
|||
|
|
| **Servers** | `api`, `mcp` |
|
|||
|
|
| **Bundles** | `full` (all extractors + processing), `server`, `cli` |
|
|||
|
|
|
|||
|
|
### Package Installation
|
|||
|
|
|
|||
|
|
=== "Python"
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
pip install kreuzberg # Core + Tesseract + PaddleOCR
|
|||
|
|
pip install kreuzberg[easyocr] # + EasyOCR
|
|||
|
|
pip install kreuzberg[all] # Everything
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
=== "TypeScript"
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npm install @kreuzberg/node # Native (Node.js/Bun)
|
|||
|
|
npm install @kreuzberg/wasm # WASM (browser/edge)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
=== "Rust"
|
|||
|
|
|
|||
|
|
```toml
|
|||
|
|
[dependencies]
|
|||
|
|
kreuzberg = { version = "4.0", features = ["pdf", "ocr", "chunking"] }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
=== "Other"
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
gem install kreuzberg # Ruby
|
|||
|
|
go get github.com/kreuzberg-dev/kreuzberg/packages/go/v5 # Go
|
|||
|
|
dotnet add package kreuzberg.dev # C#
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
For API details per language, see the [API Reference](reference/api-python.md).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Configuration
|
|||
|
|
|
|||
|
|
Four configuration methods, checked in this order:
|
|||
|
|
|
|||
|
|
1. **Programmatic** -- Construct `ExtractionConfig` objects in code (all bindings)
|
|||
|
|
2. **TOML** -- `kreuzberg.toml`
|
|||
|
|
3. **YAML** -- `kreuzberg.yaml`
|
|||
|
|
4. **JSON** -- `kreuzberg.json`
|
|||
|
|
|
|||
|
|
Config files are auto-discovered from the current directory, `~/.config/kreuzberg/`, and `/etc/kreuzberg/`. Environment variables (`KREUZBERG_CONFIG_PATH`, `KREUZBERG_CACHE_DIR`, `KREUZBERG_OCR_BACKEND`, `KREUZBERG_OCR_LANGUAGE`) override file-based settings.
|
|||
|
|
|
|||
|
|
For the full configuration schema and examples, see the [Configuration Guide](guides/configuration.md).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## AI Coding Assistants
|
|||
|
|
|
|||
|
|
!!! Info "Added in v4.2.15"
|
|||
|
|
|
|||
|
|
Kreuzberg ships with an [Agent Skill](https://agentskills.io) that teaches AI coding assistants the complete API across Python, TypeScript, Rust, and CLI. Install it with:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
npx skills add kreuzberg-dev/kreuzberg
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Compatible with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard. See the [AI Coding Assistants Guide](guides/agent-skills.md).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Next Steps
|
|||
|
|
|
|||
|
|
- [Installation](getting-started/installation.md) -- Install Kreuzberg for your language
|
|||
|
|
- [Quick Start](getting-started/quickstart.md) -- Extract your first document in 5 minutes
|
|||
|
|
- [Architecture](concepts/architecture.md) -- Understand the Rust core and binding layers
|
|||
|
|
- [Development Workflow](guides/development.md#performance) -- Performance benchmarks and optimization guidance
|