Files
fil/docs/comparisons/kreuzberg-vs-unstructured.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

89 lines
5.8 KiB
Markdown

# Kreuzberg vs Unstructured
Both Kreuzberg and Unstructured are open-source tools for extracting text, tables, and metadata from documents. They solve similar problems but make very different architectural choices. This doc breaks down where each one shines so you can pick the right tool for your project.
## At a Glance
| | Kreuzberg | Unstructured |
| ---------------- | --------------------------------------------------------------- | ------------------------------------------------------------ |
| **Written in** | Rust | Python |
| **File formats** | 90+ | ~30 |
| **Use from** | Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, Wasm | Python, or any language via REST API |
| **Run it as** | Library, CLI, self-hosted API, browser (Wasm) | Python library, managed cloud API, self-hosted API |
| **Pricing** | Free, Apache 2.0 | Free (open-source) + paid managed API |
| **Sweet spot** | High-throughput pipelines, polyglot stacks, on-prem | Managed service, ML-heavy layout analysis, quick prototyping |
---
## How They Differ
### Architecture and Performance
The core difference is what's under the hood.
- **Kreuzberg** -- Rust library with native bindings for each language. Your Python, TypeScript, or Go code calls directly into compiled Rust with no subprocess spawning or HTTP overhead.
- **Unstructured** -- Python library with an optional managed cloud API. Well-optimized for Python workflows, but other languages go through HTTP.
Bottom line: if you're processing thousands of documents in a pipeline, the Rust core gives Kreuzberg a throughput advantage. If you're in a Python-only stack, Unstructured is a natural fit.
### Format Coverage
How much of your document zoo each tool can handle.
- **Kreuzberg (90+ formats)** -- PDFs, Office docs, spreadsheets, HTML, images (via OCR), email, archives, source code, structured data (JSON/YAML/TOML), plus niche formats like LaTeX, Typst, BibTeX, Jupyter notebooks, EPUB, and OrgMode.
- **Unstructured (~30 formats)** -- PDFs, Office files, HTML, images, email, and the most common document types. Covers the essentials well.
If your pipeline only deals with PDFs and Word docs, both work. To ingest Jupyter notebooks or OrgMode files, Kreuzberg has you covered.
### Language Support and Deployment
How you integrate each tool into your stack.
- **Kreuzberg** -- Native bindings for **16 languages** (Python, TypeScript, Rust, Go, Java, Kotlin, C#, Ruby, PHP, Elixir, R, Dart, Swift, Zig, C, Wasm). Each binding calls directly into the Rust core -- same performance, same features. Also runs in the browser via Wasm.
- **Unstructured** -- Python-first. Other languages go through a **REST API**, either self-hosted or via their managed cloud. The managed API is a genuine advantage if you don't want to run infrastructure.
### OCR and Layout Analysis
Both tools handle OCR and document layout, but with different approaches.
- **OCR** -- Both integrate Tesseract. Kreuzberg adds a native **PaddleOCR** backend (ONNX-based, no Python needed) and a **multi-backend pipeline** that auto-falls back between engines based on output quality.
- **Layout detection** -- Kreuzberg ships an ONNX-based RT-DETR v2 model covering 17 element classes plus SLANet table structure recognition. Unstructured offers mature ML-based layout detection with strong complex table support.
### Embeddings and Chunking
How each tool prepares extracted text for RAG pipelines.
- **Kreuzberg** -- Generates embeddings **locally** with ONNX models (no API keys needed). Supports recursive, semantic, and markdown-aware chunking with optional token-based sizing via HuggingFace tokenizers.
- **Unstructured** -- Uses **external APIs** (OpenAI, Cohere, etc.) for embeddings. Offers its own chunking strategies including a `by_title` chunker that respects document structure. Integrates cleanly if you're already paying for an embedding API.
### Privacy and Cost
Where your documents go and what you pay.
- **Kreuzberg** -- Fully self-hosted. Documents never leave your infrastructure. No API fees -- you pay only for compute.
- **Unstructured** -- Self-host for free, or use their **managed API** (free tier: 1,000 pages/month, paid plans beyond). The managed option trades cost for convenience -- no servers to maintain, no OCR dependencies to install.
---
## When to Use Kreuzberg
- You're processing **high volumes** of documents and need throughput
- Your stack isn't Python-only -- you need native support in **Go, TypeScript, Ruby, Java**, or other languages
- You need to keep documents **on-prem** for privacy, compliance, or air-gapped environments
- You want **local embeddings** without external API dependencies
- You need to handle **uncommon formats** like LaTeX, Typst, Jupyter notebooks, or archives
## When to Use Unstructured
- You want a **managed cloud service** so you don't run any infrastructure
- You're in a **Python-only** environment and want the simplest setup
- You need **mature ML models** for complex table extraction and layout analysis
- You're prototyping and want to **get started quickly** with their hosted API
- You're already using **OpenAI or Cohere** for embeddings and want a unified pipeline
---
!!! Tip "Switching over?"
If you're currently using Unstructured and want to try Kreuzberg, check out the [Migration Guide](../migration/from-unstructured.md) for a step-by-step walkthrough.