This commit is contained in:
44
docs/comparisons/index.md
Normal file
44
docs/comparisons/index.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# Comparisons
|
||||
|
||||
Kreuzberg sits in a crowded space of document extraction tools. Some are general-purpose libraries that handle dozens of formats, others are laser-focused on PDFs. This page maps the landscape so you can find the right tool for your project.
|
||||
|
||||
For performance and quality numbers across all of these tools, see the [live benchmarks](https://kreuzberg.dev/benchmarks).
|
||||
|
||||
---
|
||||
|
||||
## Full-Scope Extraction Libraries
|
||||
|
||||
These handle multiple document formats -- not just PDFs.
|
||||
|
||||
| Library | Language | Formats | License | Focus | Deep Dive |
|
||||
| ----------------------------------------------------- | -------- | -------------- | ------------ | ---------------------------------------------------------------- | ----------------------------------------- |
|
||||
| **Kreuzberg** | Rust | 90+ | Elastic-2.0 | High-throughput extraction with native bindings for 17 languages | -- |
|
||||
| [Unstructured](https://unstructured.io) | Python | ~31 | Apache-2.0 | Element-based output, managed cloud API | [Read more](kreuzberg-vs-unstructured.md) |
|
||||
| [Docling](https://github.com/docling-project/docling) | Python | ~38 | MIT | IBM-backed, ML-powered layout analysis | [Read more](kreuzberg-vs-docling.md) |
|
||||
| [Apache Tika](https://tika.apache.org) | Java | 1500+ detected | Apache-2.0 | Enterprise standard, broadest format detection | [Read more](kreuzberg-vs-tika.md) |
|
||||
| [MarkItDown](https://github.com/microsoft/markitdown) | Python | ~25 | MIT | Microsoft-backed, outputs Markdown for LLM prep | [Read more](kreuzberg-vs-markitdown.md) |
|
||||
| [MinerU](https://github.com/opendatalab/MinerU) | Python | PDF + images | **AGPL-3.0** | Heavy ML models for scientific document layout | [Read more](kreuzberg-vs-mineru.md) |
|
||||
| [Pandoc](https://pandoc.org) | Haskell | 45+ input | **GPL-2.0** | Universal document converter (cannot read PDFs) | -- |
|
||||
|
||||
## PDF-Specific Libraries
|
||||
|
||||
These focus on PDF extraction only. They're not direct competitors to Kreuzberg's full format coverage, but you'll often see them in PDF-heavy pipelines.
|
||||
|
||||
| Library | Language | License | Focus |
|
||||
| -------------------------------------------------------- | ------------------ | ------------ | ------------------------------------------------------------------ |
|
||||
| [PyMuPDF / PyMuPDF4LLM](https://pymupdf.readthedocs.io) | Python (C core) | **AGPL-3.0** | Fast PDF extraction via MuPDF. AGPL license limits commercial use. |
|
||||
| [pdfplumber](https://github.com/jsvine/pdfplumber) | Python | MIT | Good table extraction, built on pdfminer.six |
|
||||
| [pdfminer.six](https://github.com/pdfminer/pdfminer.six) | Python | MIT | Fine-grained text positioning, pure Python |
|
||||
| [pypdf](https://github.com/py-pdf/pypdf) | Python | BSD-3 | Lightweight, pure Python, no C dependencies |
|
||||
| [playa-pdf](https://github.com/dhdaines/playa) | Python | MIT | Modern pure-Python PDF library |
|
||||
| [pdftotext](https://poppler.freedesktop.org) | C (Python binding) | **GPL-2.0** | Thin wrapper around poppler's pdftotext |
|
||||
|
||||
---
|
||||
|
||||
!!! Warning "License matters"
|
||||
|
||||
Libraries marked **AGPL-3.0** (PyMuPDF, MinerU) require that any application using them also be released under AGPL, unless you purchase a commercial license. **GPL-2.0** tools (Pandoc, pdftotext/poppler) have similar copyleft requirements. If you're building a commercial product, check the license before integrating.
|
||||
|
||||
!!! Info "Benchmarks"
|
||||
|
||||
Kreuzberg benchmarks against all of the libraries listed above. For extraction speed, quality scores, and format-by-format comparisons, see the [live benchmark dashboard](https://kreuzberg.dev/benchmarks).
|
||||
81
docs/comparisons/kreuzberg-vs-docling.md
Normal file
81
docs/comparisons/kreuzberg-vs-docling.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# Kreuzberg vs Docling
|
||||
|
||||
Kreuzberg and Docling are both open-source document extraction libraries, but they come at the problem from different angles. Kreuzberg is a Rust library focused on speed and broad format coverage across many languages. Docling is an IBM-backed Python library that leans heavily on deep learning models for document understanding. Here's how they compare.
|
||||
|
||||
## At a Glance
|
||||
|
||||
| | Kreuzberg | Docling |
|
||||
| ---------------- | ----------------------------------------------------------------- | ---------------------------------------------------- |
|
||||
| **Written in** | Rust | Python |
|
||||
| **File formats** | 90+ | ~38 extensions (15+ types) |
|
||||
| **Use from** | Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, Wasm | Python |
|
||||
| **License** | Apache-2.0 | Elastic-2.0 |
|
||||
| **OCR** | Tesseract + PaddleOCR (local, multi-backend fallback) | Tesseract + EasyOCR |
|
||||
| **Sweet spot** | High-throughput pipelines, polyglot stacks, broad format coverage | ML-powered document understanding, scientific papers |
|
||||
|
||||
---
|
||||
|
||||
## How They Differ
|
||||
|
||||
### Architecture
|
||||
|
||||
Different foundations lead to different trade-offs.
|
||||
|
||||
- **Kreuzberg** -- Rust core with native bindings for each language. Your Python or TypeScript code calls directly into compiled Rust. No subprocess overhead, no model loading delays for basic extraction.
|
||||
- **Docling** -- Python library built around deep learning models (DocLayNet for layout, TableFormer for tables). It produces a rich `DoclingDocument` object with full structural understanding, but it needs to load ML models on startup.
|
||||
|
||||
If you need raw extraction speed without ML overhead, Kreuzberg is faster out of the box. If you need deep structural understanding of complex layouts, Docling's ML pipeline is purpose-built for that.
|
||||
|
||||
### Format Coverage
|
||||
|
||||
What each tool can ingest.
|
||||
|
||||
- **Kreuzberg (90+ formats)** -- PDFs, Office docs, spreadsheets, HTML, images (via OCR), email, archives, source code, structured data (JSON/YAML/TOML), plus LaTeX, Typst, BibTeX, Jupyter notebooks, EPUB, OrgMode, and more.
|
||||
- **Docling (~38 extensions)** -- PDFs, DOCX, PPTX, XLSX, HTML, Markdown, AsciiDoc, CSV, images, and JATS (scientific article XML). Focused on the formats that benefit most from layout analysis.
|
||||
|
||||
Docling covers the core document types well. Kreuzberg handles the long tail -- archives, email files, structured data, code, and niche markup formats.
|
||||
|
||||
### Output Model
|
||||
|
||||
How extracted content is structured.
|
||||
|
||||
- **Kreuzberg** -- Outputs unified text (default), element-based structures, or per-page JSON. You choose the level of detail you need. Markdown output is built-in via HTML-to-Markdown conversion.
|
||||
- **Docling** -- Outputs a `DoclingDocument` object with rich structural metadata: reading order, table cells, figure captions, section hierarchy. Can export to Markdown, JSON, or other formats. The structural model is deeper but Python-specific.
|
||||
|
||||
### OCR
|
||||
|
||||
Both handle image-based documents, with different engine choices.
|
||||
|
||||
- **Kreuzberg** -- Tesseract + native PaddleOCR (ONNX-based, no Python dependency). Supports a multi-backend OCR pipeline that auto-falls back between engines based on output quality.
|
||||
- **Docling** -- Tesseract + EasyOCR. EasyOCR offers good accuracy on CJK and Arabic scripts but requires PyTorch.
|
||||
|
||||
### Language Support
|
||||
|
||||
How you integrate each tool into your stack.
|
||||
|
||||
- **Kreuzberg** -- Native bindings for **16 languages** (Python, TypeScript, Rust, Go, Java, Kotlin, C#, Ruby, PHP, Elixir, R, Dart, Swift, Zig, C, Wasm). Same performance and features from every language.
|
||||
- **Docling** -- **Python only**. If your backend is in Go, Java, or TypeScript, you'd need to wrap Docling in an HTTP service.
|
||||
|
||||
---
|
||||
|
||||
## When to Use Kreuzberg
|
||||
|
||||
- You're building a pipeline in **Go, TypeScript, Ruby, Java**, or any language beyond Python
|
||||
- You need to process **high volumes** quickly without ML model loading overhead
|
||||
- Your pipeline ingests **diverse formats** beyond PDFs and Office docs
|
||||
- You want **local embeddings and chunking** built into the extraction step
|
||||
- You need to run in the **browser or on edge runtimes** via Wasm
|
||||
|
||||
## When to Use Docling
|
||||
|
||||
- You need **deep structural understanding** of complex document layouts (reading order, nested tables, figure captions)
|
||||
- You're working with **scientific papers or technical documents** where layout analysis matters
|
||||
- Your stack is **Python-only** and you want a rich document object model
|
||||
- You need **TableFormer-based table extraction** for complex tables with merged cells and spanning rows
|
||||
- You value IBM's **ongoing investment** in document AI research
|
||||
|
||||
---
|
||||
|
||||
!!! Info "Benchmarks"
|
||||
|
||||
For extraction speed and quality comparisons between Kreuzberg and Docling, see the [live benchmark dashboard](https://kreuzberg.dev/benchmarks).
|
||||
81
docs/comparisons/kreuzberg-vs-markitdown.md
Normal file
81
docs/comparisons/kreuzberg-vs-markitdown.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# Kreuzberg vs MarkItDown
|
||||
|
||||
MarkItDown is a Microsoft-backed Python library that converts documents to Markdown -- purpose-built for feeding content into LLMs. Kreuzberg is a Rust-based extraction library with broader format support, native language bindings, and built-in RAG pipeline features. Both are permissively licensed and work well for AI-adjacent document processing.
|
||||
|
||||
## At a Glance
|
||||
|
||||
| | Kreuzberg | MarkItDown |
|
||||
| ---------------- | --------------------------------------------------------------- | ----------------------------------------- |
|
||||
| **Written in** | Rust | Python |
|
||||
| **File formats** | 90+ | ~25 |
|
||||
| **Use from** | Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, Wasm | Python |
|
||||
| **Output** | Unified text, element-based, per-page JSON, Markdown | Markdown |
|
||||
| **License** | Apache-2.0 | Elastic-2.0 |
|
||||
| **Sweet spot** | Full extraction pipelines with chunking and embeddings | Quick Markdown conversion for LLM context |
|
||||
|
||||
---
|
||||
|
||||
## How They Differ
|
||||
|
||||
### Philosophy
|
||||
|
||||
Different tools for different stages of a pipeline.
|
||||
|
||||
- **Kreuzberg** -- A full extraction library. Extracts text, tables, metadata, and images. Offers multiple output formats, built-in chunking, local embeddings, and OCR. Designed to be the complete document-to-vectors pipeline.
|
||||
- **MarkItDown** -- A converter. Takes documents in, outputs Markdown. Intentionally lightweight and focused on one job: turning files into clean Markdown that LLMs can consume. Downstream processing is left to you.
|
||||
|
||||
If you need a complete pipeline (extract, chunk, embed), Kreuzberg handles the full chain. If you just need Markdown for a prompt, MarkItDown does that with minimal setup.
|
||||
|
||||
### Format Coverage
|
||||
|
||||
Both cover common formats, with different long-tail reach.
|
||||
|
||||
- **Kreuzberg (90+ formats)** -- PDFs, Office docs, spreadsheets, HTML, images (via OCR), email, archives, source code, structured data (JSON/YAML/TOML), plus LaTeX, Typst, BibTeX, Jupyter notebooks, EPUB, OrgMode, and more.
|
||||
- **MarkItDown (~25 formats)** -- PDFs, DOCX, PPTX, XLSX, HTML, XML, CSV, JSON, EPUB, Jupyter notebooks, MSG email, images, and ZIP archives. Covers the essentials.
|
||||
|
||||
MarkItDown handles the formats you'll encounter most often. Kreuzberg also handles the ones you won't expect -- until you do.
|
||||
|
||||
### OCR
|
||||
|
||||
Different approaches to image-based text extraction.
|
||||
|
||||
- **Kreuzberg** -- Tesseract + native PaddleOCR (ONNX-based, runs locally, no Python needed). Multi-backend pipeline with automatic quality-based fallback. All processing happens on your machine.
|
||||
- **MarkItDown** -- Can use **Azure Document Intelligence** for image and PDF extraction. Powerful when enabled, but requires an Azure account and sends documents to Microsoft's cloud. Without it, image OCR is limited.
|
||||
|
||||
### Language Support
|
||||
|
||||
A significant difference in how you integrate each tool.
|
||||
|
||||
- **Kreuzberg** -- Native bindings for **16 languages**. Same performance and API from Python, TypeScript, Rust, Go, Java, Kotlin, C#, Ruby, PHP, Elixir, R, Dart, Swift, Zig, C, or Wasm in the browser.
|
||||
- **MarkItDown** -- **Python only**. If your backend is in Go or TypeScript, you'd need to wrap MarkItDown in an HTTP service or call it as a subprocess.
|
||||
|
||||
### Downstream Processing
|
||||
|
||||
What happens after extraction.
|
||||
|
||||
- **Kreuzberg** -- Built-in **chunking** (recursive, semantic, markdown-aware), **local embeddings** (ONNX models, no API keys), token reduction, keyword extraction, and quality processing. Extraction output is ready for RAG pipelines.
|
||||
- **MarkItDown** -- Outputs Markdown and stops. Chunking, embeddings, and vector storage are your responsibility. This is by design -- it's a converter, not a pipeline.
|
||||
|
||||
---
|
||||
|
||||
## When to Use Kreuzberg
|
||||
|
||||
- You need a **complete pipeline** from document to embeddings
|
||||
- Your stack includes **Go, TypeScript, Ruby, Java**, or other languages beyond Python
|
||||
- You want **local OCR** without cloud API dependencies
|
||||
- You need to handle **niche formats** like LaTeX, Typst, email files, or archives
|
||||
- You need **multiple output formats** (text, elements, per-page JSON) not just Markdown
|
||||
|
||||
## When to Use MarkItDown
|
||||
|
||||
- You just need **clean Markdown** to feed into an LLM prompt
|
||||
- You're in a **Python-only** environment and want the simplest possible setup
|
||||
- You're already using **Azure Document Intelligence** and want to leverage it for OCR
|
||||
- Your use case is **document-to-prompt conversion** without further processing
|
||||
- You value **minimal dependencies** and a small footprint
|
||||
|
||||
---
|
||||
|
||||
!!! Info "Benchmarks"
|
||||
|
||||
For extraction speed and quality comparisons between Kreuzberg and MarkItDown, see the [live benchmark dashboard](https://kreuzberg.dev/benchmarks).
|
||||
85
docs/comparisons/kreuzberg-vs-mineru.md
Normal file
85
docs/comparisons/kreuzberg-vs-mineru.md
Normal file
@@ -0,0 +1,85 @@
|
||||
# Kreuzberg vs MinerU
|
||||
|
||||
MinerU is an open-source tool from OpenDataLab designed for high-quality PDF extraction, especially for scientific and academic documents. Kreuzberg is a Rust-based general-purpose extraction library covering 90+ formats. They overlap on PDF extraction but differ significantly in scope, licensing, and architecture.
|
||||
|
||||
!!! Warning "License"
|
||||
|
||||
MinerU is licensed under **AGPL-3.0**. This means any application that uses MinerU must also be released under AGPL, or you need a commercial license. Kreuzberg is **Apache-2.0** -- no copyleft restrictions.
|
||||
|
||||
## At a Glance
|
||||
|
||||
| | Kreuzberg | MinerU |
|
||||
| ---------------- | --------------------------------------------------------------- | --------------------------------------------------------- |
|
||||
| **Written in** | Rust | Python |
|
||||
| **File formats** | 90+ | PDF + PNG/JPG only |
|
||||
| **Use from** | Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, Wasm | Python CLI / library |
|
||||
| **License** | Apache-2.0 | **AGPL-3.0** |
|
||||
| **GPU** | Optional (ONNX Runtime -- CUDA, CoreML, TensorRT) | Recommended for best results |
|
||||
| **Sweet spot** | Broad format extraction, high throughput, polyglot stacks | Scientific PDFs, academic papers, complex layout analysis |
|
||||
|
||||
---
|
||||
|
||||
## How They Differ
|
||||
|
||||
### Scope
|
||||
|
||||
The biggest difference is what each tool is designed to handle.
|
||||
|
||||
- **Kreuzberg (90+ formats)** -- PDFs, Office docs, spreadsheets, HTML, images (via OCR), email, archives, source code, structured data, LaTeX, Typst, Jupyter notebooks, EPUB, and more. A general-purpose extraction library.
|
||||
- **MinerU (PDF + images)** -- Handles PDFs and PNG/JPG images. Nothing else. It's a specialist tool, not a general-purpose library.
|
||||
|
||||
If your pipeline processes only PDFs, both work. If you also need to handle Word docs, email files, or spreadsheets, Kreuzberg is the only option.
|
||||
|
||||
### ML Approach
|
||||
|
||||
Different levels of ML investment.
|
||||
|
||||
- **Kreuzberg** -- Offers **optional** ONNX-based layout detection (YOLO for speed, RT-DETR v2 for accuracy) covering 17 element classes. Also works without any ML models for pure extraction -- useful when speed matters more than layout understanding.
|
||||
- **MinerU** -- **ML-first**. Uses heavy deep learning models for layout detection, formula recognition, and table structure extraction. This produces excellent results on complex scientific documents but comes with significant model loading time, memory usage, and a GPU recommendation.
|
||||
|
||||
If you're extracting from scientific papers with complex multi-column layouts, equations, and nested tables, MinerU's deep ML pipeline is purpose-built for that. For general document extraction where you don't need heavy layout analysis, Kreuzberg is faster and lighter.
|
||||
|
||||
### Architecture and Performance
|
||||
|
||||
Different runtime characteristics.
|
||||
|
||||
- **Kreuzberg** -- Rust core, runs on CPU by default. GPU acceleration available via ONNX Runtime when layout detection is enabled. Fast startup, low memory footprint for basic extraction.
|
||||
- **MinerU** -- Python, with heavy PyTorch model loading on startup. GPU recommended for acceptable performance. First-run model downloads can be several gigabytes.
|
||||
|
||||
### Language Support
|
||||
|
||||
How you integrate each tool.
|
||||
|
||||
- **Kreuzberg** -- Native bindings for **16 languages**. Same performance from Python, TypeScript, Rust, Go, Java, Kotlin, C#, Ruby, PHP, Elixir, R, Dart, Swift, Zig, C, or Wasm in the browser.
|
||||
- **MinerU** -- **Python CLI and library** only. No language bindings, no API server out of the box.
|
||||
|
||||
### Embeddings and Chunking
|
||||
|
||||
Downstream pipeline support.
|
||||
|
||||
- **Kreuzberg** -- Built-in chunking (recursive, semantic, markdown-aware), local embeddings via ONNX models, token reduction, and keyword extraction. Ready for RAG pipelines out of the box.
|
||||
- **MinerU** -- Outputs extracted content (Markdown, JSON). Chunking and embeddings are left to downstream tools.
|
||||
|
||||
---
|
||||
|
||||
## When to Use Kreuzberg
|
||||
|
||||
- You need to process **more than just PDFs** -- Office docs, email, archives, code, structured data
|
||||
- You're building a **commercial product** and need a permissive license (Apache-2.0)
|
||||
- You want **native bindings** in Go, TypeScript, Ruby, Java, or other languages
|
||||
- You need **fast extraction** without heavy ML model loading
|
||||
- You want **built-in chunking and embeddings** for RAG pipelines
|
||||
|
||||
## When to Use MinerU
|
||||
|
||||
- You're working exclusively with **scientific papers and academic PDFs**
|
||||
- You need **deep layout analysis** -- formulas, multi-column layouts, nested tables
|
||||
- The **AGPL-3.0 license** is compatible with your project (open-source, research, or you'll purchase a commercial license)
|
||||
- You have **GPU resources** available and can accept the startup cost of loading large models
|
||||
- You need the highest possible extraction quality on **complex PDF layouts** and throughput is secondary
|
||||
|
||||
---
|
||||
|
||||
!!! Info "Benchmarks"
|
||||
|
||||
For extraction speed and quality comparisons between Kreuzberg and MinerU, see the [live benchmark dashboard](https://kreuzberg.dev/benchmarks).
|
||||
90
docs/comparisons/kreuzberg-vs-tika.md
Normal file
90
docs/comparisons/kreuzberg-vs-tika.md
Normal file
@@ -0,0 +1,90 @@
|
||||
# Kreuzberg vs Apache Tika
|
||||
|
||||
Apache Tika is the longest-running open-source document extraction tool in the ecosystem. It's been the default answer for enterprise document processing since 2007. Kreuzberg is a newer Rust-based alternative that takes a different approach to the same problem. Both are Apache-2.0 licensed.
|
||||
|
||||
## At a Glance
|
||||
|
||||
| | Kreuzberg | Apache Tika |
|
||||
| ---------------- | --------------------------------------------------------------- | -------------------------------------------------------------------- |
|
||||
| **Written in** | Rust | Java |
|
||||
| **File formats** | 90+ extracted | 1500+ detected, hundreds extracted |
|
||||
| **Use from** | Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, Wasm | Java, or any language via Tika Server (HTTP) |
|
||||
| **Run it as** | Library, CLI, self-hosted API, browser (Wasm) | Java library, Tika Server (HTTP), CLI |
|
||||
| **License** | Apache-2.0 | Apache-2.0 |
|
||||
| **Sweet spot** | High-throughput pipelines, polyglot native bindings, modern CLI | Enterprise document processing, metadata extraction, search indexing |
|
||||
|
||||
---
|
||||
|
||||
## How They Differ
|
||||
|
||||
### Architecture
|
||||
|
||||
Fundamentally different runtimes.
|
||||
|
||||
- **Kreuzberg** -- Rust library that compiles to a native binary or links directly into your application via language-specific bindings. No runtime required. A single `kreuzberg` binary gives you CLI, API server, and MCP server.
|
||||
- **Tika** -- Java library that runs on the JVM. For non-Java languages, you deploy Tika Server (an HTTP service) and send documents over the network. JVM startup time and memory overhead are part of the deal.
|
||||
|
||||
If your stack is already JVM-based, Tika integrates naturally. For everything else, Kreuzberg avoids the overhead of running a separate Java service.
|
||||
|
||||
### Format Coverage
|
||||
|
||||
Both tools excel here, but in different ways.
|
||||
|
||||
- **Tika** -- Detects **1500+ MIME types** and extracts text from hundreds of formats. It's built to handle practically anything you throw at it, including exotic formats like CAD files and scientific data. Two decades of format support.
|
||||
- **Kreuzberg** -- Extracts from **90+ formats** with a focus on high-quality output. Covers PDFs, Office docs, spreadsheets, HTML, images (via OCR), email, archives, source code, structured data, and niche markup formats like Typst and Djot.
|
||||
|
||||
Tika's format detection is unmatched in breadth. Kreuzberg focuses on extraction quality for the formats most document pipelines actually encounter.
|
||||
|
||||
### Language Integration
|
||||
|
||||
How each tool fits into your codebase.
|
||||
|
||||
- **Kreuzberg** -- Native bindings for **17 languages**. Each binding calls directly into the Rust core -- same performance, same features, no network round-trips. Also runs in the browser via Wasm.
|
||||
- **Tika** -- Native only in **Java**. Every other language goes through Tika Server over HTTP. This adds latency, requires running a separate service, and means your application depends on the JVM being available.
|
||||
|
||||
### OCR
|
||||
|
||||
Different levels of OCR sophistication.
|
||||
|
||||
- **Kreuzberg** -- Tesseract + native PaddleOCR (ONNX-based, no Python needed). Multi-backend OCR pipeline with automatic quality-based fallback between engines. Image preprocessing built-in.
|
||||
- **Tika** -- Delegates to Tesseract via its OCR parser. Functional but no multi-backend fallback or built-in quality scoring.
|
||||
|
||||
### Metadata
|
||||
|
||||
Both extract metadata, with different philosophies.
|
||||
|
||||
- **Kreuzberg** -- Format-specific discriminated unions. PDF metadata includes page count, version, encryption status, and permissions. Each format type has its own metadata shape.
|
||||
- **Tika** -- Standardized metadata using Dublin Core, XMP, and other established schemas. Extremely rich metadata extraction, especially for media files. This is one of Tika's genuine strengths.
|
||||
|
||||
If metadata richness is your primary concern (especially for media, geospatial, or scientific formats), Tika is hard to beat.
|
||||
|
||||
### Ecosystem
|
||||
|
||||
Where each tool fits in the broader stack.
|
||||
|
||||
- **Tika** -- Deep integration with Apache Solr, Elasticsearch, Apache Nutch, and enterprise content management systems. If you're building a search infrastructure on the Apache stack, Tika is the natural choice.
|
||||
- **Kreuzberg** -- Standalone library with built-in chunking, embeddings, and RAG pipeline support. Designed for modern AI/ML document pipelines rather than traditional search indexing.
|
||||
|
||||
---
|
||||
|
||||
## When to Use Kreuzberg
|
||||
|
||||
- You need **native bindings** in languages beyond Java (Python, TypeScript, Go, Ruby, etc.)
|
||||
- You want a **single binary** or library -- no JVM, no separate server process
|
||||
- Your pipeline needs **built-in chunking, embeddings, and RAG support**
|
||||
- You need to run extraction in the **browser or on edge runtimes** via Wasm
|
||||
- You want **multi-backend OCR** with automatic quality-based fallback
|
||||
|
||||
## When to Use Tika
|
||||
|
||||
- Your stack is **JVM-based** and you want native Java integration
|
||||
- You need to detect or extract from **exotic formats** (CAD, geospatial, media containers)
|
||||
- You're building on the **Apache search stack** (Solr, Nutch, Elasticsearch)
|
||||
- **Metadata extraction** is your primary use case, especially across media and scientific formats
|
||||
- You need the **longest track record** and widest enterprise adoption
|
||||
|
||||
---
|
||||
|
||||
!!! Info "Benchmarks"
|
||||
|
||||
For extraction speed and quality comparisons between Kreuzberg and Apache Tika, see the [live benchmark dashboard](https://kreuzberg.dev/benchmarks).
|
||||
88
docs/comparisons/kreuzberg-vs-unstructured.md
Normal file
88
docs/comparisons/kreuzberg-vs-unstructured.md
Normal file
@@ -0,0 +1,88 @@
|
||||
# Kreuzberg vs Unstructured
|
||||
|
||||
Both Kreuzberg and Unstructured are open-source tools for extracting text, tables, and metadata from documents. They solve similar problems but make very different architectural choices. This doc breaks down where each one shines so you can pick the right tool for your project.
|
||||
|
||||
## At a Glance
|
||||
|
||||
| | Kreuzberg | Unstructured |
|
||||
| ---------------- | --------------------------------------------------------------- | ------------------------------------------------------------ |
|
||||
| **Written in** | Rust | Python |
|
||||
| **File formats** | 90+ | ~30 |
|
||||
| **Use from** | Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, Wasm | Python, or any language via REST API |
|
||||
| **Run it as** | Library, CLI, self-hosted API, browser (Wasm) | Python library, managed cloud API, self-hosted API |
|
||||
| **Pricing** | Free, Apache 2.0 | Free (open-source) + paid managed API |
|
||||
| **Sweet spot** | High-throughput pipelines, polyglot stacks, on-prem | Managed service, ML-heavy layout analysis, quick prototyping |
|
||||
|
||||
---
|
||||
|
||||
## How They Differ
|
||||
|
||||
### Architecture and Performance
|
||||
|
||||
The core difference is what's under the hood.
|
||||
|
||||
- **Kreuzberg** -- Rust library with native bindings for each language. Your Python, TypeScript, or Go code calls directly into compiled Rust with no subprocess spawning or HTTP overhead.
|
||||
- **Unstructured** -- Python library with an optional managed cloud API. Well-optimized for Python workflows, but other languages go through HTTP.
|
||||
|
||||
Bottom line: if you're processing thousands of documents in a pipeline, the Rust core gives Kreuzberg a throughput advantage. If you're in a Python-only stack, Unstructured is a natural fit.
|
||||
|
||||
### Format Coverage
|
||||
|
||||
How much of your document zoo each tool can handle.
|
||||
|
||||
- **Kreuzberg (90+ formats)** -- PDFs, Office docs, spreadsheets, HTML, images (via OCR), email, archives, source code, structured data (JSON/YAML/TOML), plus niche formats like LaTeX, Typst, BibTeX, Jupyter notebooks, EPUB, and OrgMode.
|
||||
- **Unstructured (~30 formats)** -- PDFs, Office files, HTML, images, email, and the most common document types. Covers the essentials well.
|
||||
|
||||
If your pipeline only deals with PDFs and Word docs, both work. To ingest Jupyter notebooks or OrgMode files, Kreuzberg has you covered.
|
||||
|
||||
### Language Support and Deployment
|
||||
|
||||
How you integrate each tool into your stack.
|
||||
|
||||
- **Kreuzberg** -- Native bindings for **16 languages** (Python, TypeScript, Rust, Go, Java, Kotlin, C#, Ruby, PHP, Elixir, R, Dart, Swift, Zig, C, Wasm). Each binding calls directly into the Rust core -- same performance, same features. Also runs in the browser via Wasm.
|
||||
- **Unstructured** -- Python-first. Other languages go through a **REST API**, either self-hosted or via their managed cloud. The managed API is a genuine advantage if you don't want to run infrastructure.
|
||||
|
||||
### OCR and Layout Analysis
|
||||
|
||||
Both tools handle OCR and document layout, but with different approaches.
|
||||
|
||||
- **OCR** -- Both integrate Tesseract. Kreuzberg adds a native **PaddleOCR** backend (ONNX-based, no Python needed) and a **multi-backend pipeline** that auto-falls back between engines based on output quality.
|
||||
- **Layout detection** -- Kreuzberg ships an ONNX-based RT-DETR v2 model covering 17 element classes plus SLANet table structure recognition. Unstructured offers mature ML-based layout detection with strong complex table support.
|
||||
|
||||
### Embeddings and Chunking
|
||||
|
||||
How each tool prepares extracted text for RAG pipelines.
|
||||
|
||||
- **Kreuzberg** -- Generates embeddings **locally** with ONNX models (no API keys needed). Supports recursive, semantic, and markdown-aware chunking with optional token-based sizing via HuggingFace tokenizers.
|
||||
- **Unstructured** -- Uses **external APIs** (OpenAI, Cohere, etc.) for embeddings. Offers its own chunking strategies including a `by_title` chunker that respects document structure. Integrates cleanly if you're already paying for an embedding API.
|
||||
|
||||
### Privacy and Cost
|
||||
|
||||
Where your documents go and what you pay.
|
||||
|
||||
- **Kreuzberg** -- Fully self-hosted. Documents never leave your infrastructure. No API fees -- you pay only for compute.
|
||||
- **Unstructured** -- Self-host for free, or use their **managed API** (free tier: 1,000 pages/month, paid plans beyond). The managed option trades cost for convenience -- no servers to maintain, no OCR dependencies to install.
|
||||
|
||||
---
|
||||
|
||||
## When to Use Kreuzberg
|
||||
|
||||
- You're processing **high volumes** of documents and need throughput
|
||||
- Your stack isn't Python-only -- you need native support in **Go, TypeScript, Ruby, Java**, or other languages
|
||||
- You need to keep documents **on-prem** for privacy, compliance, or air-gapped environments
|
||||
- You want **local embeddings** without external API dependencies
|
||||
- You need to handle **uncommon formats** like LaTeX, Typst, Jupyter notebooks, or archives
|
||||
|
||||
## When to Use Unstructured
|
||||
|
||||
- You want a **managed cloud service** so you don't run any infrastructure
|
||||
- You're in a **Python-only** environment and want the simplest setup
|
||||
- You need **mature ML models** for complex table extraction and layout analysis
|
||||
- You're prototyping and want to **get started quickly** with their hosted API
|
||||
- You're already using **OpenAI or Cohere** for embeddings and want a unified pipeline
|
||||
|
||||
---
|
||||
|
||||
!!! Tip "Switching over?"
|
||||
|
||||
If you're currently using Unstructured and want to try Kreuzberg, check out the [Migration Guide](../migration/from-unstructured.md) for a step-by-step walkthrough.
|
||||
Reference in New Issue
Block a user