Files
fil/docs/comparisons/kreuzberg-vs-docling.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

82 lines
5.2 KiB
Markdown

# Kreuzberg vs Docling
Kreuzberg and Docling are both open-source document extraction libraries, but they come at the problem from different angles. Kreuzberg is a Rust library focused on speed and broad format coverage across many languages. Docling is an IBM-backed Python library that leans heavily on deep learning models for document understanding. Here's how they compare.
## At a Glance
| | Kreuzberg | Docling |
| ---------------- | ----------------------------------------------------------------- | ---------------------------------------------------- |
| **Written in** | Rust | Python |
| **File formats** | 90+ | ~38 extensions (15+ types) |
| **Use from** | Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, Wasm | Python |
| **License** | Apache-2.0 | Elastic-2.0 |
| **OCR** | Tesseract + PaddleOCR (local, multi-backend fallback) | Tesseract + EasyOCR |
| **Sweet spot** | High-throughput pipelines, polyglot stacks, broad format coverage | ML-powered document understanding, scientific papers |
---
## How They Differ
### Architecture
Different foundations lead to different trade-offs.
- **Kreuzberg** -- Rust core with native bindings for each language. Your Python or TypeScript code calls directly into compiled Rust. No subprocess overhead, no model loading delays for basic extraction.
- **Docling** -- Python library built around deep learning models (DocLayNet for layout, TableFormer for tables). It produces a rich `DoclingDocument` object with full structural understanding, but it needs to load ML models on startup.
If you need raw extraction speed without ML overhead, Kreuzberg is faster out of the box. If you need deep structural understanding of complex layouts, Docling's ML pipeline is purpose-built for that.
### Format Coverage
What each tool can ingest.
- **Kreuzberg (90+ formats)** -- PDFs, Office docs, spreadsheets, HTML, images (via OCR), email, archives, source code, structured data (JSON/YAML/TOML), plus LaTeX, Typst, BibTeX, Jupyter notebooks, EPUB, OrgMode, and more.
- **Docling (~38 extensions)** -- PDFs, DOCX, PPTX, XLSX, HTML, Markdown, AsciiDoc, CSV, images, and JATS (scientific article XML). Focused on the formats that benefit most from layout analysis.
Docling covers the core document types well. Kreuzberg handles the long tail -- archives, email files, structured data, code, and niche markup formats.
### Output Model
How extracted content is structured.
- **Kreuzberg** -- Outputs unified text (default), element-based structures, or per-page JSON. You choose the level of detail you need. Markdown output is built-in via HTML-to-Markdown conversion.
- **Docling** -- Outputs a `DoclingDocument` object with rich structural metadata: reading order, table cells, figure captions, section hierarchy. Can export to Markdown, JSON, or other formats. The structural model is deeper but Python-specific.
### OCR
Both handle image-based documents, with different engine choices.
- **Kreuzberg** -- Tesseract + native PaddleOCR (ONNX-based, no Python dependency). Supports a multi-backend OCR pipeline that auto-falls back between engines based on output quality.
- **Docling** -- Tesseract + EasyOCR. EasyOCR offers good accuracy on CJK and Arabic scripts but requires PyTorch.
### Language Support
How you integrate each tool into your stack.
- **Kreuzberg** -- Native bindings for **16 languages** (Python, TypeScript, Rust, Go, Java, Kotlin, C#, Ruby, PHP, Elixir, R, Dart, Swift, Zig, C, Wasm). Same performance and features from every language.
- **Docling** -- **Python only**. If your backend is in Go, Java, or TypeScript, you'd need to wrap Docling in an HTTP service.
---
## When to Use Kreuzberg
- You're building a pipeline in **Go, TypeScript, Ruby, Java**, or any language beyond Python
- You need to process **high volumes** quickly without ML model loading overhead
- Your pipeline ingests **diverse formats** beyond PDFs and Office docs
- You want **local embeddings and chunking** built into the extraction step
- You need to run in the **browser or on edge runtimes** via Wasm
## When to Use Docling
- You need **deep structural understanding** of complex document layouts (reading order, nested tables, figure captions)
- You're working with **scientific papers or technical documents** where layout analysis matters
- Your stack is **Python-only** and you want a rich document object model
- You need **TableFormer-based table extraction** for complex tables with merged cells and spanning rows
- You value IBM's **ongoing investment** in document AI research
---
!!! Info "Benchmarks"
For extraction speed and quality comparisons between Kreuzberg and Docling, see the [live benchmark dashboard](https://kreuzberg.dev/benchmarks).