Nomad changes
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s

This commit is contained in:
Henrik Jess Nielsen
2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions

44
docs/comparisons/index.md Normal file
View File

@@ -0,0 +1,44 @@
# Comparisons
Kreuzberg sits in a crowded space of document extraction tools. Some are general-purpose libraries that handle dozens of formats, others are laser-focused on PDFs. This page maps the landscape so you can find the right tool for your project.
For performance and quality numbers across all of these tools, see the [live benchmarks](https://kreuzberg.dev/benchmarks).
---
## Full-Scope Extraction Libraries
These handle multiple document formats -- not just PDFs.
| Library | Language | Formats | License | Focus | Deep Dive |
| ----------------------------------------------------- | -------- | -------------- | ------------ | ---------------------------------------------------------------- | ----------------------------------------- |
| **Kreuzberg** | Rust | 90+ | Elastic-2.0 | High-throughput extraction with native bindings for 17 languages | -- |
| [Unstructured](https://unstructured.io) | Python | ~31 | Apache-2.0 | Element-based output, managed cloud API | [Read more](kreuzberg-vs-unstructured.md) |
| [Docling](https://github.com/docling-project/docling) | Python | ~38 | MIT | IBM-backed, ML-powered layout analysis | [Read more](kreuzberg-vs-docling.md) |
| [Apache Tika](https://tika.apache.org) | Java | 1500+ detected | Apache-2.0 | Enterprise standard, broadest format detection | [Read more](kreuzberg-vs-tika.md) |
| [MarkItDown](https://github.com/microsoft/markitdown) | Python | ~25 | MIT | Microsoft-backed, outputs Markdown for LLM prep | [Read more](kreuzberg-vs-markitdown.md) |
| [MinerU](https://github.com/opendatalab/MinerU) | Python | PDF + images | **AGPL-3.0** | Heavy ML models for scientific document layout | [Read more](kreuzberg-vs-mineru.md) |
| [Pandoc](https://pandoc.org) | Haskell | 45+ input | **GPL-2.0** | Universal document converter (cannot read PDFs) | -- |
## PDF-Specific Libraries
These focus on PDF extraction only. They're not direct competitors to Kreuzberg's full format coverage, but you'll often see them in PDF-heavy pipelines.
| Library | Language | License | Focus |
| -------------------------------------------------------- | ------------------ | ------------ | ------------------------------------------------------------------ |
| [PyMuPDF / PyMuPDF4LLM](https://pymupdf.readthedocs.io) | Python (C core) | **AGPL-3.0** | Fast PDF extraction via MuPDF. AGPL license limits commercial use. |
| [pdfplumber](https://github.com/jsvine/pdfplumber) | Python | MIT | Good table extraction, built on pdfminer.six |
| [pdfminer.six](https://github.com/pdfminer/pdfminer.six) | Python | MIT | Fine-grained text positioning, pure Python |
| [pypdf](https://github.com/py-pdf/pypdf) | Python | BSD-3 | Lightweight, pure Python, no C dependencies |
| [playa-pdf](https://github.com/dhdaines/playa) | Python | MIT | Modern pure-Python PDF library |
| [pdftotext](https://poppler.freedesktop.org) | C (Python binding) | **GPL-2.0** | Thin wrapper around poppler's pdftotext |
---
!!! Warning "License matters"
Libraries marked **AGPL-3.0** (PyMuPDF, MinerU) require that any application using them also be released under AGPL, unless you purchase a commercial license. **GPL-2.0** tools (Pandoc, pdftotext/poppler) have similar copyleft requirements. If you're building a commercial product, check the license before integrating.
!!! Info "Benchmarks"
Kreuzberg benchmarks against all of the libraries listed above. For extraction speed, quality scores, and format-by-format comparisons, see the [live benchmark dashboard](https://kreuzberg.dev/benchmarks).