# Comparisons

Kreuzberg sits in a crowded space of document extraction tools. Some are general-purpose libraries that handle dozens of formats, others are laser-focused on PDFs. This page maps the landscape so you can find the right tool for your project.

For performance and quality numbers across all of these tools, see the [live benchmarks](https://kreuzberg.dev/benchmarks).

---

## Full-Scope Extraction Libraries

These handle multiple document formats -- not just PDFs.

| Library                                               | Language | Formats        | License      | Focus                                                            | Deep Dive                                 |
| ----------------------------------------------------- | -------- | -------------- | ------------ | ---------------------------------------------------------------- | ----------------------------------------- |
| **Kreuzberg**                                         | Rust     | 90+            | Elastic-2.0  | High-throughput extraction with native bindings for 17 languages | --                                        |
| [Unstructured](https://unstructured.io)               | Python   | ~31            | Apache-2.0   | Element-based output, managed cloud API                          | [Read more](kreuzberg-vs-unstructured.md) |
| [Docling](https://github.com/docling-project/docling) | Python   | ~38            | MIT          | IBM-backed, ML-powered layout analysis                           | [Read more](kreuzberg-vs-docling.md)      |
| [Apache Tika](https://tika.apache.org)                | Java     | 1500+ detected | Apache-2.0   | Enterprise standard, broadest format detection                   | [Read more](kreuzberg-vs-tika.md)         |
| [MarkItDown](https://github.com/microsoft/markitdown) | Python   | ~25            | MIT          | Microsoft-backed, outputs Markdown for LLM prep                  | [Read more](kreuzberg-vs-markitdown.md)   |
| [MinerU](https://github.com/opendatalab/MinerU)       | Python   | PDF + images   | **AGPL-3.0** | Heavy ML models for scientific document layout                   | [Read more](kreuzberg-vs-mineru.md)       |
| [Pandoc](https://pandoc.org)                          | Haskell  | 45+ input      | **GPL-2.0**  | Universal document converter (cannot read PDFs)                  | --                                        |

## PDF-Specific Libraries

These focus on PDF extraction only. They're not direct competitors to Kreuzberg's full format coverage, but you'll often see them in PDF-heavy pipelines.

| Library                                                  | Language           | License      | Focus                                                              |
| -------------------------------------------------------- | ------------------ | ------------ | ------------------------------------------------------------------ |
| [PyMuPDF / PyMuPDF4LLM](https://pymupdf.readthedocs.io)  | Python (C core)    | **AGPL-3.0** | Fast PDF extraction via MuPDF. AGPL license limits commercial use. |
| [pdfplumber](https://github.com/jsvine/pdfplumber)       | Python             | MIT          | Good table extraction, built on pdfminer.six                       |
| [pdfminer.six](https://github.com/pdfminer/pdfminer.six) | Python             | MIT          | Fine-grained text positioning, pure Python                         |
| [pypdf](https://github.com/py-pdf/pypdf)                 | Python             | BSD-3        | Lightweight, pure Python, no C dependencies                        |
| [playa-pdf](https://github.com/dhdaines/playa)           | Python             | MIT          | Modern pure-Python PDF library                                     |
| [pdftotext](https://poppler.freedesktop.org)             | C (Python binding) | **GPL-2.0**  | Thin wrapper around poppler's pdftotext                            |

---

!!! Warning "License matters"

    Libraries marked **AGPL-3.0** (PyMuPDF, MinerU) require that any application using them also be released under AGPL, unless you purchase a commercial license. **GPL-2.0** tools (Pandoc, pdftotext/poppler) have similar copyleft requirements. If you're building a commercial product, check the license before integrating.

!!! Info "Benchmarks"

    Kreuzberg benchmarks against all of the libraries listed above. For extraction speed, quality scores, and format-by-format comparisons, see the [live benchmark dashboard](https://kreuzberg.dev/benchmarks).