Nomad changes
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s

This commit is contained in:
Henrik Jess Nielsen
2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions

View File

@@ -0,0 +1,81 @@
# Kreuzberg vs MarkItDown
MarkItDown is a Microsoft-backed Python library that converts documents to Markdown -- purpose-built for feeding content into LLMs. Kreuzberg is a Rust-based extraction library with broader format support, native language bindings, and built-in RAG pipeline features. Both are permissively licensed and work well for AI-adjacent document processing.
## At a Glance
| | Kreuzberg | MarkItDown |
| ---------------- | --------------------------------------------------------------- | ----------------------------------------- |
| **Written in** | Rust | Python |
| **File formats** | 90+ | ~25 |
| **Use from** | Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, Wasm | Python |
| **Output** | Unified text, element-based, per-page JSON, Markdown | Markdown |
| **License** | Apache-2.0 | Elastic-2.0 |
| **Sweet spot** | Full extraction pipelines with chunking and embeddings | Quick Markdown conversion for LLM context |
---
## How They Differ
### Philosophy
Different tools for different stages of a pipeline.
- **Kreuzberg** -- A full extraction library. Extracts text, tables, metadata, and images. Offers multiple output formats, built-in chunking, local embeddings, and OCR. Designed to be the complete document-to-vectors pipeline.
- **MarkItDown** -- A converter. Takes documents in, outputs Markdown. Intentionally lightweight and focused on one job: turning files into clean Markdown that LLMs can consume. Downstream processing is left to you.
If you need a complete pipeline (extract, chunk, embed), Kreuzberg handles the full chain. If you just need Markdown for a prompt, MarkItDown does that with minimal setup.
### Format Coverage
Both cover common formats, with different long-tail reach.
- **Kreuzberg (90+ formats)** -- PDFs, Office docs, spreadsheets, HTML, images (via OCR), email, archives, source code, structured data (JSON/YAML/TOML), plus LaTeX, Typst, BibTeX, Jupyter notebooks, EPUB, OrgMode, and more.
- **MarkItDown (~25 formats)** -- PDFs, DOCX, PPTX, XLSX, HTML, XML, CSV, JSON, EPUB, Jupyter notebooks, MSG email, images, and ZIP archives. Covers the essentials.
MarkItDown handles the formats you'll encounter most often. Kreuzberg also handles the ones you won't expect -- until you do.
### OCR
Different approaches to image-based text extraction.
- **Kreuzberg** -- Tesseract + native PaddleOCR (ONNX-based, runs locally, no Python needed). Multi-backend pipeline with automatic quality-based fallback. All processing happens on your machine.
- **MarkItDown** -- Can use **Azure Document Intelligence** for image and PDF extraction. Powerful when enabled, but requires an Azure account and sends documents to Microsoft's cloud. Without it, image OCR is limited.
### Language Support
A significant difference in how you integrate each tool.
- **Kreuzberg** -- Native bindings for **16 languages**. Same performance and API from Python, TypeScript, Rust, Go, Java, Kotlin, C#, Ruby, PHP, Elixir, R, Dart, Swift, Zig, C, or Wasm in the browser.
- **MarkItDown** -- **Python only**. If your backend is in Go or TypeScript, you'd need to wrap MarkItDown in an HTTP service or call it as a subprocess.
### Downstream Processing
What happens after extraction.
- **Kreuzberg** -- Built-in **chunking** (recursive, semantic, markdown-aware), **local embeddings** (ONNX models, no API keys), token reduction, keyword extraction, and quality processing. Extraction output is ready for RAG pipelines.
- **MarkItDown** -- Outputs Markdown and stops. Chunking, embeddings, and vector storage are your responsibility. This is by design -- it's a converter, not a pipeline.
---
## When to Use Kreuzberg
- You need a **complete pipeline** from document to embeddings
- Your stack includes **Go, TypeScript, Ruby, Java**, or other languages beyond Python
- You want **local OCR** without cloud API dependencies
- You need to handle **niche formats** like LaTeX, Typst, email files, or archives
- You need **multiple output formats** (text, elements, per-page JSON) not just Markdown
## When to Use MarkItDown
- You just need **clean Markdown** to feed into an LLM prompt
- You're in a **Python-only** environment and want the simplest possible setup
- You're already using **Azure Document Intelligence** and want to leverage it for OCR
- Your use case is **document-to-prompt conversion** without further processing
- You value **minimal dependencies** and a small footprint
---
!!! Info "Benchmarks"
For extraction speed and quality comparisons between Kreuzberg and MarkItDown, see the [live benchmark dashboard](https://kreuzberg.dev/benchmarks).