# Kreuzberg vs MarkItDown MarkItDown is a Microsoft-backed Python library that converts documents to Markdown -- purpose-built for feeding content into LLMs. Kreuzberg is a Rust-based extraction library with broader format support, native language bindings, and built-in RAG pipeline features. Both are permissively licensed and work well for AI-adjacent document processing. ## At a Glance | | Kreuzberg | MarkItDown | | ---------------- | --------------------------------------------------------------- | ----------------------------------------- | | **Written in** | Rust | Python | | **File formats** | 90+ | ~25 | | **Use from** | Python, TypeScript, Go, Ruby, Java, C#, PHP, Elixir, Rust, Wasm | Python | | **Output** | Unified text, element-based, per-page JSON, Markdown | Markdown | | **License** | Apache-2.0 | Elastic-2.0 | | **Sweet spot** | Full extraction pipelines with chunking and embeddings | Quick Markdown conversion for LLM context | --- ## How They Differ ### Philosophy Different tools for different stages of a pipeline. - **Kreuzberg** -- A full extraction library. Extracts text, tables, metadata, and images. Offers multiple output formats, built-in chunking, local embeddings, and OCR. Designed to be the complete document-to-vectors pipeline. - **MarkItDown** -- A converter. Takes documents in, outputs Markdown. Intentionally lightweight and focused on one job: turning files into clean Markdown that LLMs can consume. Downstream processing is left to you. If you need a complete pipeline (extract, chunk, embed), Kreuzberg handles the full chain. If you just need Markdown for a prompt, MarkItDown does that with minimal setup. ### Format Coverage Both cover common formats, with different long-tail reach. - **Kreuzberg (90+ formats)** -- PDFs, Office docs, spreadsheets, HTML, images (via OCR), email, archives, source code, structured data (JSON/YAML/TOML), plus LaTeX, Typst, BibTeX, Jupyter notebooks, EPUB, OrgMode, and more. - **MarkItDown (~25 formats)** -- PDFs, DOCX, PPTX, XLSX, HTML, XML, CSV, JSON, EPUB, Jupyter notebooks, MSG email, images, and ZIP archives. Covers the essentials. MarkItDown handles the formats you'll encounter most often. Kreuzberg also handles the ones you won't expect -- until you do. ### OCR Different approaches to image-based text extraction. - **Kreuzberg** -- Tesseract + native PaddleOCR (ONNX-based, runs locally, no Python needed). Multi-backend pipeline with automatic quality-based fallback. All processing happens on your machine. - **MarkItDown** -- Can use **Azure Document Intelligence** for image and PDF extraction. Powerful when enabled, but requires an Azure account and sends documents to Microsoft's cloud. Without it, image OCR is limited. ### Language Support A significant difference in how you integrate each tool. - **Kreuzberg** -- Native bindings for **16 languages**. Same performance and API from Python, TypeScript, Rust, Go, Java, Kotlin, C#, Ruby, PHP, Elixir, R, Dart, Swift, Zig, C, or Wasm in the browser. - **MarkItDown** -- **Python only**. If your backend is in Go or TypeScript, you'd need to wrap MarkItDown in an HTTP service or call it as a subprocess. ### Downstream Processing What happens after extraction. - **Kreuzberg** -- Built-in **chunking** (recursive, semantic, markdown-aware), **local embeddings** (ONNX models, no API keys), token reduction, keyword extraction, and quality processing. Extraction output is ready for RAG pipelines. - **MarkItDown** -- Outputs Markdown and stops. Chunking, embeddings, and vector storage are your responsibility. This is by design -- it's a converter, not a pipeline. --- ## When to Use Kreuzberg - You need a **complete pipeline** from document to embeddings - Your stack includes **Go, TypeScript, Ruby, Java**, or other languages beyond Python - You want **local OCR** without cloud API dependencies - You need to handle **niche formats** like LaTeX, Typst, email files, or archives - You need **multiple output formats** (text, elements, per-page JSON) not just Markdown ## When to Use MarkItDown - You just need **clean Markdown** to feed into an LLM prompt - You're in a **Python-only** environment and want the simplest possible setup - You're already using **Azure Document Intelligence** and want to leverage it for OCR - Your use case is **document-to-prompt conversion** without further processing - You value **minimal dependencies** and a small footprint --- !!! Info "Benchmarks" For extraction speed and quality comparisons between Kreuzberg and MarkItDown, see the [live benchmark dashboard](https://kreuzberg.dev/benchmarks).