# OCR (Optical Character Recognition) Extract text from images and scanned PDFs. Kreuzberg automatically determines when OCR is needed — images always require it, scanned PDFs trigger it per-page, and hybrid PDFs only OCR the pages that lack a text layer. Set `force_ocr=True` to OCR all pages regardless. ## Backend Comparison Four OCR backends — pick based on platform, accuracy needs, and language coverage. | | **Tesseract** | **PaddleOCR** | **EasyOCR** | **VLM** | | ---------------- | -------------------- | ----------------------------------- | ------------------- | ------------------------ | | **Speed** | Fast | Very fast | Moderate | Slow (API latency) | | **Accuracy** | Good | Excellent | Excellent | Highest | | **Languages** | 100+ | 80+ (11 script families) | 80+ | All (provider-dependent) | | **Installation** | System package | Built-in (native) or Python package | Python package only | API key only | | **Model size** | ~10 MB | Mobile ~8 MB, Server ~120 MB | ~100 MB | None (cloud-hosted) | | **GPU support** | No | Yes | Yes | N/A (server-side) | | **Platform** | All (including Wasm) | All except Wasm | Python only | All | | **Cost** | Free | Free | Free | Per-token API cost | **When to use which:** - **Tesseract** — Default choice. Works everywhere, low overhead, broadest platform support. - **PaddleOCR** — Best speed-to-accuracy ratio. Preferred for CJK languages. Mobile tier is fast; server tier maximizes accuracy with GPU. - **EasyOCR** — Highest accuracy with deep learning models. Python-only, heavier dependency. - **VLM** — Best for handwritten text, poor scans, Arabic/Farsi, and complex layouts. Requires an API key and incurs per-token costs. See [LLM Integration](llm-integration.md) for full details. ## Installation ### Tesseract === "macOS" ```bash title="Terminal" brew install tesseract ``` === "Ubuntu / Debian" ```bash title="Terminal" sudo apt-get install tesseract-ocr ``` === "RHEL / Fedora" ```bash title="Terminal" sudo dnf install tesseract ``` === "Windows" Download from [GitHub releases](https://github.com/UB-Mannheim/tesseract/wiki). **Additional language packs:** ```bash title="Terminal" # macOS — all languages brew install tesseract-lang # Ubuntu/Debian — individual languages sudo apt-get install tesseract-ocr-deu # German sudo apt-get install tesseract-ocr-fra # French # Verify installed languages tesseract --list-langs ``` ### PaddleOCR === "Native bindings (Rust, Go, TypeScript, Java, C#, Ruby, PHP, Elixir)" Built in via the `paddle-ocr` feature flag. Models download automatically on first use — no extra installation needed. ```toml title="Cargo.toml (Rust example)" [dependencies] kreuzberg = { version = "4.0", features = ["paddle-ocr"] } ``` === "Python" PaddleOCR is bundled via the native Rust bindings and works out of the box since 4.8.5 — no extra installation is needed. Models are downloaded automatically on first use. ### EasyOCR (Python only) ```bash title="Terminal" pip install "kreuzberg[easyocr]" ``` !!! Info "Python 3.14" EasyOCR 1.7.3+ and PyTorch 2.9.1+ support Python 3.14. Install `kreuzberg[easyocr]` on any supported Python version (3.10–3.14). !!! Tip "Tesseract marker extra" `pip install "kreuzberg[tesseract]"` is available as a metadata-only marker to document a dependency on the Tesseract system package. It installs no Python packages — Tesseract itself must still be installed via your OS package manager (see above). ## Configuration ### Basic OCR === "Python" --8<-- "snippets/python/ocr/ocr_extraction.md" === "TypeScript" --8<-- "snippets/typescript/ocr/ocr_extraction.md" === "Rust" --8<-- "snippets/rust/ocr/ocr_extraction.md" === "Go" --8<-- "snippets/go/ocr/ocr_extraction.md" === "Java" --8<-- "snippets/java/ocr/ocr_extraction.md" === "Ruby" --8<-- "snippets/ruby/ocr/ocr_extraction.md" === "R" --8<-- "snippets/r/ocr/ocr_extraction.md" === "Wasm" --8<-- "snippets/wasm/ocr/ocr_extraction.md" ### Multiple Languages Specify multiple language codes separated by `+` (Tesseract) or as a list (EasyOCR/PaddleOCR): === "Python" --8<-- "snippets/python/ocr/ocr_multi_language.md" === "TypeScript" --8<-- "snippets/typescript/ocr/ocr_multi_language.md" === "Rust" --8<-- "snippets/rust/ocr/ocr_multi_language.md" === "Go" --8<-- "snippets/go/ocr/ocr_multi_language.md" === "Java" --8<-- "snippets/java/ocr/ocr_multi_language.md" === "Ruby" --8<-- "snippets/ruby/ocr/ocr_multi_language.md" === "R" --8<-- "snippets/r/ocr/ocr_multi_language.md" === "Wasm" ```typescript import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm'; await initWasm(); await enableOcr(); const file = fileInput.files?.[0]; if (file) { const result = await extractFromFile(file, file.type, { ocr: { backend: 'tesseract-wasm', language: 'eng+deu' }, }); } ``` ### Force OCR Process PDFs with OCR even when they have a text layer: === "Python" --8<-- "snippets/python/ocr/ocr_force_all_pages.md" === "TypeScript" --8<-- "snippets/typescript/ocr/ocr_force_all_pages.md" === "Rust" --8<-- "snippets/rust/ocr/ocr_force_all_pages.md" === "Go" --8<-- "snippets/go/ocr/ocr_force_all_pages.md" === "Java" --8<-- "snippets/java/ocr/ocr_force_all_pages.md" === "Ruby" --8<-- "snippets/ruby/ocr/ocr_force_all_pages.md" === "R" --8<-- "snippets/r/ocr/ocr_force_all_pages.md" ### Using EasyOCR === "Python" --8<-- "snippets/python/ocr/ocr_easyocr.md" === "TypeScript" --8<-- "snippets/typescript/ocr/ocr_easyocr.md" === "Rust" --8<-- "snippets/rust/ocr/ocr_easyocr.md" ### Disable OCR !!! Info "Added in v4.7.0" When `disable_ocr` is set, image files return empty content instead of raising `MissingDependencyError`: === "Python" ```python title="disable_ocr.py" from kreuzberg import ExtractionConfig, extract_file_sync config = ExtractionConfig(disable_ocr=True) result = extract_file_sync("scanned.png", config=config) # result.content will be empty — OCR was skipped ``` === "TypeScript" ```typescript title="disable_ocr.ts" import { extractFileSync } from '@kreuzberg/node'; const result = extractFileSync('scanned.png', { disableOcr: true, }); // result.content will be empty — OCR was skipped ``` === "Rust" ```rust title="disable_ocr.rs" use kreuzberg::{ExtractionConfig, extract_file}; let config = ExtractionConfig { disable_ocr: true, ..Default::default() }; let result = extract_file("scanned.png", &config).await?; // result.content will be empty — OCR was skipped ``` ### Using PaddleOCR === "Python" --8<-- "snippets/python/ocr/ocr_paddleocr.md" === "TypeScript" --8<-- "snippets/typescript/ocr/ocr_paddleocr.md" === "Rust" --8<-- "snippets/rust/ocr/ocr_paddleocr.md" === "Go" --8<-- "snippets/go/ocr/ocr_paddleocr.md" === "Java" --8<-- "snippets/java/ocr/ocr_paddleocr.md" === "Ruby" --8<-- "snippets/ruby/ocr/ocr_paddleocr.md" === "R" --8<-- "snippets/r/ocr/ocr_paddleocr.md" ### Using VLM OCR v4.8.0 Use a vision-language model (e.g. GPT-4o, Claude) as the OCR backend — each page is rendered and sent to the VLM. Cloud providers need an API key; local engines (Ollama, etc.) use the `ollama/` prefix — see [Local LLM Support](llm-integration.md#local-llm-support). === "Python" --8<-- "snippets/python/llm/vlm_ocr.md" === "TypeScript" --8<-- "snippets/typescript/llm/vlm_ocr.md" === "Rust" ```rust title="Rust" use kreuzberg::{extract_file, ExtractionConfig, OcrConfig, LlmConfig}; let config = ExtractionConfig { force_ocr: true, ocr: Some(OcrConfig { backend: "vlm".to_string(), vlm_config: Some(LlmConfig { model: "openai/gpt-4o-mini".to_string(), ..Default::default() }), ..Default::default() }), ..Default::default() }; let result = extract_file("scan.pdf", None, &config).await?; ``` === "CLI" ```bash title="Terminal" kreuzberg extract scan.pdf --force-ocr true --vlm-model openai/gpt-4o-mini ``` === "TOML" ```toml title="kreuzberg.toml" force_ocr = true [ocr] backend = "vlm" [ocr.vlm_config] model = "openai/gpt-4o-mini" ``` For more on VLM OCR, including custom prompts, supported providers, and API key configuration, see [LLM Integration](llm-integration.md#vlm-ocr). !!! Tip "GPU Acceleration" EasyOCR and PaddleOCR support GPU acceleration. Set `use_gpu=True` in your OCR config. PaddleOCR's `model_tier="server"` gives the best accuracy with GPU. ## DPI Configuration Higher DPI improves accuracy but increases processing time and memory. | DPI | Trade-off | | ----------------- | ------------------------------------------ | | **150** | Fastest — lower accuracy, less memory | | **300** (default) | Balanced — good accuracy, reasonable speed | | **600** | Best accuracy — slower, more memory | === "Python" --8<-- "snippets/python/config/ocr_dpi_config.md" === "TypeScript" --8<-- "snippets/typescript/ocr/ocr_dpi_config.md" === "Rust" --8<-- "snippets/rust/ocr/ocr_dpi_config.md" === "Go" --8<-- "snippets/go/config/ocr_dpi_config.md" === "Java" --8<-- "snippets/java/config/ocr_dpi_config.md" === "Ruby" --8<-- "snippets/ruby/config/ocr_dpi_config.md" === "R" --8<-- "snippets/r/config/ocr_dpi_config.md" ## PaddleOCR Script Families 80+ languages across 11 script families (PP-OCRv5). Recognition models are downloaded on demand from HuggingFace: | Family | Languages | | -------------- | -------------------------------------------------------------------------------------------- | | **English** | English, numbers, punctuation | | **Chinese** | Simplified/Traditional Chinese, Japanese | | **Latin** | French, German, Spanish, Portuguese, Italian, Polish, Dutch, Turkish, Vietnamese, and so on. | | **Korean** | Korean (Hangul) | | **Slavic** | Russian, Ukrainian, Belarusian, Bulgarian, Serbian, and so on. | | **Thai** | Thai script | | **Greek** | Greek script | | **Arabic** | Arabic, Persian, Urdu | | **Devanagari** | Hindi, Marathi, Sanskrit, Nepali | | **Tamil** | Tamil script | | **Telugu** | Telugu script | Models are cached locally after first download, so subsequent runs start immediately. ## CLI Usage ```bash title="Terminal" # Basic OCR extraction kreuzberg extract scanned.pdf --ocr true # Specific language kreuzberg extract french_doc.pdf --ocr true --ocr-language fra # Specific backend kreuzberg extract chinese_doc.pdf --ocr true --ocr-backend paddle-ocr --ocr-language ch # Force OCR on all pages kreuzberg extract document.pdf --force-ocr true # VLM OCR backend kreuzberg extract handwritten.pdf --force-ocr true --vlm-model openai/gpt-4o-mini # Use a config file kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true ``` | Flag | Description | | ------------------------- | ---------------------------------------------------------------------------------- | | `--ocr true` | Enable OCR processing | | `--ocr-language ` | Language code (`eng`, `deu`, `fra`, `ch`, `ja`, `ru`, etc.) | | `--ocr-backend ` | Engine: `tesseract`, `paddle-ocr`, `easyocr`, or `vlm` | | `--force-ocr true` | OCR all pages regardless of text layer | | `--vlm-model ` | VLM model for OCR (for example, `openai/gpt-4o-mini`). Implies `--ocr-backend vlm` | ## Troubleshooting ??? Question "Tesseract not found" Install Tesseract and verify it's on your PATH: ```bash title="Terminal" # macOS brew install tesseract # Ubuntu/Debian sudo apt-get install tesseract-ocr # Verify tesseract --version ``` ??? Question "Language not found" Install the language data pack: ```bash title="Terminal" # macOS — all languages brew install tesseract-lang # Ubuntu/Debian — individual language sudo apt-get install tesseract-ocr-deu # Verify tesseract --list-langs ``` ??? Question "Poor accuracy" - Increase DPI to 600 for better quality - Try a different backend — PaddleOCR and EasyOCR often outperform Tesseract on complex layouts - Specify the correct language code for your document - Use `force_ocr=True` if a PDF's embedded text layer is low quality - For handwritten text or very poor scans, try the VLM backend with a vision-capable model (see [LLM Integration](llm-integration.md#vlm-ocr)) ??? Question "Slow processing" - Reduce DPI to 150 for faster throughput - Enable GPU acceleration with EasyOCR or PaddleOCR (`use_gpu=True`) - Use batch extraction to process multiple files concurrently ??? Question "Out of memory on large PDFs" - Reduce DPI — lower resolution uses significantly less memory - Process pages in smaller batches - Use PaddleOCR's mobile tier (`model_tier="mobile"`) for a smaller memory footprint ## Next Steps - [LLM Integration](llm-integration.md) — VLM OCR, structured extraction, and LLM embeddings - [Configuration](configuration.md) — all configuration options - [Extraction Basics](extraction.md) — core extraction API and supported formats - [Advanced Features](advanced.md) — chunking, language detection, embeddings