15 KiB
OCR (Optical Character Recognition)
Extract text from images and scanned PDFs. Kreuzberg automatically determines when OCR is needed — images always require it, scanned PDFs trigger it per-page, and hybrid PDFs only OCR the pages that lack a text layer. Set force_ocr=True to OCR all pages regardless.
Backend Comparison
Four OCR backends — pick based on platform, accuracy needs, and language coverage.
| Tesseract | PaddleOCR | EasyOCR | VLM | |
|---|---|---|---|---|
| Speed | Fast | Very fast | Moderate | Slow (API latency) |
| Accuracy | Good | Excellent | Excellent | Highest |
| Languages | 100+ | 80+ (11 script families) | 80+ | All (provider-dependent) |
| Installation | System package | Built-in (native) or Python package | Python package only | API key only |
| Model size | ~10 MB | Mobile ~8 MB, Server ~120 MB | ~100 MB | None (cloud-hosted) |
| GPU support | No | Yes | Yes | N/A (server-side) |
| Platform | All (including Wasm) | All except Wasm | Python only | All |
| Cost | Free | Free | Free | Per-token API cost |
When to use which:
- Tesseract — Default choice. Works everywhere, low overhead, broadest platform support.
- PaddleOCR — Best speed-to-accuracy ratio. Preferred for CJK languages. Mobile tier is fast; server tier maximizes accuracy with GPU.
- EasyOCR — Highest accuracy with deep learning models. Python-only, heavier dependency.
- VLM — Best for handwritten text, poor scans, Arabic/Farsi, and complex layouts. Requires an API key and incurs per-token costs. See LLM Integration for full details.
Installation
Tesseract
=== "macOS"
```bash title="Terminal"
brew install tesseract
```
=== "Ubuntu / Debian"
```bash title="Terminal"
sudo apt-get install tesseract-ocr
```
=== "RHEL / Fedora"
```bash title="Terminal"
sudo dnf install tesseract
```
=== "Windows"
Download from [GitHub releases](https://github.com/UB-Mannheim/tesseract/wiki).
Additional language packs:
# macOS — all languages
brew install tesseract-lang
# Ubuntu/Debian — individual languages
sudo apt-get install tesseract-ocr-deu # German
sudo apt-get install tesseract-ocr-fra # French
# Verify installed languages
tesseract --list-langs
PaddleOCR
=== "Native bindings (Rust, Go, TypeScript, Java, C#, Ruby, PHP, Elixir)"
Built in via the `paddle-ocr` feature flag. Models download automatically on first use — no extra installation needed.
```toml title="Cargo.toml (Rust example)"
[dependencies]
kreuzberg = { version = "4.0", features = ["paddle-ocr"] }
```
=== "Python"
PaddleOCR is bundled via the native Rust bindings and works out of the box since 4.8.5 — no extra installation is needed. Models are downloaded automatically on first use.
EasyOCR (Python only)
pip install "kreuzberg[easyocr]"
!!! Info "Python 3.14" EasyOCR 1.7.3+ and PyTorch 2.9.1+ support Python 3.14. Install kreuzberg[easyocr] on any supported Python version (3.10–3.14).
!!! Tip "Tesseract marker extra"
pip install "kreuzberg[tesseract]" is available as a metadata-only marker to document a dependency on the Tesseract system package. It installs no Python packages — Tesseract itself must still be installed via your OS package manager (see above).
Configuration
Basic OCR
=== "Python"
--8<-- "snippets/python/ocr/ocr_extraction.md"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_extraction.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_extraction.md"
=== "Go"
--8<-- "snippets/go/ocr/ocr_extraction.md"
=== "Java"
--8<-- "snippets/java/ocr/ocr_extraction.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/ocr_extraction.md"
=== "R"
--8<-- "snippets/r/ocr/ocr_extraction.md"
=== "Wasm"
--8<-- "snippets/wasm/ocr/ocr_extraction.md"
Multiple Languages
Specify multiple language codes separated by + (Tesseract) or as a list (EasyOCR/PaddleOCR):
=== "Python"
--8<-- "snippets/python/ocr/ocr_multi_language.md"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_multi_language.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_multi_language.md"
=== "Go"
--8<-- "snippets/go/ocr/ocr_multi_language.md"
=== "Java"
--8<-- "snippets/java/ocr/ocr_multi_language.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/ocr_multi_language.md"
=== "R"
--8<-- "snippets/r/ocr/ocr_multi_language.md"
=== "Wasm"
```typescript
import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';
await initWasm();
await enableOcr();
const file = fileInput.files?.[0];
if (file) {
const result = await extractFromFile(file, file.type, {
ocr: { backend: 'tesseract-wasm', language: 'eng+deu' },
});
}
```
Force OCR
Process PDFs with OCR even when they have a text layer:
=== "Python"
--8<-- "snippets/python/ocr/ocr_force_all_pages.md"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_force_all_pages.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_force_all_pages.md"
=== "Go"
--8<-- "snippets/go/ocr/ocr_force_all_pages.md"
=== "Java"
--8<-- "snippets/java/ocr/ocr_force_all_pages.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/ocr_force_all_pages.md"
=== "R"
--8<-- "snippets/r/ocr/ocr_force_all_pages.md"
Using EasyOCR
=== "Python"
--8<-- "snippets/python/ocr/ocr_easyocr.md"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_easyocr.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_easyocr.md"
Disable OCR
!!! Info "Added in v4.7.0"
When disable_ocr is set, image files return empty content instead of raising MissingDependencyError:
=== "Python"
```python title="disable_ocr.py"
from kreuzberg import ExtractionConfig, extract_file_sync
config = ExtractionConfig(disable_ocr=True)
result = extract_file_sync("scanned.png", config=config)
# result.content will be empty — OCR was skipped
```
=== "TypeScript"
```typescript title="disable_ocr.ts"
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('scanned.png', {
disableOcr: true,
});
// result.content will be empty — OCR was skipped
```
=== "Rust"
```rust title="disable_ocr.rs"
use kreuzberg::{ExtractionConfig, extract_file};
let config = ExtractionConfig {
disable_ocr: true,
..Default::default()
};
let result = extract_file("scanned.png", &config).await?;
// result.content will be empty — OCR was skipped
```
Using PaddleOCR
=== "Python"
--8<-- "snippets/python/ocr/ocr_paddleocr.md"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_paddleocr.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_paddleocr.md"
=== "Go"
--8<-- "snippets/go/ocr/ocr_paddleocr.md"
=== "Java"
--8<-- "snippets/java/ocr/ocr_paddleocr.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/ocr_paddleocr.md"
=== "R"
--8<-- "snippets/r/ocr/ocr_paddleocr.md"
Using VLM OCR v4.8.0
Use a vision-language model (e.g. GPT-4o, Claude) as the OCR backend — each page is rendered and sent to the VLM. Cloud providers need an API key; local engines (Ollama, etc.) use the ollama/ prefix — see Local LLM Support.
=== "Python"
--8<-- "snippets/python/llm/vlm_ocr.md"
=== "TypeScript"
--8<-- "snippets/typescript/llm/vlm_ocr.md"
=== "Rust"
```rust title="Rust"
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig, LlmConfig};
let config = ExtractionConfig {
force_ocr: true,
ocr: Some(OcrConfig {
backend: "vlm".to_string(),
vlm_config: Some(LlmConfig {
model: "openai/gpt-4o-mini".to_string(),
..Default::default()
}),
..Default::default()
}),
..Default::default()
};
let result = extract_file("scan.pdf", None, &config).await?;
```
=== "CLI"
```bash title="Terminal"
kreuzberg extract scan.pdf --force-ocr true --vlm-model openai/gpt-4o-mini
```
=== "TOML"
```toml title="kreuzberg.toml"
force_ocr = true
[ocr]
backend = "vlm"
[ocr.vlm_config]
model = "openai/gpt-4o-mini"
```
For more on VLM OCR, including custom prompts, supported providers, and API key configuration, see LLM Integration.
!!! Tip "GPU Acceleration" EasyOCR and PaddleOCR support GPU acceleration. Set use_gpu=True in your OCR config. PaddleOCR's model_tier="server" gives the best accuracy with GPU.
DPI Configuration
Higher DPI improves accuracy but increases processing time and memory.
| DPI | Trade-off |
|---|---|
| 150 | Fastest — lower accuracy, less memory |
| 300 (default) | Balanced — good accuracy, reasonable speed |
| 600 | Best accuracy — slower, more memory |
=== "Python"
--8<-- "snippets/python/config/ocr_dpi_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_dpi_config.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_dpi_config.md"
=== "Go"
--8<-- "snippets/go/config/ocr_dpi_config.md"
=== "Java"
--8<-- "snippets/java/config/ocr_dpi_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/ocr_dpi_config.md"
=== "R"
--8<-- "snippets/r/config/ocr_dpi_config.md"
PaddleOCR Script Families
80+ languages across 11 script families (PP-OCRv5). Recognition models are downloaded on demand from HuggingFace:
| Family | Languages |
|---|---|
| English | English, numbers, punctuation |
| Chinese | Simplified/Traditional Chinese, Japanese |
| Latin | French, German, Spanish, Portuguese, Italian, Polish, Dutch, Turkish, Vietnamese, and so on. |
| Korean | Korean (Hangul) |
| Slavic | Russian, Ukrainian, Belarusian, Bulgarian, Serbian, and so on. |
| Thai | Thai script |
| Greek | Greek script |
| Arabic | Arabic, Persian, Urdu |
| Devanagari | Hindi, Marathi, Sanskrit, Nepali |
| Tamil | Tamil script |
| Telugu | Telugu script |
Models are cached locally after first download, so subsequent runs start immediately.
CLI Usage
# Basic OCR extraction
kreuzberg extract scanned.pdf --ocr true
# Specific language
kreuzberg extract french_doc.pdf --ocr true --ocr-language fra
# Specific backend
kreuzberg extract chinese_doc.pdf --ocr true --ocr-backend paddle-ocr --ocr-language ch
# Force OCR on all pages
kreuzberg extract document.pdf --force-ocr true
# VLM OCR backend
kreuzberg extract handwritten.pdf --force-ocr true --vlm-model openai/gpt-4o-mini
# Use a config file
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true
| Flag | Description |
|---|---|
--ocr true |
Enable OCR processing |
--ocr-language <code> |
Language code (eng, deu, fra, ch, ja, ru, etc.) |
--ocr-backend <backend> |
Engine: tesseract, paddle-ocr, easyocr, or vlm |
--force-ocr true |
OCR all pages regardless of text layer |
--vlm-model <model> |
VLM model for OCR (for example, openai/gpt-4o-mini). Implies --ocr-backend vlm |
Troubleshooting
??? Question "Tesseract not found"
Install Tesseract and verify it's on your PATH:
```bash title="Terminal"
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# Verify
tesseract --version
```
??? Question "Language not found"
Install the language data pack:
```bash title="Terminal"
# macOS — all languages
brew install tesseract-lang
# Ubuntu/Debian — individual language
sudo apt-get install tesseract-ocr-deu
# Verify
tesseract --list-langs
```
??? Question "Poor accuracy"
- Increase DPI to 600 for better quality
- Try a different backend — PaddleOCR and EasyOCR often outperform Tesseract on complex layouts
- Specify the correct language code for your document
- Use `force_ocr=True` if a PDF's embedded text layer is low quality
- For handwritten text or very poor scans, try the VLM backend with a vision-capable model (see [LLM Integration](llm-integration.md#vlm-ocr))
??? Question "Slow processing"
- Reduce DPI to 150 for faster throughput
- Enable GPU acceleration with EasyOCR or PaddleOCR (`use_gpu=True`)
- Use batch extraction to process multiple files concurrently
??? Question "Out of memory on large PDFs"
- Reduce DPI — lower resolution uses significantly less memory
- Process pages in smaller batches
- Use PaddleOCR's mobile tier (`model_tier="mobile"`) for a smaller memory footprint
Next Steps
- LLM Integration — VLM OCR, structured extraction, and LLM embeddings
- Configuration — all configuration options
- Extraction Basics — core extraction API and supported formats
- Advanced Features — chunking, language detection, embeddings