493 lines
15 KiB
Markdown
493 lines
15 KiB
Markdown
# OCR (Optical Character Recognition)
|
||
|
||
Extract text from images and scanned PDFs. Kreuzberg automatically determines when OCR is needed — images always require it, scanned PDFs trigger it per-page, and hybrid PDFs only OCR the pages that lack a text layer. Set `force_ocr=True` to OCR all pages regardless.
|
||
|
||
## Backend Comparison
|
||
|
||
Four OCR backends — pick based on platform, accuracy needs, and language coverage.
|
||
|
||
| | **Tesseract** | **PaddleOCR** | **EasyOCR** | **VLM** |
|
||
| ---------------- | -------------------- | ----------------------------------- | ------------------- | ------------------------ |
|
||
| **Speed** | Fast | Very fast | Moderate | Slow (API latency) |
|
||
| **Accuracy** | Good | Excellent | Excellent | Highest |
|
||
| **Languages** | 100+ | 80+ (11 script families) | 80+ | All (provider-dependent) |
|
||
| **Installation** | System package | Built-in (native) or Python package | Python package only | API key only |
|
||
| **Model size** | ~10 MB | Mobile ~8 MB, Server ~120 MB | ~100 MB | None (cloud-hosted) |
|
||
| **GPU support** | No | Yes | Yes | N/A (server-side) |
|
||
| **Platform** | All (including Wasm) | All except Wasm | Python only | All |
|
||
| **Cost** | Free | Free | Free | Per-token API cost |
|
||
|
||
**When to use which:**
|
||
|
||
- **Tesseract** — Default choice. Works everywhere, low overhead, broadest platform support.
|
||
- **PaddleOCR** — Best speed-to-accuracy ratio. Preferred for CJK languages. Mobile tier is fast; server tier maximizes accuracy with GPU.
|
||
- **EasyOCR** — Highest accuracy with deep learning models. Python-only, heavier dependency.
|
||
- **VLM** — Best for handwritten text, poor scans, Arabic/Farsi, and complex layouts. Requires an API key and incurs per-token costs. See [LLM Integration](llm-integration.md) for full details.
|
||
|
||
## Installation
|
||
|
||
### Tesseract
|
||
|
||
=== "macOS"
|
||
|
||
```bash title="Terminal"
|
||
brew install tesseract
|
||
```
|
||
|
||
=== "Ubuntu / Debian"
|
||
|
||
```bash title="Terminal"
|
||
sudo apt-get install tesseract-ocr
|
||
```
|
||
|
||
=== "RHEL / Fedora"
|
||
|
||
```bash title="Terminal"
|
||
sudo dnf install tesseract
|
||
```
|
||
|
||
=== "Windows"
|
||
|
||
Download from [GitHub releases](https://github.com/UB-Mannheim/tesseract/wiki).
|
||
|
||
**Additional language packs:**
|
||
|
||
```bash title="Terminal"
|
||
# macOS — all languages
|
||
brew install tesseract-lang
|
||
|
||
# Ubuntu/Debian — individual languages
|
||
sudo apt-get install tesseract-ocr-deu # German
|
||
sudo apt-get install tesseract-ocr-fra # French
|
||
|
||
# Verify installed languages
|
||
tesseract --list-langs
|
||
```
|
||
|
||
### PaddleOCR
|
||
|
||
=== "Native bindings (Rust, Go, TypeScript, Java, C#, Ruby, PHP, Elixir)"
|
||
|
||
Built in via the `paddle-ocr` feature flag. Models download automatically on first use — no extra installation needed.
|
||
|
||
```toml title="Cargo.toml (Rust example)"
|
||
[dependencies]
|
||
kreuzberg = { version = "4.0", features = ["paddle-ocr"] }
|
||
```
|
||
|
||
=== "Python"
|
||
|
||
PaddleOCR is bundled via the native Rust bindings and works out of the box since 4.8.5 — no extra installation is needed. Models are downloaded automatically on first use.
|
||
|
||
### EasyOCR (Python only)
|
||
|
||
```bash title="Terminal"
|
||
pip install "kreuzberg[easyocr]"
|
||
```
|
||
|
||
!!! Info "Python 3.14" EasyOCR 1.7.3+ and PyTorch 2.9.1+ support Python 3.14. Install `kreuzberg[easyocr]` on any supported Python version (3.10–3.14).
|
||
|
||
!!! Tip "Tesseract marker extra"
|
||
`pip install "kreuzberg[tesseract]"` is available as a metadata-only marker to document a dependency on the Tesseract system package. It installs no Python packages — Tesseract itself must still be installed via your OS package manager (see above).
|
||
|
||
## Configuration
|
||
|
||
### Basic OCR
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/ocr/ocr_extraction.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/ocr/ocr_extraction.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/ocr/ocr_extraction.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/ocr/ocr_extraction.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/ocr/ocr_extraction.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/ocr/ocr_extraction.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/ocr/ocr_extraction.md"
|
||
|
||
=== "Wasm"
|
||
|
||
--8<-- "snippets/wasm/ocr/ocr_extraction.md"
|
||
|
||
### Multiple Languages
|
||
|
||
Specify multiple language codes separated by `+` (Tesseract) or as a list (EasyOCR/PaddleOCR):
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/ocr/ocr_multi_language.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/ocr/ocr_multi_language.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/ocr/ocr_multi_language.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/ocr/ocr_multi_language.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/ocr/ocr_multi_language.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/ocr/ocr_multi_language.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/ocr/ocr_multi_language.md"
|
||
|
||
=== "Wasm"
|
||
|
||
```typescript
|
||
import { enableOcr, extractFromFile, initWasm } from '@kreuzberg/wasm';
|
||
|
||
await initWasm();
|
||
await enableOcr();
|
||
|
||
const file = fileInput.files?.[0];
|
||
if (file) {
|
||
const result = await extractFromFile(file, file.type, {
|
||
ocr: { backend: 'tesseract-wasm', language: 'eng+deu' },
|
||
});
|
||
}
|
||
```
|
||
|
||
### Force OCR
|
||
|
||
Process PDFs with OCR even when they have a text layer:
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/ocr/ocr_force_all_pages.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/ocr/ocr_force_all_pages.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/ocr/ocr_force_all_pages.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/ocr/ocr_force_all_pages.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/ocr/ocr_force_all_pages.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/ocr/ocr_force_all_pages.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/ocr/ocr_force_all_pages.md"
|
||
|
||
### Using EasyOCR
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/ocr/ocr_easyocr.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/ocr/ocr_easyocr.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/ocr/ocr_easyocr.md"
|
||
|
||
### Disable OCR
|
||
|
||
!!! Info "Added in v4.7.0"
|
||
|
||
When `disable_ocr` is set, image files return empty content instead of raising `MissingDependencyError`:
|
||
|
||
=== "Python"
|
||
|
||
```python title="disable_ocr.py"
|
||
from kreuzberg import ExtractionConfig, extract_file_sync
|
||
|
||
config = ExtractionConfig(disable_ocr=True)
|
||
result = extract_file_sync("scanned.png", config=config)
|
||
# result.content will be empty — OCR was skipped
|
||
```
|
||
|
||
=== "TypeScript"
|
||
|
||
```typescript title="disable_ocr.ts"
|
||
import { extractFileSync } from '@kreuzberg/node';
|
||
|
||
const result = extractFileSync('scanned.png', {
|
||
disableOcr: true,
|
||
});
|
||
// result.content will be empty — OCR was skipped
|
||
```
|
||
|
||
=== "Rust"
|
||
|
||
```rust title="disable_ocr.rs"
|
||
use kreuzberg::{ExtractionConfig, extract_file};
|
||
|
||
let config = ExtractionConfig {
|
||
disable_ocr: true,
|
||
..Default::default()
|
||
};
|
||
let result = extract_file("scanned.png", &config).await?;
|
||
// result.content will be empty — OCR was skipped
|
||
```
|
||
|
||
### Using PaddleOCR
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/ocr/ocr_paddleocr.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/ocr/ocr_paddleocr.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/ocr/ocr_paddleocr.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/ocr/ocr_paddleocr.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/ocr/ocr_paddleocr.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/ocr/ocr_paddleocr.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/ocr/ocr_paddleocr.md"
|
||
|
||
### Using VLM OCR <span class="version-badge">v4.8.0</span>
|
||
|
||
Use a vision-language model (e.g. GPT-4o, Claude) as the OCR backend — each page is rendered and sent to the VLM. Cloud providers need an API key; local engines (Ollama, etc.) use the `ollama/` prefix — see [Local LLM Support](llm-integration.md#local-llm-support).
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/llm/vlm_ocr.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/llm/vlm_ocr.md"
|
||
|
||
=== "Rust"
|
||
|
||
```rust title="Rust"
|
||
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig, LlmConfig};
|
||
|
||
let config = ExtractionConfig {
|
||
force_ocr: true,
|
||
ocr: Some(OcrConfig {
|
||
backend: "vlm".to_string(),
|
||
vlm_config: Some(LlmConfig {
|
||
model: "openai/gpt-4o-mini".to_string(),
|
||
..Default::default()
|
||
}),
|
||
..Default::default()
|
||
}),
|
||
..Default::default()
|
||
};
|
||
let result = extract_file("scan.pdf", None, &config).await?;
|
||
```
|
||
|
||
=== "CLI"
|
||
|
||
```bash title="Terminal"
|
||
kreuzberg extract scan.pdf --force-ocr true --vlm-model openai/gpt-4o-mini
|
||
```
|
||
|
||
=== "TOML"
|
||
|
||
```toml title="kreuzberg.toml"
|
||
force_ocr = true
|
||
|
||
[ocr]
|
||
backend = "vlm"
|
||
|
||
[ocr.vlm_config]
|
||
model = "openai/gpt-4o-mini"
|
||
```
|
||
|
||
For more on VLM OCR, including custom prompts, supported providers, and API key configuration, see [LLM Integration](llm-integration.md#vlm-ocr).
|
||
|
||
!!! Tip "GPU Acceleration" EasyOCR and PaddleOCR support GPU acceleration. Set `use_gpu=True` in your OCR config. PaddleOCR's `model_tier="server"` gives the best accuracy with GPU.
|
||
|
||
## DPI Configuration
|
||
|
||
Higher DPI improves accuracy but increases processing time and memory.
|
||
|
||
| DPI | Trade-off |
|
||
| ----------------- | ------------------------------------------ |
|
||
| **150** | Fastest — lower accuracy, less memory |
|
||
| **300** (default) | Balanced — good accuracy, reasonable speed |
|
||
| **600** | Best accuracy — slower, more memory |
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/config/ocr_dpi_config.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/ocr/ocr_dpi_config.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/ocr/ocr_dpi_config.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/config/ocr_dpi_config.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/config/ocr_dpi_config.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/config/ocr_dpi_config.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/config/ocr_dpi_config.md"
|
||
|
||
## PaddleOCR Script Families
|
||
|
||
80+ languages across 11 script families (PP-OCRv5). Recognition models are downloaded on demand from HuggingFace:
|
||
|
||
| Family | Languages |
|
||
| -------------- | -------------------------------------------------------------------------------------------- |
|
||
| **English** | English, numbers, punctuation |
|
||
| **Chinese** | Simplified/Traditional Chinese, Japanese |
|
||
| **Latin** | French, German, Spanish, Portuguese, Italian, Polish, Dutch, Turkish, Vietnamese, and so on. |
|
||
| **Korean** | Korean (Hangul) |
|
||
| **Slavic** | Russian, Ukrainian, Belarusian, Bulgarian, Serbian, and so on. |
|
||
| **Thai** | Thai script |
|
||
| **Greek** | Greek script |
|
||
| **Arabic** | Arabic, Persian, Urdu |
|
||
| **Devanagari** | Hindi, Marathi, Sanskrit, Nepali |
|
||
| **Tamil** | Tamil script |
|
||
| **Telugu** | Telugu script |
|
||
|
||
Models are cached locally after first download, so subsequent runs start immediately.
|
||
|
||
## CLI Usage
|
||
|
||
```bash title="Terminal"
|
||
# Basic OCR extraction
|
||
kreuzberg extract scanned.pdf --ocr true
|
||
|
||
# Specific language
|
||
kreuzberg extract french_doc.pdf --ocr true --ocr-language fra
|
||
|
||
# Specific backend
|
||
kreuzberg extract chinese_doc.pdf --ocr true --ocr-backend paddle-ocr --ocr-language ch
|
||
|
||
# Force OCR on all pages
|
||
kreuzberg extract document.pdf --force-ocr true
|
||
|
||
# VLM OCR backend
|
||
kreuzberg extract handwritten.pdf --force-ocr true --vlm-model openai/gpt-4o-mini
|
||
|
||
# Use a config file
|
||
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true
|
||
```
|
||
|
||
| Flag | Description |
|
||
| ------------------------- | ---------------------------------------------------------------------------------- |
|
||
| `--ocr true` | Enable OCR processing |
|
||
| `--ocr-language <code>` | Language code (`eng`, `deu`, `fra`, `ch`, `ja`, `ru`, etc.) |
|
||
| `--ocr-backend <backend>` | Engine: `tesseract`, `paddle-ocr`, `easyocr`, or `vlm` |
|
||
| `--force-ocr true` | OCR all pages regardless of text layer |
|
||
| `--vlm-model <model>` | VLM model for OCR (for example, `openai/gpt-4o-mini`). Implies `--ocr-backend vlm` |
|
||
|
||
## Troubleshooting
|
||
|
||
??? Question "Tesseract not found"
|
||
|
||
Install Tesseract and verify it's on your PATH:
|
||
|
||
```bash title="Terminal"
|
||
# macOS
|
||
brew install tesseract
|
||
|
||
# Ubuntu/Debian
|
||
sudo apt-get install tesseract-ocr
|
||
|
||
# Verify
|
||
tesseract --version
|
||
```
|
||
|
||
??? Question "Language not found"
|
||
|
||
Install the language data pack:
|
||
|
||
```bash title="Terminal"
|
||
# macOS — all languages
|
||
brew install tesseract-lang
|
||
|
||
# Ubuntu/Debian — individual language
|
||
sudo apt-get install tesseract-ocr-deu
|
||
|
||
# Verify
|
||
tesseract --list-langs
|
||
```
|
||
|
||
??? Question "Poor accuracy"
|
||
|
||
- Increase DPI to 600 for better quality
|
||
- Try a different backend — PaddleOCR and EasyOCR often outperform Tesseract on complex layouts
|
||
- Specify the correct language code for your document
|
||
- Use `force_ocr=True` if a PDF's embedded text layer is low quality
|
||
- For handwritten text or very poor scans, try the VLM backend with a vision-capable model (see [LLM Integration](llm-integration.md#vlm-ocr))
|
||
|
||
??? Question "Slow processing"
|
||
|
||
- Reduce DPI to 150 for faster throughput
|
||
- Enable GPU acceleration with EasyOCR or PaddleOCR (`use_gpu=True`)
|
||
- Use batch extraction to process multiple files concurrently
|
||
|
||
??? Question "Out of memory on large PDFs"
|
||
|
||
- Reduce DPI — lower resolution uses significantly less memory
|
||
- Process pages in smaller batches
|
||
- Use PaddleOCR's mobile tier (`model_tier="mobile"`) for a smaller memory footprint
|
||
|
||
## Next Steps
|
||
|
||
- [LLM Integration](llm-integration.md) — VLM OCR, structured extraction, and LLM embeddings
|
||
- [Configuration](configuration.md) — all configuration options
|
||
- [Extraction Basics](extraction.md) — core extraction API and supported formats
|
||
- [Advanced Features](advanced.md) — chunking, language detection, embeddings
|