fil/.ai-rulez/domains/ocr-integration/DOMAIN.md at b4c07d36934823e7b674ed498e966d1583a7b4bc

hjess/fil

Files

Henrik Jess Nielsen b4c07d3693

Deploy fil (kreuzberg) / deploy (push) Successful in 49s

Details

Nomad changes

2026-06-01 23:40:55 +02:00

description

description
OCR backend integration and image processing

Multiple backends: Tesseract (C FFI via leptonica/tesseract-sys), PaddleOCR (ONNX Runtime), Python backends (EasyOCR, Surya) via FFI
Backend selection: priority-based with fallback — Tesseract default, PaddleOCR for CJK, Python backends as fallback
Image preprocessing: deskew, binarization, noise removal, contrast enhancement — applied before OCR
PSM modes: configurable page segmentation (single block, single line, sparse text) per use case
Table detection: identify table regions → cell extraction → row/column reconstruction → Markdown table output
hOCR: parse Tesseract hOCR output for word-level bounding boxes, confidence scores, reading order
Language management: auto-detect document language, load appropriate Tesseract traineddata, support multi-language documents
Caching: cache OCR results by image hash + backend + language + PSM mode
Confidence tracking: per-word and per-page confidence scores, flag low-confidence regions for review