This commit is contained in:
13
.ai-rulez/domains/ocr-integration/DOMAIN.md
Normal file
13
.ai-rulez/domains/ocr-integration/DOMAIN.md
Normal file
@@ -0,0 +1,13 @@
|
||||
---
|
||||
description: OCR backend integration and image processing
|
||||
---
|
||||
|
||||
- Multiple backends: Tesseract (C FFI via leptonica/tesseract-sys), PaddleOCR (ONNX Runtime), Python backends (EasyOCR, Surya) via FFI
|
||||
- Backend selection: priority-based with fallback — Tesseract default, PaddleOCR for CJK, Python backends as fallback
|
||||
- Image preprocessing: deskew, binarization, noise removal, contrast enhancement — applied before OCR
|
||||
- PSM modes: configurable page segmentation (single block, single line, sparse text) per use case
|
||||
- Table detection: identify table regions → cell extraction → row/column reconstruction → Markdown table output
|
||||
- hOCR: parse Tesseract hOCR output for word-level bounding boxes, confidence scores, reading order
|
||||
- Language management: auto-detect document language, load appropriate Tesseract traineddata, support multi-language documents
|
||||
- Caching: cache OCR results by image hash + backend + language + PSM mode
|
||||
- Confidence tracking: per-word and per-page confidence scores, flag low-confidence regions for review
|
||||
16
.ai-rulez/domains/ocr-integration/agents/ocr-engineer.md
Normal file
16
.ai-rulez/domains/ocr-integration/agents/ocr-engineer.md
Normal file
@@ -0,0 +1,16 @@
|
||||
---
|
||||
name: ocr-engineer
|
||||
description: OCR pipeline development, backend integration, and table reconstruction
|
||||
model: haiku
|
||||
---
|
||||
|
||||
When working on OCR code:
|
||||
|
||||
1. Key source paths: crates/kreuzberg/src/ocr/ (processor.rs, tesseract_backend.rs, hocr.rs, cache.rs, language_registry.rs, table/)
|
||||
2. The OCR pipeline: Image Detection -> Preprocessing (denoise, deskew, binarize) -> Backend Selection -> OCR Execution -> hOCR Parsing -> Table Reconstruction -> Caching -> Return
|
||||
3. Backends: Tesseract (default, native C FFI via leptess), PaddleOCR (ONNX via ort), EasyOCR (Python via PyO3)
|
||||
4. For Python backends: use tokio::task::spawn_blocking, minimize GIL hold time with py.allow_threads(), cache Python data in Rust fields
|
||||
5. For table detection: detect via line/cell boundary detection, validate grid structure, OCR each cell, output as markdown
|
||||
6. For language management: validate against LanguageRegistry, check tessdata availability
|
||||
7. Cache OCR results with key = hash(image_bytes + language + config)
|
||||
8. hOCR parsing: use the hocr module to extract word-level bounding boxes and confidence scores
|
||||
@@ -0,0 +1,11 @@
|
||||
---
|
||||
priority: critical
|
||||
---
|
||||
|
||||
- Pluggable backend architecture: all backends implement the OcrBackend trait
|
||||
- Backend independence: switching backends must not require API changes
|
||||
- Tesseract is the default backend (native C FFI via leptess)
|
||||
- Python backends (EasyOCR, PaddleOCR): use tokio::task::spawn_blocking, release GIL for Rust work
|
||||
- Graceful degradation: if preferred backend unavailable, fall back to next available
|
||||
- All backends must return structured results with confidence scores
|
||||
- Document installation requirements and troubleshooting for each backend
|
||||
@@ -0,0 +1,9 @@
|
||||
---
|
||||
priority: medium
|
||||
---
|
||||
|
||||
- Validate language packs exist before OCR execution — fail fast with helpful message
|
||||
- Support ISO 639 language codes, map to backend-specific formats
|
||||
- Configuration cascade: CLI args > environment > config file > defaults
|
||||
- Provide troubleshooting guides for common issues (missing tessdata, backend not found)
|
||||
- Language pack installation: document per-platform instructions
|
||||
10
.ai-rulez/domains/ocr-integration/rules/ocr-performance.md
Normal file
10
.ai-rulez/domains/ocr-integration/rules/ocr-performance.md
Normal file
@@ -0,0 +1,10 @@
|
||||
---
|
||||
priority: high
|
||||
---
|
||||
|
||||
- Cache OCR results: key = hash(image_bytes + language + config)
|
||||
- Invalidate cache when OCR config changes (backend, language, PSM mode)
|
||||
- Batch processing: process multiple images concurrently with configurable parallelism
|
||||
- Resource management: limit concurrent OCR operations to avoid memory exhaustion
|
||||
- Performance targets: <2s for single page, <10s for 10-page document
|
||||
- Monitor and log OCR processing times for regression detection
|
||||
10
.ai-rulez/domains/ocr-integration/rules/ocr-quality.md
Normal file
10
.ai-rulez/domains/ocr-integration/rules/ocr-quality.md
Normal file
@@ -0,0 +1,10 @@
|
||||
---
|
||||
priority: high
|
||||
---
|
||||
|
||||
- Track confidence scores on all OCR results — expose in API
|
||||
- Image preprocessing (denoise, deskew, binarize) should improve accuracy by 10-30%
|
||||
- PSM mode selection: auto-detect layout, allow user override (single block, single line, sparse text, etc.)
|
||||
- Language detection: validate requested languages are available, provide install hints if not
|
||||
- Multi-language support: allow multiple languages per OCR request
|
||||
- Test OCR accuracy against ground-truth documents in CI
|
||||
@@ -0,0 +1,10 @@
|
||||
---
|
||||
priority: high
|
||||
---
|
||||
|
||||
- hOCR parsing: extract word-level bounding boxes, confidence scores, and text content
|
||||
- Preserve spatial relationships from hOCR output for layout reconstruction
|
||||
- Table detection: use cell boundary detection (line detection + intersection analysis)
|
||||
- Validate grid structure before treating detected regions as tables
|
||||
- OCR each cell individually for better accuracy
|
||||
- Convert tables to markdown format with proper column alignment
|
||||
Reference in New Issue
Block a user