This commit is contained in:
12
.ai-rulez/domains/document-extraction/DOMAIN.md
Normal file
12
.ai-rulez/domains/document-extraction/DOMAIN.md
Normal file
@@ -0,0 +1,12 @@
|
||||
---
|
||||
description: Document extraction pipeline architecture
|
||||
---
|
||||
|
||||
- Pipeline: file input → MIME detection (magic bytes + extension) → extractor routing → extraction → post-processing → ExtractionResult
|
||||
- Extractors are plugins implementing the Extractor trait: extract(&self, source: &ExtractionSource) → ExtractionResult
|
||||
- Fallback chains: if primary extractor fails, try next in priority order (e.g., native PDF → Tesseract OCR → error)
|
||||
- Cache-first: check extraction cache before running extractors, cache results keyed by content hash
|
||||
- ExtractionResult contains: text content, metadata (page count, language, confidence), optional structured data (tables, images)
|
||||
- Async-first: all extraction paths are async, use spawn_blocking for CPU-bound work (OCR, image processing)
|
||||
- Memory limits: streaming for large files, configurable max file size, depth limits for nested archives
|
||||
- Format coverage: 91+ formats — PDF, DOCX, XLSX, PPTX, HTML, images, email (EML/MSG), archives, plain text
|
||||
@@ -0,0 +1,16 @@
|
||||
---
|
||||
name: extraction-engineer
|
||||
description: Document extraction pipeline development and maintenance
|
||||
model: haiku
|
||||
---
|
||||
|
||||
When working on document extraction code:
|
||||
|
||||
1. Key source paths: crates/kreuzberg/src/core/ (extractor.rs, mime.rs, config.rs), crates/kreuzberg/src/extraction/
|
||||
2. The extraction pipeline: Input -> Cache Check -> MIME Detection -> Format Conversion -> Extractor Selection (priority-based) -> Extraction -> Fallback Chain -> Post-Processing -> Caching -> Output
|
||||
3. For MIME detection: use EXT_TO_MIME map + magic bytes fallback via infer crate. Always validate_mime_type() before extraction.
|
||||
4. For caching: keys based on content hash, invalidate on config changes
|
||||
5. For errors: implement fallback chains (try next-priority extractor), preserve partial results, return structured error info
|
||||
6. For new formats: add to EXT_TO_MIME, implement DocumentExtractor trait, register in register_default_extractors()
|
||||
7. Always use SecurityLimits validators for user content (ZipBombValidator, DepthValidator, StringGrowthValidator)
|
||||
8. Run `task test` after changes. Target 95% coverage on core extraction code.
|
||||
@@ -0,0 +1,9 @@
|
||||
---
|
||||
priority: high
|
||||
---
|
||||
|
||||
- Follow semantic versioning — breaking changes require major version bump
|
||||
- Document all public API changes in CHANGELOG.md
|
||||
- Maintain backward compatibility for at least one minor version before removing deprecated APIs
|
||||
- All public types must be FFI-friendly or have FFI-compatible equivalents
|
||||
- Version in Cargo.toml is the single source of truth for all binding packages
|
||||
@@ -0,0 +1,9 @@
|
||||
---
|
||||
priority: high
|
||||
---
|
||||
|
||||
- All extraction paths must be fully async using tokio
|
||||
- Never block the async runtime — use spawn_blocking for CPU-intensive work
|
||||
- All public types must be Send + Sync
|
||||
- Use tokio::select! for timeout handling on extraction operations
|
||||
- Cross-platform: test on Linux (amd64, arm64) and macOS at minimum
|
||||
@@ -0,0 +1,10 @@
|
||||
---
|
||||
priority: high
|
||||
---
|
||||
|
||||
- Cache keys: content-hash based (hash of file bytes + config), not path-based
|
||||
- Invalidate cache when extraction config changes (output format, OCR settings, etc.)
|
||||
- Check cache before any extraction — cache hits should skip all processing
|
||||
- Concurrent batch processing: use configurable worker pool, default to CPU count
|
||||
- Stream large files instead of loading into memory — use AsyncRead where possible
|
||||
- Monitor cache hit rates — target >80% for repeated extractions
|
||||
@@ -0,0 +1,10 @@
|
||||
---
|
||||
priority: high
|
||||
---
|
||||
|
||||
- 95% test coverage on core extraction code, 80% on bindings
|
||||
- Test all format categories: text, office, PDF, images, archives, markup
|
||||
- Test corrupted/malformed documents — extraction must fail gracefully, never panic
|
||||
- Benchmark extraction speeds per format — track regressions in CI
|
||||
- Test both success and error paths for every extractor
|
||||
- Use property-based testing for parsers with wide input ranges
|
||||
@@ -0,0 +1,10 @@
|
||||
---
|
||||
priority: critical
|
||||
---
|
||||
|
||||
- Always use `SecurityLimits` to cap archive size, compression ratio, file count, and nesting depth for user content. Use `ZipBombValidator` for archive extraction.
|
||||
- Validate MIME type before extraction — never trust file extensions alone
|
||||
- Implement fallback chains: if primary extractor fails, try next-priority extractor
|
||||
- Preserve partial results on failure — return what was extracted with error context
|
||||
- All errors must include: operation name, input description, root cause, and suggestion
|
||||
- Never expose internal file paths or system details in error messages returned to users
|
||||
13
.ai-rulez/domains/ocr-integration/DOMAIN.md
Normal file
13
.ai-rulez/domains/ocr-integration/DOMAIN.md
Normal file
@@ -0,0 +1,13 @@
|
||||
---
|
||||
description: OCR backend integration and image processing
|
||||
---
|
||||
|
||||
- Multiple backends: Tesseract (C FFI via leptonica/tesseract-sys), PaddleOCR (ONNX Runtime), Python backends (EasyOCR, Surya) via FFI
|
||||
- Backend selection: priority-based with fallback — Tesseract default, PaddleOCR for CJK, Python backends as fallback
|
||||
- Image preprocessing: deskew, binarization, noise removal, contrast enhancement — applied before OCR
|
||||
- PSM modes: configurable page segmentation (single block, single line, sparse text) per use case
|
||||
- Table detection: identify table regions → cell extraction → row/column reconstruction → Markdown table output
|
||||
- hOCR: parse Tesseract hOCR output for word-level bounding boxes, confidence scores, reading order
|
||||
- Language management: auto-detect document language, load appropriate Tesseract traineddata, support multi-language documents
|
||||
- Caching: cache OCR results by image hash + backend + language + PSM mode
|
||||
- Confidence tracking: per-word and per-page confidence scores, flag low-confidence regions for review
|
||||
16
.ai-rulez/domains/ocr-integration/agents/ocr-engineer.md
Normal file
16
.ai-rulez/domains/ocr-integration/agents/ocr-engineer.md
Normal file
@@ -0,0 +1,16 @@
|
||||
---
|
||||
name: ocr-engineer
|
||||
description: OCR pipeline development, backend integration, and table reconstruction
|
||||
model: haiku
|
||||
---
|
||||
|
||||
When working on OCR code:
|
||||
|
||||
1. Key source paths: crates/kreuzberg/src/ocr/ (processor.rs, tesseract_backend.rs, hocr.rs, cache.rs, language_registry.rs, table/)
|
||||
2. The OCR pipeline: Image Detection -> Preprocessing (denoise, deskew, binarize) -> Backend Selection -> OCR Execution -> hOCR Parsing -> Table Reconstruction -> Caching -> Return
|
||||
3. Backends: Tesseract (default, native C FFI via leptess), PaddleOCR (ONNX via ort), EasyOCR (Python via PyO3)
|
||||
4. For Python backends: use tokio::task::spawn_blocking, minimize GIL hold time with py.allow_threads(), cache Python data in Rust fields
|
||||
5. For table detection: detect via line/cell boundary detection, validate grid structure, OCR each cell, output as markdown
|
||||
6. For language management: validate against LanguageRegistry, check tessdata availability
|
||||
7. Cache OCR results with key = hash(image_bytes + language + config)
|
||||
8. hOCR parsing: use the hocr module to extract word-level bounding boxes and confidence scores
|
||||
@@ -0,0 +1,11 @@
|
||||
---
|
||||
priority: critical
|
||||
---
|
||||
|
||||
- Pluggable backend architecture: all backends implement the OcrBackend trait
|
||||
- Backend independence: switching backends must not require API changes
|
||||
- Tesseract is the default backend (native C FFI via leptess)
|
||||
- Python backends (EasyOCR, PaddleOCR): use tokio::task::spawn_blocking, release GIL for Rust work
|
||||
- Graceful degradation: if preferred backend unavailable, fall back to next available
|
||||
- All backends must return structured results with confidence scores
|
||||
- Document installation requirements and troubleshooting for each backend
|
||||
@@ -0,0 +1,9 @@
|
||||
---
|
||||
priority: medium
|
||||
---
|
||||
|
||||
- Validate language packs exist before OCR execution — fail fast with helpful message
|
||||
- Support ISO 639 language codes, map to backend-specific formats
|
||||
- Configuration cascade: CLI args > environment > config file > defaults
|
||||
- Provide troubleshooting guides for common issues (missing tessdata, backend not found)
|
||||
- Language pack installation: document per-platform instructions
|
||||
10
.ai-rulez/domains/ocr-integration/rules/ocr-performance.md
Normal file
10
.ai-rulez/domains/ocr-integration/rules/ocr-performance.md
Normal file
@@ -0,0 +1,10 @@
|
||||
---
|
||||
priority: high
|
||||
---
|
||||
|
||||
- Cache OCR results: key = hash(image_bytes + language + config)
|
||||
- Invalidate cache when OCR config changes (backend, language, PSM mode)
|
||||
- Batch processing: process multiple images concurrently with configurable parallelism
|
||||
- Resource management: limit concurrent OCR operations to avoid memory exhaustion
|
||||
- Performance targets: <2s for single page, <10s for 10-page document
|
||||
- Monitor and log OCR processing times for regression detection
|
||||
10
.ai-rulez/domains/ocr-integration/rules/ocr-quality.md
Normal file
10
.ai-rulez/domains/ocr-integration/rules/ocr-quality.md
Normal file
@@ -0,0 +1,10 @@
|
||||
---
|
||||
priority: high
|
||||
---
|
||||
|
||||
- Track confidence scores on all OCR results — expose in API
|
||||
- Image preprocessing (denoise, deskew, binarize) should improve accuracy by 10-30%
|
||||
- PSM mode selection: auto-detect layout, allow user override (single block, single line, sparse text, etc.)
|
||||
- Language detection: validate requested languages are available, provide install hints if not
|
||||
- Multi-language support: allow multiple languages per OCR request
|
||||
- Test OCR accuracy against ground-truth documents in CI
|
||||
@@ -0,0 +1,10 @@
|
||||
---
|
||||
priority: high
|
||||
---
|
||||
|
||||
- hOCR parsing: extract word-level bounding boxes, confidence scores, and text content
|
||||
- Preserve spatial relationships from hOCR output for layout reconstruction
|
||||
- Table detection: use cell boundary detection (line detection + intersection analysis)
|
||||
- Validate grid structure before treating detected regions as tables
|
||||
- OCR each cell individually for better accuracy
|
||||
- Convert tables to markdown format with proper column alignment
|
||||
13
.ai-rulez/domains/plugin-system/DOMAIN.md
Normal file
13
.ai-rulez/domains/plugin-system/DOMAIN.md
Normal file
@@ -0,0 +1,13 @@
|
||||
---
|
||||
description: Plugin trait system and Python FFI integration
|
||||
---
|
||||
|
||||
- Core traits: Extractor, PostProcessor, MetadataExtractor — each with async extract/process methods returning Result
|
||||
- Discovery: static registration (Rust plugins compiled in) + dynamic discovery (Python plugins via PyO3 FFI)
|
||||
- Priority selection: plugins declare priority per MIME type, registry selects highest-priority match, fallback to next
|
||||
- Registry: PluginRegistry holds all discovered plugins, provides lookup by MIME type, supports hot-reload for Python plugins
|
||||
- Python FFI: Python plugins implement a Python class matching the trait interface, called via PyO3 with GIL management
|
||||
- GIL management: acquire GIL only for Python calls, release immediately after, use py.allow_threads() for Rust-side work
|
||||
- Plugin lifecycle: init → register → validate → ready. Plugins validate their dependencies (e.g., Tesseract binary, Python packages) at startup
|
||||
- Error handling: plugin errors are wrapped in PluginError with source plugin name, converted to ExtractionError at boundary
|
||||
- Testing: test plugins with real files (not mocks), test fallback chains, test Python plugin loading/unloading
|
||||
16
.ai-rulez/domains/plugin-system/agents/plugin-engineer.md
Normal file
16
.ai-rulez/domains/plugin-system/agents/plugin-engineer.md
Normal file
@@ -0,0 +1,16 @@
|
||||
---
|
||||
name: plugin-engineer
|
||||
description: Plugin system architecture, registry management, and Python FFI
|
||||
model: haiku
|
||||
---
|
||||
|
||||
When working on the plugin system:
|
||||
|
||||
1. Key source paths: crates/kreuzberg/src/plugins/ (mod.rs, extractor.rs, ocr.rs, postprocessor.rs, validator.rs, registry.rs), crates/kreuzberg-py/src/plugins.rs
|
||||
2. Plugin types: DocumentExtractor, OcrBackend, PostProcessor, Validator — all extend base Plugin trait (Send + Sync required)
|
||||
3. Priority system: 0-255, default 50, custom override > 50, fallback < 50. Registry selects highest priority for MIME type.
|
||||
4. Registries use Arc<RwLock<>> with MIME type indexing for O(log n) lookup
|
||||
5. Python plugins: validate protocol compliance, use py.allow_threads() for expensive Rust ops, tokio::task::spawn_blocking for async calls
|
||||
6. For new plugin types: define trait extending Plugin, create typed registry, add registration functions, implement priority-based selection
|
||||
7. GIL optimization: cache frequently-accessed Python data in Rust fields, measure GIL overhead
|
||||
8. All plugins must handle errors gracefully — return Result, never panic
|
||||
@@ -0,0 +1,10 @@
|
||||
---
|
||||
priority: medium
|
||||
---
|
||||
|
||||
- API stability: plugin interfaces are versioned, breaking changes require major version bump
|
||||
- Plugin discovery: support both static (compile-time) and dynamic (runtime) registration
|
||||
- Plugin validation: check capabilities, supported formats, and version compatibility before registration
|
||||
- Plugin chaining: post-processors can be composed in sequence
|
||||
- Configuration: plugins accept typed configuration, validated at registration time
|
||||
- Documentation: every plugin type must have a development guide with examples
|
||||
@@ -0,0 +1,10 @@
|
||||
---
|
||||
priority: critical
|
||||
---
|
||||
|
||||
- All plugins must implement the base Plugin trait: Send + Sync + 'static required
|
||||
- Plugin types: DocumentExtractor, OcrBackend, PostProcessor, Validator
|
||||
- Async execution: use async trait methods for non-blocking operations
|
||||
- Lifecycle: init() -> process() -> cleanup(). Init must validate all requirements.
|
||||
- Never panic in plugin code — all errors must be returned as Result
|
||||
- Consistent result format: all extractors return ExtractionResult with text, metadata, and confidence
|
||||
@@ -0,0 +1,12 @@
|
||||
---
|
||||
priority: critical
|
||||
---
|
||||
|
||||
- Separate typed registry per plugin type (ExtractorRegistry, OcrRegistry, etc.)
|
||||
- Thread safety: Arc<RwLock<>> for all registries
|
||||
- Priority system: 0-255, default 50, custom > 50, fallback < 50
|
||||
- Selection: highest priority plugin matching the MIME type wins
|
||||
- MIME type indexing for O(log n) lookup
|
||||
- Conflict resolution: if equal priority, prefer Rust-native over FFI plugins
|
||||
- Dynamic registration: plugins can be added/removed at runtime
|
||||
- Validate plugin before registration (check trait compliance, supported formats)
|
||||
10
.ai-rulez/domains/plugin-system/rules/plugin-testing.md
Normal file
10
.ai-rulez/domains/plugin-system/rules/plugin-testing.md
Normal file
@@ -0,0 +1,10 @@
|
||||
---
|
||||
priority: high
|
||||
---
|
||||
|
||||
- Mock plugin testing: create test doubles for unit tests
|
||||
- Real plugin testing: integration tests with actual backends
|
||||
- Thread safety tests: run concurrent plugin operations to detect race conditions
|
||||
- Performance baselines: measure and track plugin overhead vs direct calls
|
||||
- Test all error paths: invalid input, backend failure, timeout, resource exhaustion
|
||||
- Test plugin lifecycle: register, use, unregister, verify cleanup
|
||||
11
.ai-rulez/domains/plugin-system/rules/python-ffi-plugins.md
Normal file
11
.ai-rulez/domains/plugin-system/rules/python-ffi-plugins.md
Normal file
@@ -0,0 +1,11 @@
|
||||
---
|
||||
priority: high
|
||||
---
|
||||
|
||||
- GIL management: use py.allow_threads() for expensive Rust operations
|
||||
- Cache frequently-accessed Python data in Rust fields to minimize GIL acquisitions
|
||||
- Use tokio::task::spawn_blocking for async calls to Python backends
|
||||
- Python exception translation: convert Python exceptions to Rust errors with full context
|
||||
- Data type mapping: Python str <-> Rust String, Python bytes <-> Rust Vec<u8>, Python dict <-> Rust HashMap
|
||||
- Validate Python plugin protocol compliance on registration
|
||||
- Target GIL overhead: 5-55us per acquisition
|
||||
Reference in New Issue
Block a user