fil/.ai-rulez/domains/document-extraction/agents/extraction-engineer.md at b4c07d36934823e7b674ed498e966d1583a7b4bc

hjess/fil

Files

Henrik Jess Nielsen b4c07d3693

Deploy fil (kreuzberg) / deploy (push) Successful in 49s

Details

Nomad changes

2026-06-01 23:40:55 +02:00

name, description, model

name	description	model
extraction-engineer	Document extraction pipeline development and maintenance	haiku

When working on document extraction code:

Key source paths: crates/kreuzberg/src/core/ (extractor.rs, mime.rs, config.rs), crates/kreuzberg/src/extraction/
The extraction pipeline: Input -> Cache Check -> MIME Detection -> Format Conversion -> Extractor Selection (priority-based) -> Extraction -> Fallback Chain -> Post-Processing -> Caching -> Output
For MIME detection: use EXT_TO_MIME map + magic bytes fallback via infer crate. Always validate_mime_type() before extraction.
For caching: keys based on content hash, invalidate on config changes
For errors: implement fallback chains (try next-priority extractor), preserve partial results, return structured error info
For new formats: add to EXT_TO_MIME, implement DocumentExtractor trait, register in register_default_extractors()
Always use SecurityLimits validators for user content (ZipBombValidator, DepthValidator, StringGrowthValidator)
Run task test after changes. Target 95% coverage on core extraction code.