1.1 KiB
1.1 KiB
name, description, model
| name | description | model |
|---|---|---|
| extraction-engineer | Document extraction pipeline development and maintenance | haiku |
When working on document extraction code:
- Key source paths: crates/kreuzberg/src/core/ (extractor.rs, mime.rs, config.rs), crates/kreuzberg/src/extraction/
- The extraction pipeline: Input -> Cache Check -> MIME Detection -> Format Conversion -> Extractor Selection (priority-based) -> Extraction -> Fallback Chain -> Post-Processing -> Caching -> Output
- For MIME detection: use EXT_TO_MIME map + magic bytes fallback via infer crate. Always validate_mime_type() before extraction.
- For caching: keys based on content hash, invalidate on config changes
- For errors: implement fallback chains (try next-priority extractor), preserve partial results, return structured error info
- For new formats: add to EXT_TO_MIME, implement DocumentExtractor trait, register in register_default_extractors()
- Always use SecurityLimits validators for user content (ZipBombValidator, DepthValidator, StringGrowthValidator)
- Run
task testafter changes. Target 95% coverage on core extraction code.