2.0 KiB
2.0 KiB
summary
| summary |
|---|
| MIME type detection and extractor routing logic |
MIME Detection & Routing
Detection Flow
Extension -> EXT_TO_MIME map -> validate -> Registry lookup -> Extractor
Key Functions
| Function | Location | Purpose |
|---|---|---|
detect_mime_type(path, inspect) |
core/mime.rs |
Extension + optional content inspection |
detect_mime_type_from_bytes(bytes) |
core/mime.rs |
Magic number detection (infer crate) |
validate_mime_type(mime) |
core/mime.rs |
Check if any extractor supports it |
Extension Mapping
118+ extensions mapped in EXT_TO_MIME (core/mime.rs). Case-insensitive.
Key mappings: .pdf -> application/pdf, .docx -> application/vnd.openxmlformats-officedocument.wordprocessingml.document, .xlsx -> spreadsheet variant, .png/.jpg -> image/*
Registry Selection
// In core/extractor/bytes.rs
fn select_extractor_for_mime(mime_type: &str) -> Result<Arc<dyn DocumentExtractor>> {
let registry = get_document_extractor_registry();
let registry_guard = registry.read()?;
registry_guard.get_for_mime_type(mime_type)
.ok_or_else(|| KreuzbergError::UnsupportedFormat(mime_type.into()))
}
Selects highest-priority extractor registered for that MIME type.
Adding New MIME Types
- Add extension mapping:
m.insert("ext", "application/x-new");incore/mime.rs - Implement
DocumentExtractorwithsupported_mime_types()returning the MIME - Register in
register_default_extractors()
Wildcard Support
Extractors can register for MIME type families: "image/*" matches image/png, image/jpeg, etc.
Critical Rules
- Always
validate_mime_type()before extraction - Extension mapping is case-insensitive
- Content inspection (infer crate) is fallback for extension-less files
- Registry validation is final authority on supported types