hjess/fil

Files

Henrik Jess Nielsen b4c07d3693

Deploy fil (kreuzberg) / deploy (push) Successful in 49s

Details

Nomad changes

2026-06-01 23:40:55 +02:00

name, description, priority

name	description	priority
format-specific-extraction	Format-specific document extraction workflows	high

Format-Specific Extraction Workflows

Office XML (DOCX/PPTX/ODT)

ZIP archive → Security validation → XML parsing → Text + tables + metadata

ZipBombValidator::new(limits).validate(&mut archive)?
Extract XML files from archive (word/document.xml, ppt/slides/*.xml, content.xml)
Parse with quick-xml::Reader (streaming) + DepthValidator + StringGrowthValidator
Extract metadata via crate::extraction::office_metadata::extract_metadata()
See: extractors/docx.rs, extractors/pptx.rs, extractors/odt.rs

Bytes → pdf_oxide → Per-page text + OCR fallback → Tables → Metadata

Validate → Extract metadata → Extract plaintext files only

Detect format from MIME → Parse → Pretty-print → Metadata

Single StructuredExtractor handles multiple MIME types. Parse with format-specific library, pretty-print to text. See: extractors/structured.rs

Parse headers → Extract body (text/html) → Process attachments

See: extraction/email.rs, extractors/email.rs

Helper	Location	Purpose
`office_metadata::extract_metadata()`	`extraction/office.rs`	Office XML metadata
`cells_to_markdown()`	`extraction/mod.rs`	Convert cell grid to GFM table
`build_archive_result()`	`extraction/archive/mod.rs`	Standard archive result