This commit is contained in:
78
.ai-rulez/skills/format-specific-extraction/SKILL.md
Normal file
78
.ai-rulez/skills/format-specific-extraction/SKILL.md
Normal file
@@ -0,0 +1,78 @@
|
||||
---
|
||||
name: format-specific-extraction
|
||||
description: "Format-specific document extraction workflows"
|
||||
priority: high
|
||||
---
|
||||
|
||||
# Format-Specific Extraction Workflows
|
||||
|
||||
## Office XML (DOCX/PPTX/ODT)
|
||||
|
||||
```text
|
||||
ZIP archive → Security validation → XML parsing → Text + tables + metadata
|
||||
```
|
||||
|
||||
1. `ZipBombValidator::new(limits).validate(&mut archive)?`
|
||||
2. Extract XML files from archive (`word/document.xml`, `ppt/slides/*.xml`, `content.xml`)
|
||||
3. Parse with `quick-xml::Reader` (streaming) + `DepthValidator` + `StringGrowthValidator`
|
||||
4. Extract metadata via `crate::extraction::office_metadata::extract_metadata()`
|
||||
5. See: `extractors/docx.rs`, `extractors/pptx.rs`, `extractors/odt.rs`
|
||||
|
||||
## PDF
|
||||
|
||||
```text
|
||||
Bytes → pdf_oxide → Per-page text + OCR fallback → Tables → Metadata
|
||||
```
|
||||
|
||||
1. `pdf_oxide::PdfDocument::from_bytes(content)?`
|
||||
2. Check if needs OCR: `config.force_ocr || !has_searchable_text()`
|
||||
3. Extract text per page, tables if `config.pages` enabled
|
||||
4. Feature-gated: `#[cfg(feature = "pdf")]`
|
||||
5. See: `extractors/pdf/mod.rs`
|
||||
|
||||
## Archives (ZIP/TAR/7z/GZIP)
|
||||
|
||||
```text
|
||||
Validate → Extract metadata → Extract plaintext files only
|
||||
```
|
||||
|
||||
1. `ZipBombValidator` BEFORE any extraction
|
||||
2. Extract metadata (file list, sizes)
|
||||
3. Extract text content from plaintext files
|
||||
4. Use `build_archive_result()` helper
|
||||
5. See: `extractors/archive.rs`, `extraction/archive/*.rs`
|
||||
|
||||
## Structured Text (JSON/YAML/TOML/XML)
|
||||
|
||||
```text
|
||||
Detect format from MIME → Parse → Pretty-print → Metadata
|
||||
```
|
||||
|
||||
Single `StructuredExtractor` handles multiple MIME types. Parse with format-specific library, pretty-print to text.
|
||||
See: `extractors/structured.rs`
|
||||
|
||||
## Email (EML/MSG)
|
||||
|
||||
```text
|
||||
Parse headers → Extract body (text/html) → Process attachments
|
||||
```
|
||||
|
||||
See: `extraction/email.rs`, `extractors/email.rs`
|
||||
|
||||
## Common Helpers
|
||||
|
||||
| Helper | Location | Purpose |
|
||||
| ------------------------------------- | --------------------------- | ------------------------------ |
|
||||
| `office_metadata::extract_metadata()` | `extraction/office.rs` | Office XML metadata |
|
||||
| `cells_to_markdown()` | `extraction/mod.rs` | Convert cell grid to GFM table |
|
||||
| `build_archive_result()` | `extraction/archive/mod.rs` | Standard archive result |
|
||||
|
||||
## Adding a New Format
|
||||
|
||||
1. Add MIME type to `EXT_TO_MIME` in `core/mime.rs`
|
||||
2. Create extractor implementing `DocumentExtractor` trait
|
||||
3. Set `supported_mime_types()` and `priority()` (default: 50)
|
||||
4. Register in `extractors/mod.rs` → `register_default_extractors()`
|
||||
5. Feature-gate if optional: `#[cfg(feature = "my-format")]`
|
||||
6. Apply security validators for user content
|
||||
7. Add tests with fixture files
|
||||
Reference in New Issue
Block a user