Nomad changes

2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions
--- a/.ai-rulez/skills/api-server-mcp/SKILL.md
+++ b/.ai-rulez/skills/api-server-mcp/SKILL.md
@@ -0,0 +1,212 @@
+---
+description: "REST API server and MCP protocol integration"
+name: api-server-mcp
+priority: critical
+---
+
+# API Server & MCP Protocol
+
+**Axum server design for document extraction endpoints, middleware, async processing, and Model Context Protocol integration for AI agents**
+
+## Kreuzberg API Architecture
+
+**Location**: `crates/kreuzberg/src/api/`, `crates/kreuzberg-cli/`
+
+Kreuzberg provides a dual REST API + MCP server built with Axum + Tokio.
+
+```text
+Request Flow:
+HTTP Client / AI Agent (Claude)
+    |
+[Transport Layer]
+├── REST API (Axum HTTP)
+└── MCP Protocol (HTTP or Stdio)
+    |
+[Middleware Layer]
+├── CORS, Request Logging (TraceLayer)
+├── Request/Response size limits
+└── Rate limiting (optional)
+    |
+[Router]
+├── REST Endpoints
+│   ├── POST /extract - File upload extraction
+│   ├── POST /extract-url - URL-based extraction
+│   ├── GET /formats - List supported formats
+│   ├── GET /health - Server health check
+│   ├── POST /batch - Batch document processing
+│   ├── GET /cache/stats - Cache statistics
+│   └── DELETE /cache - Clear extraction cache
+├── MCP Endpoints
+│   ├── POST /mcp/tools - List available tools
+│   ├── POST /mcp/tools/call - Call a tool
+│   ├── GET /mcp/resources - List resources
+│   ├── GET /mcp/resources/:uri - Read resource
+│   ├── GET /mcp/prompts - List prompts
+│   └── GET /mcp/prompts/:name - Get prompt
+    |
+[Handler / Tool Layer]
+├── extract_handler / extract_file tool
+├── batch_handler / batch_extract tool
+├── health_handler / get_capabilities tool
+└── format_handler
+    |
+[Extraction Core]
+├── Format detection
+├── Extraction pipeline
+├── Post-processing (chunking, embeddings)
+└── Result formatting
+    |
+JSON Response / MCP ToolResult
+```
+
+## Server Setup & Configuration
+
+**Location**: `crates/kreuzberg/src/api/server.rs`
+
+Server initialization pattern: Create `ApiState` (holds `ExtractionConfig` + `ExtractionCache`), build Axum `Router` with all REST + MCP routes, apply middleware layers (body limits, CORS, tracing), serve via `tokio::net::TcpListener`.
+
+Key middleware layers applied in order:
+
+- `DefaultBodyLimit::max(100MB)` + `RequestBodyLimitLayer` -- configurable via env vars
+- `CorsLayer::permissive()` -- restrict in production via `CORS_ALLOWED_ORIGINS`
+- `TraceLayer::new_for_http()` -- request/response logging
+
+## Core REST Handlers
+
+**Location**: `crates/kreuzberg/src/api/handlers.rs`
+
+| Handler               | Method            | Description                                                                                            |
+| --------------------- | ----------------- | ------------------------------------------------------------------------------------------------------ |
+| `extract_handler`     | POST /extract     | Multipart upload: parse file + optional config JSON, check cache, call `extract_bytes()`, cache result |
+| `extract_url_handler` | POST /extract-url | Fetch URL via reqwest, extract bytes                                                                   |
+| `batch_handler`       | POST /batch       | Parallel extraction with `Semaphore`-limited concurrency (default: CPU count)                          |
+| `health_handler`      | GET /health       | Report status, version, uptime, feature availability (OCR, embeddings), cache stats                    |
+| `formats_handler`     | GET /formats      | Return supported format categories (office, pdf, images, web, email, archives, academic)               |
+| `cache_stats_handler` | GET /cache/stats  | Hit/miss counts and hit rate                                                                           |
+| `cache_clear_handler` | DELETE /cache     | Clear LRU cache                                                                                        |
+
+## Caching Strategy
+
+**Location**: `crates/kreuzberg/src/cache/mod.rs`
+
+LRU cache keyed by `SHA256(file_content)`, stores `Arc<ExtractionResult>`. Default 1000 entries. Thread-safe via `RwLock`. Tracks hit/miss counters with `AtomicU64` for stats endpoint.
+
+## Error Handling
+
+**Location**: `crates/kreuzberg/src/api/error.rs`
+
+`ApiError` enum maps to HTTP status codes:
+
+- `MissingFile` -> 400, `FileNotFound` -> 404
+- `OnnxRuntimeMissing` / `TesseractMissing` -> 503 (with remediation message)
+- `PayloadTooLarge` -> 413
+- `ExtractionFailed` / `InvalidConfig` / `UnsupportedFormat` -> 500
+
+## MCP Server Implementation
+
+**Location**: `crates/kreuzberg/src/mcp/server.rs`
+
+The MCP server allows Claude and other AI agents to call Kreuzberg extraction functions through the Model Context Protocol.
+
+### MCP Tools (Callable Functions)
+
+Three tools are registered:
+
+| Tool               | Purpose                                                   | Required Params |
+| ------------------ | --------------------------------------------------------- | --------------- |
+| `extract_file`     | Extract text/tables/metadata from documents (75+ formats) | `file_path`     |
+| `batch_extract`    | Extract from multiple documents in parallel               | `file_paths[]`  |
+| `get_capabilities` | List supported formats, features, backends                | (none)          |
+
+**Tool registration pattern** (example: `extract_file`):
+
+```rust
+// Define Tool with name, description, JSON Schema inputSchema
+// Register with server.register_tool(tool, handler_fn)
+// Handler: parse params -> build ExtractionConfig -> call extract_file() -> return ToolResult as JSON
+```
+
+`extract_file` optional params: `format`, `extract_tables`, `extract_images`, `ocr_enabled`, `extract_metadata`, `chunking_preset`, `generate_embeddings`.
+
+### MCP Resources (Static Knowledge)
+
+Three resources provide static information to agents:
+
+- `kreuzberg://formats` -- Supported format list as JSON
+- `kreuzberg://features` -- Cross-binding feature matrix (from `FEATURE_MATRIX.md`)
+- `kreuzberg://api-reference` -- Generated API documentation
+
+### MCP Prompts (Agent Templates)
+
+Two prompts guide agent extraction workflows:
+
+- `extract_for_rag` -- Document type-specific RAG extraction guidance (research paper, contract, report). Recommends chunking preset and embedding config.
+- `batch_document_processing` -- Optimal concurrency, grouping, and error handling for batch workflows.
+
+### MCP Transport Protocols
+
+- **HTTP/REST**: MCP routes mounted alongside REST API on separate `/mcp/` prefix
+- **Stdio**: JSON-RPC 2.0 over stdin/stdout for local CLI integration (e.g., Claude Desktop)
+
+### Integration with Claude Desktop
+
+```json
+{
+  "mcpServers": {
+    "kreuzberg": {
+      "command": "kreuzberg-mcp",
+      "env": {
+        "KREUZBERG_API_BASE": "http://localhost:8000",
+        "KREUZBERG_MCP_TRANSPORT": "stdio"
+      }
+    }
+  }
+}
+```
+
+### MCP Error Handling
+
+`ToolError` variants: `FileNotFound`, `UnsupportedFormat`, `ExtractionFailed`, `OnnxRuntimeMissing`, `TesseractMissing`, `Timeout`. Each maps to an MCP `ToolResultError` with descriptive code and message.
+
+## Environment Configuration
+
+See `.env.example` for all configurable variables. Key categories:
+
+- **Server**: `KREUZBERG_HOST`, `KREUZBERG_PORT`
+- **Size limits**: `KREUZBERG_MAX_REQUEST_BODY_BYTES` (default 100MB), `KREUZBERG_MAX_MULTIPART_FIELD_BYTES`
+- **Features**: `KREUZBERG_ENABLE_OCR`, `KREUZBERG_ENABLE_EMBEDDINGS`, `KREUZBERG_ENABLE_KEYWORDS`
+- **Cache**: `KREUZBERG_CACHE_ENABLED`, `KREUZBERG_CACHE_SIZE`
+- **CORS**: `CORS_ALLOWED_ORIGINS` (comma-separated)
+- **MCP**: `KREUZBERG_MCP_HOST`, `KREUZBERG_MCP_PORT`, `KREUZBERG_MCP_TRANSPORT` (stdio/http)
+- **Logging**: `RUST_LOG=kreuzberg=info,tower_http=debug`
+
+## Critical Rules
+
+### REST API Rules
+
+1. **Always validate multipart file uploads** - Check MIME type, size, magic bytes
+2. **Timeout long-running extractions** - Set per-handler timeout (5 min default)
+3. **Stream large files** - Never buffer entire multi-GB file in memory
+4. **Cache aggressively** - Identical files should return from cache in <1ms
+5. **Parallel extraction is CPU-bound** - Limit workers to CPU count + 1
+6. **Error responses must be actionable** - Include error code and remediation suggestion
+7. **Health checks must verify features** - Report missing dependencies (ONNX, Tesseract)
+8. **Size limits are configurable** - Allow override via env var for large deployments
+9. **CORS is permissive by default** - Restrict in production via env var
+10. **Logging all requests** - Track extraction metrics for observability
+
+### MCP Rules
+
+1. **All tools must have timeout** - Prevent hanging on large files (default 5 min)
+2. **Error responses must be detailed** - Include suggestions for missing dependencies
+3. **Feature gates must be checked** - Return helpful message if feature unavailable (embeddings, OCR)
+4. **Resources should be static** - Don't query external services in resource handlers
+5. **Prompts guide agents** - Provide clear examples and best practices
+6. **Batch tools must support cancellation** - Allow agent to stop long-running batch operations
+7. **Logging all tool calls** - Track usage for analytics and debugging
+
+## Related Skills
+
+- **extraction-pipeline-patterns** - Core extraction called by handlers and MCP tools
+- **chunking-embeddings** - Optional chunking/embedding parameters in extraction
+- **ocr-backend-management** - OCR engine selection and image preprocessing
--- a/.ai-rulez/skills/chunking-embeddings/SKILL.md
+++ b/.ai-rulez/skills/chunking-embeddings/SKILL.md
@@ -0,0 +1,120 @@
+---
+description: "Chunking, embeddings, and RAG pipeline integration"
+name: chunking-embeddings
+priority: critical
+---
+
+# Chunking & Embeddings
+
+**Text splitting strategies, embedding generation with FastEmbed, RAG pipeline integration**
+
+## Chunking Architecture Overview
+
+**Location**: `crates/kreuzberg/src/chunking/`, `crates/kreuzberg/src/embeddings.rs`
+
+```text
+Extracted Text
+    |
+[1. Normalization] -> Clean whitespace, remove control chars
+    |
+[2. Chunk Strategy Selection] -> Fixed-size, semantic, syntax-aware, recursive
+    |
+[3. Overlap Management] -> Control context window overlap
+    |
+[4. Optional Embedding] -> Generate vectors with FastEmbed
+    |
+Output: Vec<Chunk> with text, vectors, metadata
+```
+
+## Chunking Strategies
+
+**Location**: `crates/kreuzberg/src/chunking/mod.rs`
+
+| Strategy                          | Pattern                                                 | Best For                                                           |
+| --------------------------------- | ------------------------------------------------------- | ------------------------------------------------------------------ |
+| **Fixed-Size**                    | Sliding window with configurable overlap                | Uniform chunks for embedding models with fixed token limits        |
+| **Semantic**                      | Split by sentences, merge/split by similarity threshold | Smart context preservation for LLM consumption and semantic search |
+| **Syntax-Aware**                  | Split by paragraph/section/heading/code-block structure | Preserving document structure (sections, code blocks) in RAG       |
+| **Recursive** (LangChain pattern) | Try separators in order: `\n\n`, `\n`, `,`              | Best general-purpose chunking; auto-finds optimal split points     |
+
+Key config fields per strategy (see struct definitions in `chunking/mod.rs`):
+
+- Fixed-Size: `chunk_size`, `overlap`, `trim_whitespace`
+- Semantic: `target_chunk_size`, `min/max_chunk_size`, `semantic_threshold`, `use_sentence_boundaries`
+- Syntax-Aware: `chunk_by` (Paragraph/Section/Heading/Sentence/CodeBlock), `max_chunk_size`, `respect_code_blocks`
+- Recursive: `separators[]`, `chunk_size`, `overlap`
+
+## Chunking Configuration Presets
+
+**Location**: `crates/kreuzberg/src/chunking/mod.rs`
+
+| Preset       | Chunk Size  | Overlap | Strategy   | Use Case               |
+| ------------ | ----------- | ------- | ---------- | ---------------------- |
+| **Balanced** | 512 tokens  | 50      | Semantic   | RAG sweet spot         |
+| **Compact**  | 256 tokens  | 32      | Fixed-Size | Dense vectors          |
+| **Extended** | 1024 tokens | 100     | Recursive  | Full context           |
+| **Minimal**  | 128 tokens  | 16      | (default)  | Lightweight embeddings |
+
+Usage: set `config.chunking.preset = Some("balanced")` in `ExtractionConfig`.
+
+## Embedding Generation with FastEmbed
+
+**Location**: `crates/kreuzberg/src/embeddings.rs`
+
+### Model Selection
+
+| Model                               | Dimensions | Notes                            |
+| ----------------------------------- | ---------- | -------------------------------- |
+| `BAAI/bge-small-en-v1.5` (default)  | 384        | Fast, excellent for RAG          |
+| `BAAI/bge-small-zh-v1.5`            | 384        | Chinese optimized                |
+| `BAAI/bge-base-en-v1.5`             | 768        | Better quality, slower           |
+| `jinaai/jina-embeddings-v2-base-en` | 768        | Long context (up to 8192 tokens) |
+| `Custom(path)`                      | varies     | Custom ONNX model path           |
+
+### Embedding Pattern
+
+`TextEmbeddingManager` provides singleton-cached models per config. Pattern:
+
+1. `get_or_init_model()` -- lazy-loads ONNX model (downloads if needed), caches in `Arc<RwLock<HashMap>>`
+2. `embed_chunks()` -- collects chunk texts, calls `model.embed(texts, batch_size)`, zips results back to `ChunkWithEmbedding`
+
+Default config: `batch_size=256`, `device=CPU`, `parallel_requests=4`.
+
+### ONNX Runtime Requirement
+
+Embeddings require ONNX Runtime. Feature-gated via:
+
+```toml
+[features]
+embeddings = ["dep:fastembed", "dep:ort"]
+```
+
+Install: `brew install onnxruntime` (macOS) / `apt install libonnxruntime libonnxruntime-dev` (Linux). Verify: `echo $ORT_DYLIB_PATH`.
+
+## RAG Integration Pattern
+
+The full extraction-to-RAG pipeline:
+
+1. **Extract**: `extract_file(path, config)` -> `ExtractionResult`
+2. **Chunk**: Apply preset strategy to `result.content` -> `Vec<Chunk>`
+3. **Embed**: If embedding config present, `TextEmbeddingManager::embed_chunks()` -> `Vec<ChunkWithEmbedding>`
+4. **Output**: `RagDocument { file_path, metadata, chunks }` ready for vector DB ingestion
+
+See `ChunkWithEmbedding` struct in `types.rs`: contains `text`, `embedding: Vec<f32>`, `dimensions`, `norm`, `metadata`.
+
+## Critical Rules
+
+1. **Chunking is preprocessing** - Always apply before embedding to ensure consistent vector sizes
+2. **Overlap prevents information loss** - Set overlap to 15-20% of chunk size
+3. **Embedding models are stateful** - Lazy load and cache to avoid repeated initialization
+4. **ONNX Runtime is required** - Gracefully degrade if not available (skip embeddings)
+5. **Batch embedding for performance** - Never embed single chunks; batch 50-1000 chunks
+6. **Normalize embeddings for search** - Use L2 norm for cosine similarity
+7. **Cache embedding results** - Don't re-embed identical text chunks
+8. **Model selection impacts quality** - bge-small (384) for speed, bge-base (768) for quality
+
+## Related Skills
+
+- **extraction-pipeline-patterns** - Text extraction preceding chunking
+- **api-server-mcp** - Endpoint for chunking + embedding operations
+- **ocr-backend-management** - OCR text quality affects chunking success
--- a/.ai-rulez/skills/extraction-pipeline-patterns/SKILL.md
+++ b/.ai-rulez/skills/extraction-pipeline-patterns/SKILL.md
@@ -0,0 +1,126 @@
+---
+description: "Document extraction pipeline architecture and patterns"
+name: extraction-pipeline-patterns
+priority: critical
+---
+
+# Extraction Pipeline Patterns
+
+**Kreuzberg's format detection -> extraction -> fallback orchestration for 75+ file formats**
+
+## Core Pipeline Architecture
+
+The extraction pipeline (`crates/kreuzberg/src/core/pipeline.rs`, `crates/kreuzberg/src/extraction/`) orchestrates:
+
+1. **Format Detection** - MIME type inference + extension validation -> select appropriate extractor
+2. **Intelligent Extraction** - Route to format-specific extractors (PDF, DOCX, Excel, HTML, images, archives, etc.)
+3. **Fallback Strategies** - Password-protected PDFs, OCR for images, nested archive handling, corrupted file recovery
+4. **Post-Processing Pipeline** - Validators, quality processing, chunking, custom hooks (see `core/pipeline.rs`)
+
+## Format Detection Strategy
+
+**Location**: `crates/kreuzberg/src/core/mime.rs`, `crates/kreuzberg/src/core/formats.rs`
+
+Pattern: detect via magic bytes, validate extension alignment (prevent spoofing), route to extractor. Multiple extractors for same format -> choose highest confidence/specificity.
+
+```rust
+// Pseudocode: core/mime.rs
+match (magic_bytes(content), extension) {
+    (Some(fmt), Some(ext)) if aligned -> Ok(fmt),
+    (Some(fmt), Some(ext)) if misaligned -> Err(FormatMismatch),
+    (Some(fmt), None) -> Ok(fmt),  // magic bytes only
+    (None, Some(ext)) -> Ok(from_extension(ext)),
+    _ -> Err(UnknownFormat),
+}
+```
+
+## Extraction Modules (75 Formats)
+
+| Category     | Extractors                                       | Key Modules                                          |
+| ------------ | ------------------------------------------------ | ---------------------------------------------------- |
+| **Office**   | DOCX, XLSX, XLSM, XLSB, XLS, PPTX, ODP, ODS      | `extraction/{docx,excel,pptx}.rs`                    |
+| **PDF**      | Standard + encrypted, password attempts          | `pdf/` subdirectory (13 files)                       |
+| **Images**   | PNG, JPG, TIFF, WebP, JP2, SVG (OCR-enabled)     | `extraction/image.rs` + `ocr/`                       |
+| **Web**      | HTML, XHTML, XML, SVG (DOM parsing)              | `extraction/html.rs` (67KB - complex table handling) |
+| **Email**    | EML, MSG (headers, body, attachments, threading) | `extraction/email.rs`                                |
+| **Archives** | ZIP, TAR, GZ, 7Z (recursive extraction)          | `extraction/archive.rs` (31KB)                       |
+| **Markdown** | MD, TXT, RST, Org Mode, RTF                      | `extraction/markdown.rs`                             |
+| **Academic** | LaTeX, BibTeX, JATS, Jupyter, DocBook            | `extraction/{structured,xml}.rs`                     |
+
+## Extraction Dispatcher
+
+```rust
+// Pseudocode: extraction/mod.rs
+let format = detect_format(source.bytes, source.extension);
+let result = match format {
+    Pdf -> extract_pdf(source, config),
+    Docx -> extract_docx(source, config),
+    Image -> extract_image_with_ocr_fallback(source, config),
+    Archive -> extract_archive_recursive(source, config),
+    _ -> extract_with_plugin(format, source, config),
+};
+run_pipeline(result, config)  // post-processing always runs
+```
+
+## Fallback Strategies
+
+- **Password-Protected PDFs**: Try primary password -> secondary password list -> return `is_encrypted=true` in metadata on failure
+- **OCR Fallback**: If image text extraction confidence < threshold, trigger OCR backend; return both results with scores
+- **Nested Archives**: Recursive extraction with configurable depth limit; flatten or preserve hierarchy
+- **Corrupted File Recovery**: Stream-based parsing, emit content up to error point, include error location in metadata
+
+## Configuration Integration
+
+**Location**: `crates/kreuzberg/src/core/config.rs`, `crates/kreuzberg/src/core/config_validation.rs`
+
+`ExtractionConfig` holds format-specific configs (`pdf`, `image`, `html`, `office`), fallback orchestration (`fallback`), and post-processing (`postprocessor`, `chunking`, `keywords`). See struct definition in `config.rs`.
+
+## Plugin System Integration
+
+**Location**: `crates/kreuzberg/src/plugins/`
+
+- **CustomExtractor**: Override built-in format extractors
+- **PostProcessor**: Modify results after extraction (Early/Middle/Late stages)
+- **Validator**: Fail-fast validation (e.g., minimum text length)
+- **OCRBackend**: Swap OCR engine
+
+Plugin registry loaded at startup, cached for zero-cost lookup.
+
+## Feature Flag Strategy
+
+**Location**: `Cargo.toml` (workspace), `crates/kreuzberg/Cargo.toml`, `FEATURE_MATRIX.md`
+
+20+ features across 9 language bindings. Key feature groups:
+
+| Group    | Features                                                                             | Notes                             |
+| -------- | ------------------------------------------------------------------------------------ | --------------------------------- |
+| OCR      | `tesseract` (default), `tesseract-static`, `ocr-minimal`                             | Mutually exclusive recommendation |
+| Formats  | `pdf`, `pdf-minimal`, `office`, `office-minimal`                                     |                                   |
+| AI/ML    | `embeddings` (requires ONNX), `keywords-yake`, `keywords-rake`, `language-detection` |                                   |
+| Server   | `api` (Axum), `mcp`, `tokio-runtime`, `lite-runtime`                                 |                                   |
+| Bindings | `python-bindings`, `ruby-bindings`, `php-bindings`, `node-bindings`, `wasm`          |                                   |
+
+Conditional compilation: modules gated with `#[cfg(feature = "...")]`. Runtime `validate_config()` warns if requested feature not compiled in.
+
+### Feature Flag Critical Rules
+
+1. **Never mix conflicting features** - e.g., `ocr-minimal` + `tesseract` should error at compile time
+2. **Always provide feature diagnostics** - Config validation must warn if feature unavailable
+3. **Default to maximum feature set** - Unless embedded/minimal specifically requested
+4. **Test all feature combinations** - Matrix testing in CI catches regressions
+5. **WASM incompatible** with embeddings, keywords, OCR
+
+## Critical Rules
+
+1. **Always use format detection** before routing to extractors (prevent confusion attacks)
+2. **Stream-based parsing** for PDFs/archives to handle multi-GB files
+3. **Post-pipeline is mandatory**: All extraction results flow through `run_pipeline()` for validators/hooks
+4. **Plugin overrides are order-dependent**: Plugins registered first take priority
+5. **Fallback timeouts**: Set reasonable OCR/archive extraction timeouts (config-driven)
+6. **Metadata preservation**: Include format detection confidence, extraction method used, any fallbacks applied
+
+## Related Skills
+
+- **ocr-backend-management** - OCR engine selection and image preprocessing
+- **chunking-embeddings** - Post-extraction text splitting with FastEmbed
+- **api-server-mcp** - Axum endpoint for extraction pipeline exposure and MCP server
--- a/.ai-rulez/skills/format-specific-extraction/SKILL.md
+++ b/.ai-rulez/skills/format-specific-extraction/SKILL.md
@@ -0,0 +1,78 @@
+---
+name: format-specific-extraction
+description: "Format-specific document extraction workflows"
+priority: high
+---
+
+# Format-Specific Extraction Workflows
+
+## Office XML (DOCX/PPTX/ODT)
+
+```text
+ZIP archive → Security validation → XML parsing → Text + tables + metadata
+```
+
+1. `ZipBombValidator::new(limits).validate(&mut archive)?`
+2. Extract XML files from archive (`word/document.xml`, `ppt/slides/*.xml`, `content.xml`)
+3. Parse with `quick-xml::Reader` (streaming) + `DepthValidator` + `StringGrowthValidator`
+4. Extract metadata via `crate::extraction::office_metadata::extract_metadata()`
+5. See: `extractors/docx.rs`, `extractors/pptx.rs`, `extractors/odt.rs`
+
+## PDF
+
+```text
+Bytes → pdf_oxide → Per-page text + OCR fallback → Tables → Metadata
+```
+
+1. `pdf_oxide::PdfDocument::from_bytes(content)?`
+2. Check if needs OCR: `config.force_ocr || !has_searchable_text()`
+3. Extract text per page, tables if `config.pages` enabled
+4. Feature-gated: `#[cfg(feature = "pdf")]`
+5. See: `extractors/pdf/mod.rs`
+
+## Archives (ZIP/TAR/7z/GZIP)
+
+```text
+Validate → Extract metadata → Extract plaintext files only
+```
+
+1. `ZipBombValidator` BEFORE any extraction
+2. Extract metadata (file list, sizes)
+3. Extract text content from plaintext files
+4. Use `build_archive_result()` helper
+5. See: `extractors/archive.rs`, `extraction/archive/*.rs`
+
+## Structured Text (JSON/YAML/TOML/XML)
+
+```text
+Detect format from MIME → Parse → Pretty-print → Metadata
+```
+
+Single `StructuredExtractor` handles multiple MIME types. Parse with format-specific library, pretty-print to text.
+See: `extractors/structured.rs`
+
+## Email (EML/MSG)
+
+```text
+Parse headers → Extract body (text/html) → Process attachments
+```
+
+See: `extraction/email.rs`, `extractors/email.rs`
+
+## Common Helpers
+
+| Helper                                | Location                    | Purpose                        |
+| ------------------------------------- | --------------------------- | ------------------------------ |
+| `office_metadata::extract_metadata()` | `extraction/office.rs`      | Office XML metadata            |
+| `cells_to_markdown()`                 | `extraction/mod.rs`         | Convert cell grid to GFM table |
+| `build_archive_result()`              | `extraction/archive/mod.rs` | Standard archive result        |
+
+## Adding a New Format
+
+1. Add MIME type to `EXT_TO_MIME` in `core/mime.rs`
+2. Create extractor implementing `DocumentExtractor` trait
+3. Set `supported_mime_types()` and `priority()` (default: 50)
+4. Register in `extractors/mod.rs` → `register_default_extractors()`
+5. Feature-gate if optional: `#[cfg(feature = "my-format")]`
+6. Apply security validators for user content
+7. Add tests with fixture files
--- a/.ai-rulez/skills/plugin-architecture-patterns/SKILL.md
+++ b/.ai-rulez/skills/plugin-architecture-patterns/SKILL.md
@@ -0,0 +1,97 @@
+---
+name: plugin-architecture-patterns
+description: "Plugin architecture, registration, and trait patterns"
+priority: critical
+---
+
+# Plugin Architecture & Registration
+
+## Plugin Types
+
+| Type               | Trait                       | Location                     |
+| ------------------ | --------------------------- | ---------------------------- |
+| Document Extractor | `DocumentExtractor: Plugin` | `plugins/extractor/trait.rs` |
+| OCR Backend        | `OcrBackend: Plugin`        | `plugins/ocr/trait.rs`       |
+| Post Processor     | `PostProcessor: Plugin`     | `plugins/processor/trait.rs` |
+| Validator          | `Validator: Plugin`         | `plugins/validator/trait.rs` |
+
+## DocumentExtractor Implementation
+
+```rust
+use crate::plugins::{DocumentExtractor, Plugin};
+use async_trait::async_trait;
+
+pub struct MyExtractor;
+
+impl Plugin for MyExtractor {
+    fn name(&self) -> &str { "my-extractor" }
+    fn version(&self) -> String { env!("CARGO_PKG_VERSION").to_string() }
+}
+
+#[async_trait]
+impl DocumentExtractor for MyExtractor {
+    async fn extract_bytes(&self, content: &[u8], mime_type: &str, config: &ExtractionConfig)
+        -> Result<ExtractionResult> { /* ... */ }
+
+    fn supported_mime_types(&self) -> &[&str] { &["application/x-custom"] }
+    fn priority(&self) -> i32 { 50 }
+
+    // WASM support (optional)
+    fn as_sync_extractor(&self) -> Option<&dyn SyncExtractor> { None }
+}
+```
+
+## Priority System
+
+| Range  | Use                       |
+| ------ | ------------------------- |
+| 0-25   | Fallback/low-quality      |
+| 26-49  | Alternative extractors    |
+| **50** | **Default (built-in)**    |
+| 51-75  | Premium/enhanced          |
+| 76-100 | Specialized/high-priority |
+
+Registry selects **highest priority** extractor for each MIME type. Override built-ins with priority > 50.
+
+## Registration
+
+```rust
+// In extractors/mod.rs → register_default_extractors()
+let registry = get_document_extractor_registry();
+let mut registry = registry.write()
+    .map_err(|e| KreuzbergError::Other(format!("Registry lock poisoned: {}", e)))?;
+registry.register(Arc::new(MyExtractor::new()))?;
+```
+
+## Feature-Gated Registration
+
+```rust
+#[cfg(feature = "office")]
+{
+    registry.register(Arc::new(DocxExtractor::new()))?;
+    registry.register(Arc::new(PptxExtractor::new()))?;
+}
+```
+
+## PostProcessor Pattern
+
+```rust
+impl PostProcessor for MyProcessor {
+    async fn process(&self, result: &mut ExtractionResult, config: &ExtractionConfig)
+        -> Result<()> {
+        result.content = process_content(&result.content);
+        Ok(())
+    }
+    fn stage(&self) -> ProcessorStage { ProcessorStage::Middle }
+}
+```
+
+Stages: `Early` → `Middle` → `Late`. Failures isolated (don't block others).
+
+## Critical Rules
+
+1. All plugins **MUST be `Send + Sync`**
+2. Feature gate with `#[cfg(feature = "...")]` for optional formats
+3. Use `#[async_trait]` for `DocumentExtractor`
+4. Initialization via `ensure_initialized()` (lazy, called before first extraction)
+5. Plugin names: kebab-case (e.g., `"pdf-extractor"`)