Nomad changes

2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions
--- a/.ai-rulez/domains/document-extraction/DOMAIN.md
+++ b/.ai-rulez/domains/document-extraction/DOMAIN.md
@@ -0,0 +1,12 @@
+---
+description: Document extraction pipeline architecture
+---
+
+- Pipeline: file input → MIME detection (magic bytes + extension) → extractor routing → extraction → post-processing → ExtractionResult
+- Extractors are plugins implementing the Extractor trait: extract(&self, source: &ExtractionSource) → ExtractionResult
+- Fallback chains: if primary extractor fails, try next in priority order (e.g., native PDF → Tesseract OCR → error)
+- Cache-first: check extraction cache before running extractors, cache results keyed by content hash
+- ExtractionResult contains: text content, metadata (page count, language, confidence), optional structured data (tables, images)
+- Async-first: all extraction paths are async, use spawn_blocking for CPU-bound work (OCR, image processing)
+- Memory limits: streaming for large files, configurable max file size, depth limits for nested archives
+- Format coverage: 91+ formats — PDF, DOCX, XLSX, PPTX, HTML, images, email (EML/MSG), archives, plain text
--- a/.ai-rulez/domains/document-extraction/agents/extraction-engineer.md
+++ b/.ai-rulez/domains/document-extraction/agents/extraction-engineer.md
@@ -0,0 +1,16 @@
+---
+name: extraction-engineer
+description: Document extraction pipeline development and maintenance
+model: haiku
+---
+
+When working on document extraction code:
+
+1. Key source paths: crates/kreuzberg/src/core/ (extractor.rs, mime.rs, config.rs), crates/kreuzberg/src/extraction/
+2. The extraction pipeline: Input -> Cache Check -> MIME Detection -> Format Conversion -> Extractor Selection (priority-based) -> Extraction -> Fallback Chain -> Post-Processing -> Caching -> Output
+3. For MIME detection: use EXT_TO_MIME map + magic bytes fallback via infer crate. Always validate_mime_type() before extraction.
+4. For caching: keys based on content hash, invalidate on config changes
+5. For errors: implement fallback chains (try next-priority extractor), preserve partial results, return structured error info
+6. For new formats: add to EXT_TO_MIME, implement DocumentExtractor trait, register in register_default_extractors()
+7. Always use SecurityLimits validators for user content (ZipBombValidator, DepthValidator, StringGrowthValidator)
+8. Run `task test` after changes. Target 95% coverage on core extraction code.
--- a/.ai-rulez/domains/document-extraction/rules/api-compatibility.md
+++ b/.ai-rulez/domains/document-extraction/rules/api-compatibility.md
@@ -0,0 +1,9 @@
+---
+priority: high
+---
+
+- Follow semantic versioning — breaking changes require major version bump
+- Document all public API changes in CHANGELOG.md
+- Maintain backward compatibility for at least one minor version before removing deprecated APIs
+- All public types must be FFI-friendly or have FFI-compatible equivalents
+- Version in Cargo.toml is the single source of truth for all binding packages
--- a/.ai-rulez/domains/document-extraction/rules/async-and-concurrency.md
+++ b/.ai-rulez/domains/document-extraction/rules/async-and-concurrency.md
@@ -0,0 +1,9 @@
+---
+priority: high
+---
+
+- All extraction paths must be fully async using tokio
+- Never block the async runtime — use spawn_blocking for CPU-intensive work
+- All public types must be Send + Sync
+- Use tokio::select! for timeout handling on extraction operations
+- Cross-platform: test on Linux (amd64, arm64) and macOS at minimum
--- a/.ai-rulez/domains/document-extraction/rules/cache-and-performance.md
+++ b/.ai-rulez/domains/document-extraction/rules/cache-and-performance.md
@@ -0,0 +1,10 @@
+---
+priority: high
+---
+
+- Cache keys: content-hash based (hash of file bytes + config), not path-based
+- Invalidate cache when extraction config changes (output format, OCR settings, etc.)
+- Check cache before any extraction — cache hits should skip all processing
+- Concurrent batch processing: use configurable worker pool, default to CPU count
+- Stream large files instead of loading into memory — use AsyncRead where possible
+- Monitor cache hit rates — target >80% for repeated extractions
--- a/.ai-rulez/domains/document-extraction/rules/extraction-quality.md
+++ b/.ai-rulez/domains/document-extraction/rules/extraction-quality.md
@@ -0,0 +1,10 @@
+---
+priority: high
+---
+
+- 95% test coverage on core extraction code, 80% on bindings
+- Test all format categories: text, office, PDF, images, archives, markup
+- Test corrupted/malformed documents — extraction must fail gracefully, never panic
+- Benchmark extraction speeds per format — track regressions in CI
+- Test both success and error paths for every extractor
+- Use property-based testing for parsers with wide input ranges
--- a/.ai-rulez/domains/document-extraction/rules/extraction-safety.md
+++ b/.ai-rulez/domains/document-extraction/rules/extraction-safety.md
@@ -0,0 +1,10 @@
+---
+priority: critical
+---
+
+- Always use `SecurityLimits` to cap archive size, compression ratio, file count, and nesting depth for user content. Use `ZipBombValidator` for archive extraction.
+- Validate MIME type before extraction — never trust file extensions alone
+- Implement fallback chains: if primary extractor fails, try next-priority extractor
+- Preserve partial results on failure — return what was extracted with error context
+- All errors must include: operation name, input description, root cause, and suggestion
+- Never expose internal file paths or system details in error messages returned to users