Nomad changes

2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -0,0 +1,91 @@
+# Security Policy
+
+## Supported Versions
+
+Security fixes are applied to the latest release on the `main` branch.
+Patch releases are back-ported to the current minor series when the
+vulnerability is rated High or Critical. Older minor series receive no
+security back-ports.
+
+| Version | Supported |
+|---------|-----------|
+| 5.x     | Yes       |
+| < 5.0   | No        |
+
+## Threat Model
+
+Kreuzberg is a document-extraction library. Its principal threat is
+**hostile input documents** — files crafted to exhaust memory, CPU, or
+disk, or to exfiltrate data from the calling process.
+
+### Protected attack surfaces
+
+| Threat | Mitigation |
+|--------|------------|
+| Decompression bombs (ZIP/OOXML/PDF) | `ZipBombValidator` enforces `SecurityLimits.max_compression_ratio` (default 100×) and `max_archive_size` (default 500 MiB) across all archive and OOXML paths. PDF embedded-file streams are checked for ratio and absolute size before recursive processing. |
+| Oversized embedded files | `ExtractionConfig.max_embedded_file_bytes` (default 50 MiB) caps any single embedded attachment before recursive extraction is attempted. Applies to OOXML (DOCX/PPTX), email attachments, and PDF embedded files. |
+| Runaway recursive extraction | `ExtractionConfig.max_archive_depth` (default 3) limits archive nesting depth to prevent infinite recursion on mutually-embedded documents. |
+| Extraction timeout | `ExtractionConfig.extraction_timeout_secs` (default 60 s) wraps the entire extraction future in `tokio::time::timeout`. Pathological documents that take longer are cancelled with `KreuzbergError::Timeout`. |
+| Content-size bombs (repeated paragraphs) | `SecurityBudget` (`StringGrowthValidator`) enforces `SecurityLimits.max_content_size` (default 100 MiB) on accumulated element text for XML-class formats, email, and PDF. |
+| XML / HTML entity expansion (billion laughs) | `EntityValidator` (per-token) and `StringGrowthValidator` (cumulative) are wired into every XML/HTML parser path. |
+| Deeply nested XML / DOM depth bombs | `DepthValidator` enforces `SecurityLimits.max_xml_depth` and `max_nesting_depth` (both default 1024). |
+| Table cell bombs (CSV / XLSX / HTML tables) | `TableValidator` enforces `SecurityLimits.max_table_cells` (default 100 000). |
+| Path traversal in ZIP archives | `has_path_traversal()` in `extractors::security` uses `std::path::Component::ParentDir` rather than a string search, catching normalised traversal patterns. |
+| DDE / external-call formula injection (Excel) | The Excel extractor scans all string cells against a regex matching `=DDE(`, `=WEBSERVICE(`, `=HYPERLINK(`, and `=cmd|`, emitting `ProcessingWarning` per match (capped at 100 per document). This is a **warning only** — it does not prevent extraction, but gives callers the information needed to reject or quarantine the file. |
+| OLE compound file execution | OLE binary streams inside OOXML archives (recognised by the `D0 CF 11 E0` magic) are skipped with a `ProcessingWarning` because kreuzberg has no safe OLE execution path. |
+
+### Out of scope
+
+- **Network requests**: kreuzberg never makes outbound network requests.
+  `=WEBSERVICE(...)` formulas and `=HYPERLINK(...)` cells generate
+  warnings but the URLs are never resolved.
+- **Macro execution**: no VBA, JavaScript, or other macro runtime exists
+  inside kreuzberg. Formula strings are read as data, not evaluated.
+- **Password-protected documents**: encryption is not stripped; protected
+  files are returned with an extraction error.
+- **Supply-chain / dependency vulnerabilities**: report these directly to
+  the dependency maintainer and open a GitHub advisory in this repo so
+  we can update the pinned version.
+
+### Configuring limits
+
+All limits are on `ExtractionConfig.security_limits` (`SecurityLimits`
+struct) and `ExtractionConfig.max_embedded_file_bytes`. The defaults are
+chosen to be permissive enough for legitimate real-world documents while
+blocking the most common DoS payloads. Set limits to `None` or very large
+values only for input you fully trust.
+
+```rust
+use kreuzberg::{ExtractionConfig, extractors::security::SecurityLimits};
+
+let config = ExtractionConfig {
+    // Tighten limits for untrusted input from an upload endpoint.
+    security_limits: Some(SecurityLimits {
+        max_content_size: 10 * 1024 * 1024,   // 10 MiB output cap
+        max_compression_ratio: 50,             // 50× ratio cap
+        max_table_cells: 10_000,
+        ..SecurityLimits::default()
+    }),
+    max_embedded_file_bytes: Some(5 * 1024 * 1024), // 5 MiB per embedded file
+    extraction_timeout_secs: Some(10),              // 10 s timeout
+    ..ExtractionConfig::default()
+};
+```
+
+## Reporting a Vulnerability
+
+**Do not open a public GitHub issue for security vulnerabilities.**
+
+Send a report to **security@kreuzberg.dev** with:
+
+1. A description of the vulnerability and affected versions.
+2. A minimal reproducer (if possible, a file that triggers the issue).
+3. Your assessment of severity (CVSS score or plain description).
+4. Whether you want public credit when the advisory is published.
+
+We aim to acknowledge reports within **2 business days** and to publish a
+fix within **14 calendar days** for Critical/High issues and **30 days**
+for Medium/Low. We will coordinate disclosure timing with you.
+
+Researchers who follow responsible disclosure will be credited in the
+GitHub advisory unless they prefer to remain anonymous.