# Security Policy ## Supported Versions Security fixes are applied to the latest release on the `main` branch. Patch releases are back-ported to the current minor series when the vulnerability is rated High or Critical. Older minor series receive no security back-ports. | Version | Supported | |---------|-----------| | 5.x | Yes | | < 5.0 | No | ## Threat Model Kreuzberg is a document-extraction library. Its principal threat is **hostile input documents** — files crafted to exhaust memory, CPU, or disk, or to exfiltrate data from the calling process. ### Protected attack surfaces | Threat | Mitigation | |--------|------------| | Decompression bombs (ZIP/OOXML/PDF) | `ZipBombValidator` enforces `SecurityLimits.max_compression_ratio` (default 100×) and `max_archive_size` (default 500 MiB) across all archive and OOXML paths. PDF embedded-file streams are checked for ratio and absolute size before recursive processing. | | Oversized embedded files | `ExtractionConfig.max_embedded_file_bytes` (default 50 MiB) caps any single embedded attachment before recursive extraction is attempted. Applies to OOXML (DOCX/PPTX), email attachments, and PDF embedded files. | | Runaway recursive extraction | `ExtractionConfig.max_archive_depth` (default 3) limits archive nesting depth to prevent infinite recursion on mutually-embedded documents. | | Extraction timeout | `ExtractionConfig.extraction_timeout_secs` (default 60 s) wraps the entire extraction future in `tokio::time::timeout`. Pathological documents that take longer are cancelled with `KreuzbergError::Timeout`. | | Content-size bombs (repeated paragraphs) | `SecurityBudget` (`StringGrowthValidator`) enforces `SecurityLimits.max_content_size` (default 100 MiB) on accumulated element text for XML-class formats, email, and PDF. | | XML / HTML entity expansion (billion laughs) | `EntityValidator` (per-token) and `StringGrowthValidator` (cumulative) are wired into every XML/HTML parser path. | | Deeply nested XML / DOM depth bombs | `DepthValidator` enforces `SecurityLimits.max_xml_depth` and `max_nesting_depth` (both default 1024). | | Table cell bombs (CSV / XLSX / HTML tables) | `TableValidator` enforces `SecurityLimits.max_table_cells` (default 100 000). | | Path traversal in ZIP archives | `has_path_traversal()` in `extractors::security` uses `std::path::Component::ParentDir` rather than a string search, catching normalised traversal patterns. | | DDE / external-call formula injection (Excel) | The Excel extractor scans all string cells against a regex matching `=DDE(`, `=WEBSERVICE(`, `=HYPERLINK(`, and `=cmd|`, emitting `ProcessingWarning` per match (capped at 100 per document). This is a **warning only** — it does not prevent extraction, but gives callers the information needed to reject or quarantine the file. | | OLE compound file execution | OLE binary streams inside OOXML archives (recognised by the `D0 CF 11 E0` magic) are skipped with a `ProcessingWarning` because kreuzberg has no safe OLE execution path. | ### Out of scope - **Network requests**: kreuzberg never makes outbound network requests. `=WEBSERVICE(...)` formulas and `=HYPERLINK(...)` cells generate warnings but the URLs are never resolved. - **Macro execution**: no VBA, JavaScript, or other macro runtime exists inside kreuzberg. Formula strings are read as data, not evaluated. - **Password-protected documents**: encryption is not stripped; protected files are returned with an extraction error. - **Supply-chain / dependency vulnerabilities**: report these directly to the dependency maintainer and open a GitHub advisory in this repo so we can update the pinned version. ### Configuring limits All limits are on `ExtractionConfig.security_limits` (`SecurityLimits` struct) and `ExtractionConfig.max_embedded_file_bytes`. The defaults are chosen to be permissive enough for legitimate real-world documents while blocking the most common DoS payloads. Set limits to `None` or very large values only for input you fully trust. ```rust use kreuzberg::{ExtractionConfig, extractors::security::SecurityLimits}; let config = ExtractionConfig { // Tighten limits for untrusted input from an upload endpoint. security_limits: Some(SecurityLimits { max_content_size: 10 * 1024 * 1024, // 10 MiB output cap max_compression_ratio: 50, // 50× ratio cap max_table_cells: 10_000, ..SecurityLimits::default() }), max_embedded_file_bytes: Some(5 * 1024 * 1024), // 5 MiB per embedded file extraction_timeout_secs: Some(10), // 10 s timeout ..ExtractionConfig::default() }; ``` ## Reporting a Vulnerability **Do not open a public GitHub issue for security vulnerabilities.** Send a report to **security@kreuzberg.dev** with: 1. A description of the vulnerability and affected versions. 2. A minimal reproducer (if possible, a file that triggers the issue). 3. Your assessment of severity (CVSS score or plain description). 4. Whether you want public credit when the advisory is published. We aim to acknowledge reports within **2 business days** and to publish a fix within **14 calendar days** for Critical/High issues and **30 days** for Medium/Low. We will coordinate disclosure timing with you. Researchers who follow responsible disclosure will be credited in the GitHub advisory unless they prefer to remain anonymous.