92 lines
5.3 KiB
Markdown
92 lines
5.3 KiB
Markdown
|
|
# Security Policy
|
|||
|
|
|
|||
|
|
## Supported Versions
|
|||
|
|
|
|||
|
|
Security fixes are applied to the latest release on the `main` branch.
|
|||
|
|
Patch releases are back-ported to the current minor series when the
|
|||
|
|
vulnerability is rated High or Critical. Older minor series receive no
|
|||
|
|
security back-ports.
|
|||
|
|
|
|||
|
|
| Version | Supported |
|
|||
|
|
|---------|-----------|
|
|||
|
|
| 5.x | Yes |
|
|||
|
|
| < 5.0 | No |
|
|||
|
|
|
|||
|
|
## Threat Model
|
|||
|
|
|
|||
|
|
Kreuzberg is a document-extraction library. Its principal threat is
|
|||
|
|
**hostile input documents** — files crafted to exhaust memory, CPU, or
|
|||
|
|
disk, or to exfiltrate data from the calling process.
|
|||
|
|
|
|||
|
|
### Protected attack surfaces
|
|||
|
|
|
|||
|
|
| Threat | Mitigation |
|
|||
|
|
|--------|------------|
|
|||
|
|
| Decompression bombs (ZIP/OOXML/PDF) | `ZipBombValidator` enforces `SecurityLimits.max_compression_ratio` (default 100×) and `max_archive_size` (default 500 MiB) across all archive and OOXML paths. PDF embedded-file streams are checked for ratio and absolute size before recursive processing. |
|
|||
|
|
| Oversized embedded files | `ExtractionConfig.max_embedded_file_bytes` (default 50 MiB) caps any single embedded attachment before recursive extraction is attempted. Applies to OOXML (DOCX/PPTX), email attachments, and PDF embedded files. |
|
|||
|
|
| Runaway recursive extraction | `ExtractionConfig.max_archive_depth` (default 3) limits archive nesting depth to prevent infinite recursion on mutually-embedded documents. |
|
|||
|
|
| Extraction timeout | `ExtractionConfig.extraction_timeout_secs` (default 60 s) wraps the entire extraction future in `tokio::time::timeout`. Pathological documents that take longer are cancelled with `KreuzbergError::Timeout`. |
|
|||
|
|
| Content-size bombs (repeated paragraphs) | `SecurityBudget` (`StringGrowthValidator`) enforces `SecurityLimits.max_content_size` (default 100 MiB) on accumulated element text for XML-class formats, email, and PDF. |
|
|||
|
|
| XML / HTML entity expansion (billion laughs) | `EntityValidator` (per-token) and `StringGrowthValidator` (cumulative) are wired into every XML/HTML parser path. |
|
|||
|
|
| Deeply nested XML / DOM depth bombs | `DepthValidator` enforces `SecurityLimits.max_xml_depth` and `max_nesting_depth` (both default 1024). |
|
|||
|
|
| Table cell bombs (CSV / XLSX / HTML tables) | `TableValidator` enforces `SecurityLimits.max_table_cells` (default 100 000). |
|
|||
|
|
| Path traversal in ZIP archives | `has_path_traversal()` in `extractors::security` uses `std::path::Component::ParentDir` rather than a string search, catching normalised traversal patterns. |
|
|||
|
|
| DDE / external-call formula injection (Excel) | The Excel extractor scans all string cells against a regex matching `=DDE(`, `=WEBSERVICE(`, `=HYPERLINK(`, and `=cmd|`, emitting `ProcessingWarning` per match (capped at 100 per document). This is a **warning only** — it does not prevent extraction, but gives callers the information needed to reject or quarantine the file. |
|
|||
|
|
| OLE compound file execution | OLE binary streams inside OOXML archives (recognised by the `D0 CF 11 E0` magic) are skipped with a `ProcessingWarning` because kreuzberg has no safe OLE execution path. |
|
|||
|
|
|
|||
|
|
### Out of scope
|
|||
|
|
|
|||
|
|
- **Network requests**: kreuzberg never makes outbound network requests.
|
|||
|
|
`=WEBSERVICE(...)` formulas and `=HYPERLINK(...)` cells generate
|
|||
|
|
warnings but the URLs are never resolved.
|
|||
|
|
- **Macro execution**: no VBA, JavaScript, or other macro runtime exists
|
|||
|
|
inside kreuzberg. Formula strings are read as data, not evaluated.
|
|||
|
|
- **Password-protected documents**: encryption is not stripped; protected
|
|||
|
|
files are returned with an extraction error.
|
|||
|
|
- **Supply-chain / dependency vulnerabilities**: report these directly to
|
|||
|
|
the dependency maintainer and open a GitHub advisory in this repo so
|
|||
|
|
we can update the pinned version.
|
|||
|
|
|
|||
|
|
### Configuring limits
|
|||
|
|
|
|||
|
|
All limits are on `ExtractionConfig.security_limits` (`SecurityLimits`
|
|||
|
|
struct) and `ExtractionConfig.max_embedded_file_bytes`. The defaults are
|
|||
|
|
chosen to be permissive enough for legitimate real-world documents while
|
|||
|
|
blocking the most common DoS payloads. Set limits to `None` or very large
|
|||
|
|
values only for input you fully trust.
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
use kreuzberg::{ExtractionConfig, extractors::security::SecurityLimits};
|
|||
|
|
|
|||
|
|
let config = ExtractionConfig {
|
|||
|
|
// Tighten limits for untrusted input from an upload endpoint.
|
|||
|
|
security_limits: Some(SecurityLimits {
|
|||
|
|
max_content_size: 10 * 1024 * 1024, // 10 MiB output cap
|
|||
|
|
max_compression_ratio: 50, // 50× ratio cap
|
|||
|
|
max_table_cells: 10_000,
|
|||
|
|
..SecurityLimits::default()
|
|||
|
|
}),
|
|||
|
|
max_embedded_file_bytes: Some(5 * 1024 * 1024), // 5 MiB per embedded file
|
|||
|
|
extraction_timeout_secs: Some(10), // 10 s timeout
|
|||
|
|
..ExtractionConfig::default()
|
|||
|
|
};
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Reporting a Vulnerability
|
|||
|
|
|
|||
|
|
**Do not open a public GitHub issue for security vulnerabilities.**
|
|||
|
|
|
|||
|
|
Send a report to **security@kreuzberg.dev** with:
|
|||
|
|
|
|||
|
|
1. A description of the vulnerability and affected versions.
|
|||
|
|
2. A minimal reproducer (if possible, a file that triggers the issue).
|
|||
|
|
3. Your assessment of severity (CVSS score or plain description).
|
|||
|
|
4. Whether you want public credit when the advisory is published.
|
|||
|
|
|
|||
|
|
We aim to acknowledge reports within **2 business days** and to publish a
|
|||
|
|
fix within **14 calendar days** for Critical/High issues and **30 days**
|
|||
|
|
for Medium/Low. We will coordinate disclosure timing with you.
|
|||
|
|
|
|||
|
|
Researchers who follow responsible disclosure will be credited in the
|
|||
|
|
GitHub advisory unless they prefer to remain anonymous.
|