Files
fil/SECURITY.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

92 lines
5.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Security Policy
## Supported Versions
Security fixes are applied to the latest release on the `main` branch.
Patch releases are back-ported to the current minor series when the
vulnerability is rated High or Critical. Older minor series receive no
security back-ports.
| Version | Supported |
|---------|-----------|
| 5.x | Yes |
| < 5.0 | No |
## Threat Model
Kreuzberg is a document-extraction library. Its principal threat is
**hostile input documents** — files crafted to exhaust memory, CPU, or
disk, or to exfiltrate data from the calling process.
### Protected attack surfaces
| Threat | Mitigation |
|--------|------------|
| Decompression bombs (ZIP/OOXML/PDF) | `ZipBombValidator` enforces `SecurityLimits.max_compression_ratio` (default 100×) and `max_archive_size` (default 500 MiB) across all archive and OOXML paths. PDF embedded-file streams are checked for ratio and absolute size before recursive processing. |
| Oversized embedded files | `ExtractionConfig.max_embedded_file_bytes` (default 50 MiB) caps any single embedded attachment before recursive extraction is attempted. Applies to OOXML (DOCX/PPTX), email attachments, and PDF embedded files. |
| Runaway recursive extraction | `ExtractionConfig.max_archive_depth` (default 3) limits archive nesting depth to prevent infinite recursion on mutually-embedded documents. |
| Extraction timeout | `ExtractionConfig.extraction_timeout_secs` (default 60 s) wraps the entire extraction future in `tokio::time::timeout`. Pathological documents that take longer are cancelled with `KreuzbergError::Timeout`. |
| Content-size bombs (repeated paragraphs) | `SecurityBudget` (`StringGrowthValidator`) enforces `SecurityLimits.max_content_size` (default 100 MiB) on accumulated element text for XML-class formats, email, and PDF. |
| XML / HTML entity expansion (billion laughs) | `EntityValidator` (per-token) and `StringGrowthValidator` (cumulative) are wired into every XML/HTML parser path. |
| Deeply nested XML / DOM depth bombs | `DepthValidator` enforces `SecurityLimits.max_xml_depth` and `max_nesting_depth` (both default 1024). |
| Table cell bombs (CSV / XLSX / HTML tables) | `TableValidator` enforces `SecurityLimits.max_table_cells` (default 100 000). |
| Path traversal in ZIP archives | `has_path_traversal()` in `extractors::security` uses `std::path::Component::ParentDir` rather than a string search, catching normalised traversal patterns. |
| DDE / external-call formula injection (Excel) | The Excel extractor scans all string cells against a regex matching `=DDE(`, `=WEBSERVICE(`, `=HYPERLINK(`, and `=cmd|`, emitting `ProcessingWarning` per match (capped at 100 per document). This is a **warning only** — it does not prevent extraction, but gives callers the information needed to reject or quarantine the file. |
| OLE compound file execution | OLE binary streams inside OOXML archives (recognised by the `D0 CF 11 E0` magic) are skipped with a `ProcessingWarning` because kreuzberg has no safe OLE execution path. |
### Out of scope
- **Network requests**: kreuzberg never makes outbound network requests.
`=WEBSERVICE(...)` formulas and `=HYPERLINK(...)` cells generate
warnings but the URLs are never resolved.
- **Macro execution**: no VBA, JavaScript, or other macro runtime exists
inside kreuzberg. Formula strings are read as data, not evaluated.
- **Password-protected documents**: encryption is not stripped; protected
files are returned with an extraction error.
- **Supply-chain / dependency vulnerabilities**: report these directly to
the dependency maintainer and open a GitHub advisory in this repo so
we can update the pinned version.
### Configuring limits
All limits are on `ExtractionConfig.security_limits` (`SecurityLimits`
struct) and `ExtractionConfig.max_embedded_file_bytes`. The defaults are
chosen to be permissive enough for legitimate real-world documents while
blocking the most common DoS payloads. Set limits to `None` or very large
values only for input you fully trust.
```rust
use kreuzberg::{ExtractionConfig, extractors::security::SecurityLimits};
let config = ExtractionConfig {
// Tighten limits for untrusted input from an upload endpoint.
security_limits: Some(SecurityLimits {
max_content_size: 10 * 1024 * 1024, // 10 MiB output cap
max_compression_ratio: 50, // 50× ratio cap
max_table_cells: 10_000,
..SecurityLimits::default()
}),
max_embedded_file_bytes: Some(5 * 1024 * 1024), // 5 MiB per embedded file
extraction_timeout_secs: Some(10), // 10 s timeout
..ExtractionConfig::default()
};
```
## Reporting a Vulnerability
**Do not open a public GitHub issue for security vulnerabilities.**
Send a report to **security@kreuzberg.dev** with:
1. A description of the vulnerability and affected versions.
2. A minimal reproducer (if possible, a file that triggers the issue).
3. Your assessment of severity (CVSS score or plain description).
4. Whether you want public credit when the advisory is published.
We aim to acknowledge reports within **2 business days** and to publish a
fix within **14 calendar days** for Critical/High issues and **30 days**
for Medium/Low. We will coordinate disclosure timing with you.
Researchers who follow responsible disclosure will be credited in the
GitHub advisory unless they prefer to remain anonymous.