This commit is contained in:
91
SECURITY.md
Normal file
91
SECURITY.md
Normal file
@@ -0,0 +1,91 @@
|
||||
# Security Policy
|
||||
|
||||
## Supported Versions
|
||||
|
||||
Security fixes are applied to the latest release on the `main` branch.
|
||||
Patch releases are back-ported to the current minor series when the
|
||||
vulnerability is rated High or Critical. Older minor series receive no
|
||||
security back-ports.
|
||||
|
||||
| Version | Supported |
|
||||
|---------|-----------|
|
||||
| 5.x | Yes |
|
||||
| < 5.0 | No |
|
||||
|
||||
## Threat Model
|
||||
|
||||
Kreuzberg is a document-extraction library. Its principal threat is
|
||||
**hostile input documents** — files crafted to exhaust memory, CPU, or
|
||||
disk, or to exfiltrate data from the calling process.
|
||||
|
||||
### Protected attack surfaces
|
||||
|
||||
| Threat | Mitigation |
|
||||
|--------|------------|
|
||||
| Decompression bombs (ZIP/OOXML/PDF) | `ZipBombValidator` enforces `SecurityLimits.max_compression_ratio` (default 100×) and `max_archive_size` (default 500 MiB) across all archive and OOXML paths. PDF embedded-file streams are checked for ratio and absolute size before recursive processing. |
|
||||
| Oversized embedded files | `ExtractionConfig.max_embedded_file_bytes` (default 50 MiB) caps any single embedded attachment before recursive extraction is attempted. Applies to OOXML (DOCX/PPTX), email attachments, and PDF embedded files. |
|
||||
| Runaway recursive extraction | `ExtractionConfig.max_archive_depth` (default 3) limits archive nesting depth to prevent infinite recursion on mutually-embedded documents. |
|
||||
| Extraction timeout | `ExtractionConfig.extraction_timeout_secs` (default 60 s) wraps the entire extraction future in `tokio::time::timeout`. Pathological documents that take longer are cancelled with `KreuzbergError::Timeout`. |
|
||||
| Content-size bombs (repeated paragraphs) | `SecurityBudget` (`StringGrowthValidator`) enforces `SecurityLimits.max_content_size` (default 100 MiB) on accumulated element text for XML-class formats, email, and PDF. |
|
||||
| XML / HTML entity expansion (billion laughs) | `EntityValidator` (per-token) and `StringGrowthValidator` (cumulative) are wired into every XML/HTML parser path. |
|
||||
| Deeply nested XML / DOM depth bombs | `DepthValidator` enforces `SecurityLimits.max_xml_depth` and `max_nesting_depth` (both default 1024). |
|
||||
| Table cell bombs (CSV / XLSX / HTML tables) | `TableValidator` enforces `SecurityLimits.max_table_cells` (default 100 000). |
|
||||
| Path traversal in ZIP archives | `has_path_traversal()` in `extractors::security` uses `std::path::Component::ParentDir` rather than a string search, catching normalised traversal patterns. |
|
||||
| DDE / external-call formula injection (Excel) | The Excel extractor scans all string cells against a regex matching `=DDE(`, `=WEBSERVICE(`, `=HYPERLINK(`, and `=cmd|`, emitting `ProcessingWarning` per match (capped at 100 per document). This is a **warning only** — it does not prevent extraction, but gives callers the information needed to reject or quarantine the file. |
|
||||
| OLE compound file execution | OLE binary streams inside OOXML archives (recognised by the `D0 CF 11 E0` magic) are skipped with a `ProcessingWarning` because kreuzberg has no safe OLE execution path. |
|
||||
|
||||
### Out of scope
|
||||
|
||||
- **Network requests**: kreuzberg never makes outbound network requests.
|
||||
`=WEBSERVICE(...)` formulas and `=HYPERLINK(...)` cells generate
|
||||
warnings but the URLs are never resolved.
|
||||
- **Macro execution**: no VBA, JavaScript, or other macro runtime exists
|
||||
inside kreuzberg. Formula strings are read as data, not evaluated.
|
||||
- **Password-protected documents**: encryption is not stripped; protected
|
||||
files are returned with an extraction error.
|
||||
- **Supply-chain / dependency vulnerabilities**: report these directly to
|
||||
the dependency maintainer and open a GitHub advisory in this repo so
|
||||
we can update the pinned version.
|
||||
|
||||
### Configuring limits
|
||||
|
||||
All limits are on `ExtractionConfig.security_limits` (`SecurityLimits`
|
||||
struct) and `ExtractionConfig.max_embedded_file_bytes`. The defaults are
|
||||
chosen to be permissive enough for legitimate real-world documents while
|
||||
blocking the most common DoS payloads. Set limits to `None` or very large
|
||||
values only for input you fully trust.
|
||||
|
||||
```rust
|
||||
use kreuzberg::{ExtractionConfig, extractors::security::SecurityLimits};
|
||||
|
||||
let config = ExtractionConfig {
|
||||
// Tighten limits for untrusted input from an upload endpoint.
|
||||
security_limits: Some(SecurityLimits {
|
||||
max_content_size: 10 * 1024 * 1024, // 10 MiB output cap
|
||||
max_compression_ratio: 50, // 50× ratio cap
|
||||
max_table_cells: 10_000,
|
||||
..SecurityLimits::default()
|
||||
}),
|
||||
max_embedded_file_bytes: Some(5 * 1024 * 1024), // 5 MiB per embedded file
|
||||
extraction_timeout_secs: Some(10), // 10 s timeout
|
||||
..ExtractionConfig::default()
|
||||
};
|
||||
```
|
||||
|
||||
## Reporting a Vulnerability
|
||||
|
||||
**Do not open a public GitHub issue for security vulnerabilities.**
|
||||
|
||||
Send a report to **security@kreuzberg.dev** with:
|
||||
|
||||
1. A description of the vulnerability and affected versions.
|
||||
2. A minimal reproducer (if possible, a file that triggers the issue).
|
||||
3. Your assessment of severity (CVSS score or plain description).
|
||||
4. Whether you want public credit when the advisory is published.
|
||||
|
||||
We aim to acknowledge reports within **2 business days** and to publish a
|
||||
fix within **14 calendar days** for Critical/High issues and **30 days**
|
||||
for Medium/Low. We will coordinate disclosure timing with you.
|
||||
|
||||
Researchers who follow responsible disclosure will be credited in the
|
||||
GitHub advisory unless they prefer to remain anonymous.
|
||||
Reference in New Issue
Block a user