Files

83 lines
2.9 KiB
Markdown
Raw Permalink Normal View History

2026-06-01 23:40:55 +02:00
# generate-test-fixtures
Deterministic fixture-generation toolkit for kreuzberg integration tests.
Produces real on-disk DOCX / ODT / XLSX / PPTX / PDF documents that exercise
track-changes / revisions / comments / incremental-update / diff / security
code paths in `kreuzberg::extract` and `kreuzberg::diff::compare`. Every
binary fixture is paired with a `<stem>.gt.json` ground-truth sidecar that
integration tests load to assert structured expectations.
The generated fixtures fill the gap left by `test_documents/`, whose existing
~200 real-world corpus does not contain track-changes, comments, incremental
xref chains, or paired diff inputs.
## Layout
```text
tools/generate_test_fixtures/
pyproject.toml
src/generate_test_fixtures/
__init__.py
__main__.py argparse entry point
gt_schema.py GroundTruth dataclass + JSON writer
docx_revisions.py DOCX w:ins / w:del / w:rPrChange fixtures
odt_revisions.py ODT text:tracked-changes fixtures
xlsx_revisions.py XLSX xl/revisions/revisionHeaders.xml fixtures
pptx_comments.py PPTX ppt/comments/comment{N}.xml fixtures
pdf_incremental.py PDF base + incremental xref chain fixtures
diff_pairs.py paired v1/v2 inputs for kreuzberg::diff::compare
security_fixtures.py DDE / oversized embed / zip-bomb fixtures
tests/
test_generation.py smoke test: each generator runs + GT JSON parses
```
## Usage
From the kreuzberg repo root:
```bash
uv run --directory tools/generate_test_fixtures \
python -m generate_test_fixtures all
```
Or per format:
```bash
uv run --directory tools/generate_test_fixtures \
python -m generate_test_fixtures docx odt xlsx pptx pdf diff-pairs security
```
Default output: `test_documents/generated/<format>/<stem>.{ext,gt.json}`.
Override with `--output-dir <PATH>` (resolved relative to the cwd).
## Ground-truth schema
See `src/generate_test_fixtures/gt_schema.py`. Every sidecar is a JSON object
of the shape:
```json
{
"fixture_path": "test_documents/generated/docx/docx_track_changes_basic.docx",
"format": "docx",
"feature": "revisions",
"expectations": { ... feature-specific shape ... },
"generated_by": "generate-test-fixtures 0.1.0"
}
```
## Determinism
Every generator pins timestamps to fixed ISO-8601 strings (no `now()`), uses
hardcoded author names, and seeds any randomness with `random.Random(42)`.
Re-running the generator on the same source code produces byte-identical
outputs except for the ZIP archive container's mtime — which the generators
override to `2024-01-01T00:00:00Z` via `zipfile.ZipInfo`.
## Why not check binaries in?
The user owns the call on whether these belong in the `test_documents/` git
submodule. The generator scripts are committed; the binary outputs are not.
The integration test scaffold (`crates/kreuzberg/tests/`) is marked
`#[ignore]` until the binaries land.