Files
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

2.9 KiB

generate-test-fixtures

Deterministic fixture-generation toolkit for kreuzberg integration tests.

Produces real on-disk DOCX / ODT / XLSX / PPTX / PDF documents that exercise track-changes / revisions / comments / incremental-update / diff / security code paths in kreuzberg::extract and kreuzberg::diff::compare. Every binary fixture is paired with a <stem>.gt.json ground-truth sidecar that integration tests load to assert structured expectations.

The generated fixtures fill the gap left by test_documents/, whose existing ~200 real-world corpus does not contain track-changes, comments, incremental xref chains, or paired diff inputs.

Layout

tools/generate_test_fixtures/
  pyproject.toml
  src/generate_test_fixtures/
    __init__.py
    __main__.py            argparse entry point
    gt_schema.py           GroundTruth dataclass + JSON writer
    docx_revisions.py      DOCX w:ins / w:del / w:rPrChange fixtures
    odt_revisions.py       ODT text:tracked-changes fixtures
    xlsx_revisions.py      XLSX xl/revisions/revisionHeaders.xml fixtures
    pptx_comments.py       PPTX ppt/comments/comment{N}.xml fixtures
    pdf_incremental.py     PDF base + incremental xref chain fixtures
    diff_pairs.py          paired v1/v2 inputs for kreuzberg::diff::compare
    security_fixtures.py   DDE / oversized embed / zip-bomb fixtures
  tests/
    test_generation.py     smoke test: each generator runs + GT JSON parses

Usage

From the kreuzberg repo root:

uv run --directory tools/generate_test_fixtures \
    python -m generate_test_fixtures all

Or per format:

uv run --directory tools/generate_test_fixtures \
    python -m generate_test_fixtures docx odt xlsx pptx pdf diff-pairs security

Default output: test_documents/generated/<format>/<stem>.{ext,gt.json}. Override with --output-dir <PATH> (resolved relative to the cwd).

Ground-truth schema

See src/generate_test_fixtures/gt_schema.py. Every sidecar is a JSON object of the shape:

{
  "fixture_path": "test_documents/generated/docx/docx_track_changes_basic.docx",
  "format": "docx",
  "feature": "revisions",
  "expectations": { ... feature-specific shape ... },
  "generated_by": "generate-test-fixtures 0.1.0"
}

Determinism

Every generator pins timestamps to fixed ISO-8601 strings (no now()), uses hardcoded author names, and seeds any randomness with random.Random(42). Re-running the generator on the same source code produces byte-identical outputs except for the ZIP archive container's mtime — which the generators override to 2024-01-01T00:00:00Z via zipfile.ZipInfo.

Why not check binaries in?

The user owns the call on whether these belong in the test_documents/ git submodule. The generator scripts are committed; the binary outputs are not. The integration test scaffold (crates/kreuzberg/tests/) is marked #[ignore] until the binaries land.