Benchmark Harness
Rust CLI tool for comparative benchmarking of document extraction across 13 Kreuzberg language bindings and 12 reference frameworks. Measures performance (latency, throughput, memory) and quality (TF1, SF1) against ground truth.
Overview
The benchmark harness serves two distinct workflows:
- CI benchmarking -- automated cross-framework comparison triggered via GitHub Actions, producing aggregated results published as GitHub Releases.
- Local quality assessment -- developer-facing pipeline comparison against ground truth for extraction quality triage and regression detection.
Architecture
CLI (clap)
|
+-- run --> AdapterRegistry --> BenchmarkRunner --> results.json
| |
| +-- NativeAdapter (in-process Kreuzberg)
| +-- SubprocessAdapter (persistent child process)
| +-- BatchSubprocessAdapter (batch API)
|
+-- compare --> ComparisonConfig --> Pipeline extraction --> Quality scoring
+-- pipeline-benchmark --> 6-path matrix --> TF1/SF1 scoring --> Triage tables
+-- consolidate --> Load multi-job results --> Aggregate percentiles
+-- validate-gt --> Fixture scan --> HTML cleanup --> Integrity report
+-- survey --> Corpus-wide extraction stats
+-- model-benchmark --> Layout model A/B comparison
+-- embed-benchmark --> Embedding throughput measurement
Module Structure
| Module | Purpose |
|---|---|
main.rs |
CLI entry point (clap subcommands) |
adapter.rs |
FrameworkAdapter trait definition |
adapters/ |
Adapter implementations: subprocess (persistent/batch), native (in-process), kreuzberg factory functions for all languages |
runner.rs |
Benchmark orchestration, iteration control, resource monitoring |
quality.rs |
TF1: token-level bag-of-words F1 scoring |
markdown_quality.rs |
SF1: structural block-level F1 scoring |
comparison.rs |
Multi-pipeline extraction with quality guardrails |
pipeline_benchmark.rs |
6-path extraction matrix benchmark |
corpus.rs, fixture.rs |
Fixture loading, filtering, validation |
aggregate.rs, consolidate.rs |
Multi-job result merging and percentile aggregation |
output.rs, stats.rs |
Result serialization and statistical analysis |
validate_gt.rs |
Ground truth integrity checks and HTML-to-GFM cleanup |
monitoring.rs |
CPU and memory sampling during benchmarks |
profiling.rs, profile_report.rs |
Flamegraph generation (requires profiling feature) |
survey.rs |
Corpus-wide extraction statistics |
model_benchmark.rs |
Layout model A/B comparison |
embed_benchmark.rs |
Embedding throughput benchmarks |
sizes.rs |
Framework installation footprint measurement |
Quality Scoring
TF1 (Text F1)
Token-level bag-of-words F1 between extracted text and ground truth.
- Tokenization: lowercase, split on whitespace, keep alphanumeric tokens plus
.and, - Separate numeric-token F1 for number-heavy documents (financial, scientific)
- Combined score:
quality_score = 0.6 * f1_text + 0.4 * f1_numeric
SF1 (Structural F1)
Block-level matching between extracted markdown and ground truth markdown.
- Block types: Heading1-6, Paragraph, CodeBlock, Formula, Table, ListItem, Image
- Type weights: Headings = 2.0, Code/Formula/Table = 1.5, ListItem = 1.0, Paragraph/Image = 0.5
- Matching: Greedy 1:1 with fuzzy cross-type compatibility (e.g., bold paragraph matched to heading gets 0.4 compatibility score)
- Adjacent concatenation: Consecutive blocks of the same type are merged before matching
- Order score: Longest Increasing Subsequence (LIS) on matched block indices
Combined Score
When markdown ground truth is available, both metrics are combined:
quality_score = 0.5 * f1_text + 0.2 * f1_numeric + 0.3 * f1_layout
Fixture Format
Fixtures are JSON files organized by format directory under fixtures/:
{
"document": "relative/path/to/file.pdf",
"file_type": "pdf",
"file_size": 123456,
"expected_frameworks": ["kreuzberg", "docling"],
"metadata": {},
"ground_truth": {
"text_file": "relative/path/to/gt.txt",
"markdown_file": "relative/path/to/gt.md",
"source": "manual|vision|pdf_text_layer|pandoc|python-docx|..."
}
}
Ground Truth Coverage
| Format | Fixtures | With Markdown GT |
|---|---|---|
| 159 | 158 | |
| HTML | 36 | 36 |
| DOCX | 26 | 26 |
| ODT | 19 | 19 |
| RTF | 17 | 17 |
| XLSX | 12 | 11 |
| CSV | 11 | 11 |
| EPUB | 8 | 8 |
| PPTX | 8 | 8 |
| Org | 6 | 6 |
| DOC | 5 | 5 |
| OPML | 4 | 4 |
| RST | 3 | 3 |
| XLS | 3 | 3 |
| IPynb | 1 | 1 |
| JATS | 1 | 1 |
| LaTeX | 1 | 1 |
Total: 318 fixtures with markdown ground truth across 17 formats.
Frameworks
Kreuzberg Bindings (13)
Each binding is benchmarked in both single-file (sequential, fair latency) and batch (concurrent, throughput) modes:
Rust, Python, Node.js, Ruby, Go, Java, C#, PHP, Elixir, R, WASM, C, Rust+PaddleOCR
Reference Frameworks (12)
External document extraction tools benchmarked in single-file mode:
Docling, MarkItDown, Pandoc, Unstructured, Tika, PyMuPDF4LLM, PDFPlumber, MinerU, PyPDF, PDFMiner, PDFtoText, Playa-PDF
Extraction Pipelines
The compare and pipeline-benchmark commands support these extraction paths:
| Pipeline | Description |
|---|---|
baseline |
Native PDF text extraction (no OCR, no layout) |
layout |
Native PDF with layout detection |
tesseract |
Tesseract OCR with force_ocr |
tesseract+layout |
Tesseract OCR with layout detection |
paddle |
PaddleOCR mobile tier with force_ocr |
paddle+layout |
PaddleOCR mobile tier with layout detection |
paddle-server |
PaddleOCR server tier |
docling |
Vendored Docling reference extraction |
paddleocr-python |
Vendored PaddleOCR Python extraction |
rapidocr |
Vendored RapidOCR extraction |
CLI Reference
run -- CI benchmark execution
Runs benchmarks using framework adapters with configurable iterations, warmup, and sharding.
benchmark-harness run \
-f fixtures/ \
-F kreuzberg-rust,kreuzberg-python \
-m batch \
-o results/ \
-i 3 -w 1
| Flag | Description | Default |
|---|---|---|
-f, --fixtures |
Fixture directory or file | required |
-F, --frameworks |
Comma-separated framework names | all available |
-o, --output |
Output directory | results |
-m, --mode |
single-file or batch |
batch |
-i, --iterations |
Benchmark iterations | 3 |
-w, --warmup |
Warmup iterations (discarded) | 1 |
-c, --max-concurrent |
Max concurrent extractions | CPU count |
-t, --timeout |
Timeout in seconds | 1800 |
--ocr |
Enable OCR | false |
--measure-quality |
Enable quality assessment | false |
--shard |
Run fixture subset (INDEX/TOTAL, e.g. 1/3) |
none |
consolidate -- Merge multi-job results
Combines benchmark results from parallel CI jobs into a single aggregated report with percentiles.
benchmark-harness consolidate \
--inputs dir1,dir2,dir3 \
--output consolidated/
compare -- Local pipeline comparison
Compares extraction pipelines on the document corpus with quality scoring and optional guardrails.
benchmark-harness compare \
-f fixtures/ \
--pipelines baseline,layout,paddle \
--dump-outputs \
--guardrails
| Flag | Description |
|---|---|
--pipelines |
Comma-separated pipeline names |
--dump-outputs |
Write extraction outputs to /tmp/kreuzberg_compare/ |
--guardrails |
Fail on quality regressions (non-zero exit) |
--filter |
Only run documents matching this substring |
pipeline-benchmark -- 6-path extraction matrix
Runs all pipelines across the corpus and produces a ranked triage table.
benchmark-harness pipeline-benchmark \
-f fixtures/ \
--group tables \
--sort-by sf1 \
--bottom-n 10 \
--triage-blocks
| Flag | Description | Default |
|---|---|---|
--paths |
Comma-separated pipeline names | all 6 default paths |
--doc |
Filter by document name substrings | none |
--group |
Named benchmark group (tables, structure, multicolumn, text-quality, ocr-fallback) |
none |
--sort-by |
Sort metric: sf1, tf1, time |
sf1 |
--bottom-n |
Show only the N worst-performing documents | none |
--triage-blocks |
Print per-block-type F1 breakdown | false |
--dump-outputs |
Write outputs to /tmp/kreuzberg_pipeline/ |
false |
--json-output |
Write JSON results to file | none |
--profile-dir |
Generate per-pipeline flamegraph SVGs | none |
validate-gt -- Ground truth validation
Checks ground truth file integrity and optionally fixes HTML artifacts in markdown files.
benchmark-harness validate-gt -f fixtures/ --fix
survey -- Corpus extraction statistics
Produces corpus-wide extraction statistics grouped by file type.
benchmark-harness survey -f fixtures/ --types pdf,docx
model-benchmark -- Layout model A/B comparison
Compares two layout model presets across the fixture corpus.
benchmark-harness model-benchmark -f fixtures/ --model-a fast --model-b accurate
embed-benchmark -- Embedding throughput
Benchmarks embedding throughput across all presets.
benchmark-harness embed-benchmark
list-fixtures -- List loaded fixtures
benchmark-harness list-fixtures -f fixtures/
validate -- Validate fixture JSON
benchmark-harness validate -f fixtures/
measure-framework-sizes -- Installation footprints
Measures disk usage of all framework installations.
benchmark-harness measure-framework-sizes --output sizes.json
CI Integration
The benchmark suite runs via .github/workflows/benchmarks.yaml, triggered by manual workflow_dispatch.
Execution DAG
setup
Build harness + FFI library + validate ground truth
|
v
bench-{language} x {single-file, batch} (13 Kreuzberg binding jobs)
|
v
kreuzberg-gate (wait for all Kreuzberg benchmarks)
|
v
bench-{external} (12 reference framework jobs, some sharded)
|
v
aggregate-and-release (consolidate all results -> GitHub Release)
Platform
- Primary:
ubuntu-24.04-arm - Exception: WASM uses
ubuntu-24.04(x86) due to V8 ARM compatibility issues
Timeouts and Artifacts
- Per-job timeout: 6 hours (configurable per-document timeout)
- Build artifacts retained: 7 days
- Result artifacts retained: 30 days
- Final output: aggregated JSON published as a GitHub Release
Vendored Baselines
Pre-generated extraction outputs from reference tools are stored in vendored/ for offline comparison:
| Directory | Source |
|---|---|
vendored/docling/ |
Docling extraction outputs |
vendored/paddleocr-python/ |
PaddleOCR Python outputs with timing (.ms files) |
vendored/rapidocr/ |
RapidOCR extraction outputs |
Regenerate with:
python tools/benchmark-harness/scripts/generate_vendored_baselines.py
Development
# Build
cargo build -p benchmark-harness
# Run tests
cargo test -p benchmark-harness
# Lint
cargo clippy -p benchmark-harness -- -D warnings
# Local pipeline comparison
cargo run -p benchmark-harness -- compare \
-f tools/benchmark-harness/fixtures/ \
--pipelines baseline,layout \
--dump-outputs
# Validate ground truth
cargo run -p benchmark-harness -- validate-gt \
-f tools/benchmark-harness/fixtures/
# Full pipeline benchmark with triage
cargo run -p benchmark-harness -- pipeline-benchmark \
-f tools/benchmark-harness/fixtures/ \
--sort-by sf1 --bottom-n 20 --triage-blocks
# Corpus survey
cargo run -p benchmark-harness -- survey \
-f tools/benchmark-harness/fixtures/ --types pdf
Optional Features
| Feature | Description |
|---|---|
profiling |
Enables flamegraph generation via pprof |
memory-profiling |
Enables jemalloc-based memory profiling |
Build with features:
cargo build -p benchmark-harness --features profiling,memory-profiling
Tracing
The harness uses tracing with RUST_LOG env-filter support. For quality scoring diagnostics:
RUST_LOG=benchmark_harness::markdown_quality=debug cargo run -p benchmark-harness -- compare ...