fil/tools/benchmark-harness/README.md

# Benchmark Harness

Rust CLI tool for comparative benchmarking of document extraction across 13 Kreuzberg language bindings and 12 reference frameworks. Measures performance (latency, throughput, memory) and quality (TF1, SF1) against ground truth.

## Overview

The benchmark harness serves two distinct workflows:

- **CI benchmarking** -- automated cross-framework comparison triggered via GitHub Actions, producing aggregated results published as GitHub Releases.
- **Local quality assessment** -- developer-facing pipeline comparison against ground truth for extraction quality triage and regression detection.

## Architecture

```text
CLI (clap)
 |
 +-- run              --> AdapterRegistry --> BenchmarkRunner --> results.json
 |                         |
 |                         +-- NativeAdapter (in-process Kreuzberg)
 |                         +-- SubprocessAdapter (persistent child process)
 |                         +-- BatchSubprocessAdapter (batch API)
 |
 +-- compare          --> ComparisonConfig --> Pipeline extraction --> Quality scoring
 +-- pipeline-benchmark --> 6-path matrix --> TF1/SF1 scoring --> Triage tables
 +-- consolidate      --> Load multi-job results --> Aggregate percentiles
 +-- validate-gt      --> Fixture scan --> HTML cleanup --> Integrity report
 +-- survey           --> Corpus-wide extraction stats
 +-- model-benchmark  --> Layout model A/B comparison
 +-- embed-benchmark  --> Embedding throughput measurement
```

### Module Structure

| Module                              | Purpose                                                                                                                    |
| ----------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `main.rs`                           | CLI entry point (clap subcommands)                                                                                         |
| `adapter.rs`                        | `FrameworkAdapter` trait definition                                                                                        |
| `adapters/`                         | Adapter implementations: subprocess (persistent/batch), native (in-process), kreuzberg factory functions for all languages |
| `runner.rs`                         | Benchmark orchestration, iteration control, resource monitoring                                                            |
| `quality.rs`                        | TF1: token-level bag-of-words F1 scoring                                                                                   |
| `markdown_quality.rs`               | SF1: structural block-level F1 scoring                                                                                     |
| `comparison.rs`                     | Multi-pipeline extraction with quality guardrails                                                                          |
| `pipeline_benchmark.rs`             | 6-path extraction matrix benchmark                                                                                         |
| `corpus.rs`, `fixture.rs`           | Fixture loading, filtering, validation                                                                                     |
| `aggregate.rs`, `consolidate.rs`    | Multi-job result merging and percentile aggregation                                                                        |
| `output.rs`, `stats.rs`             | Result serialization and statistical analysis                                                                              |
| `validate_gt.rs`                    | Ground truth integrity checks and HTML-to-GFM cleanup                                                                      |
| `monitoring.rs`                     | CPU and memory sampling during benchmarks                                                                                  |
| `profiling.rs`, `profile_report.rs` | Flamegraph generation (requires `profiling` feature)                                                                       |
| `survey.rs`                         | Corpus-wide extraction statistics                                                                                          |
| `model_benchmark.rs`                | Layout model A/B comparison                                                                                                |
| `embed_benchmark.rs`                | Embedding throughput benchmarks                                                                                            |
| `sizes.rs`                          | Framework installation footprint measurement                                                                               |

## Quality Scoring

### TF1 (Text F1)

Token-level bag-of-words F1 between extracted text and ground truth.

- Tokenization: lowercase, split on whitespace, keep alphanumeric tokens plus `.` and `,`
- Separate numeric-token F1 for number-heavy documents (financial, scientific)
- Combined score: `quality_score = 0.6 * f1_text + 0.4 * f1_numeric`

### SF1 (Structural F1)

Block-level matching between extracted markdown and ground truth markdown.

- **Block types:** Heading1-6, Paragraph, CodeBlock, Formula, Table, ListItem, Image
- **Type weights:** Headings = 2.0, Code/Formula/Table = 1.5, ListItem = 1.0, Paragraph/Image = 0.5
- **Matching:** Greedy 1:1 with fuzzy cross-type compatibility (e.g., bold paragraph matched to heading gets 0.4 compatibility score)
- **Adjacent concatenation:** Consecutive blocks of the same type are merged before matching
- **Order score:** Longest Increasing Subsequence (LIS) on matched block indices

### Combined Score

When markdown ground truth is available, both metrics are combined:

```text
quality_score = 0.5 * f1_text + 0.2 * f1_numeric + 0.3 * f1_layout
```

## Fixture Format

Fixtures are JSON files organized by format directory under `fixtures/`:

```json
{
  "document": "relative/path/to/file.pdf",
  "file_type": "pdf",
  "file_size": 123456,
  "expected_frameworks": ["kreuzberg", "docling"],
  "metadata": {},
  "ground_truth": {
    "text_file": "relative/path/to/gt.txt",
    "markdown_file": "relative/path/to/gt.md",
    "source": "manual|vision|pdf_text_layer|pandoc|python-docx|..."
  }
}
```

### Ground Truth Coverage

| Format | Fixtures | With Markdown GT |
| ------ | -------- | ---------------- |
| PDF    | 159      | 158              |
| HTML   | 36       | 36               |
| DOCX   | 26       | 26               |
| ODT    | 19       | 19               |
| RTF    | 17       | 17               |
| XLSX   | 12       | 11               |
| CSV    | 11       | 11               |
| EPUB   | 8        | 8                |
| PPTX   | 8        | 8                |
| Org    | 6        | 6                |
| DOC    | 5        | 5                |
| OPML   | 4        | 4                |
| RST    | 3        | 3                |
| XLS    | 3        | 3                |
| IPynb  | 1        | 1                |
| JATS   | 1        | 1                |
| LaTeX  | 1        | 1                |

**Total:** 318 fixtures with markdown ground truth across 17 formats.

## Frameworks

### Kreuzberg Bindings (13)

Each binding is benchmarked in both single-file (sequential, fair latency) and batch (concurrent, throughput) modes:

Rust, Python, Node.js, Ruby, Go, Java, C#, PHP, Elixir, R, WASM, C, Rust+PaddleOCR

### Reference Frameworks (12)

External document extraction tools benchmarked in single-file mode:

Docling, MarkItDown, Pandoc, Unstructured, Tika, PyMuPDF4LLM, PDFPlumber, MinerU, PyPDF, PDFMiner, PDFtoText, Playa-PDF

## Extraction Pipelines

The `compare` and `pipeline-benchmark` commands support these extraction paths:

| Pipeline           | Description                                    |
| ------------------ | ---------------------------------------------- |
| `baseline`         | Native PDF text extraction (no OCR, no layout) |
| `layout`           | Native PDF with layout detection               |
| `tesseract`        | Tesseract OCR with force_ocr                   |
| `tesseract+layout` | Tesseract OCR with layout detection            |
| `paddle`           | PaddleOCR mobile tier with force_ocr           |
| `paddle+layout`    | PaddleOCR mobile tier with layout detection    |
| `paddle-server`    | PaddleOCR server tier                          |
| `docling`          | Vendored Docling reference extraction          |
| `paddleocr-python` | Vendored PaddleOCR Python extraction           |
| `rapidocr`         | Vendored RapidOCR extraction                   |

## CLI Reference

### `run` -- CI benchmark execution

Runs benchmarks using framework adapters with configurable iterations, warmup, and sharding.

```bash
benchmark-harness run \
  -f fixtures/ \
  -F kreuzberg-rust,kreuzberg-python \
  -m batch \
  -o results/ \
  -i 3 -w 1
```

| Flag                   | Description                                    | Default       |
| ---------------------- | ---------------------------------------------- | ------------- |
| `-f, --fixtures`       | Fixture directory or file                      | required      |
| `-F, --frameworks`     | Comma-separated framework names                | all available |
| `-o, --output`         | Output directory                               | `results`     |
| `-m, --mode`           | `single-file` or `batch`                       | `batch`       |
| `-i, --iterations`     | Benchmark iterations                           | `3`           |
| `-w, --warmup`         | Warmup iterations (discarded)                  | `1`           |
| `-c, --max-concurrent` | Max concurrent extractions                     | CPU count     |
| `-t, --timeout`        | Timeout in seconds                             | `1800`        |
| `--ocr`                | Enable OCR                                     | `false`       |
| `--measure-quality`    | Enable quality assessment                      | `false`       |
| `--shard`              | Run fixture subset (`INDEX/TOTAL`, e.g. `1/3`) | none          |

### `consolidate` -- Merge multi-job results

Combines benchmark results from parallel CI jobs into a single aggregated report with percentiles.

```bash
benchmark-harness consolidate \
  --inputs dir1,dir2,dir3 \
  --output consolidated/
```

### `compare` -- Local pipeline comparison

Compares extraction pipelines on the document corpus with quality scoring and optional guardrails.

```bash
benchmark-harness compare \
  -f fixtures/ \
  --pipelines baseline,layout,paddle \
  --dump-outputs \
  --guardrails
```

| Flag             | Description                                           |
| ---------------- | ----------------------------------------------------- |
| `--pipelines`    | Comma-separated pipeline names                        |
| `--dump-outputs` | Write extraction outputs to `/tmp/kreuzberg_compare/` |
| `--guardrails`   | Fail on quality regressions (non-zero exit)           |
| `--filter`       | Only run documents matching this substring            |

### `pipeline-benchmark` -- 6-path extraction matrix

Runs all pipelines across the corpus and produces a ranked triage table.

```bash
benchmark-harness pipeline-benchmark \
  -f fixtures/ \
  --group tables \
  --sort-by sf1 \
  --bottom-n 10 \
  --triage-blocks
```

| Flag              | Description                                                                                  | Default             |
| ----------------- | -------------------------------------------------------------------------------------------- | ------------------- |
| `--paths`         | Comma-separated pipeline names                                                               | all 6 default paths |
| `--doc`           | Filter by document name substrings                                                           | none                |
| `--group`         | Named benchmark group (`tables`, `structure`, `multicolumn`, `text-quality`, `ocr-fallback`) | none                |
| `--sort-by`       | Sort metric: `sf1`, `tf1`, `time`                                                            | `sf1`               |
| `--bottom-n`      | Show only the N worst-performing documents                                                   | none                |
| `--triage-blocks` | Print per-block-type F1 breakdown                                                            | `false`             |
| `--dump-outputs`  | Write outputs to `/tmp/kreuzberg_pipeline/`                                                  | `false`             |
| `--json-output`   | Write JSON results to file                                                                   | none                |
| `--profile-dir`   | Generate per-pipeline flamegraph SVGs                                                        | none                |

### `validate-gt` -- Ground truth validation

Checks ground truth file integrity and optionally fixes HTML artifacts in markdown files.

```bash
benchmark-harness validate-gt -f fixtures/ --fix
```

### `survey` -- Corpus extraction statistics

Produces corpus-wide extraction statistics grouped by file type.

```bash
benchmark-harness survey -f fixtures/ --types pdf,docx
```

### `model-benchmark` -- Layout model A/B comparison

Compares two layout model presets across the fixture corpus.

```bash
benchmark-harness model-benchmark -f fixtures/ --model-a fast --model-b accurate
```

### `embed-benchmark` -- Embedding throughput

Benchmarks embedding throughput across all presets.

```bash
benchmark-harness embed-benchmark
```

### `list-fixtures` -- List loaded fixtures

```bash
benchmark-harness list-fixtures -f fixtures/
```

### `validate` -- Validate fixture JSON

```bash
benchmark-harness validate -f fixtures/
```

### `measure-framework-sizes` -- Installation footprints

Measures disk usage of all framework installations.

```bash
benchmark-harness measure-framework-sizes --output sizes.json
```

## CI Integration

The benchmark suite runs via `.github/workflows/benchmarks.yaml`, triggered by manual `workflow_dispatch`.

### Execution DAG

```text
setup
  Build harness + FFI library + validate ground truth
    |
    v
bench-{language} x {single-file, batch}     (13 Kreuzberg binding jobs)
    |
    v
kreuzberg-gate                                (wait for all Kreuzberg benchmarks)
    |
    v
bench-{external}                              (12 reference framework jobs, some sharded)
    |
    v
aggregate-and-release                         (consolidate all results -> GitHub Release)
```

### Platform

- Primary: `ubuntu-24.04-arm`
- Exception: WASM uses `ubuntu-24.04` (x86) due to V8 ARM compatibility issues

### Timeouts and Artifacts

- Per-job timeout: 6 hours (configurable per-document timeout)
- Build artifacts retained: 7 days
- Result artifacts retained: 30 days
- Final output: aggregated JSON published as a GitHub Release

## Vendored Baselines

Pre-generated extraction outputs from reference tools are stored in `vendored/` for offline comparison:

| Directory                    | Source                                             |
| ---------------------------- | -------------------------------------------------- |
| `vendored/docling/`          | Docling extraction outputs                         |
| `vendored/paddleocr-python/` | PaddleOCR Python outputs with timing (`.ms` files) |
| `vendored/rapidocr/`         | RapidOCR extraction outputs                        |

Regenerate with:

```bash
python tools/benchmark-harness/scripts/generate_vendored_baselines.py
```

## Development

```bash
# Build
cargo build -p benchmark-harness

# Run tests
cargo test -p benchmark-harness

# Lint
cargo clippy -p benchmark-harness -- -D warnings

# Local pipeline comparison
cargo run -p benchmark-harness -- compare \
  -f tools/benchmark-harness/fixtures/ \
  --pipelines baseline,layout \
  --dump-outputs

# Validate ground truth
cargo run -p benchmark-harness -- validate-gt \
  -f tools/benchmark-harness/fixtures/

# Full pipeline benchmark with triage
cargo run -p benchmark-harness -- pipeline-benchmark \
  -f tools/benchmark-harness/fixtures/ \
  --sort-by sf1 --bottom-n 20 --triage-blocks

# Corpus survey
cargo run -p benchmark-harness -- survey \
  -f tools/benchmark-harness/fixtures/ --types pdf
```

### Optional Features

| Feature            | Description                               |
| ------------------ | ----------------------------------------- |
| `profiling`        | Enables flamegraph generation via `pprof` |
| `memory-profiling` | Enables jemalloc-based memory profiling   |

Build with features:

```bash
cargo build -p benchmark-harness --features profiling,memory-profiling
```

### Tracing

The harness uses `tracing` with `RUST_LOG` env-filter support. For quality scoring diagnostics:

```bash
RUST_LOG=benchmark_harness::markdown_quality=debug cargo run -p benchmark-harness -- compare ...
```