Files
fil/tools/benchmark-harness/SCHEMA.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

7.5 KiB

Aggregation Schema v2.4.0

This document describes the structure of aggregated.json produced by benchmark-harness consolidate.

Top-level Shape

{
  "schema_version": "2.4.0",
  "by_framework_mode": {
    "<aggregate_key>": {
      /* FrameworkModeAggregation */
    }
  },
  "disk_sizes": {
    "framework": {
      /* DiskSizeInfo */
    }
  },
  "comparison": {
    /* ComparisonData */
  },
  "per_fixture_results": [
    /* PerFixtureRow[] */
  ],
  "metadata": {
    /* ConsolidationMetadata */
  }
}

Output Format Discriminator

The output_format field determines:

  • markdown: Supports all metrics including SF1 (structural F1), layout percentiles, and all ranking tables
  • plaintext: Text-only extraction; SF1 and layout percentiles are null; plaintext frameworks never appear in SF1 rankings

by_framework_mode

Key format differs by framework family:

  • kreuzberg (kreuzberg-*): {framework_name}:{mode} — the output format is already encoded in the framework name (e.g. kreuzberg-markdown-baseline), so repeating it in the key is redundant.
  • competitors (all other frameworks): {framework}:{output_format}:{mode} — format is not encoded in the name, so the key carries it explicitly.

Examples:

  • kreuzberg-markdown-baseline:single
  • kreuzberg-plaintext-paddle-ocr:batch
  • pdfplumber:plaintext:single
  • docling:markdown:single

Each entry contains:

{
  "framework": "string", // Framework name without mode suffix
  "output_format": "markdown|plaintext", // Output format used
  "mode": "single|batch|...", // Execution mode
  "cold_start": {
    /* DurationPercentiles */
  }, // Optional, if cold start data available
  "by_file_type": {
    "pdf": {
      "file_type": "pdf",
      "no_ocr": {
        /* PerformancePercentiles */
      },
      "with_ocr": {
        /* PerformancePercentiles */
      }
    }
  }
}

PerformancePercentiles

Contains p50, p95, p99 for all metrics:

{
  "successful_sample_count": 42,
  "total_sample_count": 50,
  "framework_errors": 0,
  "harness_errors": 5,
  "timeouts": 3,
  "empty_content": 0,
  "error_details": {
    "error message": 2
  },
  "duration": { "p50": 100.5, "p95": 150.2, "p99": 199.9 },
  "throughput": { "p50": 5.2, "p95": 4.8, "p99": 3.1 },
  "memory": { "p50": 150.0, "p95": 200.0, "p99": 250.0 },
  "cpu": { "p50": 50.0, "p95": 75.0, "p99": 90.0 }, // Optional
  "extraction_duration": { "p50": 80.0, "p95": 120.0, "p99": 160.0 }, // Optional
  "quality": {
    /* QualityPercentiles */
  }, // Optional, if quality data available
  "success_rate_percent": 84.0
}

QualityPercentiles

Includes p50, p95, p99 for all F1 metrics. Layout percentiles are null for plaintext-only frameworks:

{
  "f1_text_p50": 0.92,
  "f1_text_p95": 0.88,
  "f1_text_p99": 0.75,
  "f1_numeric_p50": 0.85,
  "f1_numeric_p95": 0.8,
  "f1_numeric_p99": 0.7,
  "f1_layout_p50": 0.78, // null for plaintext output format
  "f1_layout_p95": 0.72, // null for plaintext output format
  "f1_layout_p99": 0.65, // null for plaintext output format
  "quality_score_p50": 0.85,
  "quality_score_p95": 0.8,
  "quality_score_p99": 0.7
}

PerFixtureRow

One row per unique combination of (framework, output_format, execution_mode, fixture_id, ocr):

{
  "framework": "kreuzberg-markdown-baseline",
  "output_format": "markdown",
  "execution_mode": "single",
  "ocr": false,
  "fixture_id": "sample_doc_1",
  "file_type": "pdf",
  "duration_ms": 125.4,
  "peak_memory_mb": 180.5,
  "f1_text": 0.92,
  "f1_layout": 0.78, // null for plaintext mode
  "f1_numeric": 0.85,
  "quality_score": 0.85,
  "correct": true,
  "success": true,
  "error_kind": null // "FrameworkError", "HarnessError", "Timeout", etc. if !success
}

ComparisonData

Contains all cross-framework rankings split by output format for quality metrics:

{
  "performance_ranking": [
    /* RankedFramework[] */
  ],
  "throughput_ranking": [
    /* RankedFramework[] */
  ],
  "memory_ranking": [
    /* RankedFramework[] */
  ],
  "cpu_ranking": [
    /* RankedFramework[] */
  ],
  "quality_ranking": [
    /* RankedFramework[] */
  ],
  "pdf_quality_ranking": [
    /* RankedFramework[] */
  ],
  "pdf_tf1_ranking_markdown": [
    /* RankedFramework[]  markdown-only */
  ],
  "pdf_tf1_ranking_plaintext": [
    /* RankedFramework[]  plaintext-only */
  ],
  "pdf_sf1_ranking_markdown": [
    /* RankedFramework[]  markdown-only, never plaintext */
  ],
  "deltas_vs_baseline": {
    "<aggregate_key>": {
      /* DeltaMetrics */
    }
  }
}

RankedFramework

{
  "framework_mode": "kreuzberg-markdown-baseline:single",
  "rank": 1,
  "value": 95.5, // The metric value (duration, throughput, etc.)
  "relative": 1.0 // Ratio relative to best (1.0 = best)
}

Migration from v2.3.0 to v2.4.0

Breaking Changes

  1. Schema version: Bumped to "2.4.0"
  2. Kreuzberg aggregate key format: Changed from framework:output_format:mode to framework_name:mode for all kreuzberg-* frameworks. Competitor key format (framework:output_format:mode) is unchanged.

Kreuzberg Consolidation

Language-binding frameworks (kreuzberg-py, kreuzberg-node, kreuzberg-rb, kreuzberg-go, kreuzberg-java, kreuzberg-csharp, kreuzberg-elixir, kreuzberg-php, kreuzberg-rust, etc.) have been removed. They are replaced by three native pipelines run directly via the kreuzberg CLI:

Pipeline Markdown name Plaintext name
Baseline kreuzberg-markdown-baseline kreuzberg-plaintext-baseline
Layout kreuzberg-markdown-layout kreuzberg-plaintext-layout
PaddleOCR kreuzberg-markdown-paddle-ocr kreuzberg-plaintext-paddle-ocr

Batch variants append -batch to the framework name (e.g. kreuzberg-markdown-baseline-batch), which the harness normalises to aggregate key kreuzberg-markdown-baseline:batch.

Key Format Rationale

The format component is implicit in the kreuzberg framework name itself. Duplicating it in the aggregate key (kreuzberg-markdown-baseline:markdown:single) would be redundant and confusing. Competitor names carry no format information, so they continue to need it in the key (docling:markdown:single).

Migration from v2.2.0 to v2.3.0

Breaking Changes

  1. Schema version: Bumped to "2.3.0"
  2. Framework key format: Changed from framework:mode to framework:output_format:mode
  3. QualityPercentiles: Added p95 and p99 percentiles for all F1 metrics; f1_layout_* fields are now optional (null for plaintext)
  4. FrameworkModeAggregation: Added output_format field
  5. ComparisonData: Replaced pdf_tf1_ranking with pdf_tf1_ranking_markdown and pdf_tf1_ranking_plaintext; pdf_sf1_ranking renamed to pdf_sf1_ranking_markdown (now markdown-only)

New Fields

  • per_fixture_results: Array of individual fixture results preserving per-file measurements
  • PerFixtureRow: New struct capturing individual extraction outcomes

Plaintext-only Behavior

  • Plaintext frameworks NEVER appear in pdf_sf1_ranking_markdown
  • Plaintext frameworks NEVER appear in pdf_tf1_ranking_markdown (they get their own pdf_tf1_ranking_plaintext)
  • SF1 and layout percentiles are null for plaintext output format
  • All performance rankings (speed, memory, throughput) include both formats without discrimination

ConsolidationMetadata

{
  "total_results": 500,
  "framework_count": 5,
  "file_type_count": 8,
  "timestamp": "2025-05-09T10:15:30Z"
}