7.5 KiB
Aggregation Schema v2.4.0
This document describes the structure of aggregated.json produced by benchmark-harness consolidate.
Top-level Shape
{
"schema_version": "2.4.0",
"by_framework_mode": {
"<aggregate_key>": {
/* FrameworkModeAggregation */
}
},
"disk_sizes": {
"framework": {
/* DiskSizeInfo */
}
},
"comparison": {
/* ComparisonData */
},
"per_fixture_results": [
/* PerFixtureRow[] */
],
"metadata": {
/* ConsolidationMetadata */
}
}
Output Format Discriminator
The output_format field determines:
markdown: Supports all metrics including SF1 (structural F1), layout percentiles, and all ranking tablesplaintext: Text-only extraction; SF1 and layout percentiles arenull; plaintext frameworks never appear in SF1 rankings
by_framework_mode
Key format differs by framework family:
- kreuzberg (
kreuzberg-*):{framework_name}:{mode}— the output format is already encoded in the framework name (e.g.kreuzberg-markdown-baseline), so repeating it in the key is redundant. - competitors (all other frameworks):
{framework}:{output_format}:{mode}— format is not encoded in the name, so the key carries it explicitly.
Examples:
kreuzberg-markdown-baseline:singlekreuzberg-plaintext-paddle-ocr:batchpdfplumber:plaintext:singledocling:markdown:single
Each entry contains:
{
"framework": "string", // Framework name without mode suffix
"output_format": "markdown|plaintext", // Output format used
"mode": "single|batch|...", // Execution mode
"cold_start": {
/* DurationPercentiles */
}, // Optional, if cold start data available
"by_file_type": {
"pdf": {
"file_type": "pdf",
"no_ocr": {
/* PerformancePercentiles */
},
"with_ocr": {
/* PerformancePercentiles */
}
}
}
}
PerformancePercentiles
Contains p50, p95, p99 for all metrics:
{
"successful_sample_count": 42,
"total_sample_count": 50,
"framework_errors": 0,
"harness_errors": 5,
"timeouts": 3,
"empty_content": 0,
"error_details": {
"error message": 2
},
"duration": { "p50": 100.5, "p95": 150.2, "p99": 199.9 },
"throughput": { "p50": 5.2, "p95": 4.8, "p99": 3.1 },
"memory": { "p50": 150.0, "p95": 200.0, "p99": 250.0 },
"cpu": { "p50": 50.0, "p95": 75.0, "p99": 90.0 }, // Optional
"extraction_duration": { "p50": 80.0, "p95": 120.0, "p99": 160.0 }, // Optional
"quality": {
/* QualityPercentiles */
}, // Optional, if quality data available
"success_rate_percent": 84.0
}
QualityPercentiles
Includes p50, p95, p99 for all F1 metrics. Layout percentiles are null for plaintext-only frameworks:
{
"f1_text_p50": 0.92,
"f1_text_p95": 0.88,
"f1_text_p99": 0.75,
"f1_numeric_p50": 0.85,
"f1_numeric_p95": 0.8,
"f1_numeric_p99": 0.7,
"f1_layout_p50": 0.78, // null for plaintext output format
"f1_layout_p95": 0.72, // null for plaintext output format
"f1_layout_p99": 0.65, // null for plaintext output format
"quality_score_p50": 0.85,
"quality_score_p95": 0.8,
"quality_score_p99": 0.7
}
PerFixtureRow
One row per unique combination of (framework, output_format, execution_mode, fixture_id, ocr):
{
"framework": "kreuzberg-markdown-baseline",
"output_format": "markdown",
"execution_mode": "single",
"ocr": false,
"fixture_id": "sample_doc_1",
"file_type": "pdf",
"duration_ms": 125.4,
"peak_memory_mb": 180.5,
"f1_text": 0.92,
"f1_layout": 0.78, // null for plaintext mode
"f1_numeric": 0.85,
"quality_score": 0.85,
"correct": true,
"success": true,
"error_kind": null // "FrameworkError", "HarnessError", "Timeout", etc. if !success
}
ComparisonData
Contains all cross-framework rankings split by output format for quality metrics:
{
"performance_ranking": [
/* RankedFramework[] */
],
"throughput_ranking": [
/* RankedFramework[] */
],
"memory_ranking": [
/* RankedFramework[] */
],
"cpu_ranking": [
/* RankedFramework[] */
],
"quality_ranking": [
/* RankedFramework[] */
],
"pdf_quality_ranking": [
/* RankedFramework[] */
],
"pdf_tf1_ranking_markdown": [
/* RankedFramework[] — markdown-only */
],
"pdf_tf1_ranking_plaintext": [
/* RankedFramework[] — plaintext-only */
],
"pdf_sf1_ranking_markdown": [
/* RankedFramework[] — markdown-only, never plaintext */
],
"deltas_vs_baseline": {
"<aggregate_key>": {
/* DeltaMetrics */
}
}
}
RankedFramework
{
"framework_mode": "kreuzberg-markdown-baseline:single",
"rank": 1,
"value": 95.5, // The metric value (duration, throughput, etc.)
"relative": 1.0 // Ratio relative to best (1.0 = best)
}
Migration from v2.3.0 to v2.4.0
Breaking Changes
- Schema version: Bumped to
"2.4.0" - Kreuzberg aggregate key format: Changed from
framework:output_format:modetoframework_name:modefor allkreuzberg-*frameworks. Competitor key format (framework:output_format:mode) is unchanged.
Kreuzberg Consolidation
Language-binding frameworks (kreuzberg-py, kreuzberg-node, kreuzberg-rb, kreuzberg-go,
kreuzberg-java, kreuzberg-csharp, kreuzberg-elixir, kreuzberg-php, kreuzberg-rust, etc.)
have been removed. They are replaced by three native pipelines run directly via the kreuzberg CLI:
| Pipeline | Markdown name | Plaintext name |
|---|---|---|
| Baseline | kreuzberg-markdown-baseline |
kreuzberg-plaintext-baseline |
| Layout | kreuzberg-markdown-layout |
kreuzberg-plaintext-layout |
| PaddleOCR | kreuzberg-markdown-paddle-ocr |
kreuzberg-plaintext-paddle-ocr |
Batch variants append -batch to the framework name (e.g. kreuzberg-markdown-baseline-batch),
which the harness normalises to aggregate key kreuzberg-markdown-baseline:batch.
Key Format Rationale
The format component is implicit in the kreuzberg framework name itself. Duplicating it in the
aggregate key (kreuzberg-markdown-baseline:markdown:single) would be redundant and confusing.
Competitor names carry no format information, so they continue to need it in the key
(docling:markdown:single).
Migration from v2.2.0 to v2.3.0
Breaking Changes
- Schema version: Bumped to
"2.3.0" - Framework key format: Changed from
framework:modetoframework:output_format:mode - QualityPercentiles: Added p95 and p99 percentiles for all F1 metrics;
f1_layout_*fields are now optional (null for plaintext) - FrameworkModeAggregation: Added
output_formatfield - ComparisonData: Replaced
pdf_tf1_rankingwithpdf_tf1_ranking_markdownandpdf_tf1_ranking_plaintext;pdf_sf1_rankingrenamed topdf_sf1_ranking_markdown(now markdown-only)
New Fields
per_fixture_results: Array of individual fixture results preserving per-file measurementsPerFixtureRow: New struct capturing individual extraction outcomes
Plaintext-only Behavior
- Plaintext frameworks NEVER appear in
pdf_sf1_ranking_markdown - Plaintext frameworks NEVER appear in
pdf_tf1_ranking_markdown(they get their ownpdf_tf1_ranking_plaintext) - SF1 and layout percentiles are
nullfor plaintext output format - All performance rankings (speed, memory, throughput) include both formats without discrimination
ConsolidationMetadata
{
"total_results": 500,
"framework_count": 5,
"file_type_count": 8,
"timestamp": "2025-05-09T10:15:30Z"
}