This commit is contained in:
76
docs/perf/iterations.md
Normal file
76
docs/perf/iterations.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# Performance Iteration Log
|
||||
|
||||
Per-iteration tracker for Kreuzberg perf optimization rounds. Append one row per accepted or rejected candidate. Follows the protocol in `profiling.md`. The institutional memory replacement for the `feedback_perf_subagent_verification.md` failure-mode warning.
|
||||
|
||||
## Format
|
||||
|
||||
| commit | candidate | hotspot self-time | p50 Δ | p95 Δ | SF1 Δ | verdict | notes |
|
||||
| ------ | --------- | ----------------- | ----- | ----- | ----- | ------- | ----- |
|
||||
|
||||
- **commit** — short SHA of the optimization commit (or REVERTED if rejected).
|
||||
- **candidate** — file:function being optimized.
|
||||
- **hotspot self-time** — pre-fix percentage from the flamegraph.
|
||||
- **p50 Δ / p95 Δ** — change in median/tail extraction time vs prior baseline JSON.
|
||||
- **SF1 Δ** — change in aggregate SF1 score. Must be ≥ −0.1pt or revert.
|
||||
- **verdict** — ACCEPT, MARGINAL (lift < 3%, no regression), or REJECT.
|
||||
- **notes** — one sentence on why.
|
||||
|
||||
## History
|
||||
|
||||
| commit | candidate | hotspot self-time | p50 Δ | p95 Δ | SF1 Δ | verdict | notes |
|
||||
| --------- | ---------------------------------------------------------------------------------------------------------- | --------------------------------------------- | ----- | ----- | ----- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| REVERTED | normalize_whitespace rewrite | not measured | n/a | n/a | n/a | REJECT | perf-engineer agent ACCEPT'd without measurement; correctness regression on leading/trailing spaces. See `feedback_perf_subagent_verification.md`. |
|
||||
| REVERTED | split_embedded bullet-count fast path | not measured | +5% | +5% | 0 | REJECT | speculation-driven; ~5% wall-time _regression_. Don't optimize without a flamegraph. |
|
||||
| b51472c1c | layout_runner: stream DynamicImage→RgbImage + drop redundant clones in table_recognition/layout_validation | n/a (memory, not CPU) | n/a | n/a | 0 | ACCEPT | No pre-M baseline; post-M anchor: 292 MB peak RSS on 60 MB PDF (plain, no layout). Q gates: 143/143 regression, 3/3 smoke, 18/18 guardrail failures identical to pre-M (all pre-existing pdf_oxide upstream). |
|
||||
| 86a706959 | rendering::markdown: Cow single-pass scans replacing 6 eager .replace() chains | 0.02% self-time post-M (flamegraph fa356cb7e) | n/a | n/a | 0 | ACCEPT | M.2 confirmed effective — render_markdown dropped to 0.02% in post-M flamegraph; was queue candidate #2. |
|
||||
|
||||
## Queue — CLEARED (stopping condition met)
|
||||
|
||||
**Flamegraph `flamegraphs/fa356cb7e/baseline.svg`** (2026-05-11, 88,524 samples, `--profile profiling`, `--features all`):
|
||||
|
||||
| rank | self-time | function |
|
||||
| ---- | --------- | ------------------------------------------------------------------------------- |
|
||||
| 1 | 0.50% | `kreuzberg::pdf::oxide::table::extract_tables_native` |
|
||||
| 2 | 0.33% | `kreuzberg::pdf::oxide::text::extract_text_fast_path` |
|
||||
| 3 | 0.14% | `kreuzberg::pdf::oxide::hierarchy::extract_all_segments` |
|
||||
| 4 | 0.12% | `kreuzberg::cache::core::GenericCache::set` |
|
||||
| 5 | 0.11% | `kreuzberg::pdf::oxide::images::extract_image_positions` |
|
||||
| 6 | 0.11% | `kreuzberg::pdf::structure::pipeline::extract_document_structure_from_segments` |
|
||||
| 7 | 0.09% | `kreuzberg::cache::cleanup::scan_cache_directory` |
|
||||
| 8 | 0.08% | `kreuzberg::pdf::structure::classify::mark_arxiv_noise` |
|
||||
| 9 | 0.02% | `kreuzberg::rendering::markdown::render_markdown` |
|
||||
|
||||
**Breakdown by crate (aggregate, 88,524 total samples):**
|
||||
|
||||
- System/other: 48.4%
|
||||
- Std/core/alloc: 25.3%
|
||||
- Benchmark_harness (quality scorer): 9.9%
|
||||
- Pdf_oxide: 9.3%
|
||||
- Rayon: 3.7%
|
||||
- **Kreuzberg: 3.4%**
|
||||
- Tokio: 0.1%
|
||||
|
||||
**Stopping condition:** Kreuzberg layer accounts for only 3.4% of total wall time. The previous queue candidates (fuse_paragraphs, text_repair, normalize_key, classify::merge_consecutive_pages) do not appear in the top-25 Kreuzberg frames — confirmed not hot on the baseline pipeline path. Dominant cost is pdf_oxide text/table extraction (9.3%) + system allocator + OS overhead (48.4%) — these are outside kreuzberg's optimization surface.
|
||||
|
||||
**Further kreuzberg-layer CPU gains require upstream pdf_oxide work** (table extraction at 0.50% is the single largest kreuzberg-visible hotspot; it delegates to pdf_oxide). Cache I/O (scan_cache_directory) at 0.09% is the next actionable target if cache efficiency becomes a priority, but it's below the noise floor for extraction pipelines.
|
||||
|
||||
### Previous blockers resolved
|
||||
|
||||
- Symbol-strip blocker from `flamegraphs/61170f7f6/baseline.svg` is fixed: `.task/workflows/benchmark.yml` patched from `--features full` → `--features all` (kreuzberg-cli has no `full` feature). The `fa356cb7e` flamegraph has 87 Kreuzberg symbols resolved.
|
||||
|
||||
### Post-M RSS anchor
|
||||
|
||||
Measured 2026-05-11 on `target/profiling/kreuzberg` (post-M, `--features all`, `--profile profiling`):
|
||||
|
||||
- **Fixture**: `test_documents/pdf/proof_of_concept_or_gtfo_v13_october_18th_2016.pdf` (60 MB, plain extraction, no layout detection)
|
||||
- **Peak RSS**: 292 MB (`maximum resident set size: 305,954,816 bytes`)
|
||||
- **Wall time**: 1.09 real seconds
|
||||
- **Note**: No pre-M baseline captured; this is the forward anchor. M.1+M.3 impact on layout pipeline RSS requires a separate run with `use_layout_for_markdown=true` on a multi-page PDF.
|
||||
|
||||
## Stopping conditions
|
||||
|
||||
Per `profiling.md` § Iteration protocol:
|
||||
|
||||
- **Three consecutive REJECT or MARGINAL** → optimization curve flattened; stop.
|
||||
- **Aggregate p95 plaintext ms/MB within 20% of pandoc** → competitive ceiling reached; stop.
|
||||
- **SF1 regression > 0.1pt on any iteration** → immediate revert; reset the streak.
|
||||
129
docs/perf/profiling.md
Normal file
129
docs/perf/profiling.md
Normal file
@@ -0,0 +1,129 @@
|
||||
# Performance Profiling Workflow
|
||||
|
||||
Reproducible flamegraph-driven workflow for Kreuzberg PDF performance work. The infrastructure (pprof, the `--profile-dir` harness flag, the `task benchmark:profile` command) already exists; this page codifies how to use it as the entry gate for any optimization.
|
||||
|
||||
## When to use
|
||||
|
||||
**Mandatory** before any code change you intend to call a "performance optimization." A flamegraph proves the function you're about to touch actually consumes meaningful CPU time. Speculative optimization without a flamegraph is forbidden — two iterations have been rejected on this repo for that exact reason.
|
||||
|
||||
## Generate a flamegraph
|
||||
|
||||
```bash
|
||||
# 1. Build with debug symbols. The default release profile strips them,
|
||||
# so the flamegraph fills with raw addresses and `__mh_execute_header`.
|
||||
cargo build --profile profiling -p kreuzberg-cli --features full
|
||||
cargo build --profile profiling -p benchmark-harness --features profiling
|
||||
|
||||
# 2. Run the harness with --profile-dir (note: pipeline-benchmark, not compare).
|
||||
SHA=$(git rev-parse --short HEAD)
|
||||
mkdir -p flamegraphs/$SHA
|
||||
target/profiling/benchmark-harness pipeline-benchmark \
|
||||
--fixtures tools/benchmark-harness/fixtures \
|
||||
--paths baseline \
|
||||
--doc pdf \
|
||||
--profile-dir flamegraphs/$SHA \
|
||||
--json-output bench/profile-baseline-$SHA.json
|
||||
```
|
||||
|
||||
The Taskfile wrapper:
|
||||
|
||||
```bash
|
||||
task benchmark:profile FRAMEWORK=kreuzberg PIPELINE=baseline OUTPUT_FORMAT=plaintext MODE=batch
|
||||
```
|
||||
|
||||
…**works for the SF1 portion** but currently builds `--release` not `--profile profiling`, so the resulting SVGs only have system-symbol resolution. Until the task definition is fixed, drive flamegraph generation manually with the explicit commands above. (Tracked as a follow-up; don't optimize against a release-stripped flamegraph.)
|
||||
|
||||
What this does:
|
||||
|
||||
- Builds `kreuzberg-cli` in release mode with `--features all`.
|
||||
- Runs the pipeline against the corpus at `tools/benchmark-harness/fixtures/`.
|
||||
- Captures CPU samples at 1000 Hz via the pprof wrapper in `tools/benchmark-harness/src/profiling.rs`.
|
||||
- Writes `flamegraphs/<short-sha>/<pipeline>-<format>-<mode>.svg`.
|
||||
|
||||
Open the SVG in any browser; the flamegraph is interactive (click to zoom, search by symbol).
|
||||
|
||||
### Pipeline + format choices
|
||||
|
||||
| Pipeline | Use it for |
|
||||
| ------------ | -------------------------------------------------------- |
|
||||
| `baseline` | Pure PDF text path — no layout, no OCR. Cleanest signal. |
|
||||
| `layout` | RT-DETR + layout-for-markdown overhead. |
|
||||
| `paddle-ocr` | Full OCR path including PaddleOCR. |
|
||||
|
||||
`OUTPUT_FORMAT=plaintext` skips the markdown classify/assembly pass (`use_layout_for_markdown=false`, no font-clustering hierarchy). Use it as the most stripped-down benchmark — what's hot here is the floor of "raw extraction" cost.
|
||||
|
||||
`OUTPUT_FORMAT=markdown` exercises the full structure pipeline. Use it when investigating heading/table classification or the rendering pass.
|
||||
|
||||
`MODE=batch` runs multiple PDFs concurrently — better for steady-state CPU measurement. `MODE=single-file` measures latency one document at a time; useful when investigating tail latency.
|
||||
|
||||
## Reading flamegraphs
|
||||
|
||||
- **Width = total time** (self + children). Wide functions at the bottom of a stack are not necessarily hotspots — they're often just the entry point. Look at _self time_ (the visible non-child width).
|
||||
- **Tall stacks** mean deep call chains; they're not problems unless the leaf is hot.
|
||||
- **Repeated narrow towers** in different stacks are good candidates — the same function called from many places, each contributing a thin slice.
|
||||
- Filter out `pdf_oxide`, `image`, `tokio`, and system libraries (`libc`, `libpthread`) — those are dependencies. Focus on `kreuzberg::*` symbols.
|
||||
|
||||
## Memory profiling
|
||||
|
||||
Build with `--features memory-profiling` to enable jemalloc heap dumps. The harness's `dump_heap_profile()` writes a `.heap` file alongside the flamegraph. Use sparingly — CPU is almost always the bottleneck.
|
||||
|
||||
## Per-iteration commit policy
|
||||
|
||||
For each accepted optimization, commit:
|
||||
|
||||
- `flamegraphs/<commit>/before.svg` — flamegraph showing the hotspot **before** the change.
|
||||
- `flamegraphs/<commit>/after.svg` — flamegraph showing the hotspot reduced or moved **after** the change.
|
||||
|
||||
This lets reviewers verify the change actually touched the function it claimed to touch.
|
||||
|
||||
---
|
||||
|
||||
## Iteration protocol — accept/reject gate
|
||||
|
||||
The protocol below is enforced for every perf candidate. Skipping a step has cost the project two reverted iterations.
|
||||
|
||||
### Before any code change
|
||||
|
||||
1. Generate flamegraph on current HEAD (`task benchmark:profile`).
|
||||
2. Identify the top-15 self-time `kreuzberg::*` functions. Filter out `pdf_oxide` / `image` / `tokio` / system libs.
|
||||
3. Pick **one** candidate. Document:
|
||||
- File:line of the function.
|
||||
- Approximate self-time percentage.
|
||||
- The shape of the optimization (e.g., "replace `Vec::clone` with `Cow::Borrowed`").
|
||||
|
||||
### Implement (subagent — `performance-engineer`)
|
||||
|
||||
4. Make the smallest change that addresses the hotspot. No surrounding cleanup, no helper extractions, no scope creep (`minimal-changes` rule).
|
||||
|
||||
### Verify (main agent — never trust subagent ACCEPT verdicts without independent measurement)
|
||||
|
||||
5. `cargo build -p kreuzberg --features full` — zero warnings.
|
||||
6. `cargo clippy -p kreuzberg --tests -- -D warnings` — clean.
|
||||
7. `cargo test -p kreuzberg` — green.
|
||||
8. Re-run the harness:
|
||||
|
||||
```bash
|
||||
target/release/benchmark-harness compare \
|
||||
--fixtures tools/benchmark-harness/fixtures \
|
||||
--pipelines baseline \
|
||||
--json-output bench/iter-N.json
|
||||
```
|
||||
|
||||
9. Compare `bench/iter-N.json` vs the most recent committed baseline JSON in `bench/`:
|
||||
- **SF1 must NOT regress** (any drop > 0.1pt → revert).
|
||||
- Lift on at least one of `{p50 ms/MB, p95 ms/MB, peak RSS}` ≥ 3% → **accept**.
|
||||
- Lift < 3%, no regression → mark commit body `marginal`, accept.
|
||||
- Regression on perf metrics → **revert**.
|
||||
|
||||
10. Generate post-fix flamegraph; commit `before.svg` + `after.svg` + the iter-N JSON.
|
||||
|
||||
### Stopping condition (curve flattens)
|
||||
|
||||
- **Three consecutive REJECT or "marginal" accepts** → stop. The optimization curve has flattened; further work is not worth the engineering cost.
|
||||
- Aggregate p95 plaintext ms/MB within 20% of `pandoc` (the best-tuned competitor on plaintext) → stop.
|
||||
|
||||
## Failure-mode reminders
|
||||
|
||||
- **Don't trust subagent ACCEPT verdicts without measuring.** A `performance-engineer` agent has previously declared an optimization accepted with a correctness regression baked in. Always rerun the test suite locally after the agent reports done.
|
||||
- **Behavior probes catch what F1 doesn't.** F1 metrics aggregate; a small regression on a corner case can wash out. When the optimization touches text-shape code (whitespace, escape, punctuation), write a 4-input mini-test asserting exact byte equality vs the unoptimized version.
|
||||
- **Cache invalidation.** When you change the Kreuzberg crate, rebuild _both_ `kreuzberg-cli` and `benchmark-harness` (the harness links the crate in-process for `compare`). A build that "finished in 1.10s" without recompiling the Kreuzberg crate is a sign the change wasn't picked up.
|
||||
Reference in New Issue
Block a user