Files
fil/docs/perf/profiling.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

7.1 KiB

Performance Profiling Workflow

Reproducible flamegraph-driven workflow for Kreuzberg PDF performance work. The infrastructure (pprof, the --profile-dir harness flag, the task benchmark:profile command) already exists; this page codifies how to use it as the entry gate for any optimization.

When to use

Mandatory before any code change you intend to call a "performance optimization." A flamegraph proves the function you're about to touch actually consumes meaningful CPU time. Speculative optimization without a flamegraph is forbidden — two iterations have been rejected on this repo for that exact reason.

Generate a flamegraph

# 1. Build with debug symbols. The default release profile strips them,
#    so the flamegraph fills with raw addresses and `__mh_execute_header`.
cargo build --profile profiling -p kreuzberg-cli --features full
cargo build --profile profiling -p benchmark-harness --features profiling

# 2. Run the harness with --profile-dir (note: pipeline-benchmark, not compare).
SHA=$(git rev-parse --short HEAD)
mkdir -p flamegraphs/$SHA
target/profiling/benchmark-harness pipeline-benchmark \
  --fixtures tools/benchmark-harness/fixtures \
  --paths baseline \
  --doc pdf \
  --profile-dir flamegraphs/$SHA \
  --json-output bench/profile-baseline-$SHA.json

The Taskfile wrapper:

task benchmark:profile FRAMEWORK=kreuzberg PIPELINE=baseline OUTPUT_FORMAT=plaintext MODE=batch

works for the SF1 portion but currently builds --release not --profile profiling, so the resulting SVGs only have system-symbol resolution. Until the task definition is fixed, drive flamegraph generation manually with the explicit commands above. (Tracked as a follow-up; don't optimize against a release-stripped flamegraph.)

What this does:

  • Builds kreuzberg-cli in release mode with --features all.
  • Runs the pipeline against the corpus at tools/benchmark-harness/fixtures/.
  • Captures CPU samples at 1000 Hz via the pprof wrapper in tools/benchmark-harness/src/profiling.rs.
  • Writes flamegraphs/<short-sha>/<pipeline>-<format>-<mode>.svg.

Open the SVG in any browser; the flamegraph is interactive (click to zoom, search by symbol).

Pipeline + format choices

Pipeline Use it for
baseline Pure PDF text path — no layout, no OCR. Cleanest signal.
layout RT-DETR + layout-for-markdown overhead.
paddle-ocr Full OCR path including PaddleOCR.

OUTPUT_FORMAT=plaintext skips the markdown classify/assembly pass (use_layout_for_markdown=false, no font-clustering hierarchy). Use it as the most stripped-down benchmark — what's hot here is the floor of "raw extraction" cost.

OUTPUT_FORMAT=markdown exercises the full structure pipeline. Use it when investigating heading/table classification or the rendering pass.

MODE=batch runs multiple PDFs concurrently — better for steady-state CPU measurement. MODE=single-file measures latency one document at a time; useful when investigating tail latency.

Reading flamegraphs

  • Width = total time (self + children). Wide functions at the bottom of a stack are not necessarily hotspots — they're often just the entry point. Look at self time (the visible non-child width).
  • Tall stacks mean deep call chains; they're not problems unless the leaf is hot.
  • Repeated narrow towers in different stacks are good candidates — the same function called from many places, each contributing a thin slice.
  • Filter out pdf_oxide, image, tokio, and system libraries (libc, libpthread) — those are dependencies. Focus on kreuzberg::* symbols.

Memory profiling

Build with --features memory-profiling to enable jemalloc heap dumps. The harness's dump_heap_profile() writes a .heap file alongside the flamegraph. Use sparingly — CPU is almost always the bottleneck.

Per-iteration commit policy

For each accepted optimization, commit:

  • flamegraphs/<commit>/before.svg — flamegraph showing the hotspot before the change.
  • flamegraphs/<commit>/after.svg — flamegraph showing the hotspot reduced or moved after the change.

This lets reviewers verify the change actually touched the function it claimed to touch.


Iteration protocol — accept/reject gate

The protocol below is enforced for every perf candidate. Skipping a step has cost the project two reverted iterations.

Before any code change

  1. Generate flamegraph on current HEAD (task benchmark:profile).
  2. Identify the top-15 self-time kreuzberg::* functions. Filter out pdf_oxide / image / tokio / system libs.
  3. Pick one candidate. Document:
    • File:line of the function.
    • Approximate self-time percentage.
    • The shape of the optimization (e.g., "replace Vec::clone with Cow::Borrowed").

Implement (subagent — performance-engineer)

  1. Make the smallest change that addresses the hotspot. No surrounding cleanup, no helper extractions, no scope creep (minimal-changes rule).

Verify (main agent — never trust subagent ACCEPT verdicts without independent measurement)

  1. cargo build -p kreuzberg --features full — zero warnings.

  2. cargo clippy -p kreuzberg --tests -- -D warnings — clean.

  3. cargo test -p kreuzberg — green.

  4. Re-run the harness:

    target/release/benchmark-harness compare \
      --fixtures tools/benchmark-harness/fixtures \
      --pipelines baseline \
      --json-output bench/iter-N.json
    
  5. Compare bench/iter-N.json vs the most recent committed baseline JSON in bench/:

    • SF1 must NOT regress (any drop > 0.1pt → revert).
    • Lift on at least one of {p50 ms/MB, p95 ms/MB, peak RSS} ≥ 3% → accept.
    • Lift < 3%, no regression → mark commit body marginal, accept.
    • Regression on perf metrics → revert.
  6. Generate post-fix flamegraph; commit before.svg + after.svg + the iter-N JSON.

Stopping condition (curve flattens)

  • Three consecutive REJECT or "marginal" accepts → stop. The optimization curve has flattened; further work is not worth the engineering cost.
  • Aggregate p95 plaintext ms/MB within 20% of pandoc (the best-tuned competitor on plaintext) → stop.

Failure-mode reminders

  • Don't trust subagent ACCEPT verdicts without measuring. A performance-engineer agent has previously declared an optimization accepted with a correctness regression baked in. Always rerun the test suite locally after the agent reports done.
  • Behavior probes catch what F1 doesn't. F1 metrics aggregate; a small regression on a corner case can wash out. When the optimization touches text-shape code (whitespace, escape, punctuation), write a 4-input mini-test asserting exact byte equality vs the unoptimized version.
  • Cache invalidation. When you change the Kreuzberg crate, rebuild both kreuzberg-cli and benchmark-harness (the harness links the crate in-process for compare). A build that "finished in 1.10s" without recompiling the Kreuzberg crate is a sign the change wasn't picked up.