Nomad changes
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s

This commit is contained in:
Henrik Jess Nielsen
2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions

354
docs/guides/development.md Normal file
View File

@@ -0,0 +1,354 @@
# Development Workflow
Everything you need to build, test, and debug Kreuzberg locally. This guide assumes you've already followed the [Contributing Guide](../contributing.md) to fork and clone the repository.
---
## The Task Runner
Kreuzberg uses [Task](https://taskfile.dev/) for all build and test workflows. One command to bootstrap everything:
```bash title="Terminal"
task setup
```
That installs all toolchains and dependencies. Safe to re-run anytime — it's idempotent.
### The Pattern
Tasks follow `<language>:<action>`. Once you learn this pattern, the command for any task is predictable:
```bash title="Terminal"
task rust:build # Build the Rust core
task rust:build:dev # Debug build (faster compile, no optimizations)
task rust:build:release # Release build (slow compile, fast binary)
task rust:test # Run Rust tests
task rust:test:ci # Same tests, with CI-level diagnostics
task python:build # Build Python bindings via maturin
task python:test # Run Python test suite
task node:build # Build Node.js bindings via napi
task node:test # Jest tests
```
The same pattern works for every language: `go:build`, `java:test`, `ruby:build`, `csharp:test`, and so on.
### Bulk Operations
```bash title="Terminal"
task build:all # Build every binding
task test:all # Test every binding (sequential)
task test:all:parallel # Test every binding (parallel — faster, noisier output)
task check # Lint + format check across the whole repo
```
---
## Testing Locally
### Rust
The core lives in `crates/kreuzberg/`. Most changes start here.
```bash title="Terminal"
task rust:test
cargo test -p kreuzberg test_pdf_extraction -- --nocapture
RUST_LOG=debug cargo test -p kreuzberg test_name -- --nocapture
```
### Python
Python bindings are in `packages/python/`. Build first, then test:
```bash title="Terminal"
task python:build:dev
task python:test
cd packages/python
uv run pytest tests/ -k "test_extract" -v
```
The `RUST_LOG` env var works here too — the Rust core logs through Python's stderr:
```bash title="Terminal"
RUST_LOG=debug uv run pytest tests/ -v
```
### Node.js
TypeScript bindings are in `packages/typescript/`:
```bash title="Terminal"
task node:build:dev
task node:test
cd packages/typescript
pnpm test -- --testPathPattern="extract"
```
### Everything Else
Same pattern. Build, then test:
```bash title="Terminal"
task go:build && task go:test
task java:build && task java:test
task csharp:build && task csharp:test
task ruby:build && task ruby:test
task php:build && task php:test
task elixir:build && task elixir:test
task r:build && task r:test
task c:build && task c:test
task wasm:build && task wasm:test
```
### Testing the live browser demo
The demo at `docs/demo.html` loads `@kreuzberg/wasm` from a CDN. To test local changes against it, use:
```bash title="Terminal"
task demo:dev
```
This builds the Wasm binary and TypeScript dist, patches the demo with local URLs, and starts two servers:
| Server | URL | Role |
| ------ | ----------------------- | ---------------------------------- |
| Docs | `http://localhost:8001` | Serves the patched `demo-dev.html` |
| Assets | `http://localhost:9000` | Serves the local Wasm package |
Open **`http://localhost:8001/demo-dev.html`** — no manual edits needed. The patched file (`docs/demo-dev.html`) is gitignored and regenerated on every run. The two different ports reproduce the cross-origin setup the CDN creates in production.
To skip the slow Rust build when you've only changed TypeScript:
```bash title="Terminal"
SKIP_WASM_BUILD=1 task demo:dev
```
---
## End-to-end Test Suites
End-to-end tests guarantee that every language binding produces identical results for the same document. They live in `e2e/` as shared fixtures — test inputs paired with expected outputs.
### Run end-to-end Tests
| Language | Directory | Run with |
| -------------------- | ----------------- | ---------------------- |
| Python | `e2e/python/` | `task python:e2e:test` |
| TypeScript / Node.js | `e2e/typescript/` | `task node:e2e:test` |
| Rust | `e2e/rust/` | `task rust:e2e:test` |
| Go | `e2e/go/` | `task go:e2e:test` |
| Java | `e2e/java/` | `task java:e2e:test` |
| .NET | `e2e/csharp/` | `task csharp:e2e:test` |
| Ruby | `e2e/ruby/` | `task ruby:e2e:test` |
| PHP | `e2e/php/` | `task php:e2e:test` |
| R | `e2e/r/` | `task r:e2e:test` |
### Regenerate end-to-end Tests
When you add a feature that changes extraction behavior, regenerate the affected end-to-end suites:
```bash title="Terminal"
task python:e2e:generate
task node:e2e:generate
task <lang>:e2e:generate
```
To regenerate and test all suites at once:
```bash title="Terminal"
task e2e:generate:all
task e2e:test:all
```
---
## Benchmarking
Measure extraction performance with the benchmark harness in `tools/benchmark-harness/`. Use it to track regressions, compare against alternatives, and identify bottlenecks with flamegraphs.
### Quick Start
```bash title="Terminal"
task benchmark:run FRAMEWORK=kreuzberg MODE=single-file
task benchmark:run FRAMEWORK=kreuzberg MODE=batch
```
### Common Modes
| Mode | What it measures |
| ------------- | --------------------------------------- |
| `single-file` | Latency — one file at a time |
| `batch` | Throughput — multiple files in parallel |
### With Profiling
Generate flamegraphs to see where time is spent:
```bash title="Terminal"
task benchmark:profile FRAMEWORK=kreuzberg MODE=single-file
```
Results appear in the `flamegraphs/` directory as interactive SVGs.
View live benchmark results at <https://kreuzberg.dev/benchmarks>.
---
## Linting and Pre-commit
```bash title="Terminal"
task check # Full lint + format check (same as CI validate stage)
```
Language-specific:
```bash title="Terminal"
task rust:lint # clippy + rustfmt
task python:lint # ruff + mypy
task node:lint # eslint + typecheck
```
The repository uses pre-commit hooks that enforce conventional commit messages, code formatting, and linter rules. If a commit is rejected, the hook output tells you exactly what to fix.
---
## Working with Documentation
### Building Locally
```bash title="Terminal"
uv sync --group doc
zensical build --clean
zensical serve
```
### How Snippets Work
Code examples in the docs aren't inline — they're pulled from `docs/snippets/` via the `--8<--` include directive. This keeps examples testable and reusable across pages.
```text
docs/snippets/
├── python/ # Python examples
│ ├── api/ # extract_file, batch_extract, etc.
│ ├── config/ # ExtractionConfig, OcrConfig, etc.
│ ├── ocr/ # OCR backends
│ ├── plugins/ # Plugin implementations
│ ├── mcp/ # MCP server and client
│ └── utils/ # Embeddings, chunking, errors
├── rust/ # Rust examples (same layout)
├── typescript/ # TypeScript examples
├── go/, java/, csharp/, ruby/, r/
├── docker/ # Docker commands
├── api_server/ # Server startup examples
└── cli/ # CLI usage
```
When you change a user-facing API, update the matching snippet. When you add a new feature, create a snippet and include it from the relevant doc page.
### Theme tokens (light mode)
Inline `code` and command-style monospace in light mode use the text token **`#26203A`**, defined in `docs/css/extra.css` as `--kb-text` (referenced as `var(--kb-text)`; brand backgrounds use the same value via `--kb-brand-ink`).
---
## Debugging
### Rust Panics
```bash title="Terminal"
RUST_BACKTRACE=1 cargo test -p kreuzberg test_name
RUST_BACKTRACE=full cargo test -p kreuzberg test_name
```
### Python FFI Problems
When something goes wrong in the Rust core during a Python call, the error introspection API gives you the details:
```python title="debug_ffi.py"
from kreuzberg import get_last_error_code, get_error_details, get_last_panic_context
details = get_error_details()
print(f"Error: {details['message']}")
print(f"Code: {details['error_code']}")
context = get_last_panic_context()
if context:
print(f"Panic context: {context}")
```
### Verbose Logging
Crank up the log level to see what the Rust core is doing:
```bash title="Terminal"
RUST_LOG=debug task python:test
RUST_LOG=trace task rust:test
```
---
## CI/CD
CI runs on every push and PR to `main` via `.github/workflows/ci.yaml`. The pipeline has four stages:
1. **Validate** — conventional commits, formatting, clippy
2. **Build** — FFI libraries, Python wheels, Node packages, all bindings
3. **Test** — per-language test suites on Linux, macOS, and Windows
4. **Integration** — Docker build, Docker smoke tests, CLI tests
### Smart Change Detection
CI doesn't rebuild everything on every PR. A `changes` job detects which paths were touched and only runs the relevant build/test jobs. Edit a Python file? Only Python builds and tests run. Touch the Rust core? Everything downstream rebuilds.
### Running CI Checks Locally
Before pushing, you can run the same checks CI runs:
```bash title="Terminal"
task check # Matches the validate stage
task rust:test:ci # Rust tests with CI diagnostics
task python:test:ci # Python tests with CI diagnostics
task test:all:ci # Everything
```
### Other Workflows
| Workflow | When it runs | What it does |
| --------------------- | ------------------------------------- | ---------------------------------- |
| `ci.yaml` | Every push/PR to `main` | The main pipeline |
| `docs.yaml` | Changes to `docs/` or `zensical.toml` | Builds and validates documentation |
| `benchmarks.yaml` | Manual trigger | Runs the full benchmark suite |
| `profiling.yaml` | Manual trigger | Generates flamegraphs |
| `publish.yaml` | Release events | Publishes packages to registries |
| `publish-docker.yaml` | Tags and releases | Builds and pushes Docker images |
---
## Performance
Kreuzberg's core is written in Rust, which enables zero-copy memory handling, SIMD acceleration, and true multi-core parallelism — all at compile time with no garbage collection.
### Why Rust Matters
- **Native compilation:** LLVM optimizes code ahead of time (inlining, vectorization, dead code elimination)
- **Zero-copy strings:** Slicing uses borrowed references, not heap allocations
- **SIMD acceleration:** Whitespace detection and character classification run 15-37x faster than scalar operations
- **No GIL:** True multi-core parallelism across all CPU cores
- **Deterministic memory:** Drop semantics free memory instantly, no GC pauses
### Key Optimizations
- **Batch processing:** 6-10x faster than sequential extraction through work-stealing scheduler
- **Caching:** 85%+ hit rates for repeated files (SQLite-backed, automatic invalidation)
- **Streaming:** Large files processed in 4KB chunks, constant memory regardless of file size
- **Lazy initialization:** Expensive subsystems (Tokio, plugins) initialized on first use only
### Benchmarking Your Workload
Measure with your actual files using the benchmark harness (see [Benchmarking](#benchmarking) section for full instructions). For detailed analysis and live benchmark results, visit <https://kreuzberg.dev/benchmarks>.
---