355 lines
12 KiB
Markdown
355 lines
12 KiB
Markdown
|
|
# Development Workflow
|
||
|
|
|
||
|
|
Everything you need to build, test, and debug Kreuzberg locally. This guide assumes you've already followed the [Contributing Guide](../contributing.md) to fork and clone the repository.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The Task Runner
|
||
|
|
|
||
|
|
Kreuzberg uses [Task](https://taskfile.dev/) for all build and test workflows. One command to bootstrap everything:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task setup
|
||
|
|
```
|
||
|
|
|
||
|
|
That installs all toolchains and dependencies. Safe to re-run anytime — it's idempotent.
|
||
|
|
|
||
|
|
### The Pattern
|
||
|
|
|
||
|
|
Tasks follow `<language>:<action>`. Once you learn this pattern, the command for any task is predictable:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task rust:build # Build the Rust core
|
||
|
|
task rust:build:dev # Debug build (faster compile, no optimizations)
|
||
|
|
task rust:build:release # Release build (slow compile, fast binary)
|
||
|
|
task rust:test # Run Rust tests
|
||
|
|
task rust:test:ci # Same tests, with CI-level diagnostics
|
||
|
|
|
||
|
|
task python:build # Build Python bindings via maturin
|
||
|
|
task python:test # Run Python test suite
|
||
|
|
task node:build # Build Node.js bindings via napi
|
||
|
|
task node:test # Jest tests
|
||
|
|
```
|
||
|
|
|
||
|
|
The same pattern works for every language: `go:build`, `java:test`, `ruby:build`, `csharp:test`, and so on.
|
||
|
|
|
||
|
|
### Bulk Operations
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task build:all # Build every binding
|
||
|
|
task test:all # Test every binding (sequential)
|
||
|
|
task test:all:parallel # Test every binding (parallel — faster, noisier output)
|
||
|
|
task check # Lint + format check across the whole repo
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Testing Locally
|
||
|
|
|
||
|
|
### Rust
|
||
|
|
|
||
|
|
The core lives in `crates/kreuzberg/`. Most changes start here.
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task rust:test
|
||
|
|
|
||
|
|
cargo test -p kreuzberg test_pdf_extraction -- --nocapture
|
||
|
|
|
||
|
|
RUST_LOG=debug cargo test -p kreuzberg test_name -- --nocapture
|
||
|
|
```
|
||
|
|
|
||
|
|
### Python
|
||
|
|
|
||
|
|
Python bindings are in `packages/python/`. Build first, then test:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task python:build:dev
|
||
|
|
task python:test
|
||
|
|
|
||
|
|
cd packages/python
|
||
|
|
uv run pytest tests/ -k "test_extract" -v
|
||
|
|
```
|
||
|
|
|
||
|
|
The `RUST_LOG` env var works here too — the Rust core logs through Python's stderr:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
RUST_LOG=debug uv run pytest tests/ -v
|
||
|
|
```
|
||
|
|
|
||
|
|
### Node.js
|
||
|
|
|
||
|
|
TypeScript bindings are in `packages/typescript/`:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task node:build:dev
|
||
|
|
task node:test
|
||
|
|
|
||
|
|
cd packages/typescript
|
||
|
|
pnpm test -- --testPathPattern="extract"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Everything Else
|
||
|
|
|
||
|
|
Same pattern. Build, then test:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task go:build && task go:test
|
||
|
|
task java:build && task java:test
|
||
|
|
task csharp:build && task csharp:test
|
||
|
|
task ruby:build && task ruby:test
|
||
|
|
task php:build && task php:test
|
||
|
|
task elixir:build && task elixir:test
|
||
|
|
task r:build && task r:test
|
||
|
|
task c:build && task c:test
|
||
|
|
task wasm:build && task wasm:test
|
||
|
|
```
|
||
|
|
|
||
|
|
### Testing the live browser demo
|
||
|
|
|
||
|
|
The demo at `docs/demo.html` loads `@kreuzberg/wasm` from a CDN. To test local changes against it, use:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task demo:dev
|
||
|
|
```
|
||
|
|
|
||
|
|
This builds the Wasm binary and TypeScript dist, patches the demo with local URLs, and starts two servers:
|
||
|
|
|
||
|
|
| Server | URL | Role |
|
||
|
|
| ------ | ----------------------- | ---------------------------------- |
|
||
|
|
| Docs | `http://localhost:8001` | Serves the patched `demo-dev.html` |
|
||
|
|
| Assets | `http://localhost:9000` | Serves the local Wasm package |
|
||
|
|
|
||
|
|
Open **`http://localhost:8001/demo-dev.html`** — no manual edits needed. The patched file (`docs/demo-dev.html`) is gitignored and regenerated on every run. The two different ports reproduce the cross-origin setup the CDN creates in production.
|
||
|
|
|
||
|
|
To skip the slow Rust build when you've only changed TypeScript:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
SKIP_WASM_BUILD=1 task demo:dev
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## End-to-end Test Suites
|
||
|
|
|
||
|
|
End-to-end tests guarantee that every language binding produces identical results for the same document. They live in `e2e/` as shared fixtures — test inputs paired with expected outputs.
|
||
|
|
|
||
|
|
### Run end-to-end Tests
|
||
|
|
|
||
|
|
| Language | Directory | Run with |
|
||
|
|
| -------------------- | ----------------- | ---------------------- |
|
||
|
|
| Python | `e2e/python/` | `task python:e2e:test` |
|
||
|
|
| TypeScript / Node.js | `e2e/typescript/` | `task node:e2e:test` |
|
||
|
|
| Rust | `e2e/rust/` | `task rust:e2e:test` |
|
||
|
|
| Go | `e2e/go/` | `task go:e2e:test` |
|
||
|
|
| Java | `e2e/java/` | `task java:e2e:test` |
|
||
|
|
| .NET | `e2e/csharp/` | `task csharp:e2e:test` |
|
||
|
|
| Ruby | `e2e/ruby/` | `task ruby:e2e:test` |
|
||
|
|
| PHP | `e2e/php/` | `task php:e2e:test` |
|
||
|
|
| R | `e2e/r/` | `task r:e2e:test` |
|
||
|
|
|
||
|
|
### Regenerate end-to-end Tests
|
||
|
|
|
||
|
|
When you add a feature that changes extraction behavior, regenerate the affected end-to-end suites:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task python:e2e:generate
|
||
|
|
task node:e2e:generate
|
||
|
|
task <lang>:e2e:generate
|
||
|
|
```
|
||
|
|
|
||
|
|
To regenerate and test all suites at once:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task e2e:generate:all
|
||
|
|
task e2e:test:all
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Benchmarking
|
||
|
|
|
||
|
|
Measure extraction performance with the benchmark harness in `tools/benchmark-harness/`. Use it to track regressions, compare against alternatives, and identify bottlenecks with flamegraphs.
|
||
|
|
|
||
|
|
### Quick Start
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task benchmark:run FRAMEWORK=kreuzberg MODE=single-file
|
||
|
|
task benchmark:run FRAMEWORK=kreuzberg MODE=batch
|
||
|
|
```
|
||
|
|
|
||
|
|
### Common Modes
|
||
|
|
|
||
|
|
| Mode | What it measures |
|
||
|
|
| ------------- | --------------------------------------- |
|
||
|
|
| `single-file` | Latency — one file at a time |
|
||
|
|
| `batch` | Throughput — multiple files in parallel |
|
||
|
|
|
||
|
|
### With Profiling
|
||
|
|
|
||
|
|
Generate flamegraphs to see where time is spent:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task benchmark:profile FRAMEWORK=kreuzberg MODE=single-file
|
||
|
|
```
|
||
|
|
|
||
|
|
Results appear in the `flamegraphs/` directory as interactive SVGs.
|
||
|
|
|
||
|
|
View live benchmark results at <https://kreuzberg.dev/benchmarks>.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Linting and Pre-commit
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task check # Full lint + format check (same as CI validate stage)
|
||
|
|
```
|
||
|
|
|
||
|
|
Language-specific:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task rust:lint # clippy + rustfmt
|
||
|
|
task python:lint # ruff + mypy
|
||
|
|
task node:lint # eslint + typecheck
|
||
|
|
```
|
||
|
|
|
||
|
|
The repository uses pre-commit hooks that enforce conventional commit messages, code formatting, and linter rules. If a commit is rejected, the hook output tells you exactly what to fix.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Working with Documentation
|
||
|
|
|
||
|
|
### Building Locally
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
uv sync --group doc
|
||
|
|
zensical build --clean
|
||
|
|
zensical serve
|
||
|
|
```
|
||
|
|
|
||
|
|
### How Snippets Work
|
||
|
|
|
||
|
|
Code examples in the docs aren't inline — they're pulled from `docs/snippets/` via the `--8<--` include directive. This keeps examples testable and reusable across pages.
|
||
|
|
|
||
|
|
```text
|
||
|
|
docs/snippets/
|
||
|
|
├── python/ # Python examples
|
||
|
|
│ ├── api/ # extract_file, batch_extract, etc.
|
||
|
|
│ ├── config/ # ExtractionConfig, OcrConfig, etc.
|
||
|
|
│ ├── ocr/ # OCR backends
|
||
|
|
│ ├── plugins/ # Plugin implementations
|
||
|
|
│ ├── mcp/ # MCP server and client
|
||
|
|
│ └── utils/ # Embeddings, chunking, errors
|
||
|
|
├── rust/ # Rust examples (same layout)
|
||
|
|
├── typescript/ # TypeScript examples
|
||
|
|
├── go/, java/, csharp/, ruby/, r/
|
||
|
|
├── docker/ # Docker commands
|
||
|
|
├── api_server/ # Server startup examples
|
||
|
|
└── cli/ # CLI usage
|
||
|
|
```
|
||
|
|
|
||
|
|
When you change a user-facing API, update the matching snippet. When you add a new feature, create a snippet and include it from the relevant doc page.
|
||
|
|
|
||
|
|
### Theme tokens (light mode)
|
||
|
|
|
||
|
|
Inline `code` and command-style monospace in light mode use the text token **`#26203A`**, defined in `docs/css/extra.css` as `--kb-text` (referenced as `var(--kb-text)`; brand backgrounds use the same value via `--kb-brand-ink`).
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Debugging
|
||
|
|
|
||
|
|
### Rust Panics
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
RUST_BACKTRACE=1 cargo test -p kreuzberg test_name
|
||
|
|
RUST_BACKTRACE=full cargo test -p kreuzberg test_name
|
||
|
|
```
|
||
|
|
|
||
|
|
### Python FFI Problems
|
||
|
|
|
||
|
|
When something goes wrong in the Rust core during a Python call, the error introspection API gives you the details:
|
||
|
|
|
||
|
|
```python title="debug_ffi.py"
|
||
|
|
from kreuzberg import get_last_error_code, get_error_details, get_last_panic_context
|
||
|
|
|
||
|
|
details = get_error_details()
|
||
|
|
print(f"Error: {details['message']}")
|
||
|
|
print(f"Code: {details['error_code']}")
|
||
|
|
|
||
|
|
context = get_last_panic_context()
|
||
|
|
if context:
|
||
|
|
print(f"Panic context: {context}")
|
||
|
|
```
|
||
|
|
|
||
|
|
### Verbose Logging
|
||
|
|
|
||
|
|
Crank up the log level to see what the Rust core is doing:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
RUST_LOG=debug task python:test
|
||
|
|
RUST_LOG=trace task rust:test
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## CI/CD
|
||
|
|
|
||
|
|
CI runs on every push and PR to `main` via `.github/workflows/ci.yaml`. The pipeline has four stages:
|
||
|
|
|
||
|
|
1. **Validate** — conventional commits, formatting, clippy
|
||
|
|
2. **Build** — FFI libraries, Python wheels, Node packages, all bindings
|
||
|
|
3. **Test** — per-language test suites on Linux, macOS, and Windows
|
||
|
|
4. **Integration** — Docker build, Docker smoke tests, CLI tests
|
||
|
|
|
||
|
|
### Smart Change Detection
|
||
|
|
|
||
|
|
CI doesn't rebuild everything on every PR. A `changes` job detects which paths were touched and only runs the relevant build/test jobs. Edit a Python file? Only Python builds and tests run. Touch the Rust core? Everything downstream rebuilds.
|
||
|
|
|
||
|
|
### Running CI Checks Locally
|
||
|
|
|
||
|
|
Before pushing, you can run the same checks CI runs:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
task check # Matches the validate stage
|
||
|
|
task rust:test:ci # Rust tests with CI diagnostics
|
||
|
|
task python:test:ci # Python tests with CI diagnostics
|
||
|
|
task test:all:ci # Everything
|
||
|
|
```
|
||
|
|
|
||
|
|
### Other Workflows
|
||
|
|
|
||
|
|
| Workflow | When it runs | What it does |
|
||
|
|
| --------------------- | ------------------------------------- | ---------------------------------- |
|
||
|
|
| `ci.yaml` | Every push/PR to `main` | The main pipeline |
|
||
|
|
| `docs.yaml` | Changes to `docs/` or `zensical.toml` | Builds and validates documentation |
|
||
|
|
| `benchmarks.yaml` | Manual trigger | Runs the full benchmark suite |
|
||
|
|
| `profiling.yaml` | Manual trigger | Generates flamegraphs |
|
||
|
|
| `publish.yaml` | Release events | Publishes packages to registries |
|
||
|
|
| `publish-docker.yaml` | Tags and releases | Builds and pushes Docker images |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Performance
|
||
|
|
|
||
|
|
Kreuzberg's core is written in Rust, which enables zero-copy memory handling, SIMD acceleration, and true multi-core parallelism — all at compile time with no garbage collection.
|
||
|
|
|
||
|
|
### Why Rust Matters
|
||
|
|
|
||
|
|
- **Native compilation:** LLVM optimizes code ahead of time (inlining, vectorization, dead code elimination)
|
||
|
|
- **Zero-copy strings:** Slicing uses borrowed references, not heap allocations
|
||
|
|
- **SIMD acceleration:** Whitespace detection and character classification run 15-37x faster than scalar operations
|
||
|
|
- **No GIL:** True multi-core parallelism across all CPU cores
|
||
|
|
- **Deterministic memory:** Drop semantics free memory instantly, no GC pauses
|
||
|
|
|
||
|
|
### Key Optimizations
|
||
|
|
|
||
|
|
- **Batch processing:** 6-10x faster than sequential extraction through work-stealing scheduler
|
||
|
|
- **Caching:** 85%+ hit rates for repeated files (SQLite-backed, automatic invalidation)
|
||
|
|
- **Streaming:** Large files processed in 4KB chunks, constant memory regardless of file size
|
||
|
|
- **Lazy initialization:** Expensive subsystems (Tokio, plugins) initialized on first use only
|
||
|
|
|
||
|
|
### Benchmarking Your Workload
|
||
|
|
|
||
|
|
Measure with your actual files using the benchmark harness (see [Benchmarking](#benchmarking) section for full instructions). For detailed analysis and live benchmark results, visit <https://kreuzberg.dev/benchmarks>.
|
||
|
|
|
||
|
|
---
|