Nomad changes
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s

This commit is contained in:
Henrik Jess Nielsen
2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions

View File

@@ -0,0 +1,65 @@
---
summary: Configuration loading precedence for CLI and server modes
---
# Configuration Loading & Precedence
## CLI Mode Precedence (highest to lowest)
1. Individual CLI flags (`--ocr`, `--output-format`, `--chunk`)
2. Inline JSON config (`--config-json` or `--config-json-base64`)
3. Config file (`--config path.toml`)
4. Auto-discovered config (`kreuzberg.{toml,yaml,json}` in cwd/parents)
5. Default values
## Server/MCP Mode Precedence
1. CLI arguments (`--host`, `--port`)
2. Environment variables (`KREUZBERG_HOST`, `KREUZBERG_PORT`)
3. Config file `[server]` section
4. Defaults (`127.0.0.1:8000`)
## Config File Discovery
Searches current directory and parents for `kreuzberg.toml`, `kreuzberg.yaml`, or `kreuzberg.json`. Stops at first match.
## Inline JSON Config
Field-level merge (not whole-object replacement):
```rust
fn merge_json_into_config(base: &ExtractionConfig, json: Value) -> Result<ExtractionConfig> {
let mut config_json = serde_json::to_value(base)?;
// Merge fields from json into config_json
serde_json::from_value(merged)?
}
```
Use `--config-json-base64` for shell escaping.
## Config File Formats
**TOML** (`kreuzberg.toml`):
```toml
use_cache = true
[ocr]
backend = "tesseract"
languages = ["eng", "deu"]
[security_limits]
max_archive_size = 524288000
```
**YAML** and **JSON** follow equivalent structure.
## CLI Flag Overrides
In `commands.rs`: `apply_extraction_overrides()` applies individual flags on top of merged config.
## Critical Rules
1. CLI flags always win over config file
2. JSON merge is field-level, not whole-object
3. Auto-discovery stops at first config file found
4. `--config-json-base64` for shell-safe JSON passing
5. Server config uses `[server]` section + extraction config

View File

@@ -0,0 +1,36 @@
---
priority: high
---
# Crate Structure
Version source of truth: root `Cargo.toml` `[workspace.package] version`.
## Workspace crates (`crates/`)
- `kreuzberg` — core library: extraction engine, MIME detection, plugin system, OCR, chunking, embeddings, API/MCP server
- `kreuzberg-cli` — CLI binary; thin wrapper over core with `cli` feature set
- `kreuzberg-ffi` — C FFI layer (`#[no_mangle] extern "C"`); opaque handles, cbindgen headers; used by Go, Java, C# bindings
- `kreuzberg-node` — NAPI-RS Node.js/TypeScript bindings
- `kreuzberg-py` — PyO3 Python bindings
- `kreuzberg-php` — ext-php-rs PHP bindings
- `kreuzberg-wasm` — wasm-bindgen WASM bindings; uses `wasm-target` feature set
- `kreuzberg-paddle-ocr` — PaddleOCR via ONNX Runtime; not available on WASM or Windows
- `kreuzberg-tesseract` — Rust bindings for Tesseract OCR
## Out-of-workspace bindings (`packages/`)
- `packages/python/` — PyPI (maturin + PyO3)
- `packages/typescript/` — npm type declarations
- `packages/ruby/` — RubyGems (Magnus); native ext compiled by `rake`
- `packages/php/` — Composer (ext-php-rs)
- `packages/go/v5/` — Go module; cgo over kreuzberg-ffi
- `packages/java/` — Maven; Foreign Function & Memory API over kreuzberg-ffi
- `packages/csharp/` — NuGet; P/Invoke over kreuzberg-ffi
- `packages/elixir/` — Hex; Rustler NIF (workspace member at `packages/elixir/native/kreuzberg_rustler`)
- `packages/r/` — CRAN; extendr (excluded from workspace)
## Tools (`tools/`)
- `tools/e2e-generator` — reads JSON fixtures, generates runnable test suites per language into `e2e/`
- `tools/benchmark-harness` — criterion-based benchmark runner

View File

@@ -0,0 +1,56 @@
---
summary: MIME type detection and extractor routing logic
---
# MIME Detection & Routing
## Detection Flow
```text
Extension -> EXT_TO_MIME map -> validate -> Registry lookup -> Extractor
```
## Key Functions
| Function | Location | Purpose |
| ------------------------------------ | -------------- | --------------------------------------- |
| `detect_mime_type(path, inspect)` | `core/mime.rs` | Extension + optional content inspection |
| `detect_mime_type_from_bytes(bytes)` | `core/mime.rs` | Magic number detection (infer crate) |
| `validate_mime_type(mime)` | `core/mime.rs` | Check if any extractor supports it |
## Extension Mapping
118+ extensions mapped in `EXT_TO_MIME` (`core/mime.rs`). Case-insensitive.
Key mappings: `.pdf` -> `application/pdf`, `.docx` -> `application/vnd.openxmlformats-officedocument.wordprocessingml.document`, `.xlsx` -> spreadsheet variant, `.png`/`.jpg` -> `image/*`
## Registry Selection
```rust
// In core/extractor/bytes.rs
fn select_extractor_for_mime(mime_type: &str) -> Result<Arc<dyn DocumentExtractor>> {
let registry = get_document_extractor_registry();
let registry_guard = registry.read()?;
registry_guard.get_for_mime_type(mime_type)
.ok_or_else(|| KreuzbergError::UnsupportedFormat(mime_type.into()))
}
```
Selects highest-priority extractor registered for that MIME type.
## Adding New MIME Types
1. Add extension mapping: `m.insert("ext", "application/x-new");` in `core/mime.rs`
2. Implement `DocumentExtractor` with `supported_mime_types()` returning the MIME
3. Register in `register_default_extractors()`
## Wildcard Support
Extractors can register for MIME type families: `"image/*"` matches `image/png`, `image/jpeg`, etc.
## Critical Rules
1. Always `validate_mime_type()` before extraction
2. Extension mapping is case-insensitive
3. Content inspection (infer crate) is fallback for extension-less files
4. Registry validation is final authority on supported types

View File

@@ -0,0 +1,78 @@
---
summary: WASM build constraints and patterns for kreuzberg-wasm crate
---
# WASM Build Constraints
## Overview
WASM target in `crates/kreuzberg-wasm/`. Uses wasm-bindgen with sync-only internal APIs.
## Feature Flags
```toml
[features]
wasm-target = ["pdf", "html", "xml", "email", "language-detection", "chunking", "quality", "office"]
wasm-threads = ["dep:wasm-bindgen-rayon"] # Optional
```
## Critical Constraints
### 1. No Tokio Runtime
All operations must be synchronous internally. Use `#[cfg(not(feature = "tokio-runtime"))]` paths.
### 2. SyncExtractor Required
Every WASM-compatible extractor MUST implement `SyncExtractor`:
```rust
impl SyncExtractor for MyExtractor {
fn extract_sync(&self, content: &[u8], mime_type: &str, config: &ExtractionConfig)
-> Result<ExtractionResult> { /* sync implementation */ }
}
impl DocumentExtractor for MyExtractor {
fn as_sync_extractor(&self) -> Option<&dyn SyncExtractor> {
Some(self) // MUST return Some for WASM
}
}
```
### 3. HTML Size Limit
```rust
const MAX_HTML_SIZE: usize = 2 * 1024 * 1024; // 2MB - stack constraint
```
## Build Config
```toml
[lib]
crate-type = ["cdylib", "rlib"]
[profile.release.package.kreuzberg-wasm]
opt-level = "z" # Size optimization
codegen-units = 1
```
## API Pattern
```rust
#[wasm_bindgen]
pub async fn extract_from_bytes(content: Vec<u8>, config: JsValue) -> Result<JsValue, JsValue> {
let config: ExtractionConfig = serde_wasm_bindgen::from_value(config)?;
let result = extract_bytes_sync(&content, mime_type, &config)?;
Ok(serde_wasm_bindgen::to_value(&result)?)
}
```
Functions can be `async` for JS compatibility, but internal extraction is sync.
## Critical Rules
1. **No tokio** -- all operations synchronous
2. **Implement SyncExtractor** for all WASM-compatible extractors
3. **HTML limited to 2MB** due to stack constraints
4. **Size optimization** via `opt-level = "z"`
5. **Feature gate** with `#[cfg(target_arch = "wasm32")]`