This commit is contained in:
65
.ai-rulez/context/config-loading-precedence.md
Normal file
65
.ai-rulez/context/config-loading-precedence.md
Normal file
@@ -0,0 +1,65 @@
|
||||
---
|
||||
summary: Configuration loading precedence for CLI and server modes
|
||||
---
|
||||
|
||||
# Configuration Loading & Precedence
|
||||
|
||||
## CLI Mode Precedence (highest to lowest)
|
||||
|
||||
1. Individual CLI flags (`--ocr`, `--output-format`, `--chunk`)
|
||||
2. Inline JSON config (`--config-json` or `--config-json-base64`)
|
||||
3. Config file (`--config path.toml`)
|
||||
4. Auto-discovered config (`kreuzberg.{toml,yaml,json}` in cwd/parents)
|
||||
5. Default values
|
||||
|
||||
## Server/MCP Mode Precedence
|
||||
|
||||
1. CLI arguments (`--host`, `--port`)
|
||||
2. Environment variables (`KREUZBERG_HOST`, `KREUZBERG_PORT`)
|
||||
3. Config file `[server]` section
|
||||
4. Defaults (`127.0.0.1:8000`)
|
||||
|
||||
## Config File Discovery
|
||||
|
||||
Searches current directory and parents for `kreuzberg.toml`, `kreuzberg.yaml`, or `kreuzberg.json`. Stops at first match.
|
||||
|
||||
## Inline JSON Config
|
||||
|
||||
Field-level merge (not whole-object replacement):
|
||||
|
||||
```rust
|
||||
fn merge_json_into_config(base: &ExtractionConfig, json: Value) -> Result<ExtractionConfig> {
|
||||
let mut config_json = serde_json::to_value(base)?;
|
||||
// Merge fields from json into config_json
|
||||
serde_json::from_value(merged)?
|
||||
}
|
||||
```
|
||||
|
||||
Use `--config-json-base64` for shell escaping.
|
||||
|
||||
## Config File Formats
|
||||
|
||||
**TOML** (`kreuzberg.toml`):
|
||||
|
||||
```toml
|
||||
use_cache = true
|
||||
[ocr]
|
||||
backend = "tesseract"
|
||||
languages = ["eng", "deu"]
|
||||
[security_limits]
|
||||
max_archive_size = 524288000
|
||||
```
|
||||
|
||||
**YAML** and **JSON** follow equivalent structure.
|
||||
|
||||
## CLI Flag Overrides
|
||||
|
||||
In `commands.rs`: `apply_extraction_overrides()` applies individual flags on top of merged config.
|
||||
|
||||
## Critical Rules
|
||||
|
||||
1. CLI flags always win over config file
|
||||
2. JSON merge is field-level, not whole-object
|
||||
3. Auto-discovery stops at first config file found
|
||||
4. `--config-json-base64` for shell-safe JSON passing
|
||||
5. Server config uses `[server]` section + extraction config
|
||||
36
.ai-rulez/context/crate-structure.md
Normal file
36
.ai-rulez/context/crate-structure.md
Normal file
@@ -0,0 +1,36 @@
|
||||
---
|
||||
priority: high
|
||||
---
|
||||
|
||||
# Crate Structure
|
||||
|
||||
Version source of truth: root `Cargo.toml` `[workspace.package] version`.
|
||||
|
||||
## Workspace crates (`crates/`)
|
||||
|
||||
- `kreuzberg` — core library: extraction engine, MIME detection, plugin system, OCR, chunking, embeddings, API/MCP server
|
||||
- `kreuzberg-cli` — CLI binary; thin wrapper over core with `cli` feature set
|
||||
- `kreuzberg-ffi` — C FFI layer (`#[no_mangle] extern "C"`); opaque handles, cbindgen headers; used by Go, Java, C# bindings
|
||||
- `kreuzberg-node` — NAPI-RS Node.js/TypeScript bindings
|
||||
- `kreuzberg-py` — PyO3 Python bindings
|
||||
- `kreuzberg-php` — ext-php-rs PHP bindings
|
||||
- `kreuzberg-wasm` — wasm-bindgen WASM bindings; uses `wasm-target` feature set
|
||||
- `kreuzberg-paddle-ocr` — PaddleOCR via ONNX Runtime; not available on WASM or Windows
|
||||
- `kreuzberg-tesseract` — Rust bindings for Tesseract OCR
|
||||
|
||||
## Out-of-workspace bindings (`packages/`)
|
||||
|
||||
- `packages/python/` — PyPI (maturin + PyO3)
|
||||
- `packages/typescript/` — npm type declarations
|
||||
- `packages/ruby/` — RubyGems (Magnus); native ext compiled by `rake`
|
||||
- `packages/php/` — Composer (ext-php-rs)
|
||||
- `packages/go/v5/` — Go module; cgo over kreuzberg-ffi
|
||||
- `packages/java/` — Maven; Foreign Function & Memory API over kreuzberg-ffi
|
||||
- `packages/csharp/` — NuGet; P/Invoke over kreuzberg-ffi
|
||||
- `packages/elixir/` — Hex; Rustler NIF (workspace member at `packages/elixir/native/kreuzberg_rustler`)
|
||||
- `packages/r/` — CRAN; extendr (excluded from workspace)
|
||||
|
||||
## Tools (`tools/`)
|
||||
|
||||
- `tools/e2e-generator` — reads JSON fixtures, generates runnable test suites per language into `e2e/`
|
||||
- `tools/benchmark-harness` — criterion-based benchmark runner
|
||||
56
.ai-rulez/context/mime-detection-routing.md
Normal file
56
.ai-rulez/context/mime-detection-routing.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
summary: MIME type detection and extractor routing logic
|
||||
---
|
||||
|
||||
# MIME Detection & Routing
|
||||
|
||||
## Detection Flow
|
||||
|
||||
```text
|
||||
Extension -> EXT_TO_MIME map -> validate -> Registry lookup -> Extractor
|
||||
```
|
||||
|
||||
## Key Functions
|
||||
|
||||
| Function | Location | Purpose |
|
||||
| ------------------------------------ | -------------- | --------------------------------------- |
|
||||
| `detect_mime_type(path, inspect)` | `core/mime.rs` | Extension + optional content inspection |
|
||||
| `detect_mime_type_from_bytes(bytes)` | `core/mime.rs` | Magic number detection (infer crate) |
|
||||
| `validate_mime_type(mime)` | `core/mime.rs` | Check if any extractor supports it |
|
||||
|
||||
## Extension Mapping
|
||||
|
||||
118+ extensions mapped in `EXT_TO_MIME` (`core/mime.rs`). Case-insensitive.
|
||||
|
||||
Key mappings: `.pdf` -> `application/pdf`, `.docx` -> `application/vnd.openxmlformats-officedocument.wordprocessingml.document`, `.xlsx` -> spreadsheet variant, `.png`/`.jpg` -> `image/*`
|
||||
|
||||
## Registry Selection
|
||||
|
||||
```rust
|
||||
// In core/extractor/bytes.rs
|
||||
fn select_extractor_for_mime(mime_type: &str) -> Result<Arc<dyn DocumentExtractor>> {
|
||||
let registry = get_document_extractor_registry();
|
||||
let registry_guard = registry.read()?;
|
||||
registry_guard.get_for_mime_type(mime_type)
|
||||
.ok_or_else(|| KreuzbergError::UnsupportedFormat(mime_type.into()))
|
||||
}
|
||||
```
|
||||
|
||||
Selects highest-priority extractor registered for that MIME type.
|
||||
|
||||
## Adding New MIME Types
|
||||
|
||||
1. Add extension mapping: `m.insert("ext", "application/x-new");` in `core/mime.rs`
|
||||
2. Implement `DocumentExtractor` with `supported_mime_types()` returning the MIME
|
||||
3. Register in `register_default_extractors()`
|
||||
|
||||
## Wildcard Support
|
||||
|
||||
Extractors can register for MIME type families: `"image/*"` matches `image/png`, `image/jpeg`, etc.
|
||||
|
||||
## Critical Rules
|
||||
|
||||
1. Always `validate_mime_type()` before extraction
|
||||
2. Extension mapping is case-insensitive
|
||||
3. Content inspection (infer crate) is fallback for extension-less files
|
||||
4. Registry validation is final authority on supported types
|
||||
78
.ai-rulez/context/wasm-constraints.md
Normal file
78
.ai-rulez/context/wasm-constraints.md
Normal file
@@ -0,0 +1,78 @@
|
||||
---
|
||||
summary: WASM build constraints and patterns for kreuzberg-wasm crate
|
||||
---
|
||||
|
||||
# WASM Build Constraints
|
||||
|
||||
## Overview
|
||||
|
||||
WASM target in `crates/kreuzberg-wasm/`. Uses wasm-bindgen with sync-only internal APIs.
|
||||
|
||||
## Feature Flags
|
||||
|
||||
```toml
|
||||
[features]
|
||||
wasm-target = ["pdf", "html", "xml", "email", "language-detection", "chunking", "quality", "office"]
|
||||
wasm-threads = ["dep:wasm-bindgen-rayon"] # Optional
|
||||
```
|
||||
|
||||
## Critical Constraints
|
||||
|
||||
### 1. No Tokio Runtime
|
||||
|
||||
All operations must be synchronous internally. Use `#[cfg(not(feature = "tokio-runtime"))]` paths.
|
||||
|
||||
### 2. SyncExtractor Required
|
||||
|
||||
Every WASM-compatible extractor MUST implement `SyncExtractor`:
|
||||
|
||||
```rust
|
||||
impl SyncExtractor for MyExtractor {
|
||||
fn extract_sync(&self, content: &[u8], mime_type: &str, config: &ExtractionConfig)
|
||||
-> Result<ExtractionResult> { /* sync implementation */ }
|
||||
}
|
||||
|
||||
impl DocumentExtractor for MyExtractor {
|
||||
fn as_sync_extractor(&self) -> Option<&dyn SyncExtractor> {
|
||||
Some(self) // MUST return Some for WASM
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3. HTML Size Limit
|
||||
|
||||
```rust
|
||||
const MAX_HTML_SIZE: usize = 2 * 1024 * 1024; // 2MB - stack constraint
|
||||
```
|
||||
|
||||
## Build Config
|
||||
|
||||
```toml
|
||||
[lib]
|
||||
crate-type = ["cdylib", "rlib"]
|
||||
|
||||
[profile.release.package.kreuzberg-wasm]
|
||||
opt-level = "z" # Size optimization
|
||||
codegen-units = 1
|
||||
```
|
||||
|
||||
## API Pattern
|
||||
|
||||
```rust
|
||||
#[wasm_bindgen]
|
||||
pub async fn extract_from_bytes(content: Vec<u8>, config: JsValue) -> Result<JsValue, JsValue> {
|
||||
let config: ExtractionConfig = serde_wasm_bindgen::from_value(config)?;
|
||||
let result = extract_bytes_sync(&content, mime_type, &config)?;
|
||||
Ok(serde_wasm_bindgen::to_value(&result)?)
|
||||
}
|
||||
```
|
||||
|
||||
Functions can be `async` for JS compatibility, but internal extraction is sync.
|
||||
|
||||
## Critical Rules
|
||||
|
||||
1. **No tokio** -- all operations synchronous
|
||||
2. **Implement SyncExtractor** for all WASM-compatible extractors
|
||||
3. **HTML limited to 2MB** due to stack constraints
|
||||
4. **Size optimization** via `opt-level = "z"`
|
||||
5. **Feature gate** with `#[cfg(target_arch = "wasm32")]`
|
||||
Reference in New Issue
Block a user