Files
fil/docs/guides/advanced.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

544 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Advanced Features
## Text Chunking
Split extracted text into chunks for RAG, vector databases, or LLM context windows. Four strategies:
- **Text** — splits on whitespace/punctuation boundaries
- **Markdown** — structure-aware; preserves headings, lists, and code blocks
- **YAML** — section-aware; preserves YAML document structure
- **Semantic** — topic-aware; splits at natural document boundaries
### Semantic
Set `chunker_type` to `"semantic"`. Uses an embedding model for topic detection when one is configured; otherwise falls back to structural heuristics.
```python
config = ExtractionConfig(
chunking=ChunkingConfig(chunker_type="semantic")
)
```
**Behavior:**
- **Without embeddings** — Uses structural heuristics: detects headers (ALL CAPS, numbered sections) and paragraph boundaries
- **With embeddings** — Compares consecutive paragraphs via embeddings to detect topic shifts, merging paragraphs below the `topic_threshold` (default: 0.5)
Use `topic_threshold` to control sensitivity: higher values (0.70.9) preserve more fine-grained topics, lower values (0.10.3) merge aggressive. Only applies when an embedding model is configured.
### Configuration
=== "Python"
--8<-- "snippets/python/config/chunking_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/chunking_config.md"
=== "Rust"
--8<-- "snippets/rust/advanced/chunking_config.md"
=== "Go"
--8<-- "snippets/go/config/chunking_config.md"
=== "Java"
--8<-- "snippets/java/config/chunking_config.md"
=== "C#"
--8<-- "snippets/csharp/advanced/chunking_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/chunking_config.md"
=== "R"
--8<-- "snippets/r/config/chunking_config.md"
=== "Wasm"
--8<-- "snippets/wasm/config/chunking_config.md"
### Chunk Output
Each chunk in `result.chunks` contains:
| Field | Description |
| --------------------------------------- | ------------------------------------------------ |
| `content` | Chunk text |
| `metadata.byte_start` / `byte_end` | Byte offsets in the original text |
| `metadata.chunk_index` / `total_chunks` | Position in sequence |
| `metadata.token_count` | Token count (when embeddings enabled) |
| `metadata.heading_context` | Active heading hierarchy (Markdown chunker only) |
| `embedding` | Embedding vector (when configured) |
Chunks can be sized by token count instead of characters — enable the `chunking-tokenizers` feature and set `sizing` to `token`.
### RAG Pipeline Example
=== "Python"
--8<-- "snippets/python/utils/chunking_rag.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/chunking_rag.md"
=== "Rust"
--8<-- "snippets/rust/advanced/chunking_rag.md"
=== "Go"
--8<-- "snippets/go/advanced/chunking_rag.md"
=== "Java"
--8<-- "snippets/java/advanced/chunking_rag.md"
=== "C#"
--8<-- "snippets/csharp/advanced/chunking_rag.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/chunking_rag.md"
=== "R"
--8<-- "snippets/r/advanced/chunking_rag.md"
## Language Detection
Detect languages in extracted text using [`whatlang`](https://crates.io/crates/whatlang) — 60+ languages with ISO 639-3 codes. Set `detect_multiple: true` to chunk the text into 200-char segments and return all detected languages sorted by prevalence.
### Configuration
=== "Python"
--8<-- "snippets/python/config/language_detection_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/language_detection_config.md"
=== "Rust"
--8<-- "snippets/rust/advanced/language_detection_config.md"
=== "Go"
--8<-- "snippets/go/config/language_detection_config.md"
=== "Java"
--8<-- "snippets/java/config/language_detection_config.md"
=== "C#"
--8<-- "snippets/csharp/advanced/language_detection_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/language_detection_config.md"
=== "R"
--8<-- "snippets/r/config/language_detection_config.md"
### Multilingual Example
=== "Python"
--8<-- "snippets/python/utils/language_detection_multilingual.md"
=== "TypeScript"
--8<-- "snippets/typescript/metadata/language_detection_multilingual.md"
=== "Rust"
--8<-- "snippets/rust/advanced/language_detection_multilingual.md"
=== "Go"
--8<-- "snippets/go/advanced/language_detection_multilingual.md"
=== "Java"
--8<-- "snippets/java/advanced/language_detection_multilingual.md"
=== "C#"
--8<-- "snippets/csharp/advanced/language_detection_multilingual.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/language_detection_multilingual.md"
=== "R"
--8<-- "snippets/r/advanced/language_detection_multilingual.md"
## Embedding Generation
Local in-process embeddings via ONNX for semantic search and RAG — no external API calls. Requires the `embeddings` feature.
| Preset | Model | Dimensions | Max Tokens | Use Case |
| -------------- | ---------------------------- | ---------- | ---------- | ------------------------------------------------------- |
| `fast` | all-MiniLM-L6-v2 (quantized) | 384 | 512 | Quick prototyping, development, resource-constrained |
| `balanced` | BGE-base-en-v1.5 | 768 | 1024 | General-purpose RAG, production deployments, English |
| `quality` | BGE-large-en-v1.5 | 1024 | 2000 | Complex documents, maximum accuracy, sufficient compute |
| `multilingual` | multilingual-e5-base | 768 | 1024 | International documents, mixed-language content |
### In-Process Embedding Backends (Plugin Variant)
Plug a caller-managed embedder (e.g. `llama-cpp-python`, `sentence-transformers`) into Kreuzberg via the `Plugin` variant of `EmbeddingModelType` — Kreuzberg calls back into the registered backend instead of running its own ONNX model.
1. Register the backend once at startup via `kreuzberg::plugins::register_embedding_backend(Arc::new(MyEmbedder))`. The backend implements `EmbeddingBackend` (a `Plugin`-inheriting async trait with `dimensions()` and `embed(texts) -> Vec<Vec<f32>>`).
2. Reference it by name in `EmbeddingConfig`: `{ "model": { "type": "plugin", "name": "my-embedder" } }`.
3. Optional: set `EmbeddingConfig.max_embed_duration_secs` (default 60) to bound the wait on a hung backend; `None` disables the timeout.
The CLI (`kreuzberg embed --provider plugin --plugin my-embedder`), MCP server (`embed_text` tool, `embedding_plugin` parameter), REST API, and env var `KREUZBERG_EMBEDDING_PLUGIN_NAME` all accept the Plugin variant once a backend is registered.
**Fork-safety**: Python callers running under `multiprocessing`, `gunicorn`'s prefork worker, or Celery prefork must re-register the backend in each child process — native-backed embedders (including `llama-cpp-python`) aren't fork-safe. Use `os.register_at_fork(after_in_child=reregister_fn)` to automate the re-registration.
### Configuration
=== "Python"
--8<-- "snippets/python/utils/embedding_with_chunking.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/embedding_with_chunking.md"
=== "Rust"
--8<-- "snippets/rust/advanced/embedding_with_chunking.md"
=== "Go"
--8<-- "snippets/go/advanced/embedding_with_chunking.md"
=== "Java"
--8<-- "snippets/java/advanced/embedding_with_chunking.md"
=== "C#"
--8<-- "snippets/csharp/advanced/embedding_with_chunking.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/embedding_with_chunking.md"
=== "R"
--8<-- "snippets/r/advanced/embedding_with_chunking.md"
### Vector Database Integration
=== "Python"
--8<-- "snippets/python/utils/vector_database_integration.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/vector_database_integration.md"
=== "Rust"
--8<-- "snippets/rust/advanced/vector_database_integration.md"
=== "Go"
--8<-- "snippets/go/advanced/vector_database_integration.md"
=== "Java"
--8<-- "snippets/java/advanced/vector_database_integration.md"
=== "C#"
--8<-- "snippets/csharp/advanced/vector_database_integration.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/vector_database_integration.md"
=== "R"
--8<-- "snippets/r/advanced/vector_database_integration.md"
## Token Reduction
Reduce token count while preserving meaning for LLM pipelines.
| Level | Reduction | Effect |
| ------------ | --------- | ---------------------------------------- |
| `off` | 0% | Pass-through |
| `moderate` | 1525% | Stopwords + redundancy removal |
| `aggressive` | 3050% | Semantic clustering + importance scoring |
### Configuration
=== "Python"
--8<-- "snippets/python/config/token_reduction_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/token_reduction_config.md"
=== "Rust"
--8<-- "snippets/rust/advanced/token_reduction_config.md"
=== "Go"
--8<-- "snippets/go/config/token_reduction_config.md"
=== "Java"
--8<-- "snippets/java/config/token_reduction_config.md"
=== "C#"
--8<-- "snippets/csharp/advanced/token_reduction_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/token_reduction_config.md"
=== "R"
--8<-- "snippets/r/config/token_reduction_config.md"
### Example
=== "Python"
--8<-- "snippets/python/utils/token_reduction_example.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/token_reduction_example.md"
=== "Rust"
--8<-- "snippets/rust/advanced/token_reduction_example.md"
=== "Go"
--8<-- "snippets/go/advanced/token_reduction_example.md"
=== "Java"
--8<-- "snippets/java/advanced/token_reduction_example.md"
=== "C#"
--8<-- "snippets/csharp/advanced/token_reduction_example.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/token_reduction_example.md"
=== "R"
--8<-- "snippets/r/advanced/token_reduction_example.md"
## Keyword Extraction
Extract keywords using YAKE or RAKE algorithms. Requires the `keywords` feature flag. See [Keyword Extraction](keywords.md) for algorithm details and parameter reference.
### Configuration
=== "Python"
--8<-- "snippets/python/config/keyword_extraction_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/keyword_extraction_config.md"
=== "Rust"
--8<-- "snippets/rust/advanced/keyword_extraction_config.md"
=== "Go"
--8<-- "snippets/go/config/keyword_extraction_config.md"
=== "Java"
--8<-- "snippets/java/config/keyword_extraction_config.md"
=== "C#"
--8<-- "snippets/csharp/advanced/keyword_extraction_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/keyword_extraction_config.md"
=== "R"
--8<-- "snippets/r/config/keyword_extraction_config.md"
### Example
=== "Python"
--8<-- "snippets/python/utils/keyword_extraction_example.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/keyword_extraction_example.md"
=== "Rust"
--8<-- "snippets/rust/advanced/keyword_extraction_example.md"
=== "Go"
--8<-- "snippets/go/advanced/keyword_extraction_example.md"
=== "Java"
--8<-- "snippets/java/advanced/keyword_extraction_example.md"
=== "C#"
--8<-- "snippets/csharp/advanced/keyword_extraction_example.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/keyword_extraction_example.md"
=== "R"
--8<-- "snippets/r/advanced/keyword_extraction_example.md"
## Quality Processing
Score extracted text for quality issues (0.01.0, where 1.0 is highest quality). Detects OCR artifacts, script content, navigation elements, and structural issues.
| Factor | Weight | Detects |
| ------------------- | ------ | ------------------------------------------------------ |
| OCR Artifacts | 30% | Scattered chars, repeated punctuation, malformed words |
| Script Content | 20% | JavaScript, CSS, HTML tags |
| Navigation Elements | 10% | Breadcrumbs, pagination, skip links |
| Document Structure | 20% | Sentence/paragraph length, punctuation distribution |
| Metadata Quality | 10% | Presence of title, author, subject |
Score ranges: `0.00.3` very low, `0.30.6` low, `0.60.8` moderate, `0.81.0` high.
### Configuration
=== "Python"
--8<-- "snippets/python/config/quality_processing_config.md"
=== "TypeScript"
--8<-- "snippets/typescript/config/quality_processing_config.md"
=== "Rust"
--8<-- "snippets/rust/advanced/quality_processing_config.md"
=== "Go"
--8<-- "snippets/go/config/quality_processing_config.md"
=== "Java"
--8<-- "snippets/java/config/quality_processing_config.md"
=== "C#"
--8<-- "snippets/csharp/advanced/quality_processing_config.md"
=== "Ruby"
--8<-- "snippets/ruby/config/quality_processing_config.md"
=== "R"
--8<-- "snippets/r/config/quality_processing_config.md"
### Example
=== "Python"
--8<-- "snippets/python/utils/quality_processing_example.md"
=== "TypeScript"
--8<-- "snippets/typescript/utils/quality_processing_example.md"
=== "Rust"
--8<-- "snippets/rust/advanced/quality_processing_example.md"
=== "Go"
--8<-- "snippets/go/advanced/quality_processing_example.md"
=== "Java"
--8<-- "snippets/java/advanced/quality_processing_example.md"
=== "C#"
--8<-- "snippets/csharp/advanced/quality_processing_example.md"
=== "Ruby"
--8<-- "snippets/ruby/advanced/quality_processing_example.md"
=== "R"
--8<-- "snippets/r/advanced/quality_processing_example.md"
## Combining Features
=== "Python"
--8<-- "snippets/python/advanced/combining_all_features.md"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/combining_all_features.md"
=== "Rust"
--8<-- "snippets/rust/api/combining_all_features.md"
=== "Go"
--8<-- "snippets/go/api/combining_all_features.md"
=== "Java"
--8<-- "snippets/java/api/combining_all_features.md"
=== "C#"
--8<-- "snippets/csharp/advanced/combining_all_features.md"
=== "Ruby"
--8<-- "snippets/ruby/api/combining_all_features.md"
=== "R"
--8<-- "snippets/r/api/combining_all_features.md"