544 lines
14 KiB
Markdown
544 lines
14 KiB
Markdown
# Advanced Features
|
||
|
||
## Text Chunking
|
||
|
||
Split extracted text into chunks for RAG, vector databases, or LLM context windows. Four strategies:
|
||
|
||
- **Text** — splits on whitespace/punctuation boundaries
|
||
- **Markdown** — structure-aware; preserves headings, lists, and code blocks
|
||
- **YAML** — section-aware; preserves YAML document structure
|
||
- **Semantic** — topic-aware; splits at natural document boundaries
|
||
|
||
### Semantic
|
||
|
||
Set `chunker_type` to `"semantic"`. Uses an embedding model for topic detection when one is configured; otherwise falls back to structural heuristics.
|
||
|
||
```python
|
||
config = ExtractionConfig(
|
||
chunking=ChunkingConfig(chunker_type="semantic")
|
||
)
|
||
```
|
||
|
||
**Behavior:**
|
||
|
||
- **Without embeddings** — Uses structural heuristics: detects headers (ALL CAPS, numbered sections) and paragraph boundaries
|
||
- **With embeddings** — Compares consecutive paragraphs via embeddings to detect topic shifts, merging paragraphs below the `topic_threshold` (default: 0.5)
|
||
|
||
Use `topic_threshold` to control sensitivity: higher values (0.7–0.9) preserve more fine-grained topics, lower values (0.1–0.3) merge aggressive. Only applies when an embedding model is configured.
|
||
|
||
### Configuration
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/config/chunking_config.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/config/chunking_config.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/advanced/chunking_config.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/config/chunking_config.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/config/chunking_config.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/advanced/chunking_config.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/config/chunking_config.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/config/chunking_config.md"
|
||
|
||
=== "Wasm"
|
||
|
||
--8<-- "snippets/wasm/config/chunking_config.md"
|
||
|
||
### Chunk Output
|
||
|
||
Each chunk in `result.chunks` contains:
|
||
|
||
| Field | Description |
|
||
| --------------------------------------- | ------------------------------------------------ |
|
||
| `content` | Chunk text |
|
||
| `metadata.byte_start` / `byte_end` | Byte offsets in the original text |
|
||
| `metadata.chunk_index` / `total_chunks` | Position in sequence |
|
||
| `metadata.token_count` | Token count (when embeddings enabled) |
|
||
| `metadata.heading_context` | Active heading hierarchy (Markdown chunker only) |
|
||
| `embedding` | Embedding vector (when configured) |
|
||
|
||
Chunks can be sized by token count instead of characters — enable the `chunking-tokenizers` feature and set `sizing` to `token`.
|
||
|
||
### RAG Pipeline Example
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/utils/chunking_rag.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/utils/chunking_rag.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/advanced/chunking_rag.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/advanced/chunking_rag.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/advanced/chunking_rag.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/advanced/chunking_rag.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/advanced/chunking_rag.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/advanced/chunking_rag.md"
|
||
|
||
## Language Detection
|
||
|
||
Detect languages in extracted text using [`whatlang`](https://crates.io/crates/whatlang) — 60+ languages with ISO 639-3 codes. Set `detect_multiple: true` to chunk the text into 200-char segments and return all detected languages sorted by prevalence.
|
||
|
||
### Configuration
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/config/language_detection_config.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/config/language_detection_config.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/advanced/language_detection_config.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/config/language_detection_config.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/config/language_detection_config.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/advanced/language_detection_config.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/config/language_detection_config.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/config/language_detection_config.md"
|
||
|
||
### Multilingual Example
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/utils/language_detection_multilingual.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/metadata/language_detection_multilingual.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/advanced/language_detection_multilingual.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/advanced/language_detection_multilingual.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/advanced/language_detection_multilingual.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/advanced/language_detection_multilingual.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/advanced/language_detection_multilingual.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/advanced/language_detection_multilingual.md"
|
||
|
||
## Embedding Generation
|
||
|
||
Local in-process embeddings via ONNX for semantic search and RAG — no external API calls. Requires the `embeddings` feature.
|
||
|
||
| Preset | Model | Dimensions | Max Tokens | Use Case |
|
||
| -------------- | ---------------------------- | ---------- | ---------- | ------------------------------------------------------- |
|
||
| `fast` | all-MiniLM-L6-v2 (quantized) | 384 | 512 | Quick prototyping, development, resource-constrained |
|
||
| `balanced` | BGE-base-en-v1.5 | 768 | 1024 | General-purpose RAG, production deployments, English |
|
||
| `quality` | BGE-large-en-v1.5 | 1024 | 2000 | Complex documents, maximum accuracy, sufficient compute |
|
||
| `multilingual` | multilingual-e5-base | 768 | 1024 | International documents, mixed-language content |
|
||
|
||
### In-Process Embedding Backends (Plugin Variant)
|
||
|
||
Plug a caller-managed embedder (e.g. `llama-cpp-python`, `sentence-transformers`) into Kreuzberg via the `Plugin` variant of `EmbeddingModelType` — Kreuzberg calls back into the registered backend instead of running its own ONNX model.
|
||
|
||
1. Register the backend once at startup via `kreuzberg::plugins::register_embedding_backend(Arc::new(MyEmbedder))`. The backend implements `EmbeddingBackend` (a `Plugin`-inheriting async trait with `dimensions()` and `embed(texts) -> Vec<Vec<f32>>`).
|
||
2. Reference it by name in `EmbeddingConfig`: `{ "model": { "type": "plugin", "name": "my-embedder" } }`.
|
||
3. Optional: set `EmbeddingConfig.max_embed_duration_secs` (default 60) to bound the wait on a hung backend; `None` disables the timeout.
|
||
|
||
The CLI (`kreuzberg embed --provider plugin --plugin my-embedder`), MCP server (`embed_text` tool, `embedding_plugin` parameter), REST API, and env var `KREUZBERG_EMBEDDING_PLUGIN_NAME` all accept the Plugin variant once a backend is registered.
|
||
|
||
**Fork-safety**: Python callers running under `multiprocessing`, `gunicorn`'s prefork worker, or Celery prefork must re-register the backend in each child process — native-backed embedders (including `llama-cpp-python`) aren't fork-safe. Use `os.register_at_fork(after_in_child=reregister_fn)` to automate the re-registration.
|
||
|
||
### Configuration
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/utils/embedding_with_chunking.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/utils/embedding_with_chunking.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/advanced/embedding_with_chunking.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/advanced/embedding_with_chunking.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/advanced/embedding_with_chunking.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/advanced/embedding_with_chunking.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/advanced/embedding_with_chunking.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/advanced/embedding_with_chunking.md"
|
||
|
||
### Vector Database Integration
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/utils/vector_database_integration.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/utils/vector_database_integration.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/advanced/vector_database_integration.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/advanced/vector_database_integration.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/advanced/vector_database_integration.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/advanced/vector_database_integration.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/advanced/vector_database_integration.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/advanced/vector_database_integration.md"
|
||
|
||
## Token Reduction
|
||
|
||
Reduce token count while preserving meaning for LLM pipelines.
|
||
|
||
| Level | Reduction | Effect |
|
||
| ------------ | --------- | ---------------------------------------- |
|
||
| `off` | 0% | Pass-through |
|
||
| `moderate` | 15–25% | Stopwords + redundancy removal |
|
||
| `aggressive` | 30–50% | Semantic clustering + importance scoring |
|
||
|
||
### Configuration
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/config/token_reduction_config.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/config/token_reduction_config.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/advanced/token_reduction_config.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/config/token_reduction_config.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/config/token_reduction_config.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/advanced/token_reduction_config.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/config/token_reduction_config.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/config/token_reduction_config.md"
|
||
|
||
### Example
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/utils/token_reduction_example.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/utils/token_reduction_example.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/advanced/token_reduction_example.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/advanced/token_reduction_example.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/advanced/token_reduction_example.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/advanced/token_reduction_example.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/advanced/token_reduction_example.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/advanced/token_reduction_example.md"
|
||
|
||
## Keyword Extraction
|
||
|
||
Extract keywords using YAKE or RAKE algorithms. Requires the `keywords` feature flag. See [Keyword Extraction](keywords.md) for algorithm details and parameter reference.
|
||
|
||
### Configuration
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/config/keyword_extraction_config.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/config/keyword_extraction_config.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/advanced/keyword_extraction_config.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/config/keyword_extraction_config.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/config/keyword_extraction_config.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/advanced/keyword_extraction_config.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/config/keyword_extraction_config.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/config/keyword_extraction_config.md"
|
||
|
||
### Example
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/utils/keyword_extraction_example.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/utils/keyword_extraction_example.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/advanced/keyword_extraction_example.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/advanced/keyword_extraction_example.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/advanced/keyword_extraction_example.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/advanced/keyword_extraction_example.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/advanced/keyword_extraction_example.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/advanced/keyword_extraction_example.md"
|
||
|
||
## Quality Processing
|
||
|
||
Score extracted text for quality issues (0.0–1.0, where 1.0 is highest quality). Detects OCR artifacts, script content, navigation elements, and structural issues.
|
||
|
||
| Factor | Weight | Detects |
|
||
| ------------------- | ------ | ------------------------------------------------------ |
|
||
| OCR Artifacts | 30% | Scattered chars, repeated punctuation, malformed words |
|
||
| Script Content | 20% | JavaScript, CSS, HTML tags |
|
||
| Navigation Elements | 10% | Breadcrumbs, pagination, skip links |
|
||
| Document Structure | 20% | Sentence/paragraph length, punctuation distribution |
|
||
| Metadata Quality | 10% | Presence of title, author, subject |
|
||
|
||
Score ranges: `0.0–0.3` very low, `0.3–0.6` low, `0.6–0.8` moderate, `0.8–1.0` high.
|
||
|
||
### Configuration
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/config/quality_processing_config.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/config/quality_processing_config.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/advanced/quality_processing_config.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/config/quality_processing_config.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/config/quality_processing_config.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/advanced/quality_processing_config.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/config/quality_processing_config.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/config/quality_processing_config.md"
|
||
|
||
### Example
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/utils/quality_processing_example.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/utils/quality_processing_example.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/advanced/quality_processing_example.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/advanced/quality_processing_example.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/advanced/quality_processing_example.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/advanced/quality_processing_example.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/advanced/quality_processing_example.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/advanced/quality_processing_example.md"
|
||
|
||
## Combining Features
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/advanced/combining_all_features.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/getting-started/combining_all_features.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/api/combining_all_features.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/api/combining_all_features.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/api/combining_all_features.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/advanced/combining_all_features.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/api/combining_all_features.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/api/combining_all_features.md"
|