# Advanced Features ## Text Chunking Split extracted text into chunks for RAG, vector databases, or LLM context windows. Four strategies: - **Text** — splits on whitespace/punctuation boundaries - **Markdown** — structure-aware; preserves headings, lists, and code blocks - **YAML** — section-aware; preserves YAML document structure - **Semantic** — topic-aware; splits at natural document boundaries ### Semantic Set `chunker_type` to `"semantic"`. Uses an embedding model for topic detection when one is configured; otherwise falls back to structural heuristics. ```python config = ExtractionConfig( chunking=ChunkingConfig(chunker_type="semantic") ) ``` **Behavior:** - **Without embeddings** — Uses structural heuristics: detects headers (ALL CAPS, numbered sections) and paragraph boundaries - **With embeddings** — Compares consecutive paragraphs via embeddings to detect topic shifts, merging paragraphs below the `topic_threshold` (default: 0.5) Use `topic_threshold` to control sensitivity: higher values (0.7–0.9) preserve more fine-grained topics, lower values (0.1–0.3) merge aggressive. Only applies when an embedding model is configured. ### Configuration === "Python" --8<-- "snippets/python/config/chunking_config.md" === "TypeScript" --8<-- "snippets/typescript/config/chunking_config.md" === "Rust" --8<-- "snippets/rust/advanced/chunking_config.md" === "Go" --8<-- "snippets/go/config/chunking_config.md" === "Java" --8<-- "snippets/java/config/chunking_config.md" === "C#" --8<-- "snippets/csharp/advanced/chunking_config.md" === "Ruby" --8<-- "snippets/ruby/config/chunking_config.md" === "R" --8<-- "snippets/r/config/chunking_config.md" === "Wasm" --8<-- "snippets/wasm/config/chunking_config.md" ### Chunk Output Each chunk in `result.chunks` contains: | Field | Description | | --------------------------------------- | ------------------------------------------------ | | `content` | Chunk text | | `metadata.byte_start` / `byte_end` | Byte offsets in the original text | | `metadata.chunk_index` / `total_chunks` | Position in sequence | | `metadata.token_count` | Token count (when embeddings enabled) | | `metadata.heading_context` | Active heading hierarchy (Markdown chunker only) | | `embedding` | Embedding vector (when configured) | Chunks can be sized by token count instead of characters — enable the `chunking-tokenizers` feature and set `sizing` to `token`. ### RAG Pipeline Example === "Python" --8<-- "snippets/python/utils/chunking_rag.md" === "TypeScript" --8<-- "snippets/typescript/utils/chunking_rag.md" === "Rust" --8<-- "snippets/rust/advanced/chunking_rag.md" === "Go" --8<-- "snippets/go/advanced/chunking_rag.md" === "Java" --8<-- "snippets/java/advanced/chunking_rag.md" === "C#" --8<-- "snippets/csharp/advanced/chunking_rag.md" === "Ruby" --8<-- "snippets/ruby/advanced/chunking_rag.md" === "R" --8<-- "snippets/r/advanced/chunking_rag.md" ## Language Detection Detect languages in extracted text using [`whatlang`](https://crates.io/crates/whatlang) — 60+ languages with ISO 639-3 codes. Set `detect_multiple: true` to chunk the text into 200-char segments and return all detected languages sorted by prevalence. ### Configuration === "Python" --8<-- "snippets/python/config/language_detection_config.md" === "TypeScript" --8<-- "snippets/typescript/config/language_detection_config.md" === "Rust" --8<-- "snippets/rust/advanced/language_detection_config.md" === "Go" --8<-- "snippets/go/config/language_detection_config.md" === "Java" --8<-- "snippets/java/config/language_detection_config.md" === "C#" --8<-- "snippets/csharp/advanced/language_detection_config.md" === "Ruby" --8<-- "snippets/ruby/config/language_detection_config.md" === "R" --8<-- "snippets/r/config/language_detection_config.md" ### Multilingual Example === "Python" --8<-- "snippets/python/utils/language_detection_multilingual.md" === "TypeScript" --8<-- "snippets/typescript/metadata/language_detection_multilingual.md" === "Rust" --8<-- "snippets/rust/advanced/language_detection_multilingual.md" === "Go" --8<-- "snippets/go/advanced/language_detection_multilingual.md" === "Java" --8<-- "snippets/java/advanced/language_detection_multilingual.md" === "C#" --8<-- "snippets/csharp/advanced/language_detection_multilingual.md" === "Ruby" --8<-- "snippets/ruby/advanced/language_detection_multilingual.md" === "R" --8<-- "snippets/r/advanced/language_detection_multilingual.md" ## Embedding Generation Local in-process embeddings via ONNX for semantic search and RAG — no external API calls. Requires the `embeddings` feature. | Preset | Model | Dimensions | Max Tokens | Use Case | | -------------- | ---------------------------- | ---------- | ---------- | ------------------------------------------------------- | | `fast` | all-MiniLM-L6-v2 (quantized) | 384 | 512 | Quick prototyping, development, resource-constrained | | `balanced` | BGE-base-en-v1.5 | 768 | 1024 | General-purpose RAG, production deployments, English | | `quality` | BGE-large-en-v1.5 | 1024 | 2000 | Complex documents, maximum accuracy, sufficient compute | | `multilingual` | multilingual-e5-base | 768 | 1024 | International documents, mixed-language content | ### In-Process Embedding Backends (Plugin Variant) Plug a caller-managed embedder (e.g. `llama-cpp-python`, `sentence-transformers`) into Kreuzberg via the `Plugin` variant of `EmbeddingModelType` — Kreuzberg calls back into the registered backend instead of running its own ONNX model. 1. Register the backend once at startup via `kreuzberg::plugins::register_embedding_backend(Arc::new(MyEmbedder))`. The backend implements `EmbeddingBackend` (a `Plugin`-inheriting async trait with `dimensions()` and `embed(texts) -> Vec>`). 2. Reference it by name in `EmbeddingConfig`: `{ "model": { "type": "plugin", "name": "my-embedder" } }`. 3. Optional: set `EmbeddingConfig.max_embed_duration_secs` (default 60) to bound the wait on a hung backend; `None` disables the timeout. The CLI (`kreuzberg embed --provider plugin --plugin my-embedder`), MCP server (`embed_text` tool, `embedding_plugin` parameter), REST API, and env var `KREUZBERG_EMBEDDING_PLUGIN_NAME` all accept the Plugin variant once a backend is registered. **Fork-safety**: Python callers running under `multiprocessing`, `gunicorn`'s prefork worker, or Celery prefork must re-register the backend in each child process — native-backed embedders (including `llama-cpp-python`) aren't fork-safe. Use `os.register_at_fork(after_in_child=reregister_fn)` to automate the re-registration. ### Configuration === "Python" --8<-- "snippets/python/utils/embedding_with_chunking.md" === "TypeScript" --8<-- "snippets/typescript/utils/embedding_with_chunking.md" === "Rust" --8<-- "snippets/rust/advanced/embedding_with_chunking.md" === "Go" --8<-- "snippets/go/advanced/embedding_with_chunking.md" === "Java" --8<-- "snippets/java/advanced/embedding_with_chunking.md" === "C#" --8<-- "snippets/csharp/advanced/embedding_with_chunking.md" === "Ruby" --8<-- "snippets/ruby/advanced/embedding_with_chunking.md" === "R" --8<-- "snippets/r/advanced/embedding_with_chunking.md" ### Vector Database Integration === "Python" --8<-- "snippets/python/utils/vector_database_integration.md" === "TypeScript" --8<-- "snippets/typescript/utils/vector_database_integration.md" === "Rust" --8<-- "snippets/rust/advanced/vector_database_integration.md" === "Go" --8<-- "snippets/go/advanced/vector_database_integration.md" === "Java" --8<-- "snippets/java/advanced/vector_database_integration.md" === "C#" --8<-- "snippets/csharp/advanced/vector_database_integration.md" === "Ruby" --8<-- "snippets/ruby/advanced/vector_database_integration.md" === "R" --8<-- "snippets/r/advanced/vector_database_integration.md" ## Token Reduction Reduce token count while preserving meaning for LLM pipelines. | Level | Reduction | Effect | | ------------ | --------- | ---------------------------------------- | | `off` | 0% | Pass-through | | `moderate` | 15–25% | Stopwords + redundancy removal | | `aggressive` | 30–50% | Semantic clustering + importance scoring | ### Configuration === "Python" --8<-- "snippets/python/config/token_reduction_config.md" === "TypeScript" --8<-- "snippets/typescript/config/token_reduction_config.md" === "Rust" --8<-- "snippets/rust/advanced/token_reduction_config.md" === "Go" --8<-- "snippets/go/config/token_reduction_config.md" === "Java" --8<-- "snippets/java/config/token_reduction_config.md" === "C#" --8<-- "snippets/csharp/advanced/token_reduction_config.md" === "Ruby" --8<-- "snippets/ruby/config/token_reduction_config.md" === "R" --8<-- "snippets/r/config/token_reduction_config.md" ### Example === "Python" --8<-- "snippets/python/utils/token_reduction_example.md" === "TypeScript" --8<-- "snippets/typescript/utils/token_reduction_example.md" === "Rust" --8<-- "snippets/rust/advanced/token_reduction_example.md" === "Go" --8<-- "snippets/go/advanced/token_reduction_example.md" === "Java" --8<-- "snippets/java/advanced/token_reduction_example.md" === "C#" --8<-- "snippets/csharp/advanced/token_reduction_example.md" === "Ruby" --8<-- "snippets/ruby/advanced/token_reduction_example.md" === "R" --8<-- "snippets/r/advanced/token_reduction_example.md" ## Keyword Extraction Extract keywords using YAKE or RAKE algorithms. Requires the `keywords` feature flag. See [Keyword Extraction](keywords.md) for algorithm details and parameter reference. ### Configuration === "Python" --8<-- "snippets/python/config/keyword_extraction_config.md" === "TypeScript" --8<-- "snippets/typescript/config/keyword_extraction_config.md" === "Rust" --8<-- "snippets/rust/advanced/keyword_extraction_config.md" === "Go" --8<-- "snippets/go/config/keyword_extraction_config.md" === "Java" --8<-- "snippets/java/config/keyword_extraction_config.md" === "C#" --8<-- "snippets/csharp/advanced/keyword_extraction_config.md" === "Ruby" --8<-- "snippets/ruby/config/keyword_extraction_config.md" === "R" --8<-- "snippets/r/config/keyword_extraction_config.md" ### Example === "Python" --8<-- "snippets/python/utils/keyword_extraction_example.md" === "TypeScript" --8<-- "snippets/typescript/utils/keyword_extraction_example.md" === "Rust" --8<-- "snippets/rust/advanced/keyword_extraction_example.md" === "Go" --8<-- "snippets/go/advanced/keyword_extraction_example.md" === "Java" --8<-- "snippets/java/advanced/keyword_extraction_example.md" === "C#" --8<-- "snippets/csharp/advanced/keyword_extraction_example.md" === "Ruby" --8<-- "snippets/ruby/advanced/keyword_extraction_example.md" === "R" --8<-- "snippets/r/advanced/keyword_extraction_example.md" ## Quality Processing Score extracted text for quality issues (0.0–1.0, where 1.0 is highest quality). Detects OCR artifacts, script content, navigation elements, and structural issues. | Factor | Weight | Detects | | ------------------- | ------ | ------------------------------------------------------ | | OCR Artifacts | 30% | Scattered chars, repeated punctuation, malformed words | | Script Content | 20% | JavaScript, CSS, HTML tags | | Navigation Elements | 10% | Breadcrumbs, pagination, skip links | | Document Structure | 20% | Sentence/paragraph length, punctuation distribution | | Metadata Quality | 10% | Presence of title, author, subject | Score ranges: `0.0–0.3` very low, `0.3–0.6` low, `0.6–0.8` moderate, `0.8–1.0` high. ### Configuration === "Python" --8<-- "snippets/python/config/quality_processing_config.md" === "TypeScript" --8<-- "snippets/typescript/config/quality_processing_config.md" === "Rust" --8<-- "snippets/rust/advanced/quality_processing_config.md" === "Go" --8<-- "snippets/go/config/quality_processing_config.md" === "Java" --8<-- "snippets/java/config/quality_processing_config.md" === "C#" --8<-- "snippets/csharp/advanced/quality_processing_config.md" === "Ruby" --8<-- "snippets/ruby/config/quality_processing_config.md" === "R" --8<-- "snippets/r/config/quality_processing_config.md" ### Example === "Python" --8<-- "snippets/python/utils/quality_processing_example.md" === "TypeScript" --8<-- "snippets/typescript/utils/quality_processing_example.md" === "Rust" --8<-- "snippets/rust/advanced/quality_processing_example.md" === "Go" --8<-- "snippets/go/advanced/quality_processing_example.md" === "Java" --8<-- "snippets/java/advanced/quality_processing_example.md" === "C#" --8<-- "snippets/csharp/advanced/quality_processing_example.md" === "Ruby" --8<-- "snippets/ruby/advanced/quality_processing_example.md" === "R" --8<-- "snippets/r/advanced/quality_processing_example.md" ## Combining Features === "Python" --8<-- "snippets/python/advanced/combining_all_features.md" === "TypeScript" --8<-- "snippets/typescript/getting-started/combining_all_features.md" === "Rust" --8<-- "snippets/rust/api/combining_all_features.md" === "Go" --8<-- "snippets/go/api/combining_all_features.md" === "Java" --8<-- "snippets/java/api/combining_all_features.md" === "C#" --8<-- "snippets/csharp/advanced/combining_all_features.md" === "Ruby" --8<-- "snippets/ruby/api/combining_all_features.md" === "R" --8<-- "snippets/r/api/combining_all_features.md"