fil/docs/guides/advanced.md

# Advanced Features

## Text Chunking

Split extracted text into chunks for RAG, vector databases, or LLM context windows. Four strategies:

- **Text** — splits on whitespace/punctuation boundaries
- **Markdown** — structure-aware; preserves headings, lists, and code blocks
- **YAML** — section-aware; preserves YAML document structure
- **Semantic** — topic-aware; splits at natural document boundaries

### Semantic

Set `chunker_type` to `"semantic"`. Uses an embedding model for topic detection when one is configured; otherwise falls back to structural heuristics.

```python
config = ExtractionConfig(
    chunking=ChunkingConfig(chunker_type="semantic")
)
```

**Behavior:**

- **Without embeddings** — Uses structural heuristics: detects headers (ALL CAPS, numbered sections) and paragraph boundaries
- **With embeddings** — Compares consecutive paragraphs via embeddings to detect topic shifts, merging paragraphs below the `topic_threshold` (default: 0.5)

Use `topic_threshold` to control sensitivity: higher values (0.7–0.9) preserve more fine-grained topics, lower values (0.1–0.3) merge aggressive. Only applies when an embedding model is configured.

### Configuration

=== "Python"

    --8<-- "snippets/python/config/chunking_config.md"

=== "TypeScript"

    --8<-- "snippets/typescript/config/chunking_config.md"

=== "Rust"

    --8<-- "snippets/rust/advanced/chunking_config.md"

=== "Go"

    --8<-- "snippets/go/config/chunking_config.md"

=== "Java"

    --8<-- "snippets/java/config/chunking_config.md"

=== "C#"

    --8<-- "snippets/csharp/advanced/chunking_config.md"

=== "Ruby"

    --8<-- "snippets/ruby/config/chunking_config.md"

=== "R"

    --8<-- "snippets/r/config/chunking_config.md"

=== "Wasm"

    --8<-- "snippets/wasm/config/chunking_config.md"

### Chunk Output

Each chunk in `result.chunks` contains:

| Field                                   | Description                                      |
| --------------------------------------- | ------------------------------------------------ |
| `content`                               | Chunk text                                       |
| `metadata.byte_start` / `byte_end`      | Byte offsets in the original text                |
| `metadata.chunk_index` / `total_chunks` | Position in sequence                             |
| `metadata.token_count`                  | Token count (when embeddings enabled)            |
| `metadata.heading_context`              | Active heading hierarchy (Markdown chunker only) |
| `embedding`                             | Embedding vector (when configured)               |

Chunks can be sized by token count instead of characters — enable the `chunking-tokenizers` feature and set `sizing` to `token`.

### RAG Pipeline Example

=== "Python"

    --8<-- "snippets/python/utils/chunking_rag.md"

=== "TypeScript"

    --8<-- "snippets/typescript/utils/chunking_rag.md"

=== "Rust"

    --8<-- "snippets/rust/advanced/chunking_rag.md"

=== "Go"

    --8<-- "snippets/go/advanced/chunking_rag.md"

=== "Java"

    --8<-- "snippets/java/advanced/chunking_rag.md"

=== "C#"

    --8<-- "snippets/csharp/advanced/chunking_rag.md"

=== "Ruby"

    --8<-- "snippets/ruby/advanced/chunking_rag.md"

=== "R"

    --8<-- "snippets/r/advanced/chunking_rag.md"

## Language Detection

Detect languages in extracted text using [`whatlang`](https://crates.io/crates/whatlang) — 60+ languages with ISO 639-3 codes. Set `detect_multiple: true` to chunk the text into 200-char segments and return all detected languages sorted by prevalence.

### Configuration

=== "Python"

    --8<-- "snippets/python/config/language_detection_config.md"

=== "TypeScript"

    --8<-- "snippets/typescript/config/language_detection_config.md"

=== "Rust"

    --8<-- "snippets/rust/advanced/language_detection_config.md"

=== "Go"

    --8<-- "snippets/go/config/language_detection_config.md"

=== "Java"

    --8<-- "snippets/java/config/language_detection_config.md"

=== "C#"

    --8<-- "snippets/csharp/advanced/language_detection_config.md"

=== "Ruby"

    --8<-- "snippets/ruby/config/language_detection_config.md"

=== "R"

    --8<-- "snippets/r/config/language_detection_config.md"

### Multilingual Example

=== "Python"

    --8<-- "snippets/python/utils/language_detection_multilingual.md"

=== "TypeScript"

    --8<-- "snippets/typescript/metadata/language_detection_multilingual.md"

=== "Rust"

    --8<-- "snippets/rust/advanced/language_detection_multilingual.md"

=== "Go"

    --8<-- "snippets/go/advanced/language_detection_multilingual.md"

=== "Java"

    --8<-- "snippets/java/advanced/language_detection_multilingual.md"

=== "C#"

    --8<-- "snippets/csharp/advanced/language_detection_multilingual.md"

=== "Ruby"

    --8<-- "snippets/ruby/advanced/language_detection_multilingual.md"

=== "R"

    --8<-- "snippets/r/advanced/language_detection_multilingual.md"

## Embedding Generation

Local in-process embeddings via ONNX for semantic search and RAG — no external API calls. Requires the `embeddings` feature.

| Preset         | Model                        | Dimensions | Max Tokens | Use Case                                                |
| -------------- | ---------------------------- | ---------- | ---------- | ------------------------------------------------------- |
| `fast`         | all-MiniLM-L6-v2 (quantized) | 384        | 512        | Quick prototyping, development, resource-constrained    |
| `balanced`     | BGE-base-en-v1.5             | 768        | 1024       | General-purpose RAG, production deployments, English    |
| `quality`      | BGE-large-en-v1.5            | 1024       | 2000       | Complex documents, maximum accuracy, sufficient compute |
| `multilingual` | multilingual-e5-base         | 768        | 1024       | International documents, mixed-language content         |

### In-Process Embedding Backends (Plugin Variant)

Plug a caller-managed embedder (e.g. `llama-cpp-python`, `sentence-transformers`) into Kreuzberg via the `Plugin` variant of `EmbeddingModelType` — Kreuzberg calls back into the registered backend instead of running its own ONNX model.

1. Register the backend once at startup via `kreuzberg::plugins::register_embedding_backend(Arc::new(MyEmbedder))`. The backend implements `EmbeddingBackend` (a `Plugin`-inheriting async trait with `dimensions()` and `embed(texts) -> Vec<Vec<f32>>`).
2. Reference it by name in `EmbeddingConfig`: `{ "model": { "type": "plugin", "name": "my-embedder" } }`.
3. Optional: set `EmbeddingConfig.max_embed_duration_secs` (default 60) to bound the wait on a hung backend; `None` disables the timeout.

The CLI (`kreuzberg embed --provider plugin --plugin my-embedder`), MCP server (`embed_text` tool, `embedding_plugin` parameter), REST API, and env var `KREUZBERG_EMBEDDING_PLUGIN_NAME` all accept the Plugin variant once a backend is registered.

**Fork-safety**: Python callers running under `multiprocessing`, `gunicorn`'s prefork worker, or Celery prefork must re-register the backend in each child process — native-backed embedders (including `llama-cpp-python`) aren't fork-safe. Use `os.register_at_fork(after_in_child=reregister_fn)` to automate the re-registration.

### Configuration

=== "Python"

    --8<-- "snippets/python/utils/embedding_with_chunking.md"

=== "TypeScript"

    --8<-- "snippets/typescript/utils/embedding_with_chunking.md"

=== "Rust"

    --8<-- "snippets/rust/advanced/embedding_with_chunking.md"

=== "Go"

    --8<-- "snippets/go/advanced/embedding_with_chunking.md"

=== "Java"

    --8<-- "snippets/java/advanced/embedding_with_chunking.md"

=== "C#"

    --8<-- "snippets/csharp/advanced/embedding_with_chunking.md"

=== "Ruby"

    --8<-- "snippets/ruby/advanced/embedding_with_chunking.md"

=== "R"

    --8<-- "snippets/r/advanced/embedding_with_chunking.md"

### Vector Database Integration

=== "Python"

    --8<-- "snippets/python/utils/vector_database_integration.md"

=== "TypeScript"

    --8<-- "snippets/typescript/utils/vector_database_integration.md"

=== "Rust"

    --8<-- "snippets/rust/advanced/vector_database_integration.md"

=== "Go"

    --8<-- "snippets/go/advanced/vector_database_integration.md"

=== "Java"

    --8<-- "snippets/java/advanced/vector_database_integration.md"

=== "C#"

    --8<-- "snippets/csharp/advanced/vector_database_integration.md"

=== "Ruby"

    --8<-- "snippets/ruby/advanced/vector_database_integration.md"

=== "R"

    --8<-- "snippets/r/advanced/vector_database_integration.md"

## Token Reduction

Reduce token count while preserving meaning for LLM pipelines.

| Level        | Reduction | Effect                                   |
| ------------ | --------- | ---------------------------------------- |
| `off`        | 0%        | Pass-through                             |
| `moderate`   | 15–25%    | Stopwords + redundancy removal           |
| `aggressive` | 30–50%    | Semantic clustering + importance scoring |

### Configuration

=== "Python"

    --8<-- "snippets/python/config/token_reduction_config.md"

=== "TypeScript"

    --8<-- "snippets/typescript/config/token_reduction_config.md"

=== "Rust"

    --8<-- "snippets/rust/advanced/token_reduction_config.md"

=== "Go"

    --8<-- "snippets/go/config/token_reduction_config.md"

=== "Java"

    --8<-- "snippets/java/config/token_reduction_config.md"

=== "C#"

    --8<-- "snippets/csharp/advanced/token_reduction_config.md"

=== "Ruby"

    --8<-- "snippets/ruby/config/token_reduction_config.md"

=== "R"

    --8<-- "snippets/r/config/token_reduction_config.md"

### Example

=== "Python"

    --8<-- "snippets/python/utils/token_reduction_example.md"

=== "TypeScript"

    --8<-- "snippets/typescript/utils/token_reduction_example.md"

=== "Rust"

    --8<-- "snippets/rust/advanced/token_reduction_example.md"

=== "Go"

    --8<-- "snippets/go/advanced/token_reduction_example.md"

=== "Java"

    --8<-- "snippets/java/advanced/token_reduction_example.md"

=== "C#"

    --8<-- "snippets/csharp/advanced/token_reduction_example.md"

=== "Ruby"

    --8<-- "snippets/ruby/advanced/token_reduction_example.md"

=== "R"

    --8<-- "snippets/r/advanced/token_reduction_example.md"

## Keyword Extraction

Extract keywords using YAKE or RAKE algorithms. Requires the `keywords` feature flag. See [Keyword Extraction](keywords.md) for algorithm details and parameter reference.

### Configuration

=== "Python"

    --8<-- "snippets/python/config/keyword_extraction_config.md"

=== "TypeScript"

    --8<-- "snippets/typescript/config/keyword_extraction_config.md"

=== "Rust"

    --8<-- "snippets/rust/advanced/keyword_extraction_config.md"

=== "Go"

    --8<-- "snippets/go/config/keyword_extraction_config.md"

=== "Java"

    --8<-- "snippets/java/config/keyword_extraction_config.md"

=== "C#"

    --8<-- "snippets/csharp/advanced/keyword_extraction_config.md"

=== "Ruby"

    --8<-- "snippets/ruby/config/keyword_extraction_config.md"

=== "R"

    --8<-- "snippets/r/config/keyword_extraction_config.md"

### Example

=== "Python"

    --8<-- "snippets/python/utils/keyword_extraction_example.md"

=== "TypeScript"

    --8<-- "snippets/typescript/utils/keyword_extraction_example.md"

=== "Rust"

    --8<-- "snippets/rust/advanced/keyword_extraction_example.md"

=== "Go"

    --8<-- "snippets/go/advanced/keyword_extraction_example.md"

=== "Java"

    --8<-- "snippets/java/advanced/keyword_extraction_example.md"

=== "C#"

    --8<-- "snippets/csharp/advanced/keyword_extraction_example.md"

=== "Ruby"

    --8<-- "snippets/ruby/advanced/keyword_extraction_example.md"

=== "R"

    --8<-- "snippets/r/advanced/keyword_extraction_example.md"

## Quality Processing

Score extracted text for quality issues (0.0–1.0, where 1.0 is highest quality). Detects OCR artifacts, script content, navigation elements, and structural issues.

| Factor              | Weight | Detects                                                |
| ------------------- | ------ | ------------------------------------------------------ |
| OCR Artifacts       | 30%    | Scattered chars, repeated punctuation, malformed words |
| Script Content      | 20%    | JavaScript, CSS, HTML tags                             |
| Navigation Elements | 10%    | Breadcrumbs, pagination, skip links                    |
| Document Structure  | 20%    | Sentence/paragraph length, punctuation distribution    |
| Metadata Quality    | 10%    | Presence of title, author, subject                     |

Score ranges: `0.0–0.3` very low, `0.3–0.6` low, `0.6–0.8` moderate, `0.8–1.0` high.

### Configuration

=== "Python"

    --8<-- "snippets/python/config/quality_processing_config.md"

=== "TypeScript"

    --8<-- "snippets/typescript/config/quality_processing_config.md"

=== "Rust"

    --8<-- "snippets/rust/advanced/quality_processing_config.md"

=== "Go"

    --8<-- "snippets/go/config/quality_processing_config.md"

=== "Java"

    --8<-- "snippets/java/config/quality_processing_config.md"

=== "C#"

    --8<-- "snippets/csharp/advanced/quality_processing_config.md"

=== "Ruby"

    --8<-- "snippets/ruby/config/quality_processing_config.md"

=== "R"

    --8<-- "snippets/r/config/quality_processing_config.md"

### Example

=== "Python"

    --8<-- "snippets/python/utils/quality_processing_example.md"

=== "TypeScript"

    --8<-- "snippets/typescript/utils/quality_processing_example.md"

=== "Rust"

    --8<-- "snippets/rust/advanced/quality_processing_example.md"

=== "Go"

    --8<-- "snippets/go/advanced/quality_processing_example.md"

=== "Java"

    --8<-- "snippets/java/advanced/quality_processing_example.md"

=== "C#"

    --8<-- "snippets/csharp/advanced/quality_processing_example.md"

=== "Ruby"

    --8<-- "snippets/ruby/advanced/quality_processing_example.md"

=== "R"

    --8<-- "snippets/r/advanced/quality_processing_example.md"

## Combining Features

=== "Python"

    --8<-- "snippets/python/advanced/combining_all_features.md"

=== "TypeScript"

    --8<-- "snippets/typescript/getting-started/combining_all_features.md"

=== "Rust"

    --8<-- "snippets/rust/api/combining_all_features.md"

=== "Go"

    --8<-- "snippets/go/api/combining_all_features.md"

=== "Java"

    --8<-- "snippets/java/api/combining_all_features.md"

=== "C#"

    --8<-- "snippets/csharp/advanced/combining_all_features.md"

=== "Ruby"

    --8<-- "snippets/ruby/api/combining_all_features.md"

=== "R"

    --8<-- "snippets/r/api/combining_all_features.md"