Files
fil/docs/guides/advanced.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

14 KiB
Raw Blame History

Advanced Features

Text Chunking

Split extracted text into chunks for RAG, vector databases, or LLM context windows. Four strategies:

  • Text — splits on whitespace/punctuation boundaries
  • Markdown — structure-aware; preserves headings, lists, and code blocks
  • YAML — section-aware; preserves YAML document structure
  • Semantic — topic-aware; splits at natural document boundaries

Semantic

Set chunker_type to "semantic". Uses an embedding model for topic detection when one is configured; otherwise falls back to structural heuristics.

config = ExtractionConfig(
    chunking=ChunkingConfig(chunker_type="semantic")
)

Behavior:

  • Without embeddings — Uses structural heuristics: detects headers (ALL CAPS, numbered sections) and paragraph boundaries
  • With embeddings — Compares consecutive paragraphs via embeddings to detect topic shifts, merging paragraphs below the topic_threshold (default: 0.5)

Use topic_threshold to control sensitivity: higher values (0.70.9) preserve more fine-grained topics, lower values (0.10.3) merge aggressive. Only applies when an embedding model is configured.

Configuration

=== "Python"

--8<-- "snippets/python/config/chunking_config.md"

=== "TypeScript"

--8<-- "snippets/typescript/config/chunking_config.md"

=== "Rust"

--8<-- "snippets/rust/advanced/chunking_config.md"

=== "Go"

--8<-- "snippets/go/config/chunking_config.md"

=== "Java"

--8<-- "snippets/java/config/chunking_config.md"

=== "C#"

--8<-- "snippets/csharp/advanced/chunking_config.md"

=== "Ruby"

--8<-- "snippets/ruby/config/chunking_config.md"

=== "R"

--8<-- "snippets/r/config/chunking_config.md"

=== "Wasm"

--8<-- "snippets/wasm/config/chunking_config.md"

Chunk Output

Each chunk in result.chunks contains:

Field Description
content Chunk text
metadata.byte_start / byte_end Byte offsets in the original text
metadata.chunk_index / total_chunks Position in sequence
metadata.token_count Token count (when embeddings enabled)
metadata.heading_context Active heading hierarchy (Markdown chunker only)
embedding Embedding vector (when configured)

Chunks can be sized by token count instead of characters — enable the chunking-tokenizers feature and set sizing to token.

RAG Pipeline Example

=== "Python"

--8<-- "snippets/python/utils/chunking_rag.md"

=== "TypeScript"

--8<-- "snippets/typescript/utils/chunking_rag.md"

=== "Rust"

--8<-- "snippets/rust/advanced/chunking_rag.md"

=== "Go"

--8<-- "snippets/go/advanced/chunking_rag.md"

=== "Java"

--8<-- "snippets/java/advanced/chunking_rag.md"

=== "C#"

--8<-- "snippets/csharp/advanced/chunking_rag.md"

=== "Ruby"

--8<-- "snippets/ruby/advanced/chunking_rag.md"

=== "R"

--8<-- "snippets/r/advanced/chunking_rag.md"

Language Detection

Detect languages in extracted text using whatlang — 60+ languages with ISO 639-3 codes. Set detect_multiple: true to chunk the text into 200-char segments and return all detected languages sorted by prevalence.

Configuration

=== "Python"

--8<-- "snippets/python/config/language_detection_config.md"

=== "TypeScript"

--8<-- "snippets/typescript/config/language_detection_config.md"

=== "Rust"

--8<-- "snippets/rust/advanced/language_detection_config.md"

=== "Go"

--8<-- "snippets/go/config/language_detection_config.md"

=== "Java"

--8<-- "snippets/java/config/language_detection_config.md"

=== "C#"

--8<-- "snippets/csharp/advanced/language_detection_config.md"

=== "Ruby"

--8<-- "snippets/ruby/config/language_detection_config.md"

=== "R"

--8<-- "snippets/r/config/language_detection_config.md"

Multilingual Example

=== "Python"

--8<-- "snippets/python/utils/language_detection_multilingual.md"

=== "TypeScript"

--8<-- "snippets/typescript/metadata/language_detection_multilingual.md"

=== "Rust"

--8<-- "snippets/rust/advanced/language_detection_multilingual.md"

=== "Go"

--8<-- "snippets/go/advanced/language_detection_multilingual.md"

=== "Java"

--8<-- "snippets/java/advanced/language_detection_multilingual.md"

=== "C#"

--8<-- "snippets/csharp/advanced/language_detection_multilingual.md"

=== "Ruby"

--8<-- "snippets/ruby/advanced/language_detection_multilingual.md"

=== "R"

--8<-- "snippets/r/advanced/language_detection_multilingual.md"

Embedding Generation

Local in-process embeddings via ONNX for semantic search and RAG — no external API calls. Requires the embeddings feature.

Preset Model Dimensions Max Tokens Use Case
fast all-MiniLM-L6-v2 (quantized) 384 512 Quick prototyping, development, resource-constrained
balanced BGE-base-en-v1.5 768 1024 General-purpose RAG, production deployments, English
quality BGE-large-en-v1.5 1024 2000 Complex documents, maximum accuracy, sufficient compute
multilingual multilingual-e5-base 768 1024 International documents, mixed-language content

In-Process Embedding Backends (Plugin Variant)

Plug a caller-managed embedder (e.g. llama-cpp-python, sentence-transformers) into Kreuzberg via the Plugin variant of EmbeddingModelType — Kreuzberg calls back into the registered backend instead of running its own ONNX model.

  1. Register the backend once at startup via kreuzberg::plugins::register_embedding_backend(Arc::new(MyEmbedder)). The backend implements EmbeddingBackend (a Plugin-inheriting async trait with dimensions() and embed(texts) -> Vec<Vec<f32>>).
  2. Reference it by name in EmbeddingConfig: { "model": { "type": "plugin", "name": "my-embedder" } }.
  3. Optional: set EmbeddingConfig.max_embed_duration_secs (default 60) to bound the wait on a hung backend; None disables the timeout.

The CLI (kreuzberg embed --provider plugin --plugin my-embedder), MCP server (embed_text tool, embedding_plugin parameter), REST API, and env var KREUZBERG_EMBEDDING_PLUGIN_NAME all accept the Plugin variant once a backend is registered.

Fork-safety: Python callers running under multiprocessing, gunicorn's prefork worker, or Celery prefork must re-register the backend in each child process — native-backed embedders (including llama-cpp-python) aren't fork-safe. Use os.register_at_fork(after_in_child=reregister_fn) to automate the re-registration.

Configuration

=== "Python"

--8<-- "snippets/python/utils/embedding_with_chunking.md"

=== "TypeScript"

--8<-- "snippets/typescript/utils/embedding_with_chunking.md"

=== "Rust"

--8<-- "snippets/rust/advanced/embedding_with_chunking.md"

=== "Go"

--8<-- "snippets/go/advanced/embedding_with_chunking.md"

=== "Java"

--8<-- "snippets/java/advanced/embedding_with_chunking.md"

=== "C#"

--8<-- "snippets/csharp/advanced/embedding_with_chunking.md"

=== "Ruby"

--8<-- "snippets/ruby/advanced/embedding_with_chunking.md"

=== "R"

--8<-- "snippets/r/advanced/embedding_with_chunking.md"

Vector Database Integration

=== "Python"

--8<-- "snippets/python/utils/vector_database_integration.md"

=== "TypeScript"

--8<-- "snippets/typescript/utils/vector_database_integration.md"

=== "Rust"

--8<-- "snippets/rust/advanced/vector_database_integration.md"

=== "Go"

--8<-- "snippets/go/advanced/vector_database_integration.md"

=== "Java"

--8<-- "snippets/java/advanced/vector_database_integration.md"

=== "C#"

--8<-- "snippets/csharp/advanced/vector_database_integration.md"

=== "Ruby"

--8<-- "snippets/ruby/advanced/vector_database_integration.md"

=== "R"

--8<-- "snippets/r/advanced/vector_database_integration.md"

Token Reduction

Reduce token count while preserving meaning for LLM pipelines.

Level Reduction Effect
off 0% Pass-through
moderate 1525% Stopwords + redundancy removal
aggressive 3050% Semantic clustering + importance scoring

Configuration

=== "Python"

--8<-- "snippets/python/config/token_reduction_config.md"

=== "TypeScript"

--8<-- "snippets/typescript/config/token_reduction_config.md"

=== "Rust"

--8<-- "snippets/rust/advanced/token_reduction_config.md"

=== "Go"

--8<-- "snippets/go/config/token_reduction_config.md"

=== "Java"

--8<-- "snippets/java/config/token_reduction_config.md"

=== "C#"

--8<-- "snippets/csharp/advanced/token_reduction_config.md"

=== "Ruby"

--8<-- "snippets/ruby/config/token_reduction_config.md"

=== "R"

--8<-- "snippets/r/config/token_reduction_config.md"

Example

=== "Python"

--8<-- "snippets/python/utils/token_reduction_example.md"

=== "TypeScript"

--8<-- "snippets/typescript/utils/token_reduction_example.md"

=== "Rust"

--8<-- "snippets/rust/advanced/token_reduction_example.md"

=== "Go"

--8<-- "snippets/go/advanced/token_reduction_example.md"

=== "Java"

--8<-- "snippets/java/advanced/token_reduction_example.md"

=== "C#"

--8<-- "snippets/csharp/advanced/token_reduction_example.md"

=== "Ruby"

--8<-- "snippets/ruby/advanced/token_reduction_example.md"

=== "R"

--8<-- "snippets/r/advanced/token_reduction_example.md"

Keyword Extraction

Extract keywords using YAKE or RAKE algorithms. Requires the keywords feature flag. See Keyword Extraction for algorithm details and parameter reference.

Configuration

=== "Python"

--8<-- "snippets/python/config/keyword_extraction_config.md"

=== "TypeScript"

--8<-- "snippets/typescript/config/keyword_extraction_config.md"

=== "Rust"

--8<-- "snippets/rust/advanced/keyword_extraction_config.md"

=== "Go"

--8<-- "snippets/go/config/keyword_extraction_config.md"

=== "Java"

--8<-- "snippets/java/config/keyword_extraction_config.md"

=== "C#"

--8<-- "snippets/csharp/advanced/keyword_extraction_config.md"

=== "Ruby"

--8<-- "snippets/ruby/config/keyword_extraction_config.md"

=== "R"

--8<-- "snippets/r/config/keyword_extraction_config.md"

Example

=== "Python"

--8<-- "snippets/python/utils/keyword_extraction_example.md"

=== "TypeScript"

--8<-- "snippets/typescript/utils/keyword_extraction_example.md"

=== "Rust"

--8<-- "snippets/rust/advanced/keyword_extraction_example.md"

=== "Go"

--8<-- "snippets/go/advanced/keyword_extraction_example.md"

=== "Java"

--8<-- "snippets/java/advanced/keyword_extraction_example.md"

=== "C#"

--8<-- "snippets/csharp/advanced/keyword_extraction_example.md"

=== "Ruby"

--8<-- "snippets/ruby/advanced/keyword_extraction_example.md"

=== "R"

--8<-- "snippets/r/advanced/keyword_extraction_example.md"

Quality Processing

Score extracted text for quality issues (0.01.0, where 1.0 is highest quality). Detects OCR artifacts, script content, navigation elements, and structural issues.

Factor Weight Detects
OCR Artifacts 30% Scattered chars, repeated punctuation, malformed words
Script Content 20% JavaScript, CSS, HTML tags
Navigation Elements 10% Breadcrumbs, pagination, skip links
Document Structure 20% Sentence/paragraph length, punctuation distribution
Metadata Quality 10% Presence of title, author, subject

Score ranges: 0.00.3 very low, 0.30.6 low, 0.60.8 moderate, 0.81.0 high.

Configuration

=== "Python"

--8<-- "snippets/python/config/quality_processing_config.md"

=== "TypeScript"

--8<-- "snippets/typescript/config/quality_processing_config.md"

=== "Rust"

--8<-- "snippets/rust/advanced/quality_processing_config.md"

=== "Go"

--8<-- "snippets/go/config/quality_processing_config.md"

=== "Java"

--8<-- "snippets/java/config/quality_processing_config.md"

=== "C#"

--8<-- "snippets/csharp/advanced/quality_processing_config.md"

=== "Ruby"

--8<-- "snippets/ruby/config/quality_processing_config.md"

=== "R"

--8<-- "snippets/r/config/quality_processing_config.md"

Example

=== "Python"

--8<-- "snippets/python/utils/quality_processing_example.md"

=== "TypeScript"

--8<-- "snippets/typescript/utils/quality_processing_example.md"

=== "Rust"

--8<-- "snippets/rust/advanced/quality_processing_example.md"

=== "Go"

--8<-- "snippets/go/advanced/quality_processing_example.md"

=== "Java"

--8<-- "snippets/java/advanced/quality_processing_example.md"

=== "C#"

--8<-- "snippets/csharp/advanced/quality_processing_example.md"

=== "Ruby"

--8<-- "snippets/ruby/advanced/quality_processing_example.md"

=== "R"

--8<-- "snippets/r/advanced/quality_processing_example.md"

Combining Features

=== "Python"

--8<-- "snippets/python/advanced/combining_all_features.md"

=== "TypeScript"

--8<-- "snippets/typescript/getting-started/combining_all_features.md"

=== "Rust"

--8<-- "snippets/rust/api/combining_all_features.md"

=== "Go"

--8<-- "snippets/go/api/combining_all_features.md"

=== "Java"

--8<-- "snippets/java/api/combining_all_features.md"

=== "C#"

--8<-- "snippets/csharp/advanced/combining_all_features.md"

=== "Ruby"

--8<-- "snippets/ruby/api/combining_all_features.md"

=== "R"

--8<-- "snippets/r/api/combining_all_features.md"