Files
fil/.ai-rulez/skills/chunking-embeddings/SKILL.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

5.6 KiB

description, name, priority
description name priority
Chunking, embeddings, and RAG pipeline integration chunking-embeddings critical

Chunking & Embeddings

Text splitting strategies, embedding generation with FastEmbed, RAG pipeline integration

Chunking Architecture Overview

Location: crates/kreuzberg/src/chunking/, crates/kreuzberg/src/embeddings.rs

Extracted Text
    |
[1. Normalization] -> Clean whitespace, remove control chars
    |
[2. Chunk Strategy Selection] -> Fixed-size, semantic, syntax-aware, recursive
    |
[3. Overlap Management] -> Control context window overlap
    |
[4. Optional Embedding] -> Generate vectors with FastEmbed
    |
Output: Vec<Chunk> with text, vectors, metadata

Chunking Strategies

Location: crates/kreuzberg/src/chunking/mod.rs

Strategy Pattern Best For
Fixed-Size Sliding window with configurable overlap Uniform chunks for embedding models with fixed token limits
Semantic Split by sentences, merge/split by similarity threshold Smart context preservation for LLM consumption and semantic search
Syntax-Aware Split by paragraph/section/heading/code-block structure Preserving document structure (sections, code blocks) in RAG
Recursive (LangChain pattern) Try separators in order: \n\n, \n, , Best general-purpose chunking; auto-finds optimal split points

Key config fields per strategy (see struct definitions in chunking/mod.rs):

  • Fixed-Size: chunk_size, overlap, trim_whitespace
  • Semantic: target_chunk_size, min/max_chunk_size, semantic_threshold, use_sentence_boundaries
  • Syntax-Aware: chunk_by (Paragraph/Section/Heading/Sentence/CodeBlock), max_chunk_size, respect_code_blocks
  • Recursive: separators[], chunk_size, overlap

Chunking Configuration Presets

Location: crates/kreuzberg/src/chunking/mod.rs

Preset Chunk Size Overlap Strategy Use Case
Balanced 512 tokens 50 Semantic RAG sweet spot
Compact 256 tokens 32 Fixed-Size Dense vectors
Extended 1024 tokens 100 Recursive Full context
Minimal 128 tokens 16 (default) Lightweight embeddings

Usage: set config.chunking.preset = Some("balanced") in ExtractionConfig.

Embedding Generation with FastEmbed

Location: crates/kreuzberg/src/embeddings.rs

Model Selection

Model Dimensions Notes
BAAI/bge-small-en-v1.5 (default) 384 Fast, excellent for RAG
BAAI/bge-small-zh-v1.5 384 Chinese optimized
BAAI/bge-base-en-v1.5 768 Better quality, slower
jinaai/jina-embeddings-v2-base-en 768 Long context (up to 8192 tokens)
Custom(path) varies Custom ONNX model path

Embedding Pattern

TextEmbeddingManager provides singleton-cached models per config. Pattern:

  1. get_or_init_model() -- lazy-loads ONNX model (downloads if needed), caches in Arc<RwLock<HashMap>>
  2. embed_chunks() -- collects chunk texts, calls model.embed(texts, batch_size), zips results back to ChunkWithEmbedding

Default config: batch_size=256, device=CPU, parallel_requests=4.

ONNX Runtime Requirement

Embeddings require ONNX Runtime. Feature-gated via:

[features]
embeddings = ["dep:fastembed", "dep:ort"]

Install: brew install onnxruntime (macOS) / apt install libonnxruntime libonnxruntime-dev (Linux). Verify: echo $ORT_DYLIB_PATH.

RAG Integration Pattern

The full extraction-to-RAG pipeline:

  1. Extract: extract_file(path, config) -> ExtractionResult
  2. Chunk: Apply preset strategy to result.content -> Vec<Chunk>
  3. Embed: If embedding config present, TextEmbeddingManager::embed_chunks() -> Vec<ChunkWithEmbedding>
  4. Output: RagDocument { file_path, metadata, chunks } ready for vector DB ingestion

See ChunkWithEmbedding struct in types.rs: contains text, embedding: Vec<f32>, dimensions, norm, metadata.

Critical Rules

  1. Chunking is preprocessing - Always apply before embedding to ensure consistent vector sizes
  2. Overlap prevents information loss - Set overlap to 15-20% of chunk size
  3. Embedding models are stateful - Lazy load and cache to avoid repeated initialization
  4. ONNX Runtime is required - Gracefully degrade if not available (skip embeddings)
  5. Batch embedding for performance - Never embed single chunks; batch 50-1000 chunks
  6. Normalize embeddings for search - Use L2 norm for cosine similarity
  7. Cache embedding results - Don't re-embed identical text chunks
  8. Model selection impacts quality - bge-small (384) for speed, bge-base (768) for quality
  • extraction-pipeline-patterns - Text extraction preceding chunking
  • api-server-mcp - Endpoint for chunking + embedding operations
  • ocr-backend-management - OCR text quality affects chunking success