--- title: "C API Reference" --- ## C API Reference v5.0.0-rc.3 ### Functions #### kreuzberg_extract_bytes() Extract content from a byte array. This is the main entry point for in-memory extraction. It performs the following steps: 1. Validate MIME type 2. Handle legacy format conversion if needed 3. Select appropriate extractor from registry 4. Extract content 5. Run post-processing pipeline **Returns:** An `ExtractionResult` containing the extracted content and metadata. **Errors:** Returns `KreuzbergError.Validation` if MIME type is invalid. Returns `KreuzbergError.UnsupportedFormat` if MIME type is not supported. **Signature:** ```c KreuzbergExtractionResult* kreuzberg_extract_bytes(const uint8_t* content, const char* mime_type, KreuzbergExtractionConfig config); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `content` | `const uint8_t*` | Yes | The byte array to extract | | `mime_type` | `const char*` | Yes | MIME type of the content | | `config` | `KreuzbergExtractionConfig` | Yes | Extraction configuration | **Returns:** `KreuzbergExtractionResult` **Errors:** Returns `NULL` on error. --- #### kreuzberg_extract_file() Extract content from a file. This is the main entry point for file-based extraction. It performs the following steps: 1. Check cache for existing result (if caching enabled) 2. Detect or validate MIME type 3. Select appropriate extractor from registry 4. Extract content 5. Run post-processing pipeline 6. Store result in cache (if caching enabled) **Returns:** An `ExtractionResult` containing the extracted content and metadata. **Errors:** Returns `KreuzbergError.Io` if the file doesn't exist (NotFound) or for other file I/O errors. Returns `KreuzbergError.UnsupportedFormat` if MIME type is not supported. **Signature:** ```c KreuzbergExtractionResult* kreuzberg_extract_file(const char* path, const char* mime_type, KreuzbergExtractionConfig config); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `path` | `const char*` | Yes | Path to the file to extract | | `mime_type` | `const char**` | No | Optional MIME type override. If None, will be auto-detected | | `config` | `KreuzbergExtractionConfig` | Yes | Extraction configuration | **Returns:** `KreuzbergExtractionResult` **Errors:** Returns `NULL` on error. --- #### kreuzberg_extract_file_sync() Synchronous wrapper for `extract_file`. This is a convenience function that blocks the current thread until extraction completes. For async code, use `extract_file` directly. Uses the global Tokio runtime for 100x+ performance improvement over creating a new runtime per call. Always uses the global runtime to avoid nested runtime issues. This function is only available with the `tokio-runtime` feature. For WASM targets, use a truly synchronous extraction approach instead. **Signature:** ```c KreuzbergExtractionResult* kreuzberg_extract_file_sync(const char* path, const char* mime_type, KreuzbergExtractionConfig config); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `path` | `const char*` | Yes | Path to the file | | `mime_type` | `const char**` | No | The mime type | | `config` | `KreuzbergExtractionConfig` | Yes | The configuration options | **Returns:** `KreuzbergExtractionResult` **Errors:** Returns `NULL` on error. --- #### kreuzberg_extract_bytes_sync() Synchronous wrapper for `extract_bytes`. Uses the global Tokio runtime for 100x+ performance improvement over creating a new runtime per call. With the `tokio-runtime` feature, this blocks the current thread using the global Tokio runtime. Without it (WASM), this calls a truly synchronous implementation. **Signature:** ```c KreuzbergExtractionResult* kreuzberg_extract_bytes_sync(const uint8_t* content, const char* mime_type, KreuzbergExtractionConfig config); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `content` | `const uint8_t*` | Yes | The content to process | | `mime_type` | `const char*` | Yes | The mime type | | `config` | `KreuzbergExtractionConfig` | Yes | The configuration options | **Returns:** `KreuzbergExtractionResult` **Errors:** Returns `NULL` on error. --- #### kreuzberg_batch_extract_files_sync() Synchronous wrapper for `batch_extract_files`. Uses the global Tokio runtime for optimal performance. Only available with `tokio-runtime` (WASM has no filesystem). **Signature:** ```c KreuzbergExtractionResult* kreuzberg_batch_extract_files_sync(KreuzbergBatchFileItem* items, KreuzbergExtractionConfig config); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `items` | `KreuzbergBatchFileItem*` | Yes | The items | | `config` | `KreuzbergExtractionConfig` | Yes | The configuration options | **Returns:** `KreuzbergExtractionResult*` **Errors:** Returns `NULL` on error. --- #### kreuzberg_batch_extract_bytes_sync() Synchronous wrapper for `batch_extract_bytes`. Uses the global Tokio runtime for optimal performance. With the `tokio-runtime` feature, this blocks the current thread using the global Tokio runtime. Without it (WASM), this calls a truly synchronous implementation that iterates through items and calls `extract_bytes_sync()`. **Signature:** ```c KreuzbergExtractionResult* kreuzberg_batch_extract_bytes_sync(KreuzbergBatchBytesItem* items, KreuzbergExtractionConfig config); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `items` | `KreuzbergBatchBytesItem*` | Yes | The items | | `config` | `KreuzbergExtractionConfig` | Yes | The configuration options | **Returns:** `KreuzbergExtractionResult*` **Errors:** Returns `NULL` on error. --- #### kreuzberg_batch_extract_files() Extract content from multiple files concurrently. This function processes multiple files in parallel, automatically managing concurrency to prevent resource exhaustion. The concurrency limit can be configured via `ExtractionConfig.max_concurrent_extractions` or defaults to `(num_cpus * 1.5).ceil()`. Each file can optionally specify a `FileExtractionConfig` that overrides specific fields from the batch-level `config`. Pass `NULL` for a file to use the batch defaults. Batch-level settings like `max_concurrent_extractions` and `use_cache` are always taken from the batch-level `config`. per-file configuration overrides. - `config` - Batch-level extraction configuration (provides defaults and batch settings) **Returns:** A vector of `ExtractionResult` in the same order as the input items. **Errors:** Individual file errors are captured in the result metadata. System errors (IO, RuntimeError equivalents) will bubble up and fail the entire batch. Simple usage with no per-file overrides: Per-file configuration overrides: **Signature:** ```c KreuzbergExtractionResult* kreuzberg_batch_extract_files(KreuzbergBatchFileItem* items, KreuzbergExtractionConfig config); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `items` | `KreuzbergBatchFileItem*` | Yes | Vector of `BatchFileItem` structs, each containing a path and optional | | `config` | `KreuzbergExtractionConfig` | Yes | Batch-level extraction configuration (provides defaults and batch settings) | **Returns:** `KreuzbergExtractionResult*` **Errors:** Returns `NULL` on error. --- #### kreuzberg_batch_extract_bytes() Extract content from multiple byte arrays concurrently. This function processes multiple byte arrays in parallel, automatically managing concurrency to prevent resource exhaustion. The concurrency limit can be configured via `ExtractionConfig.max_concurrent_extractions` or defaults to `(num_cpus * 1.5).ceil()`. Each item can optionally specify a `FileExtractionConfig` that overrides specific fields from the batch-level `config`. Pass `NULL` as the config to use the batch-level defaults for that item. MIME type, and optional per-item configuration overrides. - `config` - Batch-level extraction configuration **Returns:** A vector of `ExtractionResult` in the same order as the input items. Simple usage with no per-item overrides: Per-item configuration overrides: **Signature:** ```c KreuzbergExtractionResult* kreuzberg_batch_extract_bytes(KreuzbergBatchBytesItem* items, KreuzbergExtractionConfig config); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `items` | `KreuzbergBatchBytesItem*` | Yes | Vector of `BatchBytesItem` structs, each containing content bytes, | | `config` | `KreuzbergExtractionConfig` | Yes | Batch-level extraction configuration | **Returns:** `KreuzbergExtractionResult*` **Errors:** Returns `NULL` on error. --- #### kreuzberg_detect_mime_type_from_bytes() Detect MIME type from raw file bytes. Uses magic byte signatures to detect file type from content. Falls back to `infer` crate for comprehensive detection. For ZIP-based files, inspects contents to distinguish Office Open XML formats (DOCX, XLSX, PPTX) from plain ZIP archives. **Returns:** The detected MIME type string. **Errors:** Returns `KreuzbergError.UnsupportedFormat` if MIME type cannot be determined. **Signature:** ```c const char* kreuzberg_detect_mime_type_from_bytes(const uint8_t* content); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `content` | `const uint8_t*` | Yes | Raw file bytes | **Returns:** `const char*` **Errors:** Returns `NULL` on error. --- #### kreuzberg_get_extensions_for_mime() Get file extensions for a given MIME type. Returns all known file extensions that map to the specified MIME type. **Returns:** A vector of file extensions (without leading dot) for the MIME type. **Signature:** ```c const char** kreuzberg_get_extensions_for_mime(const char* mime_type); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `mime_type` | `const char*` | Yes | The MIME type to look up | **Returns:** `const char**` **Errors:** Returns `NULL` on error. --- #### kreuzberg_clear_embedding_backends() Clear all embedding backends from the global registry. Calls `shutdown()` on every registered backend, then empties the registry. **Errors:** - Any error returned by a backend's `shutdown()` method. The first error encountered stops processing of remaining backends. **Signature:** ```c void kreuzberg_clear_embedding_backends(); ``` **Returns:** `void` **Errors:** Returns `NULL` on error. --- #### kreuzberg_list_embedding_backends() List the names of all registered embedding backends. Used by `kreuzberg-cli`, the api/mcp endpoints, and generated language bindings. **Signature:** ```c const char** kreuzberg_list_embedding_backends(); ``` **Returns:** `const char**` **Errors:** Returns `NULL` on error. --- #### kreuzberg_list_document_extractors() List names of all registered document extractors. **Signature:** ```c const char** kreuzberg_list_document_extractors(); ``` **Returns:** `const char**` **Errors:** Returns `NULL` on error. --- #### kreuzberg_clear_document_extractors() Clear all document extractors from the global registry. Calls `shutdown()` on every registered extractor, then empties the registry. **Errors:** - Any error returned by an extractor's `shutdown()` method. The first error encountered stops processing of remaining extractors. **Signature:** ```c void kreuzberg_clear_document_extractors(); ``` **Returns:** `void` **Errors:** Returns `NULL` on error. --- #### kreuzberg_list_ocr_backends() List all registered OCR backends. Returns the names of all OCR backends currently registered in the global registry. **Returns:** A vector of OCR backend names. **Signature:** ```c const char** kreuzberg_list_ocr_backends(); ``` **Returns:** `const char**` **Errors:** Returns `NULL` on error. --- #### kreuzberg_clear_ocr_backends() Clear all OCR backends from the global registry. Removes all OCR backends and calls their `shutdown()` methods. **Returns:** - `Ok(())` if all backends were cleared successfully - `Err(...)` if any shutdown method failed **Signature:** ```c void kreuzberg_clear_ocr_backends(); ``` **Returns:** `void` **Errors:** Returns `NULL` on error. --- #### kreuzberg_list_post_processors() List all registered post-processor names. Returns a vector of all post-processor names currently registered in the global registry. **Returns:** - `Ok(Vec)` - Vector of post-processor names - `Err(...)` if the registry lock is poisoned **Signature:** ```c const char** kreuzberg_list_post_processors(); ``` **Returns:** `const char**` **Errors:** Returns `NULL` on error. --- #### kreuzberg_clear_post_processors() Remove all registered post-processors. **Signature:** ```c void kreuzberg_clear_post_processors(); ``` **Returns:** `void` **Errors:** Returns `NULL` on error. --- #### kreuzberg_list_renderers() List names of all registered renderers. **Errors:** Returns an error if the registry lock is poisoned. **Signature:** ```c const char** kreuzberg_list_renderers(); ``` **Returns:** `const char**` **Errors:** Returns `NULL` on error. --- #### kreuzberg_clear_renderers() Clear all renderers from the global registry. Removes every renderer, including the built-in defaults (markdown, html, djot, plain). After calling this no renderers are registered; re-register as needed. **Errors:** Returns an error if the registry lock is poisoned. **Signature:** ```c void kreuzberg_clear_renderers(); ``` **Returns:** `void` **Errors:** Returns `NULL` on error. --- #### kreuzberg_list_validators() List names of all registered validators. **Signature:** ```c const char** kreuzberg_list_validators(); ``` **Returns:** `const char**` **Errors:** Returns `NULL` on error. --- #### kreuzberg_clear_validators() Remove all registered validators. **Signature:** ```c void kreuzberg_clear_validators(); ``` **Returns:** `void` **Errors:** Returns `NULL` on error. --- #### kreuzberg_compare() Compare two extraction results and return a structured diff. The comparison is purely structural — no I/O, no side effects. All fields of `ExtractionDiff` are populated according to the provided `DiffOptions`. **Signature:** ```c KreuzbergExtractionDiff* kreuzberg_compare(KreuzbergExtractionResult a, KreuzbergExtractionResult b, KreuzbergDiffOptions opts); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `a` | `KreuzbergExtractionResult` | Yes | The extraction result | | `b` | `KreuzbergExtractionResult` | Yes | The extraction result | | `opts` | `KreuzbergDiffOptions` | Yes | The options to use | **Returns:** `KreuzbergExtractionDiff` --- #### kreuzberg_embed_texts_async() Generate embeddings asynchronously for a list of text strings. This is the async counterpart to `embed_texts`. It offloads the blocking ONNX inference work to a dedicated blocking thread pool via Tokio's `spawn_blocking`, keeping the async executor free. Returns one embedding vector per input text in the same order. **Errors:** - `KreuzbergError.MissingDependency` if ONNX Runtime is not installed - `KreuzbergError.Embedding` if the preset name is unknown, model download fails, or the blocking inference task panics **Signature:** ```c float** kreuzberg_embed_texts_async(const char** texts, KreuzbergEmbeddingConfig config); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `texts` | `const char**` | Yes | Vec of strings to embed (owned, sent to blocking thread) | | `config` | `KreuzbergEmbeddingConfig` | Yes | Embedding configuration specifying model, batch size, and normalization | **Returns:** `float**` **Errors:** Returns `NULL` on error. --- #### kreuzberg_render_pdf_page_to_png() Render a single PDF page to PNG bytes. Returns raw PNG-encoded bytes for the specified page at the given DPI. Uses pdf_oxide with tiny-skia for pure-Rust rendering. **Errors:** Returns `KreuzbergError.Parsing` if the PDF cannot be opened, authenticated, or rendered, or if `page_index` is out of range. **Signature:** ```c const uint8_t* kreuzberg_render_pdf_page_to_png(const uint8_t* pdf_bytes, uintptr_t page_index, int32_t dpi, const char* password); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `pdf_bytes` | `const uint8_t*` | Yes | Raw PDF file bytes | | `page_index` | `uintptr_t` | Yes | Zero-based page index | | `dpi` | `int32_t*` | No | Resolution in dots per inch (default: 150) | | `password` | `const char**` | No | Optional password for encrypted PDFs | **Returns:** `const uint8_t*` **Errors:** Returns `NULL` on error. --- #### kreuzberg_detect_mime_type() Detect the MIME type of a file at the given path. Uses the file extension and optionally the file content to determine the MIME type. Set `check_exists` to `true` to verify the file exists before detection. **Signature:** ```c const char* kreuzberg_detect_mime_type(const char* path, bool check_exists); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `path` | `const char*` | Yes | Path to the file | | `check_exists` | `bool` | Yes | The check exists | **Returns:** `const char*` **Errors:** Returns `NULL` on error. --- #### kreuzberg_embed_texts() Embed a list of texts using the configured embedding model. Returns a 2D vector where each inner vector is the embedding for the corresponding text. **Signature:** ```c float** kreuzberg_embed_texts(const char** texts, KreuzbergEmbeddingConfig config); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `texts` | `const char**` | Yes | The texts | | `config` | `KreuzbergEmbeddingConfig` | Yes | The configuration options | **Returns:** `float**` **Errors:** Returns `NULL` on error. --- #### kreuzberg_get_embedding_preset() Get an embedding preset by name. Returns `NULL` if no preset with the given name exists. Returns an owned clone so the value is safe to pass across FFI boundaries. **Signature:** ```c KreuzbergEmbeddingPreset* kreuzberg_get_embedding_preset(const char* name); ``` **Parameters:** | Name | Type | Required | Description | |------|------|----------|-------------| | `name` | `const char*` | Yes | The name | **Returns:** `KreuzbergEmbeddingPreset*` --- #### kreuzberg_list_embedding_presets() List the names of all available embedding presets. Returns owned `String`s so the values are safe to pass across FFI boundaries. **Signature:** ```c const char** kreuzberg_list_embedding_presets(); ``` **Returns:** `const char**` --- ### Types #### KreuzbergAccelerationConfig Hardware acceleration configuration for ONNX Runtime models. Controls which execution provider (CPU, CoreML, CUDA, TensorRT) is used for inference in layout detection and embedding generation. | Field | Type | Default | Description | |-------|------|---------|-------------| | `provider` | `KreuzbergExecutionProviderType` | `KREUZBERG_KREUZBERG_AUTO` | Execution provider to use for ONNX inference. | | `device_id` | `uint32_t` | — | GPU device ID (for CUDA/TensorRT). Ignored for CPU/CoreML/Auto. | --- #### KreuzbergArchiveEntry A single file extracted from an archive. When archives (ZIP, TAR, 7Z, GZIP) are extracted with recursive extraction enabled, each processable file produces its own full `ExtractionResult`. | Field | Type | Default | Description | |-------|------|---------|-------------| | `path` | `const char*` | — | Archive-relative file path (e.g. "folder/document.pdf"). | | `mime_type` | `const char*` | — | Detected MIME type of the file. | | `result` | `KreuzbergExtractionResult` | — | Full extraction result for this file. | --- #### KreuzbergArchiveMetadata Archive (ZIP/TAR/7Z) metadata. Extracted from compressed archive files containing file lists and size information. | Field | Type | Default | Description | |-------|------|---------|-------------| | `format` | `const char*` | — | Archive format ("ZIP", "TAR", "7Z", etc.) | | `file_count` | `uint32_t` | — | Total number of files in the archive | | `file_list` | `const char**` | `NULL` | List of file paths within the archive | | `total_size` | `uint64_t` | — | Total uncompressed size in bytes | | `compressed_size` | `uint64_t*` | `NULL` | Compressed size in bytes (if available) | --- #### KreuzbergBBox Bounding box in original image coordinates (x1, y1) top-left, (x2, y2) bottom-right. | Field | Type | Default | Description | |-------|------|---------|-------------| | `x1` | `float` | — | X1 | | `y1` | `float` | — | Y1 | | `x2` | `float` | — | X2 | | `y2` | `float` | — | Y2 | --- #### KreuzbergBatchBytesItem Batch item for byte array extraction. Used with `batch_extract_bytes` and `batch_extract_bytes_sync` to represent a single item in a batch extraction job. | Field | Type | Default | Description | |-------|------|---------|-------------| | `content` | `const uint8_t*` | — | The content bytes to extract from | | `mime_type` | `const char*` | — | MIME type of the content (e.g., "application/pdf", "text/html") | | `config` | `KreuzbergFileExtractionConfig*` | `NULL` | Per-item configuration overrides (None uses batch-level defaults) | --- #### KreuzbergBatchFileItem Batch item for file extraction. Used with `batch_extract_files` and `batch_extract_files_sync` to represent a single file in a batch extraction job. | Field | Type | Default | Description | |-------|------|---------|-------------| | `path` | `const char*` | — | Path to the file to extract from | | `config` | `KreuzbergFileExtractionConfig*` | `NULL` | Per-file configuration overrides (None uses batch-level defaults) | --- #### KreuzbergBibtexMetadata BibTeX bibliography metadata. | Field | Type | Default | Description | |-------|------|---------|-------------| | `entry_count` | `uintptr_t` | — | Number of entries in the bibliography. | | `citation_keys` | `const char**` | `NULL` | Citation keys | | `authors` | `const char**` | `NULL` | Authors | | `year_range` | `KreuzbergYearRange*` | `NULL` | Year range (year range) | | `entry_types` | `void**` | `NULL` | Entry types | --- #### KreuzbergBoundingBox Bounding box coordinates for element positioning. | Field | Type | Default | Description | |-------|------|---------|-------------| | `x0` | `double` | — | Left x-coordinate | | `y0` | `double` | — | Bottom y-coordinate | | `x1` | `double` | — | Right x-coordinate | | `y1` | `double` | — | Top y-coordinate | --- #### KreuzbergCacheStats | Field | Type | Default | Description | |-------|------|---------|-------------| | `total_files` | `uintptr_t` | — | Total files | | `total_size_mb` | `double` | — | Total size mb | | `available_space_mb` | `double` | — | Available space mb | | `oldest_file_age_days` | `double` | — | Oldest file age days | | `newest_file_age_days` | `double` | — | Newest file age days | --- #### KreuzbergCellChange A single changed cell within a table. Defined here (rather than only in `crate.diff`) so `RevisionDelta` can reference it unconditionally, without requiring the `diff` Cargo feature. `crate.diff` re-exports this type verbatim. | Field | Type | Default | Description | |-------|------|---------|-------------| | `row` | `uintptr_t` | — | Zero-based row index. | | `col` | `uintptr_t` | — | Zero-based column index. | | `from` | `const char*` | — | Value before the change. | | `to` | `const char*` | — | Value after the change. | --- #### KreuzbergChunk A text chunk with optional embedding and metadata. Chunks are created when chunking is enabled in `ExtractionConfig`. Each chunk contains the text content, optional embedding vector (if embedding generation is configured), and metadata about its position in the document. | Field | Type | Default | Description | |-------|------|---------|-------------| | `content` | `const char*` | — | The text content of this chunk. | | `chunk_type` | `KreuzbergChunkType` | `/* serde(default) */` | Semantic structural classification of this chunk. Assigned by the heuristic classifier based on content patterns and heading context. Defaults to `ChunkType.Unknown` when no rule matches. | | `embedding` | `float**` | `NULL` | Optional embedding vector for this chunk. Only populated when `EmbeddingConfig` is provided in chunking configuration. The dimensionality depends on the chosen embedding model. | | `metadata` | `KreuzbergChunkMetadata` | — | Metadata about this chunk's position and properties. | --- #### KreuzbergChunkMetadata Metadata about a chunk's position in the original document. | Field | Type | Default | Description | |-------|------|---------|-------------| | `byte_start` | `uintptr_t` | — | Byte offset where this chunk starts in the original text (UTF-8 valid boundary). | | `byte_end` | `uintptr_t` | — | Byte offset where this chunk ends in the original text (UTF-8 valid boundary). | | `token_count` | `uintptr_t*` | `NULL` | Number of tokens in this chunk (if available). This is calculated by the embedding model's tokenizer if embeddings are enabled. | | `chunk_index` | `uintptr_t` | — | Zero-based index of this chunk in the document. | | `total_chunks` | `uintptr_t` | — | Total number of chunks in the document. | | `first_page` | `uint32_t*` | `NULL` | First page number this chunk spans (1-indexed). Only populated when page tracking is enabled in extraction configuration. | | `last_page` | `uint32_t*` | `NULL` | Last page number this chunk spans (1-indexed, equal to first_page for single-page chunks). Only populated when page tracking is enabled in extraction configuration. | | `heading_context` | `KreuzbergHeadingContext*` | `/* serde(default) */` | Heading context when using Markdown chunker. Contains the heading hierarchy this chunk falls under. Only populated when `ChunkerType.Markdown` is used. | | `image_indices` | `uint32_t*` | `/* serde(default) */` | Indices into `ExtractionResult.images` for images on pages covered by this chunk. Contains zero-based indices into the top-level `images` collection for every image whose `page_number` falls within `[first_page, last_page]`. Empty when image extraction is disabled or the chunk spans no pages with images. | --- #### KreuzbergChunkingConfig Chunking configuration. Configures text chunking for document content, including chunk size, overlap, trimming behavior, and optional embeddings. Use `..the default constructor` when constructing to allow for future field additions: | Field | Type | Default | Description | |-------|------|---------|-------------| | `max_characters` | `uintptr_t` | `1000` | Maximum size per chunk (in units determined by `sizing`). When `sizing` is `Characters` (default), this is the max character count. When using token-based sizing, this is the max token count. Default: 1000 | | `overlap` | `uintptr_t` | `200` | Overlap between chunks (in units determined by `sizing`). Default: 200 | | `trim` | `bool` | `true` | Whether to trim whitespace from chunk boundaries. Default: true | | `chunker_type` | `KreuzbergChunkerType` | `KREUZBERG_KREUZBERG_TEXT` | Type of chunker to use (Text or Markdown). Default: Text | | `embedding` | `KreuzbergEmbeddingConfig*` | `NULL` | Optional embedding configuration for chunk embeddings. | | `preset` | `const char**` | `NULL` | Use a preset configuration (overrides individual settings if provided). | | `sizing` | `KreuzbergChunkSizing` | `KREUZBERG_KREUZBERG_CHARACTERS` | How to measure chunk size. Default: `Characters` (Unicode character count). Enable `chunking-tiktoken` or `chunking-tokenizers` features for token-based sizing. | | `prepend_heading_context` | `bool` | `false` | When `true` and `chunker_type` is `Markdown`, prepend the heading hierarchy path (e.g. `"# Title > ## Section\n\n"`) to each chunk's content string. This is useful for RAG pipelines where each chunk needs self-contained context about its position in the document structure. Default: `false` | | `topic_threshold` | `float*` | `NULL` | Optional cosine similarity threshold for semantic topic boundary detection. Only used when `chunker_type` is `Semantic` and an `EmbeddingConfig` is provided. You almost never need to set this. When omitted, defaults to `0.75` which works well for most documents. Lower values detect more topic boundaries (more, smaller chunks); higher values detect fewer. Range: `0.0..=1.0`. | ### Methods #### kreuzberg_default() **Signature:** ```c KreuzbergChunkingConfig kreuzberg_default(); ``` --- #### KreuzbergCitationMetadata Citation file metadata (RIS, PubMed, EndNote). | Field | Type | Default | Description | |-------|------|---------|-------------| | `citation_count` | `uintptr_t` | — | Number of citations | | `format` | `const char**` | `NULL` | Format | | `authors` | `const char**` | `NULL` | Authors | | `year_range` | `KreuzbergYearRange*` | `NULL` | Year range (year range) | | `dois` | `const char**` | `NULL` | Dois | | `keywords` | `const char**` | `NULL` | Keywords | --- #### KreuzbergContentFilterConfig Cross-extractor content filtering configuration. Controls whether "furniture" content (headers, footers, page numbers, watermarks, repeating text) is included in or stripped from extraction results. Applies across all extractors (PDF, DOCX, RTF, ODT, HTML, etc.) with format-specific implementation. When `NULL` on `ExtractionConfig`, each extractor uses its current default behavior unchanged. | Field | Type | Default | Description | |-------|------|---------|-------------| | `include_headers` | `bool` | `false` | Include running headers in extraction output. - PDF: Disables top-margin furniture stripping and prevents the layout model from treating `PageHeader`-classified regions as furniture. - DOCX: Includes document headers in text output. - RTF/ODT: Headers already included; this is a no-op when true. - HTML/EPUB: Keeps `
` element content. Default: `false` (headers are stripped or excluded). | | `include_footers` | `bool` | `false` | Include running footers in extraction output. - PDF: Disables bottom-margin furniture stripping and prevents the layout model from treating `PageFooter`-classified regions as furniture. - DOCX: Includes document footers in text output. - RTF/ODT: Footers already included; this is a no-op when true. - HTML/EPUB: Keeps `