skills/kreuzberg/references/rust-api.md

# Kreuzberg Rust API Reference

Complete API reference for the Kreuzberg document extraction library in Rust.

## Setup

Add to your `Cargo.toml`:

```toml
[dependencies]
kreuzberg = { version = "4", features = [
    "tokio-runtime",
    "pdf",
    "ocr",
    "chunking",
    "embeddings",
    "language-detection",
    "keywords-yake",
    "keywords-rake",
    "api",
    "mcp"
] }
tokio = { version = "1", features = ["full"] }
```

### Core Features

- **tokio-runtime**: Enables async/sync extraction (default). Required for `extract_file_sync`, `batch_extract_file_sync`, `batch_extract_file`
- **pdf**: PDF extraction with PDFium
- **ocr**: Tesseract-based OCR for scanned documents
- **chunking**: Text chunking for RAG pipelines
- **embeddings**: Vector embeddings generation
- **language-detection**: Detect document language
- **keywords-yake** / **keywords-rake**: Extract keywords using YAKE or RAKE
- **api**: HTTP API with Axum
- **mcp**: Model Context Protocol support

---

## Core Extraction Functions

### `extract_file` (async)

Extract content from a file path.

```rust
pub async fn extract_file(
    path: impl AsRef<Path>,
    mime_type: Option<&str>,
    config: &ExtractionConfig,
) -> Result<ExtractionResult>
```

**Always available.** Requires async context (`#[tokio::main]`, `tokio::spawn`, etc.).

```rust
use kreuzberg::{extract_file, ExtractionConfig};
use std::path::Path;

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file("document.pdf", None, &config).await?;
    println!("Content: {}", result.content);
    Ok(())
}
```

### `extract_bytes` (async)

Extract content from byte data.

```rust
pub async fn extract_bytes(
    data: &[u8],
    mime_type: &str,
    config: &ExtractionConfig,
) -> Result<ExtractionResult>
```

**Always available.** Requires async context.

```rust
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let pdf_bytes = std::fs::read("document.pdf")?;
    let result = extract_bytes(&pdf_bytes, "application/pdf", &config).await?;
    Ok(())
}
```

### `extract_file_sync` (sync)

Synchronous wrapper around `extract_file`.

```rust
pub fn extract_file_sync(
    path: impl AsRef<Path>,
    mime_type: Option<&str>,
    config: &ExtractionConfig,
) -> Result<ExtractionResult>
```

**Requires tokio-runtime feature.** Blocks the current thread using a global Tokio runtime.

```rust
use kreuzberg::{extract_file_sync, ExtractionConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file_sync("document.pdf", None, &config)?;
    println!("Content: {}", result.content);
    Ok(())
}
```

### `extract_bytes_sync` (sync)

Synchronous wrapper around `extract_bytes`.

```rust
pub fn extract_bytes_sync(
    content: &[u8],
    mime_type: &str,
    config: &ExtractionConfig,
) -> Result<ExtractionResult>
```

**Always available.** Works in sync and async contexts.

```rust
fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let bytes = b"Hello, world!";
    let result = extract_bytes_sync(bytes, "text/plain", &config)?;
    Ok(())
}
```

### `batch_extract_file` (async, parallel)

Extract multiple files concurrently.

```rust
pub async fn batch_extract_file(
    paths: Vec<impl AsRef<Path>>,
    config: &ExtractionConfig,
) -> Result<Vec<ExtractionResult>>
```

**Requires tokio-runtime feature.** Processes files in parallel with automatic concurrency management (defaults to `num_cpus * 1.5`).

```rust
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let paths = vec!["doc1.pdf", "doc2.pdf", "doc3.pdf"];
    let results = batch_extract_file(paths, &config).await?;
    println!("Processed {} files", results.len());
    Ok(())
}
```

### `batch_extract_bytes` (async, parallel)

Extract multiple byte arrays concurrently.

```rust
pub async fn batch_extract_bytes(
    contents: Vec<(Vec<u8>, String)>,
    config: &ExtractionConfig,
) -> Result<Vec<ExtractionResult>>
```

**Requires tokio-runtime feature.** Each tuple is `(bytes, mime_type)`.

```rust
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let contents = vec![
        (b"PDF content".to_vec(), "application/pdf".to_string()),
        (b"Text content".to_vec(), "text/plain".to_string()),
    ];
    let results = batch_extract_bytes(contents, &config).await?;
    Ok(())
}
```

### `batch_extract_file_sync` (sync, parallel)

Synchronous wrapper for batch file extraction.

```rust
pub fn batch_extract_file_sync(
    paths: Vec<impl AsRef<Path>>,
    config: &ExtractionConfig,
) -> Result<Vec<ExtractionResult>>
```

**Requires tokio-runtime feature.** Uses global runtime for concurrency.

```rust
fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let paths = vec!["doc1.pdf", "doc2.pdf"];
    let results = batch_extract_file_sync(paths, &config)?;
    Ok(())
}
```

### `batch_extract_bytes_sync` (sync, parallel)

Synchronous wrapper for batch byte extraction.

```rust
pub fn batch_extract_bytes_sync(
    contents: Vec<(Vec<u8>, String)>,
    config: &ExtractionConfig,
) -> Result<Vec<ExtractionResult>>
```

**Always available.** Each tuple is `(bytes, mime_type)`.

```rust
fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let contents = vec![
        (b"content 1".to_vec(), "text/plain".to_string()),
        (b"content 2".to_vec(), "text/plain".to_string()),
    ];
    let results = batch_extract_bytes_sync(contents, &config)?;
    Ok(())
}
```

### `FileExtractionConfig`

Per-file overrides for batch operations, passed as an optional parameter to `batch_extract_file` / `batch_extract_bytes` (and their sync variants). All fields `Option<T>` — `None` = use batch default.

> **Note (v4.5.0):** The separate `batch_extract_file_with_configs` / `batch_extract_bytes_with_configs` functions have been removed. Per-file configs are now an optional parameter on the unified batch functions.

```rust
pub struct FileExtractionConfig {
    pub enable_quality_processing: Option<bool>,
    pub ocr: Option<OcrConfig>,
    pub force_ocr: Option<bool>,
    pub chunking: Option<ChunkingConfig>,
    pub images: Option<ImageExtractionConfig>,
    pub pdf_options: Option<PdfConfig>,
    pub token_reduction: Option<TokenReductionConfig>,
    pub language_detection: Option<LanguageDetectionConfig>,
    pub pages: Option<PageConfig>,
    pub postprocessor: Option<PostProcessorConfig>,
    pub output_format: Option<OutputFormat>,
    pub include_document_structure: Option<bool>,
}
```

Excluded batch-level fields: `max_concurrent_extractions`, `use_cache`, `acceleration`, `security_limits`.

---

## Configuration

### `ExtractionConfig`

Main configuration struct for all extraction operations.

```rust
pub struct ExtractionConfig {
    /// Enable caching (default: true)
    pub use_cache: bool,

    /// Enable quality post-processing (default: true)
    pub enable_quality_processing: bool,

    /// OCR configuration (None = OCR disabled)
    pub ocr: Option<OcrConfig>,

    /// Force OCR even for searchable PDFs (default: false)
    pub force_ocr: bool,

    /// Text chunking configuration (None = disabled)
    pub chunking: Option<ChunkingConfig>,

    /// Image extraction configuration (None = disabled)
    pub images: Option<ImageExtractionConfig>,

    /// PDF-specific options (requires pdf feature)
    #[cfg(feature = "pdf")]
    pub pdf_options: Option<PdfConfig>,

    /// Token reduction configuration (None = disabled)
    pub token_reduction: Option<TokenReductionConfig>,

    /// Language detection configuration (None = disabled)
    pub language_detection: Option<LanguageDetectionConfig>,

    /// Page extraction configuration (None = disabled)
    pub pages: Option<PageConfig>,

    /// Keyword extraction configuration (requires keywords-yake or keywords-rake)
    #[cfg(any(feature = "keywords-yake", feature = "keywords-rake"))]
    pub keywords: Option<KeywordConfig>,

    /// Post-processor configuration (None = use defaults)
    pub postprocessor: Option<PostProcessorConfig>,

    /// HTML to Markdown conversion options (requires html feature)
    #[cfg(feature = "html")]
    pub html_options: Option<ConversionOptions>,

    /// Maximum concurrent extractions in batch (None = num_cpus * 1.5)
    pub max_concurrent_extractions: Option<usize>,

    /// Result structure format (default: Unified)
    /// Uses types::OutputFormat (Unified | ElementBased)
    pub result_format: types::OutputFormat,

    /// Security limits for archives (requires archives feature)
    #[cfg(feature = "archives")]
    pub security_limits: Option<SecurityLimits>,

    /// Content output format (default: Plain)
    /// Uses config::OutputFormat (Plain | Markdown | Djot | Html)
    pub output_format: OutputFormat,
}
```

#### Creating Configs

```rust
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};

// Default configuration
let config = ExtractionConfig::default();

// With OCR
let config = ExtractionConfig {
    ocr: Some(OcrConfig {
        backend: "tesseract".to_string(),
        ..Default::default()
    }),
    ..Default::default()
};

// With chunking
let config = ExtractionConfig {
    chunking: Some(ChunkingConfig {
        max_characters: 512,
        overlap: 50,
        ..Default::default()
    }),
    output_format: OutputFormat::Markdown,
    ..Default::default()
};
```

---

## Output Formats

There are two separate enums both named `OutputFormat` in different modules:

### Content `OutputFormat` (`core::config::formats::OutputFormat`)

Controls the format of the `content` field text. Used by `ExtractionConfig::output_format`.

```rust
pub enum OutputFormat {
    /// Plain text (default)
    Plain,
    /// Markdown formatted
    Markdown,
    /// Djot markup format
    Djot,
    /// HTML format
    Html,
}
```

### Result `OutputFormat` (`types::extraction::OutputFormat`)

Controls the result structure. Used by `ExtractionConfig::result_format`.

```rust
pub enum OutputFormat {
    /// Unified format with all content in `content` field (default)
    Unified,
    /// Element-based format with semantic element extraction
    ElementBased,
}
```

```rust
use kreuzberg::{ExtractionConfig, OutputFormat};

let config = ExtractionConfig {
    output_format: OutputFormat::Markdown,  // content format (Plain/Markdown/Djot/Html)
    // result_format uses types::OutputFormat (Unified/ElementBased) — defaults to Unified
    ..Default::default()
};
```

---

## Extraction Result

### `ExtractionResult`

Result returned by all extraction functions.

```rust
pub struct ExtractionResult {
    /// Main extracted content
    pub content: String,

    /// Document MIME type
    pub mime_type: Cow<'static, str>,

    /// Metadata about extraction
    pub metadata: Metadata,

    /// Extracted tables (HTML/Markdown)
    pub tables: Vec<Table>,

    /// Detected languages (if language-detection enabled)
    pub detected_languages: Option<Vec<String>>,

    /// Text chunks (if chunking enabled)
    pub chunks: Option<Vec<Chunk>>,

    /// Extracted images (if image extraction enabled)
    pub images: Option<Vec<ExtractedImage>>,

    /// Per-page content (if page extraction enabled)
    pub pages: Option<Vec<PageContent>>,

    /// Semantic elements (if element-based format enabled)
    pub elements: Option<Vec<Element>>,

    /// Djot document structure (if extracting Djot)
    pub djot_content: Option<DjotContent>,

    /// Extracted keywords with relevance scores (if keyword extraction enabled)
    pub extracted_keywords: Option<Vec<ExtractedKeyword>>,

    /// Quality score for extraction result (0.0-1.0)
    pub quality_score: Option<f64>,

    /// Non-fatal warnings during processing pipeline
    pub processing_warnings: Vec<ProcessingWarning>,
}
```

### `ExtractedKeyword`

Extracted keyword with relevance score and position information.

```rust
pub struct ExtractedKeyword {
    /// Keyword text
    pub text: String,

    /// Relevance score (0.0-1.0)
    pub score: f32,

    /// Algorithm used for extraction ("tfidf", "textrank", "yake", etc.)
    pub algorithm: String,

    /// Character positions in content (if available)
    pub positions: Option<Vec<usize>>,
}
```

### `ProcessingWarning`

Non-fatal warning encountered during document processing.

```rust
pub struct ProcessingWarning {
    /// Component that generated the warning
    pub source: String,

    /// Warning message describing the issue
    pub message: String,
}
```

### `Chunk`

Text chunk with optional embedding.

```rust
pub struct Chunk {
    /// Chunk text content
    pub content: String,

    /// Optional embedding vector
    pub embedding: Option<Vec<f32>>,

    /// Chunk metadata
    pub metadata: ChunkMetadata,
}

pub struct ChunkMetadata {
    pub byte_start: usize,
    pub byte_end: usize,
    pub token_count: Option<usize>,
    pub chunk_index: usize,
    pub total_chunks: usize,
    pub first_page: Option<usize>,
    pub last_page: Option<usize>,
}
```

### `ExtractedImage`

Image extracted from document.

```rust
pub struct ExtractedImage {
    /// Raw image bytes
    pub data: Bytes,

    /// Format: "jpeg", "png", "webp", etc.
    pub format: Cow<'static, str>,

    /// Zero-indexed position
    pub image_index: usize,

    /// Page number (1-indexed)
    pub page_number: Option<usize>,

    /// Image dimensions
    pub width: Option<u32>,
    pub height: Option<u32>,

    /// Colorspace: "RGB", "CMYK", "Gray"
    pub colorspace: Option<String>,

    /// Bits per component
    pub bits_per_component: Option<u32>,

    /// Whether this is a mask image
    pub is_mask: bool,

    /// Image description
    pub description: Option<String>,

    /// Nested OCR result (if OCRed)
    pub ocr_result: Option<Box<ExtractionResult>>,
}
```

---

## Error Handling

### `KreuzbergError` enum

```rust
pub enum KreuzbergError {
    /// File system errors (always bubble up)
    Io(std::io::Error),

    /// Document parsing errors
    Parsing {
        message: String,
        source: Option<Box<dyn std::error::Error + Send + Sync>>,
    },

    /// OCR processing errors
    Ocr {
        message: String,
        source: Option<Box<dyn std::error::Error + Send + Sync>>,
    },

    /// Configuration/input validation errors
    Validation {
        message: String,
        source: Option<Box<dyn std::error::Error + Send + Sync>>,
    },

    /// Cache operation errors
    Cache {
        message: String,
        source: Option<Box<dyn std::error::Error + Send + Sync>>,
    },

    /// Image processing errors
    ImageProcessing {
        message: String,
        source: Option<Box<dyn std::error::Error + Send + Sync>>,
    },

    /// Serialization errors (JSON, MessagePack)
    Serialization {
        message: String,
        source: Option<Box<dyn std::error::Error + Send + Sync>>,
    },

    /// Missing system dependency (e.g. Tesseract)
    MissingDependency(String),

    /// Plugin-specific errors
    Plugin {
        message: String,
        plugin_name: String,
    },

    /// Mutex/RwLock poisoning
    LockPoisoned(String),

    /// Unsupported MIME type or format
    UnsupportedFormat(String),

    /// Other errors
    Other(String),
}
```

#### Error Constructors

```rust
use kreuzberg::KreuzbergError;

// Create errors
let err = KreuzbergError::parsing("invalid PDF");
let err = KreuzbergError::ocr("Tesseract failed");
let err = KreuzbergError::validation("config invalid");
let err = KreuzbergError::unsupported_format("application/unknown");
let err = KreuzbergError::missing_dependency("tesseract");

// With source
let source = std::io::Error::new(std::io::ErrorKind::NotFound, "file missing");
let err = KreuzbergError::parsing_with_source("corrupt PDF", source);
```

#### Handling Errors

```rust
use kreuzberg::extract_file;

match extract_file("doc.pdf", None, &config).await {
    Ok(result) => println!("Success: {}", result.content),
    Err(kreuzberg::KreuzbergError::Io(e)) => {
        println!("File error: {}", e);
    }
    Err(kreuzberg::KreuzbergError::UnsupportedFormat(fmt)) => {
        println!("Unsupported: {}", fmt);
    }
    Err(e) => println!("Other error: {}", e),
}
```

---

## MIME Type Detection

### `detect_mime_type`

Detect MIME type from file path.

```rust
pub fn detect_mime_type(path: impl AsRef<Path>) -> Result<String>
```

```rust
use kreuzberg::detect_mime_type;

let mime = detect_mime_type("document.pdf")?;
assert_eq!(mime, "application/pdf");
```

### `detect_mime_type_from_bytes`

Detect MIME type from byte data.

```rust
pub fn detect_mime_type_from_bytes(data: &[u8]) -> Result<String>
```

### `validate_mime_type`

Check if a MIME type is supported.

```rust
pub fn validate_mime_type(mime_type: &str) -> Result<()>
```

```rust
use kreuzberg::validate_mime_type;

validate_mime_type("application/pdf")?;  // OK
validate_mime_type("application/unknown")?;  // Error
```

### `get_extensions_for_mime`

Get file extensions for a MIME type.

```rust
pub fn get_extensions_for_mime(mime_type: &str) -> Vec<String>
```

```rust
use kreuzberg::get_extensions_for_mime;

let exts = get_extensions_for_mime("application/pdf");
// ["pdf"]

let exts = get_extensions_for_mime("text/plain");
// ["txt", "text"]
```

### MIME Type Constants

```rust
use kreuzberg::{
    PDF_MIME_TYPE,
    PLAIN_TEXT_MIME_TYPE,
    HTML_MIME_TYPE,
    MARKDOWN_MIME_TYPE,
    JSON_MIME_TYPE,
    XML_MIME_TYPE,
    DOCX_MIME_TYPE,
    POWER_POINT_MIME_TYPE,
    EXCEL_MIME_TYPE,
};

assert_eq!(PDF_MIME_TYPE, "application/pdf");
assert_eq!(PLAIN_TEXT_MIME_TYPE, "text/plain");
```

---

## Plugin Registry

Access extractors, OCR backends, and validators.

### `get_document_extractor_registry`

Get all available document extractors.

```rust
pub fn get_document_extractor_registry() -> Arc<RwLock<DocumentExtractorRegistry>>
```

### `get_ocr_backend_registry`

Get all available OCR backends.

```rust
pub fn get_ocr_backend_registry() -> Arc<RwLock<OcrBackendRegistry>>
```

### `get_post_processor_registry`

Get all available post-processors.

```rust
pub fn get_post_processor_registry() -> Arc<RwLock<PostProcessorRegistry>>
```

### `get_validator_registry`

Get all available validators.

```rust
pub fn get_validator_registry() -> Arc<RwLock<ValidatorRegistry>>
```

---

## Complete Example

```rust
use kreuzberg::{
    extract_file, ExtractionConfig, OutputFormat,
    ChunkingConfig, OcrConfig, LanguageDetectionConfig,
};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    // Configure extraction
    let config = ExtractionConfig {
        output_format: OutputFormat::Markdown,
        chunking: Some(ChunkingConfig {
            max_characters: 512,
            overlap: 50,
            ..Default::default()
        }),
        language_detection: Some(LanguageDetectionConfig::default()),
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            ..Default::default()
        }),
        force_ocr: false,
        ..Default::default()
    };

    // Extract from file
    let result = extract_file("document.pdf", None, &config).await?;

    // Use results
    println!("Content:\n{}", result.content);
    println!("MIME: {}", result.mime_type);

    if let Some(langs) = result.detected_languages {
        println!("Languages: {:?}", langs);
    }

    if let Some(chunks) = result.chunks {
        println!("Chunks: {}", chunks.len());
        for chunk in chunks {
            println!("  - {}", &chunk.content[..50.min(chunk.content.len())]);
        }
    }

    if let Some(images) = result.images {
        println!("Images: {}", images.len());
    }

    if let Some(pages) = result.pages {
        println!("Pages: {}", pages.len());
    }

    Ok(())
}
```

---

## Result Type Alias

```rust
pub type Result<T> = std::result::Result<T, KreuzbergError>;
```

All fallible operations return `Result<T>` where errors are `KreuzbergError`.

---

## Feature Flags Summary

| Feature            | Availability | Dependencies                                   |
| ------------------ | ------------ | ---------------------------------------------- |
| tokio-runtime      | Default      | Tokio runtime for async/sync                   |
| pdf                | Default      | PDFium                                         |
| ocr                | Optional     | Tesseract                                      |
| chunking           | Optional     | text-splitter                                  |
| embeddings         | Optional     | FastEmbed, requires tokio-runtime              |
| language-detection | Optional     | whatlang                                       |
| keywords-yake      | Optional     | yake-rust                                      |
| keywords-rake      | Optional     | rake                                           |
| api                | Optional     | Axum, requires tokio-runtime                   |
| mcp                | Optional     | Model Context Protocol, requires tokio-runtime |

---

## Version

This reference is for Kreuzberg 4.x.
Nomad changes 2026-06-01 23:40:55 +02:00			`# Kreuzberg Rust API Reference`

			`Complete API reference for the Kreuzberg document extraction library in Rust.`

			`## Setup`

			Add to your `Cargo.toml`:

			```toml
			`[dependencies]`
			`kreuzberg = { version = "4", features = [`
			`"tokio-runtime",`
			`"pdf",`
			`"ocr",`
			`"chunking",`
			`"embeddings",`
			`"language-detection",`
			`"keywords-yake",`
			`"keywords-rake",`
			`"api",`
			`"mcp"`
			`] }`
			`tokio = { version = "1", features = ["full"] }`
			```

			`### Core Features`

			- tokio-runtime: Enables async/sync extraction (default). Required for `extract_file_sync`, `batch_extract_file_sync`, `batch_extract_file`
			`- pdf: PDF extraction with PDFium`
			`- ocr: Tesseract-based OCR for scanned documents`
			`- chunking: Text chunking for RAG pipelines`
			`- embeddings: Vector embeddings generation`
			`- language-detection: Detect document language`
			`- keywords-yake / keywords-rake: Extract keywords using YAKE or RAKE`
			`- api: HTTP API with Axum`
			`- mcp: Model Context Protocol support`

			`---`

			`## Core Extraction Functions`

			### `extract_file` (async)

			`Extract content from a file path.`

			```rust
			`pub async fn extract_file(`
			`path: impl AsRef<Path>,`
			`mime_type: Option<&str>,`
			`config: &ExtractionConfig,`
			`) -> Result<ExtractionResult>`
			```

			Always available. Requires async context (`#[tokio::main]`, `tokio::spawn`, etc.).

			```rust
			`use kreuzberg::{extract_file, ExtractionConfig};`
			`use std::path::Path;`

			`#[tokio::main]`
			`async fn main() -> kreuzberg::Result<()> {`
			`let config = ExtractionConfig::default();`
			`let result = extract_file("document.pdf", None, &config).await?;`
			`println!("Content: {}", result.content);`
			`Ok(())`
			`}`
			```

			### `extract_bytes` (async)

			`Extract content from byte data.`

			```rust
			`pub async fn extract_bytes(`
			`data: &[u8],`
			`mime_type: &str,`
			`config: &ExtractionConfig,`
			`) -> Result<ExtractionResult>`
			```

			`Always available. Requires async context.`

			```rust
			`#[tokio::main]`
			`async fn main() -> kreuzberg::Result<()> {`
			`let config = ExtractionConfig::default();`
			`let pdf_bytes = std::fs::read("document.pdf")?;`
			`let result = extract_bytes(&pdf_bytes, "application/pdf", &config).await?;`
			`Ok(())`
			`}`
			```

			### `extract_file_sync` (sync)

			Synchronous wrapper around `extract_file`.

			```rust
			`pub fn extract_file_sync(`
			`path: impl AsRef<Path>,`
			`mime_type: Option<&str>,`
			`config: &ExtractionConfig,`
			`) -> Result<ExtractionResult>`
			```

			`Requires tokio-runtime feature. Blocks the current thread using a global Tokio runtime.`

			```rust
			`use kreuzberg::{extract_file_sync, ExtractionConfig};`

			`fn main() -> kreuzberg::Result<()> {`
			`let config = ExtractionConfig::default();`
			`let result = extract_file_sync("document.pdf", None, &config)?;`
			`println!("Content: {}", result.content);`
			`Ok(())`
			`}`
			```

			### `extract_bytes_sync` (sync)

			Synchronous wrapper around `extract_bytes`.

			```rust
			`pub fn extract_bytes_sync(`
			`content: &[u8],`
			`mime_type: &str,`
			`config: &ExtractionConfig,`
			`) -> Result<ExtractionResult>`
			```

			`Always available. Works in sync and async contexts.`

			```rust
			`fn main() -> kreuzberg::Result<()> {`
			`let config = ExtractionConfig::default();`
			`let bytes = b"Hello, world!";`
			`let result = extract_bytes_sync(bytes, "text/plain", &config)?;`
			`Ok(())`
			`}`
			```

			### `batch_extract_file` (async, parallel)

			`Extract multiple files concurrently.`

			```rust
			`pub async fn batch_extract_file(`
			`paths: Vec<impl AsRef<Path>>,`
			`config: &ExtractionConfig,`
			`) -> Result<Vec<ExtractionResult>>`
			```

			Requires tokio-runtime feature. Processes files in parallel with automatic concurrency management (defaults to `num_cpus * 1.5`).

			```rust
			`#[tokio::main]`
			`async fn main() -> kreuzberg::Result<()> {`
			`let config = ExtractionConfig::default();`
			`let paths = vec!["doc1.pdf", "doc2.pdf", "doc3.pdf"];`
			`let results = batch_extract_file(paths, &config).await?;`
			`println!("Processed {} files", results.len());`
			`Ok(())`
			`}`
			```

			### `batch_extract_bytes` (async, parallel)

			`Extract multiple byte arrays concurrently.`

			```rust
			`pub async fn batch_extract_bytes(`
			`contents: Vec<(Vec<u8>, String)>,`
			`config: &ExtractionConfig,`
			`) -> Result<Vec<ExtractionResult>>`
			```

			Requires tokio-runtime feature. Each tuple is `(bytes, mime_type)`.

			```rust
			`#[tokio::main]`
			`async fn main() -> kreuzberg::Result<()> {`
			`let config = ExtractionConfig::default();`
			`let contents = vec![`
			`(b"PDF content".to_vec(), "application/pdf".to_string()),`
			`(b"Text content".to_vec(), "text/plain".to_string()),`
			`];`
			`let results = batch_extract_bytes(contents, &config).await?;`
			`Ok(())`
			`}`
			```

			### `batch_extract_file_sync` (sync, parallel)

			`Synchronous wrapper for batch file extraction.`

			```rust
			`pub fn batch_extract_file_sync(`
			`paths: Vec<impl AsRef<Path>>,`
			`config: &ExtractionConfig,`
			`) -> Result<Vec<ExtractionResult>>`
			```

			`Requires tokio-runtime feature. Uses global runtime for concurrency.`

			```rust
			`fn main() -> kreuzberg::Result<()> {`
			`let config = ExtractionConfig::default();`
			`let paths = vec!["doc1.pdf", "doc2.pdf"];`
			`let results = batch_extract_file_sync(paths, &config)?;`
			`Ok(())`
			`}`
			```

			### `batch_extract_bytes_sync` (sync, parallel)

			`Synchronous wrapper for batch byte extraction.`

			```rust
			`pub fn batch_extract_bytes_sync(`
			`contents: Vec<(Vec<u8>, String)>,`
			`config: &ExtractionConfig,`
			`) -> Result<Vec<ExtractionResult>>`
			```

			Always available. Each tuple is `(bytes, mime_type)`.

			```rust
			`fn main() -> kreuzberg::Result<()> {`
			`let config = ExtractionConfig::default();`
			`let contents = vec![`
			`(b"content 1".to_vec(), "text/plain".to_string()),`
			`(b"content 2".to_vec(), "text/plain".to_string()),`
			`];`
			`let results = batch_extract_bytes_sync(contents, &config)?;`
			`Ok(())`
			`}`
			```

			### `FileExtractionConfig`

			Per-file overrides for batch operations, passed as an optional parameter to `batch_extract_file` / `batch_extract_bytes` (and their sync variants). All fields `Option<T>` — `None` = use batch default.

			> Note (v4.5.0): The separate `batch_extract_file_with_configs` / `batch_extract_bytes_with_configs` functions have been removed. Per-file configs are now an optional parameter on the unified batch functions.

			```rust
			`pub struct FileExtractionConfig {`
			`pub enable_quality_processing: Option<bool>,`
			`pub ocr: Option<OcrConfig>,`
			`pub force_ocr: Option<bool>,`
			`pub chunking: Option<ChunkingConfig>,`
			`pub images: Option<ImageExtractionConfig>,`
			`pub pdf_options: Option<PdfConfig>,`
			`pub token_reduction: Option<TokenReductionConfig>,`
			`pub language_detection: Option<LanguageDetectionConfig>,`
			`pub pages: Option<PageConfig>,`
			`pub postprocessor: Option<PostProcessorConfig>,`
			`pub output_format: Option<OutputFormat>,`
			`pub include_document_structure: Option<bool>,`
			`}`
			```

			Excluded batch-level fields: `max_concurrent_extractions`, `use_cache`, `acceleration`, `security_limits`.

			`---`

			`## Configuration`

			### `ExtractionConfig`

			`Main configuration struct for all extraction operations.`

			```rust
			`pub struct ExtractionConfig {`
			`/// Enable caching (default: true)`
			`pub use_cache: bool,`

			`/// Enable quality post-processing (default: true)`
			`pub enable_quality_processing: bool,`

			`/// OCR configuration (None = OCR disabled)`
			`pub ocr: Option<OcrConfig>,`

			`/// Force OCR even for searchable PDFs (default: false)`
			`pub force_ocr: bool,`

			`/// Text chunking configuration (None = disabled)`
			`pub chunking: Option<ChunkingConfig>,`

			`/// Image extraction configuration (None = disabled)`
			`pub images: Option<ImageExtractionConfig>,`

			`/// PDF-specific options (requires pdf feature)`
			`#[cfg(feature = "pdf")]`
			`pub pdf_options: Option<PdfConfig>,`

			`/// Token reduction configuration (None = disabled)`
			`pub token_reduction: Option<TokenReductionConfig>,`

			`/// Language detection configuration (None = disabled)`
			`pub language_detection: Option<LanguageDetectionConfig>,`

			`/// Page extraction configuration (None = disabled)`
			`pub pages: Option<PageConfig>,`

			`/// Keyword extraction configuration (requires keywords-yake or keywords-rake)`
			`#[cfg(any(feature = "keywords-yake", feature = "keywords-rake"))]`
			`pub keywords: Option<KeywordConfig>,`

			`/// Post-processor configuration (None = use defaults)`
			`pub postprocessor: Option<PostProcessorConfig>,`

			`/// HTML to Markdown conversion options (requires html feature)`
			`#[cfg(feature = "html")]`
			`pub html_options: Option<ConversionOptions>,`

			`/// Maximum concurrent extractions in batch (None = num_cpus * 1.5)`
			`pub max_concurrent_extractions: Option<usize>,`

			`/// Result structure format (default: Unified)`
			`/// Uses types::OutputFormat (Unified \| ElementBased)`
			`pub result_format: types::OutputFormat,`

			`/// Security limits for archives (requires archives feature)`
			`#[cfg(feature = "archives")]`
			`pub security_limits: Option<SecurityLimits>,`

			`/// Content output format (default: Plain)`
			`/// Uses config::OutputFormat (Plain \| Markdown \| Djot \| Html)`
			`pub output_format: OutputFormat,`
			`}`
			```

			`#### Creating Configs`

			```rust
			`use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};`

			`// Default configuration`
			`let config = ExtractionConfig::default();`

			`// With OCR`
			`let config = ExtractionConfig {`
			`ocr: Some(OcrConfig {`
			`backend: "tesseract".to_string(),`
			`..Default::default()`
			`}),`
			`..Default::default()`
			`};`

			`// With chunking`
			`let config = ExtractionConfig {`
			`chunking: Some(ChunkingConfig {`
			`max_characters: 512,`
			`overlap: 50,`
			`..Default::default()`
			`}),`
			`output_format: OutputFormat::Markdown,`
			`..Default::default()`
			`};`
			```

			`---`

			`## Output Formats`

			There are two separate enums both named `OutputFormat` in different modules:

			### Content `OutputFormat` (`core::config::formats::OutputFormat`)

			Controls the format of the `content` field text. Used by `ExtractionConfig::output_format`.

			```rust
			`pub enum OutputFormat {`
			`/// Plain text (default)`
			`Plain,`
			`/// Markdown formatted`
			`Markdown,`
			`/// Djot markup format`
			`Djot,`
			`/// HTML format`
			`Html,`
			`}`
			```

			### Result `OutputFormat` (`types::extraction::OutputFormat`)

			Controls the result structure. Used by `ExtractionConfig::result_format`.

			```rust
			`pub enum OutputFormat {`
			/// Unified format with all content in `content` field (default)
			`Unified,`
			`/// Element-based format with semantic element extraction`
			`ElementBased,`
			`}`
			```

			```rust
			`use kreuzberg::{ExtractionConfig, OutputFormat};`

			`let config = ExtractionConfig {`
			`output_format: OutputFormat::Markdown, // content format (Plain/Markdown/Djot/Html)`
			`// result_format uses types::OutputFormat (Unified/ElementBased) — defaults to Unified`
			`..Default::default()`
			`};`
			```

			`---`

			`## Extraction Result`

			### `ExtractionResult`

			`Result returned by all extraction functions.`

			```rust
			`pub struct ExtractionResult {`
			`/// Main extracted content`
			`pub content: String,`

			`/// Document MIME type`
			`pub mime_type: Cow<'static, str>,`

			`/// Metadata about extraction`
			`pub metadata: Metadata,`

			`/// Extracted tables (HTML/Markdown)`
			`pub tables: Vec<Table>,`

			`/// Detected languages (if language-detection enabled)`
			`pub detected_languages: Option<Vec<String>>,`

			`/// Text chunks (if chunking enabled)`
			`pub chunks: Option<Vec<Chunk>>,`

			`/// Extracted images (if image extraction enabled)`
			`pub images: Option<Vec<ExtractedImage>>,`

			`/// Per-page content (if page extraction enabled)`
			`pub pages: Option<Vec<PageContent>>,`

			`/// Semantic elements (if element-based format enabled)`
			`pub elements: Option<Vec<Element>>,`

			`/// Djot document structure (if extracting Djot)`
			`pub djot_content: Option<DjotContent>,`

			`/// Extracted keywords with relevance scores (if keyword extraction enabled)`
			`pub extracted_keywords: Option<Vec<ExtractedKeyword>>,`

			`/// Quality score for extraction result (0.0-1.0)`
			`pub quality_score: Option<f64>,`

			`/// Non-fatal warnings during processing pipeline`
			`pub processing_warnings: Vec<ProcessingWarning>,`
			`}`
			```

			### `ExtractedKeyword`

			`Extracted keyword with relevance score and position information.`

			```rust
			`pub struct ExtractedKeyword {`
			`/// Keyword text`
			`pub text: String,`

			`/// Relevance score (0.0-1.0)`
			`pub score: f32,`

			`/// Algorithm used for extraction ("tfidf", "textrank", "yake", etc.)`
			`pub algorithm: String,`

			`/// Character positions in content (if available)`
			`pub positions: Option<Vec<usize>>,`
			`}`
			```

			### `ProcessingWarning`

			`Non-fatal warning encountered during document processing.`

			```rust
			`pub struct ProcessingWarning {`
			`/// Component that generated the warning`
			`pub source: String,`

			`/// Warning message describing the issue`
			`pub message: String,`
			`}`
			```

			### `Chunk`

			`Text chunk with optional embedding.`

			```rust
			`pub struct Chunk {`
			`/// Chunk text content`
			`pub content: String,`

			`/// Optional embedding vector`
			`pub embedding: Option<Vec<f32>>,`

			`/// Chunk metadata`
			`pub metadata: ChunkMetadata,`
			`}`

			`pub struct ChunkMetadata {`
			`pub byte_start: usize,`
			`pub byte_end: usize,`
			`pub token_count: Option<usize>,`
			`pub chunk_index: usize,`
			`pub total_chunks: usize,`
			`pub first_page: Option<usize>,`
			`pub last_page: Option<usize>,`
			`}`
			```

			### `ExtractedImage`

			`Image extracted from document.`

			```rust
			`pub struct ExtractedImage {`
			`/// Raw image bytes`
			`pub data: Bytes,`

			`/// Format: "jpeg", "png", "webp", etc.`
			`pub format: Cow<'static, str>,`

			`/// Zero-indexed position`
			`pub image_index: usize,`

			`/// Page number (1-indexed)`
			`pub page_number: Option<usize>,`

			`/// Image dimensions`
			`pub width: Option<u32>,`
			`pub height: Option<u32>,`

			`/// Colorspace: "RGB", "CMYK", "Gray"`
			`pub colorspace: Option<String>,`

			`/// Bits per component`
			`pub bits_per_component: Option<u32>,`

			`/// Whether this is a mask image`
			`pub is_mask: bool,`

			`/// Image description`
			`pub description: Option<String>,`

			`/// Nested OCR result (if OCRed)`
			`pub ocr_result: Option<Box<ExtractionResult>>,`
			`}`
			```

			`---`

			`## Error Handling`

			### `KreuzbergError` enum

			```rust
			`pub enum KreuzbergError {`
			`/// File system errors (always bubble up)`
			`Io(std::io::Error),`

			`/// Document parsing errors`
			`Parsing {`
			`message: String,`
			`source: Option<Box<dyn std::error::Error + Send + Sync>>,`
			`},`

			`/// OCR processing errors`
			`Ocr {`
			`message: String,`
			`source: Option<Box<dyn std::error::Error + Send + Sync>>,`
			`},`

			`/// Configuration/input validation errors`
			`Validation {`
			`message: String,`
			`source: Option<Box<dyn std::error::Error + Send + Sync>>,`
			`},`

			`/// Cache operation errors`
			`Cache {`
			`message: String,`
			`source: Option<Box<dyn std::error::Error + Send + Sync>>,`
			`},`

			`/// Image processing errors`
			`ImageProcessing {`
			`message: String,`
			`source: Option<Box<dyn std::error::Error + Send + Sync>>,`
			`},`

			`/// Serialization errors (JSON, MessagePack)`
			`Serialization {`
			`message: String,`
			`source: Option<Box<dyn std::error::Error + Send + Sync>>,`
			`},`

			`/// Missing system dependency (e.g. Tesseract)`
			`MissingDependency(String),`

			`/// Plugin-specific errors`
			`Plugin {`
			`message: String,`
			`plugin_name: String,`
			`},`

			`/// Mutex/RwLock poisoning`
			`LockPoisoned(String),`

			`/// Unsupported MIME type or format`
			`UnsupportedFormat(String),`

			`/// Other errors`
			`Other(String),`
			`}`
			```

			`#### Error Constructors`

			```rust
			`use kreuzberg::KreuzbergError;`

			`// Create errors`
			`let err = KreuzbergError::parsing("invalid PDF");`
			`let err = KreuzbergError::ocr("Tesseract failed");`
			`let err = KreuzbergError::validation("config invalid");`
			`let err = KreuzbergError::unsupported_format("application/unknown");`
			`let err = KreuzbergError::missing_dependency("tesseract");`

			`// With source`
			`let source = std::io::Error::new(std::io::ErrorKind::NotFound, "file missing");`
			`let err = KreuzbergError::parsing_with_source("corrupt PDF", source);`
			```

			`#### Handling Errors`

			```rust
			`use kreuzberg::extract_file;`

			`match extract_file("doc.pdf", None, &config).await {`
			`Ok(result) => println!("Success: {}", result.content),`
			`Err(kreuzberg::KreuzbergError::Io(e)) => {`
			`println!("File error: {}", e);`
			`}`
			`Err(kreuzberg::KreuzbergError::UnsupportedFormat(fmt)) => {`
			`println!("Unsupported: {}", fmt);`
			`}`
			`Err(e) => println!("Other error: {}", e),`
			`}`
			```

			`---`

			`## MIME Type Detection`

			### `detect_mime_type`

			`Detect MIME type from file path.`

			```rust
			`pub fn detect_mime_type(path: impl AsRef<Path>) -> Result<String>`
			```

			```rust
			`use kreuzberg::detect_mime_type;`

			`let mime = detect_mime_type("document.pdf")?;`
			`assert_eq!(mime, "application/pdf");`
			```

			### `detect_mime_type_from_bytes`

			`Detect MIME type from byte data.`

			```rust
			`pub fn detect_mime_type_from_bytes(data: &[u8]) -> Result<String>`
			```

			### `validate_mime_type`

			`Check if a MIME type is supported.`

			```rust
			`pub fn validate_mime_type(mime_type: &str) -> Result<()>`
			```

			```rust
			`use kreuzberg::validate_mime_type;`

			`validate_mime_type("application/pdf")?; // OK`
			`validate_mime_type("application/unknown")?; // Error`
			```

			### `get_extensions_for_mime`

			`Get file extensions for a MIME type.`

			```rust
			`pub fn get_extensions_for_mime(mime_type: &str) -> Vec<String>`
			```

			```rust
			`use kreuzberg::get_extensions_for_mime;`

			`let exts = get_extensions_for_mime("application/pdf");`
			`// ["pdf"]`

			`let exts = get_extensions_for_mime("text/plain");`
			`// ["txt", "text"]`
			```

			`### MIME Type Constants`

			```rust
			`use kreuzberg::{`
			`PDF_MIME_TYPE,`
			`PLAIN_TEXT_MIME_TYPE,`
			`HTML_MIME_TYPE,`
			`MARKDOWN_MIME_TYPE,`
			`JSON_MIME_TYPE,`
			`XML_MIME_TYPE,`
			`DOCX_MIME_TYPE,`
			`POWER_POINT_MIME_TYPE,`
			`EXCEL_MIME_TYPE,`
			`};`

			`assert_eq!(PDF_MIME_TYPE, "application/pdf");`
			`assert_eq!(PLAIN_TEXT_MIME_TYPE, "text/plain");`
			```

			`---`

			`## Plugin Registry`

			`Access extractors, OCR backends, and validators.`

			### `get_document_extractor_registry`

			`Get all available document extractors.`

			```rust
			`pub fn get_document_extractor_registry() -> Arc<RwLock<DocumentExtractorRegistry>>`
			```

			### `get_ocr_backend_registry`

			`Get all available OCR backends.`

			```rust
			`pub fn get_ocr_backend_registry() -> Arc<RwLock<OcrBackendRegistry>>`
			```

			### `get_post_processor_registry`

			`Get all available post-processors.`

			```rust
			`pub fn get_post_processor_registry() -> Arc<RwLock<PostProcessorRegistry>>`
			```

			### `get_validator_registry`

			`Get all available validators.`

			```rust
			`pub fn get_validator_registry() -> Arc<RwLock<ValidatorRegistry>>`
			```

			`---`

			`## Complete Example`

			```rust
			`use kreuzberg::{`
			`extract_file, ExtractionConfig, OutputFormat,`
			`ChunkingConfig, OcrConfig, LanguageDetectionConfig,`
			`};`

			`#[tokio::main]`
			`async fn main() -> kreuzberg::Result<()> {`
			`// Configure extraction`
			`let config = ExtractionConfig {`
			`output_format: OutputFormat::Markdown,`
			`chunking: Some(ChunkingConfig {`
			`max_characters: 512,`
			`overlap: 50,`
			`..Default::default()`
			`}),`
			`language_detection: Some(LanguageDetectionConfig::default()),`
			`ocr: Some(OcrConfig {`
			`backend: "tesseract".to_string(),`
			`..Default::default()`
			`}),`
			`force_ocr: false,`
			`..Default::default()`
			`};`

			`// Extract from file`
			`let result = extract_file("document.pdf", None, &config).await?;`

			`// Use results`
			`println!("Content:\n{}", result.content);`
			`println!("MIME: {}", result.mime_type);`

			`if let Some(langs) = result.detected_languages {`
			`println!("Languages: {:?}", langs);`
			`}`

			`if let Some(chunks) = result.chunks {`
			`println!("Chunks: {}", chunks.len());`
			`for chunk in chunks {`
			`println!(" - {}", &chunk.content[..50.min(chunk.content.len())]);`
			`}`
			`}`

			`if let Some(images) = result.images {`
			`println!("Images: {}", images.len());`
			`}`

			`if let Some(pages) = result.pages {`
			`println!("Pages: {}", pages.len());`
			`}`

			`Ok(())`
			`}`
			```

			`---`

			`## Result Type Alias`

			```rust
			`pub type Result<T> = std::result::Result<T, KreuzbergError>;`
			```

			All fallible operations return `Result<T>` where errors are `KreuzbergError`.

			`---`

			`## Feature Flags Summary`

			`\| Feature \| Availability \| Dependencies \|`
			`\| ------------------ \| ------------ \| ---------------------------------------------- \|`
			`\| tokio-runtime \| Default \| Tokio runtime for async/sync \|`
			`\| pdf \| Default \| PDFium \|`
			`\| ocr \| Optional \| Tesseract \|`
			`\| chunking \| Optional \| text-splitter \|`
			`\| embeddings \| Optional \| FastEmbed, requires tokio-runtime \|`
			`\| language-detection \| Optional \| whatlang \|`
			`\| keywords-yake \| Optional \| yake-rust \|`
			`\| keywords-rake \| Optional \| rake \|`
			`\| api \| Optional \| Axum, requires tokio-runtime \|`
			`\| mcp \| Optional \| Model Context Protocol, requires tokio-runtime \|`

			`---`

			`## Version`

			`This reference is for Kreuzberg 4.x.`