Nomad changes

2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions
--- a/docs/concepts/architecture.md
+++ b/docs/concepts/architecture.md
@@ -0,0 +1,172 @@
+# Architecture
+
+Kreuzberg is a document extraction library with a Rust core and native bindings for Python, TypeScript, Ruby, and more. The core handles all the expensive work (PDF parsing, OCR, text processing) and exposes it through thin language-specific wrappers. Your code calls directly into compiled Rust. No subprocesses, no serialization, no IPC overhead.
+
+---
+
+## Design Principles
+
+Three ideas shape how Kreuzberg is built:
+
+1. **Rust does the heavy lifting.** Every performance-critical operation runs as native Rust code - compiled, optimized, and fast.
+2. **Plugins cross language boundaries.** A Python OCR backend can register itself with the Rust core and participate in the extraction pipeline as a first-class citizen.
+3. **Minimize data copying.** Data passes across FFI boundaries using zero-copy techniques wherever possible. When a Python plugin receives file bytes, it gets a buffer protocol view into Rust-owned memory, not a copy.
+
+---
+
+## System Layers
+
+```mermaid
+flowchart TB
+    subgraph your_code ["Your Code"]
+        Python["Python"]
+        Node["TypeScript\nNode.js"]
+        Wasm["TypeScript\nWASM"]
+        Ruby["Ruby"]
+    end
+
+    subgraph bridges ["FFI Bridges"]
+        PyO3["PyO3"]
+        NAPI["NAPI-RS"]
+        WB["wasm-bindgen"]
+        Magnus["Magnus"]
+    end
+
+    subgraph engine ["Rust Core"]
+        Core["kreuzberg\ncrate"]
+    end
+
+    Python --> PyO3
+    Node --> NAPI
+    Wasm --> WB
+    Ruby --> Magnus
+
+    PyO3 --> Core
+    NAPI --> Core
+    WB --> Core
+    Magnus --> Core
+
+    style Core fill:#e1f5ff,stroke:#0288d1
+    style PyO3 fill:#ffe1e1,stroke:#c62828
+    style NAPI fill:#ffe1e1,stroke:#c62828
+    style WB fill:#fff3e0,stroke:#ef6c00
+    style Magnus fill:#ffe1e1,stroke:#c62828
+```
+
+Your code sits at the top. It calls into a bridge layer that translates types between your language and Rust. The bridge forwards the call to the Rust core, which does the actual extraction, OCR, and text processing. Results come back through the same bridge.
+
+### TypeScript: Native vs Wasm
+
+There are two TypeScript packages because server and browser environments have fundamentally different constraints:
+
+- **`@kreuzberg/node`** (native) - compiled via NAPI-RS. Maximum performance on Node.js, Bun, and Deno. Requires a platform-specific native binary.
+- **`@kreuzberg/wasm`** (WebAssembly) - compiled via wasm-bindgen. Runs in browsers, Cloudflare Workers, Vercel Edge, and any JavaScript runtime. About 60-80% of native speed, but zero native dependencies.
+
+Rule of thumb: use native on servers, Wasm in browsers and edge runtimes. See the [Installation Guide](../getting-started/installation.md#typescript) for setup.
+
+---
+
+## Rust Core Structure
+
+The core crate (`crates/kreuzberg`) is organized into modules with clear responsibilities:
+
+```mermaid
+flowchart LR
+    subgraph crate ["kreuzberg crate"]
+        Core["core/\nOrchestration\nPipeline entry points"]
+        Plugins["plugins/\nTrait definitions\nRegistries"]
+        Extractors["extractors/\nMIME → handler\nmapping"]
+        Extraction["extraction/\nPDF · Excel · Email\nHTML · XML · Text"]
+        OCR["ocr/\nTesseract\nTable detection"]
+        Text["text/\nToken reduction\nQuality scoring"]
+        Types["types/\nExtractionResult\nMetadata · Chunk"]
+        Error["error/\nKreuzbergError"]
+    end
+
+    Core --> Plugins
+    Core --> Extractors
+    Extractors --> Extraction
+    Extractors --> Plugins
+    Extraction --> OCR
+    Extraction --> Text
+    Core --> Types
+    Core --> Error
+
+    style Core fill:#bbdefb,stroke:#1565c0
+    style Plugins fill:#c8e6c9,stroke:#2e7d32
+    style Extraction fill:#fff9c4,stroke:#f9a825
+    style Extractors fill:#ffccbc,stroke:#d84315
+```
+
+| Module          | Responsibility                                                                                                                                                                                                          |
+| --------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| **core/**       | Main entry points (`extract_file`, `extract_bytes`), MIME detection, config loading, pipeline orchestration                                                                                                             |
+| **plugins/**    | Plugin trait definitions (`DocumentExtractor`, `OcrBackend`, `PostProcessor`, `Validator`, `Renderer`) and the registry system (ExtractorRegistry, OcrRegistry, ValidatorRegistry, ProcessorRegistry, RendererRegistry) |
+| **extractors/** | Maps MIME types to the correct extractor implementation and registers them with the plugin system                                                                                                                       |
+| **extraction/** | Format-specific extraction logic - PDF via pdf_oxide, Excel via calamine, email parsing, and so on.                                                                                                                     |
+| **ocr/**        | OCR orchestration - Tesseract bindings, HOCR parsing, table detection                                                                                                                                                   |
+| **text/**       | Text processing utilities - token reduction, quality scoring, string manipulation                                                                                                                                       |
+| **types/**      | Shared data structures: `ExtractionResult`, `Metadata`, `Chunk`, and friends                                                                                                                                            |
+| **error/**      | Centralized error handling with the `KreuzbergError` enum                                                                                                                                                               |
+
+---
+
+## Rendering Pipeline
+
+After extraction, the raw internal document representation is passed through the **RendererRegistry** to produce the final output in the requested content format. Kreuzberg uses a comrak-based AST bridge for GFM Markdown and HTML5 rendering, ensuring high-fidelity output with full table, heading, and list support.
+
+```mermaid
+flowchart LR
+    Extractor["Extractor"] --> ID["InternalDocument"]
+    ID --> RR["RendererRegistry"]
+    RR --> GFM["GFM Markdown"]
+    RR --> HTML["HTML5"]
+    RR --> Djot["Djot"]
+    RR --> Plain["Plain Text"]
+    RR --> Custom["Custom Renderer"]
+
+    style RR fill:#c8e6c9,stroke:#2e7d32
+    style ID fill:#bbdefb,stroke:#1565c0
+```
+
+The RendererRegistry selects the appropriate renderer based on the requested content format (`--content-format`). Built-in renderers cover Markdown (GFM via comrak), HTML5 (also via comrak), Djot, and plain text. Custom renderers can be registered through the plugin system to support additional output formats.
+
+---
+
+## Why Rust?
+
+**Speed.** Rust compiles to native machine code with LLVM optimizations. PDF parsing uses pdf_oxide — a pure-Rust library with no system-library overhead. Text processing uses SIMD instructions to handle multiple characters per CPU cycle. Batch extraction runs on all CPU cores through Tokio's async runtime.
+
+**Safety.** Rust's type system and ownership model catch entire categories of bugs at compile time. No null pointer exceptions, no data races, no buffer overflows, no use-after-free. If it compiles, those runtime errors can't happen.
+
+**Real concurrency.** Unlike Python (limited by the GIL), Rust executes on all available cores simultaneously. Tokio's work-stealing scheduler distributes async tasks efficiently. File I/O is non-blocking, so threads never stall waiting on disk.
+
+For detailed performance analysis, see [Performance](../guides/development.md#performance).
+
+---
+
+## Using Kreuzberg from Rust
+
+The Rust core is a standalone library. You don't need Python or Node.js to use it:
+
+```rust title="main.rs"
+use kreuzberg::{extract_file_sync, ExtractionConfig};
+
+fn main() -> kreuzberg::Result<()> {
+    let config = ExtractionConfig::default();
+    let result = extract_file_sync("document.pdf", None, &config)?;
+    println!("Extracted: {}", result.content);
+    Ok(())
+}
+```
+
+This makes Kreuzberg a fit for Rust-native applications, command-line tools, high-performance API servers, and embedded systems where Python or Node.js aren't practical.
+
+---
+
+## What to Read Next
+
+- [Extraction Pipeline](extraction-pipeline.md) - how files flow through the system stage by stage
+- [Plugin System](plugin-system.md) - extending Kreuzberg with custom extractors, OCR backends, and processors
+- [Performance](../guides/development.md#performance) - why Rust matters for extraction performance
+- [Creating Plugins](../guides/plugins.md) - step-by-step plugin development guide
--- a/docs/concepts/extraction-pipeline.md
+++ b/docs/concepts/extraction-pipeline.md
@@ -0,0 +1,233 @@
+# Extraction Pipeline
+
+Every file Kreuzberg processes follows the same multi-stage pipeline. A PDF, a scanned image, a spreadsheet, an email attachment: they all enter at the top and come out as a structured `ExtractionResult` at the bottom. The stages run in a fixed order, but several of them are conditional. Caching can short-circuit the entire flow. OCR only runs when images are present. Post-processing steps only fire if you've configured them.
+
+This page walks through each stage in detail so you understand what happens to your file, when, and why.
+
+---
+
+## How the Pipeline Works
+
+```mermaid
+flowchart TD
+    Input(["Input: file path or raw bytes"]):::input
+
+    Input --> S1["<b>1. Cache Lookup</b>\nHash file + config, check for stored result"]
+    S1 -->|Cache hit| FastReturn(["Return cached ExtractionResult"]):::cached
+
+    S1 -->|Cache miss| S2["<b>2. MIME Detection</b>\nResolve file type from extension or explicit param"]
+    S2 --> S3["<b>3. Registry Lookup</b>\nFind the right DocumentExtractor for this MIME type"]
+    S3 --> S4["<b>4. Format Extraction</b>\nRun the extractor: PDF, Excel, image, email, etc."]
+
+    S4 --> S5{"<b>5. OCR</b>\nImages present\nand OCR enabled?"}
+    S5 -->|Yes| OCR["Run OCR backend\n(Tesseract / PaddleOCR / EasyOCR)"]
+    S5 -->|No| S6
+
+    OCR --> S6["<b>6. Validators</b>\nCheck result meets requirements"]
+    S6 --> S7["<b>7. Quality + Chunking</b>\nScore quality, split into chunks"]
+    S7 --> S8["<b>8. Post-Processors</b>\nTransform result (Early → Middle → Late)"]
+
+    S8 --> S9["<b>9. Cache Store</b>\nSave result for future lookups"]
+    S9 --> Output(["Return ExtractionResult"]):::output
+
+    classDef input fill:#e3f2fd,stroke:#1565c0,color:#0d47a1
+    classDef output fill:#e8f5e9,stroke:#2e7d32,color:#1b5e20
+    classDef cached fill:#fff8e1,stroke:#f9a825,color:#e65100
+```
+
+The diagram above shows every stage in sequence. Let's break each one down.
+
+---
+
+## 1. Cache Lookup
+
+When caching is enabled (`cache=True` in your `ExtractionConfig`), the pipeline starts by computing a hash from the file's content and your configuration. If a result with that exact hash already exists in the cache, it's returned immediately. No extraction, no OCR, no post-processing. The entire pipeline is skipped.
+
+This is significant for workloads that reprocess the same files. Repeated extractions of the same document go from hundreds of milliseconds to single-digit milliseconds.
+
+Cache keys are content-based, not path-based. If you rename a file but the bytes are identical, the cache still hits. If you change your config (switch OCR backends, adjust chunking), a new cache key is generated so stale results are never returned.
+
+---
+
+## 2. MIME Detection
+
+Before Kreuzberg can extract anything, it needs to know what format the file is. It resolves the MIME type through one of two paths:
+
+- **Explicit:** You pass `mime_type="application/pdf"` and Kreuzberg validates it against the list of supported types.
+- **Auto-detection:** Kreuzberg reads the file extension (for example, `.pdf` → `application/pdf`) from an internal mapping table.
+
+If the resolved MIME type isn't in the supported list, the pipeline stops immediately with an `UnsupportedFormat` error. No compute is wasted on files Kreuzberg can't handle.
+
+For the full details on how extension mapping, normalization, and validation work, see [Format Support](../reference/formats.md).
+
+---
+
+## 3. Registry Lookup
+
+With the MIME type resolved, Kreuzberg queries the extractor registry to find the `DocumentExtractor` that handles this format. The registry is a map from MIME types to extractor implementations, managed by the [plugin system](plugin-system.md).
+
+If multiple extractors are registered for the same MIME type (for example, you registered a custom PDF extractor alongside the built-in one), the one with the higher `priority()` value is selected. All built-in extractors have a priority of 0, so any custom extractor with a priority above 0 takes precedence.
+
+```rust title="registry_lookup.rs"
+let registry = get_document_extractor_registry();
+let extractor = registry.get("application/pdf")?;
+```
+
+---
+
+## 4. Format Extraction
+
+This is the core of the pipeline. The selected extractor reads the file and produces an `ExtractionResult` containing the extracted text, metadata (author, title, creation date), page count, and detected language.
+
+Each file format has a tailored extraction strategy:
+
+| Format                             | What happens                                                                                                                                                                                           |
+| ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| **PDF**                            | Text is extracted directly from the PDF text layer using pdf_oxide (pure Rust). If the PDF contains embedded images (scanned pages, diagrams), those images are collected and passed to the OCR stage. |
+| **Excel / Spreadsheets**           | Each sheet is parsed individually using calamine. Cell values are assembled into structured Markdown tables, preserving column alignment.                                                              |
+| **Images** (JPEG, PNG, TIFF, etc.) | The image bytes are loaded into memory and forwarded directly to the OCR backend. There is no text layer to extract from an image.                                                                     |
+| **XML / Plain text**               | A streaming parser processes the file incrementally. This keeps memory usage constant even for multi-gigabyte files because the entire file is never loaded at once.                                   |
+| **Email** (`.eml`, `.msg`)         | The MIME structure is parsed. The email body (plain text or HTML) is extracted as the main content. Attachments are extracted recursively using the same pipeline.                                     |
+| **Office** (DOCX, PPTX)            | The file is a ZIP archive containing XML. Kreuzberg opens the archive, locates the content XML parts, and parses the document structure into text.                                                     |
+
+The extraction result at this point contains raw extracted text. It hasn't been validated, scored, or chunked yet.
+
+---
+
+## 5. OCR (Conditional)
+
+OCR runs only when two conditions are true: the file contains images (or is an image itself), and OCR is enabled in the configuration. Even when both conditions are met, Kreuzberg applies a third check: if the format extractor already produced text, OCR is skipped. This avoids redundant processing on PDFs that have a searchable text layer.
+
+You can override this behavior with `force_ocr=True`, which tells Kreuzberg to always run OCR regardless of whether text was already extracted. This is useful for PDFs where the text layer is unreliable or incomplete.
+
+Conversely, `disable_ocr=True` skips OCR entirely. Image files that would normally require OCR return empty content instead of raising a `MissingDependencyError`. This is useful when you want to extract text from non-image formats only and avoid OCR overhead or dependency requirements.
+
+```mermaid
+flowchart LR
+    A{"Images present?"} -->|No| Skip(["Skip OCR"])
+
+    A -->|Yes| B{"force_ocr?"}
+    B -->|Yes| Run["Run OCR backend"]
+    B -->|No| C{"Text already\nextracted?"}
+    C -->|Yes| Skip
+    C -->|No| Run
+
+    Run --> Merge["Merge OCR output\nwith extracted text"]
+
+    style Skip fill:#f5f5f5,stroke:#bdbdbd
+    style Run fill:#e8f5e9,stroke:#2e7d32
+```
+
+Kreuzberg ships three OCR backends:
+
+| Backend       | Engine               | When to use it                                                                                                               |
+| ------------- | -------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
+| **Tesseract** | Native Rust bindings | Default. Fast, solid accuracy for Latin scripts. Good general-purpose choice.                                                |
+| **PaddleOCR** | ONNX Runtime         | Best accuracy for Chinese, Japanese, Korean (CJK) scripts. Runs natively without Python.                                     |
+| **EasyOCR**   | Python + PyTorch     | Supports 80+ languages including Arabic, Hindi, Thai, and other complex scripts. Only available through the Python bindings. |
+
+When OCR completes, the OCR output is merged with any text the format extractor already produced. The merged result moves to post-processing.
+
+---
+
+## 6. Validators
+
+Validators are the first post-processing step. They inspect the `ExtractionResult` and decide whether it meets your requirements. If a validator rejects the result, the pipeline stops immediately and the error is returned to the caller. No further processing happens.
+
+This is intentionally strict. Validators exist to catch results that are fundamentally wrong (empty text, garbled output, suspiciously short content) before downstream systems consume them.
+
+```python title="example_validator.py"
+class MinLengthValidator:
+    def validate(self, result, config):
+        if len(result.content) < 100:
+            raise ValidationError("Extracted text too short")
+```
+
+You register validators through the plugin system. See [Plugin System](plugin-system.md) for details.
+
+---
+
+## 7. Quality Scoring + Chunking
+
+These two steps run after validation.
+
+**Quality scoring** is optional. When `enable_quality_processing=True`, Kreuzberg analyzes the extracted text and assigns a numeric score between 0.0 and 1.0. The score factors in the ratio of alphabetic characters to non-text characters, word frequency distribution (gibberish scores low), and the presence of formatting artifacts like repeated whitespace or encoding errors. The result is stored in `result.quality_score`.
+
+**Chunking** is also optional. When you provide a `ChunkingConfig`, the extracted text is split into overlapping fragments with configurable maximum size and overlap. Each chunk records its start and end offset relative to the original text.
+
+```python title="chunking_config.py"
+config = ExtractionConfig(
+    chunking=ChunkingConfig(max_chars=1000, max_overlap=100)
+)
+# result.chunks → list of Chunk objects with .text, .start_offset, .end_offset
+```
+
+Chunking is designed for RAG (Retrieval-Augmented Generation) pipelines. The overlap ensures that context at chunk boundaries isn't lost when chunks are embedded and retrieved independently.
+
+---
+
+## 8. Post-Processors
+
+Post-processors are the final transformation step. They receive the `ExtractionResult` and can modify it in any way: clean up text, extract entities, redact sensitive content, reformat output, or add custom metadata.
+
+Post-processors run in three ordered stages so you can control what happens first:
+
+| Stage      | Purpose               | Examples                                                            |
+| ---------- | --------------------- | ------------------------------------------------------------------- |
+| **Early**  | Raw text cleanup      | Strip control characters, fix encoding issues, normalize whitespace |
+| **Middle** | Content analysis      | Extract named entities, detect language, classify document type     |
+| **Late**   | Final transformations | Apply output formatting, generate summaries, redact PII             |
+
+An important design choice: **post-processor errors do not fail the extraction.** If a post-processor throws an exception, the error is logged and the pipeline continues with the result as-is. This means a buggy post-processor can't take down your extraction pipeline.
+
+---
+
+## 9. Cache Store + Return
+
+If caching is enabled and the extraction completed without errors, the result is written to the cache for future lookups.
+
+The final `ExtractionResult` returned to you contains:
+
+- **`content`** - the fully processed text
+- **`metadata`** — format-specific metadata (author, title, creation date, page count, etc.)
+- **`chunks`** — optional list of text chunks with offsets (if chunking was configured)
+- **`quality_score`** — optional quality assessment (if quality processing was enabled)
+- **Processing history** — a trace of which stages ran, useful for debugging
+
+---
+
+## Error Handling Strategy
+
+The pipeline follows a deliberate error strategy: fail early for things the developer can fix, be resilient for things that are beyond their control.
+
+| Stage             | Error type                                                | What happens                                                                         |
+| ----------------- | --------------------------------------------------------- | ------------------------------------------------------------------------------------ |
+| MIME detection    | `UnsupportedFormat`                                       | Pipeline stops. The file type isn't supported.                                       |
+| Format extraction | `ParsingError`                                            | Pipeline stops. The file is corrupt or the format couldn't be parsed.                |
+| Validators        | `ValidationError`                                         | Pipeline stops. The result didn't meet your defined requirements.                    |
+| Post-processors   | Non-fatal processor error                                 | Error is logged. Pipeline continues. Result is returned without that transformation. |
+| System            | I/O failure, out-of-memory, or other system-level failure | Always propagated. These indicate infrastructure problems.                           |
+
+For the complete error taxonomy, see [Error Handling](../reference/errors.md).
+
+---
+
+## Built-in Optimizations
+
+The pipeline includes several optimizations that run automatically without configuration:
+
+- **Cache short-circuits** bypass every processing stage when a cached result exists
+- **Lazy OCR** avoids redundant OCR when the format extractor already produced usable text
+- **Streaming parsers** process XML, text, and archive files incrementally with constant memory
+- **Parallel batching** with `batch_extract_file` distributes files across all CPU cores via Tokio
+- **Shared async runtime** reuses a single Tokio runtime across calls, avoiding repeated initialization
+
+---
+
+## What to Read Next
+
+- [Architecture](architecture.md) — how the system is designed
+- [Plugin System](plugin-system.md) — building custom extractors, OCR backends, and processors
+- [Format Support](../reference/formats.md) — how file types are identified
+- [Configuration Guide](../guides/configuration.md) — tuning the pipeline
+- [OCR Guide](../guides/ocr.md) — configuring OCR backends
--- a/docs/concepts/plugin-system.md
+++ b/docs/concepts/plugin-system.md
@@ -0,0 +1,254 @@
+# Plugin System <span class="version-badge">v5.0.0</span>
+
+Kreuzberg's extraction pipeline is entirely plugin-driven. Every format extractor, OCR engine, post-processor, validator, and renderer is a plugin that registers itself into a typed registry. The pipeline queries these registries at each stage to find the right handler. You extend Kreuzberg by writing your own plugin and registering it. The pipeline picks it up automatically.
+
+This page explains the five plugin types, the registry mechanism, the plugin lifecycle, and how plugins work across language boundaries.
+
+---
+
+## Overview
+
+The plugin system has three layers: plugins, registries, and the pipeline. Plugins implement a trait. Registries store them by key (MIME type, name, or processing stage). The pipeline queries the registries during extraction.
+
+```mermaid
+flowchart TB
+    subgraph layer1 ["You write plugins"]
+        direction LR
+        E["DocumentExtractor\n<i>Handles a file format</i>"]
+        O["OcrBackend\n<i>Runs OCR on images</i>"]
+        V["Validator\n<i>Rejects bad results</i>"]
+        P["PostProcessor\n<i>Transforms results</i>"]
+        R["Renderer\n<i>Formats output</i>"]
+    end
+
+    subgraph layer2 ["Registries store them"]
+        direction LR
+        ER["Extractor Registry\n<i>MIME type → extractor</i>"]
+        OR["OCR Registry\n<i>name → backend</i>"]
+        VR["Validator Registry\n<i>name → validator</i>"]
+        PR["Processor Registry\n<i>stage → processors</i>"]
+        RR["Renderer Registry\n<i>name → renderer</i>"]
+    end
+
+    subgraph layer3 ["Pipeline uses them"]
+        direction LR
+        P1["Format\nextraction"]
+        P2["OCR"]
+        P3["Validation"]
+        P4["Post-\nprocessing"]
+        P5["Rendering"]
+    end
+
+    E --> ER
+    O --> OR
+    V --> VR
+    P --> PR
+    R --> RR
+
+    ER --> P1
+    OR --> P2
+    VR --> P3
+    PR --> P4
+    RR --> P5
+
+    style ER fill:#bbdefb,stroke:#1565c0
+    style OR fill:#c8e6c9,stroke:#2e7d32
+    style VR fill:#ffccbc,stroke:#d84315
+    style PR fill:#fff9c4,stroke:#f9a825
+    style RR fill:#e1bee7,stroke:#7b1fa2
+```
+
+You register a plugin once. From that point on, the pipeline uses it wherever the MIME type, name, or stage matches. No wiring, no config files, no boilerplate.
+
+---
+
+## The Five Plugin Types
+
+### DocumentExtractor
+
+A `DocumentExtractor` teaches Kreuzberg how to extract text from a specific file format. It declares supported MIME types and provides async methods to extract from file paths or raw bytes.
+
+See [`DocumentExtractor`](../reference/types.md#documentextractor) for the trait signature.
+
+Kreuzberg ships with built-in extractors for PDF, Excel, images (routed to OCR), XML, plain text, email, and Office formats (DOCX, PPTX).
+
+**Priority resolution.** When two extractors are registered for the same MIME type, the one with the higher `priority()` value wins. Every built-in extractor has a priority of 0. To override the built-in PDF extractor with your own, register yours with a higher priority:
+
+```rust title="override_builtin.rs"
+impl DocumentExtractor for BetterPDFExtractor {
+    fn priority(&self) -> i32 { 100 }
+    // ...
+}
+```
+
+Now when the pipeline encounters `application/pdf`, it selects `BetterPDFExtractor` instead of the default.
+
+---
+
+### OcrBackend
+
+An `OcrBackend` performs optical character recognition on image data. It declares supported languages and provides async methods to process image bytes or files.
+
+See [`OcrBackend`](../reference/types.md#ocrbackend) for the trait signature.
+
+Three backends ship out of the box:
+
+| Backend       | Engine               | Strengths                                                                                         |
+| ------------- | -------------------- | ------------------------------------------------------------------------------------------------- |
+| **Tesseract** | Native Rust bindings | Fast, general-purpose, default backend. Good accuracy for Latin scripts.                          |
+| **PaddleOCR** | ONNX Runtime         | Best accuracy for CJK (Chinese, Japanese, Korean) scripts. No Python dependency.                  |
+| **EasyOCR**   | Python + PyTorch     | Supports 80+ languages including Arabic, Hindi, and Thai. Only available through Python bindings. |
+
+You can register your own OCR backend (for example, a cloud-based API, a custom model) using the same trait.
+
+---
+
+### PostProcessor
+
+A `PostProcessor` transforms extraction results after the main extraction and OCR stages are complete. Each processor declares a processing stage that determines its execution order.
+
+See [`PostProcessor`](../reference/types.md#postprocessor) for the trait signature.
+
+The three stages execute in fixed order:
+
+| Stage    | Runs   | Purpose              | Examples                                                        |
+| -------- | ------ | -------------------- | --------------------------------------------------------------- |
+| `Early`  | First  | Clean up raw text    | Strip control characters, fix encoding, normalize whitespace    |
+| `Middle` | Second | Analyze content      | Extract named entities, detect language, classify document type |
+| `Late`   | Third  | Final output shaping | Format output, generate summaries, redact PII                   |
+
+**Error handling:** Post-processor errors do not fail the extraction. Errors are logged and the pipeline continues unchanged, ensuring no processor can take down extraction.
+
+---
+
+### Validator
+
+A `Validator` inspects extraction results and can reject them if they don't meet requirements. Unlike post-processors, validator errors stop the pipeline immediately — they're a hard gate.
+
+See [`Validator`](../reference/types.md#validator) for the trait signature.
+
+Two common validator patterns:
+
+```python title="example_validators.py"
+class MinimumLengthValidator:
+    """Reject extractions that produce less than 100 characters."""
+    def validate(self, result, config):
+        if len(result.content) < 100:
+            raise ValidationError("Text too short")
+
+class QualityThresholdValidator:
+    """Reject extractions with a quality score below 0.5."""
+    def validate(self, result, config):
+        if (result.quality_score or 0.0) < 0.5:
+            raise ValidationError("Quality below threshold")
+```
+
+Validators run before post-processors. This means you can catch and reject bad results before any transformation work happens.
+
+---
+
+### Renderer
+
+A `Renderer` converts the internal document representation into a specific output format. It declares a name and provides a render method.
+
+```rust
+pub trait Renderer: Send + Sync {
+    fn name(&self) -> &str;
+    fn render(&self, document: &InternalDocument) -> Result<String>;
+}
+```
+
+Kreuzberg ships with four built-in renderers:
+
+| Renderer     | Output       | Description                                                              |
+| ------------ | ------------ | ------------------------------------------------------------------------ |
+| **Markdown** | GFM Markdown | GitHub Flavored Markdown via comrak AST bridge. Tables, headings, lists. |
+| **HTML**     | HTML5        | Full HTML5 rendering via comrak.                                         |
+| **djot**     | Djot         | Djot markup format.                                                      |
+| **plain**    | Plain text   | Raw text with no markup.                                                 |
+
+To register a custom renderer:
+
+```rust title="custom_renderer.rs"
+use kreuzberg::plugins::registry::get_renderer_registry;
+use std::sync::Arc;
+
+let registry = get_renderer_registry();
+let mut registry = registry.write().unwrap();
+registry.register(Arc::new(MyCustomRenderer))?;
+```
+
+Custom renderers participate in the pipeline just like built-in ones. When the user requests your renderer's name via `--content-format`, the RendererRegistry dispatches to your implementation.
+
+---
+
+## Plugin Lifecycle
+
+Every plugin follows the same lifecycle from creation to shutdown.
+
+```mermaid
+stateDiagram-v2
+    [*] --> Created: new()
+    Created --> Registered: registry.register()
+    Registered --> Active: initialize()
+    Active --> Active: called by pipeline
+    Active --> [*]: shutdown()
+```
+
+See [`Plugin`](../reference/types.md#plugin) for the base trait signature.
+
+Key behaviors: `initialize()` is called lazily the first time the plugin is used, not at registration. This avoids startup overhead for plugins that may never be invoked. `shutdown()` runs when the plugin is unregistered or on process exit. Both have default no-op implementations — override only if your plugin needs setup or cleanup.
+
+---
+
+## Registering Plugins
+
+Get the appropriate registry for your plugin type and call `register()`. Once registered, the pipeline automatically dispatches to your plugin based on MIME type (extractors), backend name (OCR), processing stage (post-processors), or validator name.
+
+---
+
+## Cross-Language Plugins
+
+Plugins written in Python can integrate directly with the Rust extraction pipeline via PyO3 FFI. The bridge layer handles all type conversion automatically.
+
+```mermaid
+sequenceDiagram
+    participant P as Python Plugin
+    participant B as PyO3 Bridge
+    participant R as Rust Pipeline
+
+    P->>B: register(plugin)
+    B->>R: Store as Arc<dyn DocumentExtractor>
+
+    Note over R: During extraction...
+    R->>B: extract_file(path, mime, config)
+    B->>P: Call plugin.extract_file()
+    P-->>B: Return result as dict
+    B-->>R: Convert to ExtractionResult
+```
+
+Type mapping: `Vec<u8>` ↔ `bytes`, `String` ↔ `str`, Rust structs ↔ Python dataclasses. Large buffers use Python's buffer protocol to minimize copying.
+
+---
+
+## Thread Safety
+
+All plugins must implement `Send + Sync` because the extraction pipeline invokes them concurrently from Tokio's worker thread pool. For mutable internal state, use `Mutex`, `RwLock`, or atomic types. The compiler will enforce this requirement.
+
+---
+
+## Plugin Discovery
+
+Plugins can be registered in two ways:
+
+1. **Built-in** — automatically registered when Kreuzberg initializes. These are the default extractors, OCR backends, and processors.
+2. **Programmatic** — registered manually via the registry API at runtime.
+
+---
+
+## What to Read Next
+
+- [Creating Plugins](../guides/plugins.md) — step-by-step guide to building a custom plugin
+- [Extraction Pipeline](extraction-pipeline.md) — where each plugin type fits in the extraction flow
+- [Architecture](architecture.md) — overall system design
+- [API Reference](../reference/api-python.md) — plugin API documentation