Files
fil/docs/guides/plugins.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

7.2 KiB
Raw Permalink Blame History

Creating Plugins v4.0.0

Extend Kreuzberg with custom extractors, post-processors, OCR backends, and validators registered globally for use across all extraction calls.

!!! Note "Wasm" Custom plugins are not supported in Wasm environments. Use Python, Rust, or other native bindings.

Plugin Types

Type Purpose Use case
DocumentExtractor Extract content from file formats New format support, override built-in extractors
PostProcessor Transform extraction results Metadata enrichment, content filtering, text normalization
OcrBackend Perform OCR on images Cloud OCR services, custom OCR engines
Validator Validate extraction quality Minimum content length, quality score thresholds

All plugins must be thread-safe (Send + Sync in Rust, thread-safe in Python) and implement initialize() / shutdown() lifecycle methods.

Document Extractors

Implementation

=== "Rust"

--8<-- "snippets/rust/plugins/plugin_extractor.md"

=== "Python"

--8<-- "snippets/python/plugins/plugin_extractor.md"

Registration

=== "Python"

--8<-- "snippets/python/plugins/extractor_registration.md"

=== "TypeScript"

--8<-- "snippets/typescript/plugins/custom_extractor_plugin.md"

=== "Rust"

--8<-- "snippets/rust/plugins/extractor_registration.md"

=== "Go"

--8<-- "snippets/go/plugins/extractor_registration.md"

=== "Java"

--8<-- "snippets/java/plugins/extractor_registration.md"

=== "C#"

--8<-- "snippets/csharp/extractor_registration.md"

=== "Ruby"

--8<-- "snippets/ruby/plugins/extractor_registration.md"

=== "R"

--8<-- "snippets/r/plugins/extractor_registration.md"

Priority System

When multiple extractors support the same MIME type, the highest priority wins:

Range Level
025 Fallback / low-quality
2649 Alternative
50 Default (built-in)
5175 Enhanced / premium
76100 Specialized / high-priority

Post-Processors

Processors execute in three stages:

  • Early — Foundational: language detection, quality scoring, text normalization
  • Middle — Transformation: keyword extraction, token reduction, summarization
  • Late — Final: custom metadata, analytics, output formatting

Implementation

=== "Rust"

--8<-- "snippets/rust/plugins/word_count_processor.md"

=== "Python"

--8<-- "snippets/python/plugins/word_count_processor.md"

Conditional Processing

=== "Python"

--8<-- "snippets/python/plugins/pdf_only_processor.md"

=== "Rust"

--8<-- "snippets/rust/metadata/pdf_only_processor.md"

=== "Go"

--8<-- "snippets/go/plugins/pdf_only_processor.md"

=== "Java"

--8<-- "snippets/java/plugins/pdf_only_processor.md"

=== "C#"

--8<-- "snippets/csharp/pdf_only_processor.md"

OCR Backends

Implementation

=== "Rust"

--8<-- "snippets/rust/ocr/cloud_ocr_backend.md"

=== "Python"

--8<-- "snippets/python/ocr/cloud_ocr_backend.md"

=== "Java"

--8<-- "snippets/java/ocr/cloud_ocr_backend.md"

=== "C#"

--8<-- "snippets/csharp/cloud_ocr_backend.md"

=== "Ruby"

--8<-- "snippets/ruby/ocr/cloud_ocr_backend.md"

=== "R"

--8<-- "snippets/r/ocr/cloud_ocr_backend.md"

Registration

Register the backend and set its name in OcrConfig:

=== "Python"

```python title="Python"
from kreuzberg import register_ocr_backend, unregister_ocr_backend

backend = CloudOcrBackend(api_key="your-api-key")
register_ocr_backend(backend)

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config = ExtractionConfig(ocr=OcrConfig(backend="cloud-ocr", language="eng"))
result = extract_file_sync("scanned.pdf", config=config)

unregister_ocr_backend("cloud-ocr")
```

Using EasyOCR (Built-in)

The built-in EasyOCR backend supports 80+ languages and GPU acceleration — just point OcrConfig at it:

=== "Python"

--8<-- "snippets/python/ocr/ocr_easyocr.md"

Validators

!!! Warning Validation errors cause extraction to fail. Use validators for critical quality checks only.

=== "Rust"

--8<-- "snippets/rust/plugins/min_length_validator.md"

=== "Python"

--8<-- "snippets/python/plugins/min_length_validator.md"

=== "Java"

--8<-- "snippets/java/plugins/min_length_validator.md"

=== "C#"

--8<-- "snippets/csharp/min_length_validator.md"

Quality Score Validator

=== "Rust"

--8<-- "snippets/rust/plugins/quality_score_validator.md"

=== "Python"

--8<-- "snippets/python/plugins/quality_score_validator.md"

=== "Java"

--8<-- "snippets/java/plugins/quality_score_validator.md"

=== "C#"

--8<-- "snippets/csharp/quality_score_validator.md"

Plugin Management

Listing

=== "Python"

--8<-- "snippets/python/plugins/list_plugins.md"

=== "Rust"

--8<-- "snippets/rust/plugins/list_plugins.md"

=== "Java"

--8<-- "snippets/java/plugins/list_plugins.md"

=== "C#"

--8<-- "snippets/csharp/list_plugins.md"

Unregistering

=== "Python"

--8<-- "snippets/python/plugins/unregister_plugins.md"

=== "Rust"

--8<-- "snippets/rust/plugins/unregister_plugins.md"

=== "Java"

--8<-- "snippets/java/plugins/unregister_plugins.md"

=== "C#"

--8<-- "snippets/csharp/unregister_plugins.md"

Clearing All

=== "Python"

--8<-- "snippets/python/plugins/clear_plugins.md"

=== "Rust"

--8<-- "snippets/rust/plugins/clear_plugins.md"

=== "Java"

--8<-- "snippets/java/plugins/clear_plugins.md"

=== "C#"

--8<-- "snippets/csharp/clear_plugins.md"

Thread Safety

=== "Rust"

--8<-- "snippets/rust/plugins/stateful_plugin.md"

=== "Python"

--8<-- "snippets/python/plugins/stateful_plugin.md"

=== "Java"

--8<-- "snippets/java/plugins/stateful_plugin.md"

=== "C#"

--8<-- "snippets/csharp/stateful_plugin.md"

Best Practices

Naming: Use kebab-case (my-custom-plugin), lowercase only, no spaces or special characters.

Logging

=== "Python"

--8<-- "snippets/python/plugins/plugin_logging.md"

=== "Rust"

--8<-- "snippets/rust/plugins/plugin_logging.md"

=== "Java"

--8<-- "snippets/java/plugins/plugin_logging.md"

=== "C#"

--8<-- "snippets/csharp/plugin_logging.md"

Testing

=== "Python"

--8<-- "snippets/python/plugins/plugin_testing.md"

=== "Rust"

--8<-- "snippets/rust/plugins/plugin_testing.md"

=== "Java"

--8<-- "snippets/java/plugins/plugin_testing.md"

=== "C#"

--8<-- "snippets/csharp/plugin_testing.md"

Complete Example: PDF Metadata Extractor

=== "Python"

--8<-- "snippets/python/metadata/pdf_metadata_extractor.md"

=== "Go"

--8<-- "snippets/go/plugins/pdf_metadata_extractor.md"

=== "Java"

--8<-- "snippets/java/plugins/pdf_metadata_extractor.md"

=== "C#"

--8<-- "snippets/csharp/pdf_metadata_extractor.md"

=== "Ruby"

--8<-- "snippets/ruby/plugins/pdf_metadata_extractor.md"

=== "R"

--8<-- "snippets/r/plugins/pdf_metadata_extractor.md"