7.2 KiB
Creating Plugins v4.0.0
Extend Kreuzberg with custom extractors, post-processors, OCR backends, and validators registered globally for use across all extraction calls.
!!! Note "Wasm" Custom plugins are not supported in Wasm environments. Use Python, Rust, or other native bindings.
Plugin Types
| Type | Purpose | Use case |
|---|---|---|
| DocumentExtractor | Extract content from file formats | New format support, override built-in extractors |
| PostProcessor | Transform extraction results | Metadata enrichment, content filtering, text normalization |
| OcrBackend | Perform OCR on images | Cloud OCR services, custom OCR engines |
| Validator | Validate extraction quality | Minimum content length, quality score thresholds |
All plugins must be thread-safe (Send + Sync in Rust, thread-safe in Python) and implement initialize() / shutdown() lifecycle methods.
Document Extractors
Implementation
=== "Rust"
--8<-- "snippets/rust/plugins/plugin_extractor.md"
=== "Python"
--8<-- "snippets/python/plugins/plugin_extractor.md"
Registration
=== "Python"
--8<-- "snippets/python/plugins/extractor_registration.md"
=== "TypeScript"
--8<-- "snippets/typescript/plugins/custom_extractor_plugin.md"
=== "Rust"
--8<-- "snippets/rust/plugins/extractor_registration.md"
=== "Go"
--8<-- "snippets/go/plugins/extractor_registration.md"
=== "Java"
--8<-- "snippets/java/plugins/extractor_registration.md"
=== "C#"
--8<-- "snippets/csharp/extractor_registration.md"
=== "Ruby"
--8<-- "snippets/ruby/plugins/extractor_registration.md"
=== "R"
--8<-- "snippets/r/plugins/extractor_registration.md"
Priority System
When multiple extractors support the same MIME type, the highest priority wins:
| Range | Level |
|---|---|
| 0–25 | Fallback / low-quality |
| 26–49 | Alternative |
| 50 | Default (built-in) |
| 51–75 | Enhanced / premium |
| 76–100 | Specialized / high-priority |
Post-Processors
Processors execute in three stages:
- Early — Foundational: language detection, quality scoring, text normalization
- Middle — Transformation: keyword extraction, token reduction, summarization
- Late — Final: custom metadata, analytics, output formatting
Implementation
=== "Rust"
--8<-- "snippets/rust/plugins/word_count_processor.md"
=== "Python"
--8<-- "snippets/python/plugins/word_count_processor.md"
Conditional Processing
=== "Python"
--8<-- "snippets/python/plugins/pdf_only_processor.md"
=== "Rust"
--8<-- "snippets/rust/metadata/pdf_only_processor.md"
=== "Go"
--8<-- "snippets/go/plugins/pdf_only_processor.md"
=== "Java"
--8<-- "snippets/java/plugins/pdf_only_processor.md"
=== "C#"
--8<-- "snippets/csharp/pdf_only_processor.md"
OCR Backends
Implementation
=== "Rust"
--8<-- "snippets/rust/ocr/cloud_ocr_backend.md"
=== "Python"
--8<-- "snippets/python/ocr/cloud_ocr_backend.md"
=== "Java"
--8<-- "snippets/java/ocr/cloud_ocr_backend.md"
=== "C#"
--8<-- "snippets/csharp/cloud_ocr_backend.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/cloud_ocr_backend.md"
=== "R"
--8<-- "snippets/r/ocr/cloud_ocr_backend.md"
Registration
Register the backend and set its name in OcrConfig:
=== "Python"
```python title="Python"
from kreuzberg import register_ocr_backend, unregister_ocr_backend
backend = CloudOcrBackend(api_key="your-api-key")
register_ocr_backend(backend)
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(ocr=OcrConfig(backend="cloud-ocr", language="eng"))
result = extract_file_sync("scanned.pdf", config=config)
unregister_ocr_backend("cloud-ocr")
```
Using EasyOCR (Built-in)
The built-in EasyOCR backend supports 80+ languages and GPU acceleration — just point OcrConfig at it:
=== "Python"
--8<-- "snippets/python/ocr/ocr_easyocr.md"
Validators
!!! Warning Validation errors cause extraction to fail. Use validators for critical quality checks only.
=== "Rust"
--8<-- "snippets/rust/plugins/min_length_validator.md"
=== "Python"
--8<-- "snippets/python/plugins/min_length_validator.md"
=== "Java"
--8<-- "snippets/java/plugins/min_length_validator.md"
=== "C#"
--8<-- "snippets/csharp/min_length_validator.md"
Quality Score Validator
=== "Rust"
--8<-- "snippets/rust/plugins/quality_score_validator.md"
=== "Python"
--8<-- "snippets/python/plugins/quality_score_validator.md"
=== "Java"
--8<-- "snippets/java/plugins/quality_score_validator.md"
=== "C#"
--8<-- "snippets/csharp/quality_score_validator.md"
Plugin Management
Listing
=== "Python"
--8<-- "snippets/python/plugins/list_plugins.md"
=== "Rust"
--8<-- "snippets/rust/plugins/list_plugins.md"
=== "Java"
--8<-- "snippets/java/plugins/list_plugins.md"
=== "C#"
--8<-- "snippets/csharp/list_plugins.md"
Unregistering
=== "Python"
--8<-- "snippets/python/plugins/unregister_plugins.md"
=== "Rust"
--8<-- "snippets/rust/plugins/unregister_plugins.md"
=== "Java"
--8<-- "snippets/java/plugins/unregister_plugins.md"
=== "C#"
--8<-- "snippets/csharp/unregister_plugins.md"
Clearing All
=== "Python"
--8<-- "snippets/python/plugins/clear_plugins.md"
=== "Rust"
--8<-- "snippets/rust/plugins/clear_plugins.md"
=== "Java"
--8<-- "snippets/java/plugins/clear_plugins.md"
=== "C#"
--8<-- "snippets/csharp/clear_plugins.md"
Thread Safety
=== "Rust"
--8<-- "snippets/rust/plugins/stateful_plugin.md"
=== "Python"
--8<-- "snippets/python/plugins/stateful_plugin.md"
=== "Java"
--8<-- "snippets/java/plugins/stateful_plugin.md"
=== "C#"
--8<-- "snippets/csharp/stateful_plugin.md"
Best Practices
Naming: Use kebab-case (my-custom-plugin), lowercase only, no spaces or special characters.
Logging
=== "Python"
--8<-- "snippets/python/plugins/plugin_logging.md"
=== "Rust"
--8<-- "snippets/rust/plugins/plugin_logging.md"
=== "Java"
--8<-- "snippets/java/plugins/plugin_logging.md"
=== "C#"
--8<-- "snippets/csharp/plugin_logging.md"
Testing
=== "Python"
--8<-- "snippets/python/plugins/plugin_testing.md"
=== "Rust"
--8<-- "snippets/rust/plugins/plugin_testing.md"
=== "Java"
--8<-- "snippets/java/plugins/plugin_testing.md"
=== "C#"
--8<-- "snippets/csharp/plugin_testing.md"
Complete Example: PDF Metadata Extractor
=== "Python"
--8<-- "snippets/python/metadata/pdf_metadata_extractor.md"
=== "Go"
--8<-- "snippets/go/plugins/pdf_metadata_extractor.md"
=== "Java"
--8<-- "snippets/java/plugins/pdf_metadata_extractor.md"
=== "C#"
--8<-- "snippets/csharp/pdf_metadata_extractor.md"
=== "Ruby"
--8<-- "snippets/ruby/plugins/pdf_metadata_extractor.md"
=== "R"
--8<-- "snippets/r/plugins/pdf_metadata_extractor.md"