349 lines
7.2 KiB
Markdown
349 lines
7.2 KiB
Markdown
# Creating Plugins <span class="version-badge">v4.0.0</span>
|
||
|
||
Extend Kreuzberg with custom extractors, post-processors, OCR backends, and validators registered globally for use across all extraction calls.
|
||
|
||
!!! Note "Wasm" Custom plugins are not supported in Wasm environments. Use Python, Rust, or other native bindings.
|
||
|
||
## Plugin Types
|
||
|
||
| Type | Purpose | Use case |
|
||
| --------------------- | --------------------------------- | ---------------------------------------------------------- |
|
||
| **DocumentExtractor** | Extract content from file formats | New format support, override built-in extractors |
|
||
| **PostProcessor** | Transform extraction results | Metadata enrichment, content filtering, text normalization |
|
||
| **OcrBackend** | Perform OCR on images | Cloud OCR services, custom OCR engines |
|
||
| **Validator** | Validate extraction quality | Minimum content length, quality score thresholds |
|
||
|
||
All plugins must be thread-safe (`Send + Sync` in Rust, thread-safe in Python) and implement `initialize()` / `shutdown()` lifecycle methods.
|
||
|
||
## Document Extractors
|
||
|
||
### Implementation
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/plugins/plugin_extractor.md"
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/plugins/plugin_extractor.md"
|
||
|
||
### Registration
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/plugins/extractor_registration.md"
|
||
|
||
=== "TypeScript"
|
||
|
||
--8<-- "snippets/typescript/plugins/custom_extractor_plugin.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/plugins/extractor_registration.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/plugins/extractor_registration.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/plugins/extractor_registration.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/extractor_registration.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/plugins/extractor_registration.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/plugins/extractor_registration.md"
|
||
|
||
### Priority System
|
||
|
||
When multiple extractors support the same MIME type, the highest priority wins:
|
||
|
||
| Range | Level |
|
||
| ------ | --------------------------- |
|
||
| 0–25 | Fallback / low-quality |
|
||
| 26–49 | Alternative |
|
||
| **50** | **Default (built-in)** |
|
||
| 51–75 | Enhanced / premium |
|
||
| 76–100 | Specialized / high-priority |
|
||
|
||
## Post-Processors
|
||
|
||
Processors execute in three stages:
|
||
|
||
- **Early** — Foundational: language detection, quality scoring, text normalization
|
||
- **Middle** — Transformation: keyword extraction, token reduction, summarization
|
||
- **Late** — Final: custom metadata, analytics, output formatting
|
||
|
||
### Implementation
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/plugins/word_count_processor.md"
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/plugins/word_count_processor.md"
|
||
|
||
### Conditional Processing
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/plugins/pdf_only_processor.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/metadata/pdf_only_processor.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/plugins/pdf_only_processor.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/plugins/pdf_only_processor.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/pdf_only_processor.md"
|
||
|
||
## OCR Backends
|
||
|
||
### Implementation
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/ocr/cloud_ocr_backend.md"
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/ocr/cloud_ocr_backend.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/ocr/cloud_ocr_backend.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/cloud_ocr_backend.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/ocr/cloud_ocr_backend.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/ocr/cloud_ocr_backend.md"
|
||
|
||
### Registration
|
||
|
||
Register the backend and set its name in `OcrConfig`:
|
||
|
||
=== "Python"
|
||
|
||
```python title="Python"
|
||
from kreuzberg import register_ocr_backend, unregister_ocr_backend
|
||
|
||
backend = CloudOcrBackend(api_key="your-api-key")
|
||
register_ocr_backend(backend)
|
||
|
||
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
|
||
|
||
config = ExtractionConfig(ocr=OcrConfig(backend="cloud-ocr", language="eng"))
|
||
result = extract_file_sync("scanned.pdf", config=config)
|
||
|
||
unregister_ocr_backend("cloud-ocr")
|
||
```
|
||
|
||
### Using EasyOCR (Built-in)
|
||
|
||
The built-in EasyOCR backend supports 80+ languages and GPU acceleration — just point `OcrConfig` at it:
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/ocr/ocr_easyocr.md"
|
||
|
||
## Validators
|
||
|
||
!!! Warning Validation errors cause extraction to fail. Use validators for critical quality checks only.
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/plugins/min_length_validator.md"
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/plugins/min_length_validator.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/plugins/min_length_validator.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/min_length_validator.md"
|
||
|
||
### Quality Score Validator
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/plugins/quality_score_validator.md"
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/plugins/quality_score_validator.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/plugins/quality_score_validator.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/quality_score_validator.md"
|
||
|
||
## Plugin Management
|
||
|
||
### Listing
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/plugins/list_plugins.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/plugins/list_plugins.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/plugins/list_plugins.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/list_plugins.md"
|
||
|
||
### Unregistering
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/plugins/unregister_plugins.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/plugins/unregister_plugins.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/plugins/unregister_plugins.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/unregister_plugins.md"
|
||
|
||
### Clearing All
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/plugins/clear_plugins.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/plugins/clear_plugins.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/plugins/clear_plugins.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/clear_plugins.md"
|
||
|
||
## Thread Safety
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/plugins/stateful_plugin.md"
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/plugins/stateful_plugin.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/plugins/stateful_plugin.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/stateful_plugin.md"
|
||
|
||
## Best Practices
|
||
|
||
**Naming:** Use kebab-case (`my-custom-plugin`), lowercase only, no spaces or special characters.
|
||
|
||
### Logging
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/plugins/plugin_logging.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/plugins/plugin_logging.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/plugins/plugin_logging.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/plugin_logging.md"
|
||
|
||
### Testing
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/plugins/plugin_testing.md"
|
||
|
||
=== "Rust"
|
||
|
||
--8<-- "snippets/rust/plugins/plugin_testing.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/plugins/plugin_testing.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/plugin_testing.md"
|
||
|
||
## Complete Example: PDF Metadata Extractor
|
||
|
||
=== "Python"
|
||
|
||
--8<-- "snippets/python/metadata/pdf_metadata_extractor.md"
|
||
|
||
=== "Go"
|
||
|
||
--8<-- "snippets/go/plugins/pdf_metadata_extractor.md"
|
||
|
||
=== "Java"
|
||
|
||
--8<-- "snippets/java/plugins/pdf_metadata_extractor.md"
|
||
|
||
=== "C#"
|
||
|
||
--8<-- "snippets/csharp/pdf_metadata_extractor.md"
|
||
|
||
=== "Ruby"
|
||
|
||
--8<-- "snippets/ruby/plugins/pdf_metadata_extractor.md"
|
||
|
||
=== "R"
|
||
|
||
--8<-- "snippets/r/plugins/pdf_metadata_extractor.md"
|