Files
fil/docs/guides/plugins.md

349 lines
7.2 KiB
Markdown
Raw Permalink Normal View History

2026-06-01 23:40:55 +02:00
# Creating Plugins <span class="version-badge">v4.0.0</span>
Extend Kreuzberg with custom extractors, post-processors, OCR backends, and validators registered globally for use across all extraction calls.
!!! Note "Wasm" Custom plugins are not supported in Wasm environments. Use Python, Rust, or other native bindings.
## Plugin Types
| Type | Purpose | Use case |
| --------------------- | --------------------------------- | ---------------------------------------------------------- |
| **DocumentExtractor** | Extract content from file formats | New format support, override built-in extractors |
| **PostProcessor** | Transform extraction results | Metadata enrichment, content filtering, text normalization |
| **OcrBackend** | Perform OCR on images | Cloud OCR services, custom OCR engines |
| **Validator** | Validate extraction quality | Minimum content length, quality score thresholds |
All plugins must be thread-safe (`Send + Sync` in Rust, thread-safe in Python) and implement `initialize()` / `shutdown()` lifecycle methods.
## Document Extractors
### Implementation
=== "Rust"
--8<-- "snippets/rust/plugins/plugin_extractor.md"
=== "Python"
--8<-- "snippets/python/plugins/plugin_extractor.md"
### Registration
=== "Python"
--8<-- "snippets/python/plugins/extractor_registration.md"
=== "TypeScript"
--8<-- "snippets/typescript/plugins/custom_extractor_plugin.md"
=== "Rust"
--8<-- "snippets/rust/plugins/extractor_registration.md"
=== "Go"
--8<-- "snippets/go/plugins/extractor_registration.md"
=== "Java"
--8<-- "snippets/java/plugins/extractor_registration.md"
=== "C#"
--8<-- "snippets/csharp/extractor_registration.md"
=== "Ruby"
--8<-- "snippets/ruby/plugins/extractor_registration.md"
=== "R"
--8<-- "snippets/r/plugins/extractor_registration.md"
### Priority System
When multiple extractors support the same MIME type, the highest priority wins:
| Range | Level |
| ------ | --------------------------- |
| 025 | Fallback / low-quality |
| 2649 | Alternative |
| **50** | **Default (built-in)** |
| 5175 | Enhanced / premium |
| 76100 | Specialized / high-priority |
## Post-Processors
Processors execute in three stages:
- **Early** — Foundational: language detection, quality scoring, text normalization
- **Middle** — Transformation: keyword extraction, token reduction, summarization
- **Late** — Final: custom metadata, analytics, output formatting
### Implementation
=== "Rust"
--8<-- "snippets/rust/plugins/word_count_processor.md"
=== "Python"
--8<-- "snippets/python/plugins/word_count_processor.md"
### Conditional Processing
=== "Python"
--8<-- "snippets/python/plugins/pdf_only_processor.md"
=== "Rust"
--8<-- "snippets/rust/metadata/pdf_only_processor.md"
=== "Go"
--8<-- "snippets/go/plugins/pdf_only_processor.md"
=== "Java"
--8<-- "snippets/java/plugins/pdf_only_processor.md"
=== "C#"
--8<-- "snippets/csharp/pdf_only_processor.md"
## OCR Backends
### Implementation
=== "Rust"
--8<-- "snippets/rust/ocr/cloud_ocr_backend.md"
=== "Python"
--8<-- "snippets/python/ocr/cloud_ocr_backend.md"
=== "Java"
--8<-- "snippets/java/ocr/cloud_ocr_backend.md"
=== "C#"
--8<-- "snippets/csharp/cloud_ocr_backend.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/cloud_ocr_backend.md"
=== "R"
--8<-- "snippets/r/ocr/cloud_ocr_backend.md"
### Registration
Register the backend and set its name in `OcrConfig`:
=== "Python"
```python title="Python"
from kreuzberg import register_ocr_backend, unregister_ocr_backend
backend = CloudOcrBackend(api_key="your-api-key")
register_ocr_backend(backend)
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(ocr=OcrConfig(backend="cloud-ocr", language="eng"))
result = extract_file_sync("scanned.pdf", config=config)
unregister_ocr_backend("cloud-ocr")
```
### Using EasyOCR (Built-in)
The built-in EasyOCR backend supports 80+ languages and GPU acceleration — just point `OcrConfig` at it:
=== "Python"
--8<-- "snippets/python/ocr/ocr_easyocr.md"
## Validators
!!! Warning Validation errors cause extraction to fail. Use validators for critical quality checks only.
=== "Rust"
--8<-- "snippets/rust/plugins/min_length_validator.md"
=== "Python"
--8<-- "snippets/python/plugins/min_length_validator.md"
=== "Java"
--8<-- "snippets/java/plugins/min_length_validator.md"
=== "C#"
--8<-- "snippets/csharp/min_length_validator.md"
### Quality Score Validator
=== "Rust"
--8<-- "snippets/rust/plugins/quality_score_validator.md"
=== "Python"
--8<-- "snippets/python/plugins/quality_score_validator.md"
=== "Java"
--8<-- "snippets/java/plugins/quality_score_validator.md"
=== "C#"
--8<-- "snippets/csharp/quality_score_validator.md"
## Plugin Management
### Listing
=== "Python"
--8<-- "snippets/python/plugins/list_plugins.md"
=== "Rust"
--8<-- "snippets/rust/plugins/list_plugins.md"
=== "Java"
--8<-- "snippets/java/plugins/list_plugins.md"
=== "C#"
--8<-- "snippets/csharp/list_plugins.md"
### Unregistering
=== "Python"
--8<-- "snippets/python/plugins/unregister_plugins.md"
=== "Rust"
--8<-- "snippets/rust/plugins/unregister_plugins.md"
=== "Java"
--8<-- "snippets/java/plugins/unregister_plugins.md"
=== "C#"
--8<-- "snippets/csharp/unregister_plugins.md"
### Clearing All
=== "Python"
--8<-- "snippets/python/plugins/clear_plugins.md"
=== "Rust"
--8<-- "snippets/rust/plugins/clear_plugins.md"
=== "Java"
--8<-- "snippets/java/plugins/clear_plugins.md"
=== "C#"
--8<-- "snippets/csharp/clear_plugins.md"
## Thread Safety
=== "Rust"
--8<-- "snippets/rust/plugins/stateful_plugin.md"
=== "Python"
--8<-- "snippets/python/plugins/stateful_plugin.md"
=== "Java"
--8<-- "snippets/java/plugins/stateful_plugin.md"
=== "C#"
--8<-- "snippets/csharp/stateful_plugin.md"
## Best Practices
**Naming:** Use kebab-case (`my-custom-plugin`), lowercase only, no spaces or special characters.
### Logging
=== "Python"
--8<-- "snippets/python/plugins/plugin_logging.md"
=== "Rust"
--8<-- "snippets/rust/plugins/plugin_logging.md"
=== "Java"
--8<-- "snippets/java/plugins/plugin_logging.md"
=== "C#"
--8<-- "snippets/csharp/plugin_logging.md"
### Testing
=== "Python"
--8<-- "snippets/python/plugins/plugin_testing.md"
=== "Rust"
--8<-- "snippets/rust/plugins/plugin_testing.md"
=== "Java"
--8<-- "snippets/java/plugins/plugin_testing.md"
=== "C#"
--8<-- "snippets/csharp/plugin_testing.md"
## Complete Example: PDF Metadata Extractor
=== "Python"
--8<-- "snippets/python/metadata/pdf_metadata_extractor.md"
=== "Go"
--8<-- "snippets/go/plugins/pdf_metadata_extractor.md"
=== "Java"
--8<-- "snippets/java/plugins/pdf_metadata_extractor.md"
=== "C#"
--8<-- "snippets/csharp/pdf_metadata_extractor.md"
=== "Ruby"
--8<-- "snippets/ruby/plugins/pdf_metadata_extractor.md"
=== "R"
--8<-- "snippets/r/plugins/pdf_metadata_extractor.md"