Files
fil/docs/guides/plugins.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

349 lines
7.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Creating Plugins <span class="version-badge">v4.0.0</span>
Extend Kreuzberg with custom extractors, post-processors, OCR backends, and validators registered globally for use across all extraction calls.
!!! Note "Wasm" Custom plugins are not supported in Wasm environments. Use Python, Rust, or other native bindings.
## Plugin Types
| Type | Purpose | Use case |
| --------------------- | --------------------------------- | ---------------------------------------------------------- |
| **DocumentExtractor** | Extract content from file formats | New format support, override built-in extractors |
| **PostProcessor** | Transform extraction results | Metadata enrichment, content filtering, text normalization |
| **OcrBackend** | Perform OCR on images | Cloud OCR services, custom OCR engines |
| **Validator** | Validate extraction quality | Minimum content length, quality score thresholds |
All plugins must be thread-safe (`Send + Sync` in Rust, thread-safe in Python) and implement `initialize()` / `shutdown()` lifecycle methods.
## Document Extractors
### Implementation
=== "Rust"
--8<-- "snippets/rust/plugins/plugin_extractor.md"
=== "Python"
--8<-- "snippets/python/plugins/plugin_extractor.md"
### Registration
=== "Python"
--8<-- "snippets/python/plugins/extractor_registration.md"
=== "TypeScript"
--8<-- "snippets/typescript/plugins/custom_extractor_plugin.md"
=== "Rust"
--8<-- "snippets/rust/plugins/extractor_registration.md"
=== "Go"
--8<-- "snippets/go/plugins/extractor_registration.md"
=== "Java"
--8<-- "snippets/java/plugins/extractor_registration.md"
=== "C#"
--8<-- "snippets/csharp/extractor_registration.md"
=== "Ruby"
--8<-- "snippets/ruby/plugins/extractor_registration.md"
=== "R"
--8<-- "snippets/r/plugins/extractor_registration.md"
### Priority System
When multiple extractors support the same MIME type, the highest priority wins:
| Range | Level |
| ------ | --------------------------- |
| 025 | Fallback / low-quality |
| 2649 | Alternative |
| **50** | **Default (built-in)** |
| 5175 | Enhanced / premium |
| 76100 | Specialized / high-priority |
## Post-Processors
Processors execute in three stages:
- **Early** — Foundational: language detection, quality scoring, text normalization
- **Middle** — Transformation: keyword extraction, token reduction, summarization
- **Late** — Final: custom metadata, analytics, output formatting
### Implementation
=== "Rust"
--8<-- "snippets/rust/plugins/word_count_processor.md"
=== "Python"
--8<-- "snippets/python/plugins/word_count_processor.md"
### Conditional Processing
=== "Python"
--8<-- "snippets/python/plugins/pdf_only_processor.md"
=== "Rust"
--8<-- "snippets/rust/metadata/pdf_only_processor.md"
=== "Go"
--8<-- "snippets/go/plugins/pdf_only_processor.md"
=== "Java"
--8<-- "snippets/java/plugins/pdf_only_processor.md"
=== "C#"
--8<-- "snippets/csharp/pdf_only_processor.md"
## OCR Backends
### Implementation
=== "Rust"
--8<-- "snippets/rust/ocr/cloud_ocr_backend.md"
=== "Python"
--8<-- "snippets/python/ocr/cloud_ocr_backend.md"
=== "Java"
--8<-- "snippets/java/ocr/cloud_ocr_backend.md"
=== "C#"
--8<-- "snippets/csharp/cloud_ocr_backend.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/cloud_ocr_backend.md"
=== "R"
--8<-- "snippets/r/ocr/cloud_ocr_backend.md"
### Registration
Register the backend and set its name in `OcrConfig`:
=== "Python"
```python title="Python"
from kreuzberg import register_ocr_backend, unregister_ocr_backend
backend = CloudOcrBackend(api_key="your-api-key")
register_ocr_backend(backend)
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(ocr=OcrConfig(backend="cloud-ocr", language="eng"))
result = extract_file_sync("scanned.pdf", config=config)
unregister_ocr_backend("cloud-ocr")
```
### Using EasyOCR (Built-in)
The built-in EasyOCR backend supports 80+ languages and GPU acceleration — just point `OcrConfig` at it:
=== "Python"
--8<-- "snippets/python/ocr/ocr_easyocr.md"
## Validators
!!! Warning Validation errors cause extraction to fail. Use validators for critical quality checks only.
=== "Rust"
--8<-- "snippets/rust/plugins/min_length_validator.md"
=== "Python"
--8<-- "snippets/python/plugins/min_length_validator.md"
=== "Java"
--8<-- "snippets/java/plugins/min_length_validator.md"
=== "C#"
--8<-- "snippets/csharp/min_length_validator.md"
### Quality Score Validator
=== "Rust"
--8<-- "snippets/rust/plugins/quality_score_validator.md"
=== "Python"
--8<-- "snippets/python/plugins/quality_score_validator.md"
=== "Java"
--8<-- "snippets/java/plugins/quality_score_validator.md"
=== "C#"
--8<-- "snippets/csharp/quality_score_validator.md"
## Plugin Management
### Listing
=== "Python"
--8<-- "snippets/python/plugins/list_plugins.md"
=== "Rust"
--8<-- "snippets/rust/plugins/list_plugins.md"
=== "Java"
--8<-- "snippets/java/plugins/list_plugins.md"
=== "C#"
--8<-- "snippets/csharp/list_plugins.md"
### Unregistering
=== "Python"
--8<-- "snippets/python/plugins/unregister_plugins.md"
=== "Rust"
--8<-- "snippets/rust/plugins/unregister_plugins.md"
=== "Java"
--8<-- "snippets/java/plugins/unregister_plugins.md"
=== "C#"
--8<-- "snippets/csharp/unregister_plugins.md"
### Clearing All
=== "Python"
--8<-- "snippets/python/plugins/clear_plugins.md"
=== "Rust"
--8<-- "snippets/rust/plugins/clear_plugins.md"
=== "Java"
--8<-- "snippets/java/plugins/clear_plugins.md"
=== "C#"
--8<-- "snippets/csharp/clear_plugins.md"
## Thread Safety
=== "Rust"
--8<-- "snippets/rust/plugins/stateful_plugin.md"
=== "Python"
--8<-- "snippets/python/plugins/stateful_plugin.md"
=== "Java"
--8<-- "snippets/java/plugins/stateful_plugin.md"
=== "C#"
--8<-- "snippets/csharp/stateful_plugin.md"
## Best Practices
**Naming:** Use kebab-case (`my-custom-plugin`), lowercase only, no spaces or special characters.
### Logging
=== "Python"
--8<-- "snippets/python/plugins/plugin_logging.md"
=== "Rust"
--8<-- "snippets/rust/plugins/plugin_logging.md"
=== "Java"
--8<-- "snippets/java/plugins/plugin_logging.md"
=== "C#"
--8<-- "snippets/csharp/plugin_logging.md"
### Testing
=== "Python"
--8<-- "snippets/python/plugins/plugin_testing.md"
=== "Rust"
--8<-- "snippets/rust/plugins/plugin_testing.md"
=== "Java"
--8<-- "snippets/java/plugins/plugin_testing.md"
=== "C#"
--8<-- "snippets/csharp/plugin_testing.md"
## Complete Example: PDF Metadata Extractor
=== "Python"
--8<-- "snippets/python/metadata/pdf_metadata_extractor.md"
=== "Go"
--8<-- "snippets/go/plugins/pdf_metadata_extractor.md"
=== "Java"
--8<-- "snippets/java/plugins/pdf_metadata_extractor.md"
=== "C#"
--8<-- "snippets/csharp/pdf_metadata_extractor.md"
=== "Ruby"
--8<-- "snippets/ruby/plugins/pdf_metadata_extractor.md"
=== "R"
--8<-- "snippets/r/plugins/pdf_metadata_extractor.md"