Nomad changes
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s

This commit is contained in:
Henrik Jess Nielsen
2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions

725
docs/reference/formats.md Normal file
View File

@@ -0,0 +1,725 @@
# Format Support
Kreuzberg supports 90+ file formats across major categories, providing comprehensive document intelligence capabilities through native Rust extractors.
## Overview
Kreuzberg v4 uses a high-performance Rust core with two extraction methods:
- **Native Rust Extractors**: Fast, memory-efficient extractors for all supported formats
> **Note:** LibreOffice was a required system dependency for legacy .doc/.ppt extraction in Kreuzberg < 4.3. Since 4.3, these formats are extracted natively without any external tools.
All formats support async/await and batch processing. Image formats and PDFs support optional OCR when configured.
## Format Support Matrix
### Office Documents
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
| ------------------------ | --------------------------------------------------------------------------- | --------------------------------------------------------------------------- | ----------------------- | ------------------------- | ----------------------------------------------------------- |
| PDF | `.pdf` | `application/pdf` | Native Rust (pdf_oxide) | Yes | Metadata extraction, image extraction, text layer detection |
| Excel | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xlam`, `.xla`, `.xltx`, `.xlt`, `.ods` | Various Excel MIME types | Native Rust (calamine) | No | Multi-sheet support, formula preservation |
| PowerPoint | `.pptx`, `.pptm`, `.ppsx` | `application/vnd.openxmlformats-officedocument.presentationml.presentation` | Native Rust (roxmltree) | Yes (for embedded images) | Slide extraction, image OCR, table detection |
| PowerPoint Template | `.potx`, `.potm`, `.pot` | Various PowerPoint template MIME types | Native Rust (roxmltree) | Yes (for embedded images) | Template slide extraction |
| Word (Modern) | `.docx` | `application/vnd.openxmlformats-officedocument.wordprocessingml.document` | Native Rust | No | Preserves formatting, extracts metadata |
| Word (Macro/Template) | `.docm`, `.dotx`, `.dotm`, `.dot` | Various Word MIME types | Native Rust | No | Macro-enabled and template variants |
| Word (Legacy) | `.doc` | `application/msword` | Native OLE/CFB | Yes | Direct binary parsing |
| PowerPoint (Legacy) | `.ppt` | `application/vnd.ms-powerpoint` | Native OLE/CFB | Yes | Direct binary parsing |
| OpenDocument Text | `.odt` | `application/vnd.oasis.opendocument.text` | Native Rust | No | Full OpenDocument support |
| OpenDocument Spreadsheet | `.ods` | `application/vnd.oasis.opendocument.spreadsheet` | Native Rust (calamine) | No | Multi-sheet support |
| dBASE | `.dbf` | `application/x-dbf` | Native Rust (dbase) | No | Table data extraction, field type support |
| Hangul Word Processor | `.hwp`, `.hwpx` | `application/x-hwp` | Native Rust (hwpers) | No | Korean document format, text extraction |
| Apple Pages | `.pages` | `application/x-iwork-pages-sffpages` | Native Rust | No | Modern iWork format support |
| Apple Numbers | `.numbers` | `application/x-iwork-numbers-sffnumbers` | Native Rust | No | Spreadsheet extraction |
| Apple Keynote | `.key` | `application/x-iwork-keynote-sffkey` | Native Rust | No | Slide and speaker notes extraction |
### Text & Markup
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
| ---------------- | ------------------ | ------------------------------------ | -------------------------------------------------------------------------------- | ----------- | -------------------------------------------------------------------------------- |
| Plain Text | `.txt` | `text/plain` | Native Rust (streaming) | No | Line/word/character counting, memory-efficient streaming |
| Markdown | `.md`, `.markdown` | `text/markdown`, `text/x-markdown` | Native Rust (streaming) | No | Header extraction, link detection, code block detection |
| HTML | `.html`, `.htm` | `text/html`, `application/xhtml+xml` | Native Rust ([html-to-markdown-rs](https://docs.html-to-markdown.kreuzberg.dev)) | No | Converts to Markdown, metadata extraction |
| XML | `.xml` | `application/xml`, `text/xml` | Native Rust (quick-xml streaming) | No | Element counting, unique element tracking |
| SVG | `.svg` | `image/svg+xml` | Native Rust (XML parser) | No | Treated as XML document |
| reStructuredText | `.rst` | `text/x-rst` | Native (rst-parser) | No | Full reST syntax support |
| Org Mode | `.org` | `text/x-org` | Native (org) | No | Emacs Org mode support |
| Rich Text Format | `.rtf` | `application/rtf`, `text/rtf` | Native (rtf-parser) | No | RTF 1.x support |
| Djot | `.djot` | `text/x-djot` | Native Rust (jotdown) | No | Smart punctuation, tables, code blocks, YAML frontmatter, footnotes, math blocks |
| MDX | `.mdx` | `text/mdx` | Native Rust (pulldown-cmark) | No | JSX-in-Markdown, component-based documents |
### Structured Data
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
| ------ | --------------- | ------------------------------------------------ | ------------------------ | ----------- | ------------------------------------------- |
| JSON | `.json` | `application/json`, `text/json` | Native Rust (serde_json) | No | Field counting, nested structure extraction |
| YAML | `.yaml`, `.yml` | `application/x-yaml`, `text/yaml`, `text/x-yaml` | Native Rust (serde_yaml) | No | Multi-document support, field counting |
| TOML | `.toml` | `application/toml`, `text/toml` | Native Rust (toml crate) | No | Configuration file support |
| CSV | `.csv` | `text/csv` | Native Rust | No | Tabular data extraction |
| TSV | `.tsv` | `text/tab-separated-values` | Native Rust | No | Tab-separated data extraction |
### Email
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
| ------ | ---------- | ---------------------------- | ------------------------- | ----------- | ---------------------------------------------------------------- |
| EML | `.eml` | `message/rfc822` | Native Rust (mail-parser) | No | Header extraction, attachment listing, body text, UTF-16 support |
| MSG | `.msg` | `application/vnd.ms-outlook` | Native Rust (mail-parser) | No | Outlook message support, metadata extraction |
### Images
All image formats support OCR when configured with `ocr` parameter in `ExtractionConfig`.
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
| ---------- | ------------------------------ | -------------------------------------------------- | ---------------------------- | ----------- | --------------------------------------------------------------------------------------------------------------------------- |
| PNG | `.png` | `image/png` | Native Rust (image-rs) | Yes | EXIF metadata extraction |
| JPEG | `.jpg`, `.jpeg` | `image/jpeg`, `image/jpg` | Native Rust (image-rs) | Yes | EXIF metadata extraction |
| WebP | `.webp` | `image/webp` | Native Rust (image-rs) | Yes | Modern format support |
| BMP | `.bmp` | `image/bmp`, `image/x-bmp`, `image/x-ms-bmp` | Native Rust (image-rs) | Yes | Uncompressed format |
| TIFF | `.tiff`, `.tif` | `image/tiff`, `image/x-tiff` | Native Rust (image-rs) | Yes | Multi-page support |
| GIF | `.gif` | `image/gif` | Native Rust (image-rs) | Yes | Animation frame extraction |
| JPEG 2000 | `.jp2`, `.jpx`, `.jpm`, `.mj2` | `image/jp2`, `image/jpx`, `image/jpm`, `image/mj2` | Native Rust (hayro-jpeg2000) | Yes | OCR: Pure Rust, memory-safe decoder for JP2 container and J2K codestream formats, table detection, format-specific metadata |
| JBIG2 | `.jbig2`, `.jb2` | `image/x-jbig2` | Native Rust (hayro-jbig2) | Yes | OCR: Pure Rust bi-level decoder, commonly found in scanned PDFs |
| PNM Family | `.pnm`, `.pbm`, `.pgm`, `.ppm` | `image/x-portable-anymap`, and so on. | Native Rust (image-rs) | Yes | NetPBM formats |
### Archives
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
| ------ | -------------- | ----------------------------------------------------------------------------------- | ------------------------- | ----------- | ------------------------------------------------ |
| ZIP | `.zip` | `application/zip`, `application/x-zip-compressed` | Native Rust (zip crate) | No | File listing, text content extraction |
| TAR | `.tar`, `.tgz` | `application/x-tar`, `application/tar`, `application/x-gtar`, `application/x-ustar` | Native Rust (tar crate) | No | Unix archive support, gzip compression detection |
| 7-Zip | `.7z` | `application/x-7z-compressed` | Native Rust (sevenz-rust) | No | High compression format support |
| Gzip | `.gz` | `application/gzip`, `application/x-gzip` | Native Rust (flate2) | No | Gzip decompression with text extraction |
### Academic & Publishing (Native)
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
| ---------------- | ------------------ | ------------------------------------------------ | --------------------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------ |
| LaTeX | `.tex`, `.latex` | `application/x-latex`, `text/x-tex` | Native (manual parser) | No | Full LaTeX document support |
| EPUB | `.epub` | `application/epub+zip` | Native (zip + roxmltree + [html-to-markdown-rs](https://docs.html-to-markdown.kreuzberg.dev)) | No | E-book format, metadata extraction |
| BibTeX | `.bib` | `application/x-bibtex`, `application/x-biblatex` | Native (biblatex) | No | Bibliography database support |
| Typst | `.typst`, `.typ` | `application/x-typst` | Native (typst-syntax) | No | Modern typesetting format |
| Jupyter Notebook | `.ipynb` | `application/x-ipynb+json` | Native (JSON parsing) | No | Code cells, markdown cells, output extraction |
| FictionBook | `.fb2` | `application/x-fictionbook+xml` | Native (fb2) | No | XML-based e-book format |
| DocBook | `.docbook`, `.dbk` | `application/docbook+xml` | Native (roxmltree) | No | Technical documentation format |
| JATS | `.jats` | `application/x-jats+xml` | Native (roxmltree) | No | Journal article XML format |
| OPML | `.opml` | `application/x-opml+xml` | Native (roxmltree) | No | Outline format |
| RIS | `.ris` | `application/x-research-info-systems` | Native (biblib) | No | Structured citation parsing with title, authors, DOI, and abstract extraction |
| EndNote XML | `.enw` | `application/x-endnote+xml` | Native (biblib) | No | Structured citation parsing with title, authors, DOI, and keywords extraction |
| PubMed/MEDLINE | `.nbib` | `application/x-pubmed` | Native (biblib) | No | Structured citation parsing with author affiliations, MeSH terms, and abstract |
| CSL JSON | `.csl` | `application/csl+json` | Native (JSON parser) | No | Citation Style Language JSON |
### Markdown Variants (Native)
| Format | MIME Type | Extraction Method | Special Features |
| ------------------------ | ----------------------- | ----------------------- | -------------------------------------------- |
| CommonMark | `text/x-commonmark` | Native (pulldown-cmark) | Standard Markdown spec |
| GitHub Flavored Markdown | `text/x-gfm` | Native (pulldown-cmark) | GFM extensions (tables, strikethrough, etc.) |
| MultiMarkdown | `text/x-multimarkdown` | Native (pulldown-cmark) | MMD extensions |
| Markdown Extra | `text/x-markdown-extra` | Native (pulldown-cmark) | PHP Markdown Extra extensions |
| MDX | `text/mdx` | Native (pulldown-cmark) | JSX-in-Markdown format |
| Djot | `text/x-djot` | Native (jotdown) | Djot markup format with extended features |
### Other Formats
| Format | MIME Type | Extraction Method | Special Features |
| --------- | ----------------- | ------------------------ | ------------------------- |
| Man Pages | `text/x-mdoc` | Native (mdoc-parser) | Unix manual page format |
| Troff | `text/troff` | Native (troff-parser) | Unix document format |
| POD | `text/x-pod` | Native (pod-parser) | Perl documentation format |
| DokuWiki | `text/x-dokuwiki` | Native (dokuwiki-parser) | Wiki markup format |
## Wire Formats vs Content Formats
Kreuzberg distinguishes between two kinds of format:
### Wire Formats (`--format`)
Wire formats control how the extraction result is **serialized** for output. They determine the structure of the data you receive.
| Format | Flag | Description |
| -------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Text** | `--format text` | Plain text output of the `content` field only. Default for `extract`. |
| **JSON** | `--format json` | Standard JSON serialization of the full result object. Default for `batch`. |
| **TOON** | `--format toon` | Token-Oriented Object Notation. Losslessly convertible to/from JSON, but optimized for LLM prompts. Produces ~30-50% fewer tokens than equivalent JSON. |
TOON is designed for RAG and LLM pipelines where every token counts against context window limits and API costs. It encodes the same information as JSON but uses a more compact notation that language models parse equally well.
### Content Formats (`--content-format`)
Content formats control how extracted text is **rendered** inside the `content` field of the result. This determines the markup used for the document's textual content.
| Format | Flag | Description |
| ------------ | --------------------------- | ----------------------------------------------------------------------------- |
| **Plain** | `--content-format plain` | Raw text with no markup. Default. |
| **Markdown** | `--content-format markdown` | GitHub Flavored Markdown (GFM) via comrak. Tables, headings, lists preserved. |
| **HTML** | `--content-format html` | HTML5 rendering via comrak. |
| **Djot** | `--content-format djot` | Djot markup format. |
Wire format and content format are orthogonal. You can combine them freely, for example `--content-format markdown --format toon` produces a TOON-serialized result where the `content` field contains Markdown-formatted text.
!!! Note
The `--output-format` flag is a deprecated alias for `--content-format` and will be removed in a future release.
## Architecture Diagram
```mermaid
graph TD
A[File Input] --> B{MIME Detection}
B --> C{Extraction Method}
C -->|Native Format| D[Rust Core Extractors]
D --> G[PDF Extractor]
D --> H[Excel Extractor]
D --> I[Image Extractor]
D --> J[XML/Text/HTML Extractors]
D --> K[Email Extractor]
D --> L[Archive Extractor]
D --> M[OLE/CFB Parser for .doc/.ppt]
G --> P{OCR Needed?}
I --> P
P -->|Yes| Q[Tesseract OCR]
P -->|No| R[Text Output]
Q --> R
H --> R
J --> R
K --> R
L --> R
M --> R
R --> S[Post-Processing Pipeline]
S --> T[Final Result]
```
## Feature Flags
Kreuzberg uses Cargo feature flags to enable optional format support:
| Feature Flag | Formats Enabled | Default |
| ------------ | --------------------------------- | ------- |
| `pdf` | PDF documents | No |
| `excel` | Excel spreadsheets (all variants) | No |
| `office` | PowerPoint and Office formats | No |
| `ocr` | OCR for images and PDFs | No |
| `email` | EML, MSG email formats | No |
| `html` | HTML to Markdown conversion | No |
| `xml` | XML document parsing | No |
| `archives` | ZIP, TAR, 7z archive support | No |
| `markdown` | Markdown documents | No |
| `djot` | Djot documents | No |
| `mdx` | MDX documents | No |
**Note:** No features are enabled by default (`default = []`). You must explicitly enable the features you need.
To enable specific features:
```toml title="Cargo.toml"
[dependencies]
# Enable only PDF and Excel format support
kreuzberg = { version = "4.0", features = ["pdf", "excel"] }
```
To enable all features with `--all-features`:
```bash title="Terminal"
# Build with all format extraction features enabled
cargo build --all-features
```
Or use the convenience bundles:
All format extraction features (no server components):
```toml title="Cargo.toml"
[dependencies]
kreuzberg = { version = "4.0", features = ["full"] }
```
Server features (API, MCP) with common format support:
```toml title="Cargo.toml"
[dependencies]
kreuzberg = { version = "4.0", features = ["server"] }
```
CLI features with commonly used formats:
```toml title="Cargo.toml"
[dependencies]
kreuzberg = { version = "4.0", features = ["cli"] }
```
## System Dependencies
Some formats require external system tools:
### Tesseract OCR (Optional)
Required for OCR on images and PDFs:
```bash title="Terminal"
# Install Tesseract OCR on macOS
brew install tesseract
# Install Tesseract OCR on Ubuntu/Debian
sudo apt-get install tesseract-ocr
# Install Tesseract OCR on RHEL/CentOS/Fedora
sudo dnf install tesseract
# Install Tesseract OCR on Windows (using Scoop)
scoop install tesseract
```
**Docker Note**: All system dependencies are pre-installed in official Kreuzberg Docker images.
## Format Detection
Kreuzberg automatically detects file formats using:
1. **File Extension Mapping**: 85+ formats mapped to MIME types
2. **mime_guess Crate**: Fallback for unknown extensions
3. **Manual Override**: Explicit MIME type can be provided
Example with manual override:
=== "C#"
```csharp title="format_detection.cs"
using Kreuzberg;
// Automatic format detection from file extension
var result = KreuzbergClient.ExtractFileSync("document.pdf");
// Manual MIME type override for files without extensions
var result2 = KreuzbergClient.ExtractFileAsBytes(rawBytes, "application/pdf", null);
```
=== "Go"
```go title="format_detection.go"
import "kreuzberg"
// Automatic format detection from file extension
result, err := kreuzberg.ExtractFileSync("document.pdf", nil)
if err != nil {
log.Fatal(err)
}
// Manual MIME type override for ambiguous files
config := &kreuzberg.ExtractionConfig{}
mimeBytes, _ := ioutil.ReadFile("document.dat")
result2, err := kreuzberg.ExtractBytesSync(mimeBytes, "application/pdf", config)
```
=== "Java"
```java title="FormatDetection.java"
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
// Automatic format detection from file extension
ExtractionResult result = Kreuzberg.extractFile("document.pdf");
// Manual MIME type override using detectMimeType for byte arrays
String mimeType = Kreuzberg.detectMimeType(new byte[]{/* PDF header bytes */});
ExtractionResult result2 = Kreuzberg.extractFileAsBytes(rawBytes, mimeType, null);
```
=== "Python"
```python title="format_detection.py"
from kreuzberg import extract_file
# Automatic format detection from file extension
result = extract_file("document.pdf")
# Manual MIME type override for unknown extensions
result = extract_file("document.dat", mime_type="application/pdf")
```
=== "Ruby"
```ruby title="format_detection.rb"
require 'kreuzberg'
# Automatic format detection from file extension
result = Kreuzberg.extract_file_sync('document.pdf')
# Manual MIME type override for files with ambiguous extensions
config = Kreuzberg::Config::Extraction.new
result = Kreuzberg.extract_file_sync('document.dat', mime_type: 'application/pdf', config: config)
```
=== "Rust"
```rust title="format_detection.rs"
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
// Automatic format detection from file extension
let result = extract_file("document.pdf", None, &config).await?;
// Manual MIME type override for extensionless files
let result = extract_file("document.dat", Some("application/pdf"), &config).await?;
Ok(())
}
```
=== "TypeScript"
```typescript title="format_detection.ts"
import { extractFile } from '@kreuzberg/node';
// Automatic format detection from file extension
const result = await extractFile('document.pdf');
// Manual MIME type override for files with no extension
const result2 = await extractFile('document.dat', { mimeType: 'application/pdf' });
```
## OCR Support
OCR is available for:
- All image formats (PNG, JPEG, WebP, BMP, TIFF, GIF, etc.)
- PDF documents (with automatic fallback for scanned PDFs)
- Embedded images in PowerPoint presentations
### Configuration
```python title="ocr_configuration.py"
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig
# Configure OCR with multi-language support and custom Tesseract settings
config = ExtractionConfig(
ocr=OcrConfig(
tesseract_config=TesseractConfig(
lang="eng+deu", # Multiple languages: English and German
psm=3, # Page segmentation mode: Auto
oem=1 # OCR Engine mode: LSTM neural net
)
),
force_ocr=False # Only use OCR when native text extraction is insufficient
)
result = extract_file("scanned_document.pdf", config=config)
```
### Automatic OCR Decision
For PDFs, Kreuzberg automatically decides whether OCR is needed by analyzing native text:
- **No OCR**: Document has substantial, meaningful text (>64 non-whitespace chars, >32 chars/page average)
- **OCR Fallback**: Document appears scanned (mostly punctuation, very low alphanumeric ratio)
Override with `force_ocr=True` to always use OCR regardless of native text quality.
## Performance Characteristics
### Native Rust Extractors
- **PDF**: Significantly faster than Python libraries due to native Rust implementation
- **Excel**: Streaming parser, handles multi-GB files
- **XML**: Streaming parser, memory-efficient for large documents
- **Text/Markdown**: Streaming parser with lazy regex compilation
- **Archives**: Efficient extraction without full decompression
### OLE/CFB Extractors
- Direct binary parsing of OLE2/CFB compound files
- Used for legacy formats (`.doc`, `.ppt`)
- No external tool dependencies, native Rust implementation
### Batch Processing
All formats support concurrent batch processing:
```python title="batch_processing.py"
from kreuzberg import batch_extract_file, ExtractionConfig
# Process multiple files concurrently for better throughput
paths = ["file1.pdf", "file2.docx", "file3.xlsx"]
config = ExtractionConfig(max_concurrent_extractions=8)
results = batch_extract_file(paths, config=config)
```
## Format Limitations
### Known Limitations
- **Password-Protected PDFs**: Requires `crypto` extra (`pip install kreuzberg[crypto]`)
- **Legacy Excel (.xls)**: Formula evaluation not supported (values only)
- **Encrypted Office Documents**: Password protection not supported
- **Multi-page TIFF**: OCR processes first page only (configurable)
- **Animated GIF**: Extracts first frame only
### Unsupported Formats
- Video formats (MP4, AVI, MOV, etc.)
- Audio formats (MP3, WAV, FLAC, etc.)
- CAD formats (DWG, DXF, etc.)
- Database files (MDB, ACCDB, etc.)
- Compressed Office formats without proper headers
## Adding New Formats
Kreuzberg's plugin system allows adding custom format extractors:
=== "C#"
```csharp title="CustomExtractor.cs"
using Kreuzberg;
using Kreuzberg.Plugins;
// Custom document extractor for proprietary format support
public class CustomExtractor : IDocumentExtractor
{
public string Name => "custom-format-extractor";
public string[] SupportedMimeTypes => new[] { "application/x-custom" };
public ExtractionResult ExtractBytes(byte[] content, string mimeType, ExtractionConfig config)
{
// Implement custom extraction logic for your format
var text = ParseCustomFormat(content);
return new ExtractionResult
{
Content = text,
MimeType = mimeType,
Metadata = new Dictionary<string, object>()
};
}
}
// Register the custom extractor with Kreuzberg
KreuzbergClient.RegisterDocumentExtractor(new CustomExtractor());
```
=== "Go"
```go title="custom_extractor.go"
package main
import (
"kreuzberg"
"log"
)
// CustomExtractor implements DocumentExtractor for proprietary formats
type CustomExtractor struct{}
func (e *CustomExtractor) Name() string {
return "custom-format-extractor"
}
func (e *CustomExtractor) SupportedMimeTypes() []string {
return []string{"application/x-custom"}
}
func (e *CustomExtractor) ExtractBytes(content []byte, mimeType string, config *kreuzberg.ExtractionConfig) (*kreuzberg.ExtractionResult, error) {
// Implement custom parsing logic for your file format
text := parseCustomFormat(content)
return &kreuzberg.ExtractionResult{
Content: text,
MimeType: mimeType,
Success: true,
}, nil
}
// Register the custom extractor during package initialization
func init() {
if err := kreuzberg.RegisterDocumentExtractor("custom-format-extractor", &CustomExtractor{}); err != nil {
log.Fatal(err)
}
}
```
=== "Java"
```java title="CustomExtractor.java"
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.DocumentExtractorProtocol;
import dev.kreuzberg.ExtractionResult;
import dev.kreuzberg.config.ExtractionConfig;
// Custom document extractor for unsupported file formats
public class CustomExtractor implements DocumentExtractorProtocol {
@Override
public String name() {
return "custom-format-extractor";
}
@Override
public String[] supportedMimeTypes() {
return new String[]{"application/x-custom"};
}
@Override
public ExtractionResult extractBytes(
byte[] content,
String mimeType,
ExtractionConfig config) throws Exception {
// Implement format-specific extraction logic
String text = parseCustomFormat(content);
return new ExtractionResult(text, mimeType, true, null);
}
}
// Register the custom extractor
Kreuzberg.registerDocumentExtractor(new CustomExtractor());
```
=== "Python"
```python title="custom_extractor.py"
from kreuzberg import DocumentExtractor, ExtractionResult, Metadata
# Custom extractor for proprietary or unsupported file formats
class CustomExtractor(DocumentExtractor):
def name(self) -> str:
return "custom-format-extractor"
def supported_mime_types(self) -> list[str]:
return ["application/x-custom"]
def extract_bytes(self, content: bytes, mime_type: str, config) -> ExtractionResult:
# Implement parsing logic specific to your format
text = parse_custom_format(content)
return ExtractionResult(
content=text,
mime_type=mime_type,
metadata=Metadata()
)
# Register the custom extractor with Kreuzberg's registry
from kreuzberg import get_document_extractor_registry
registry = get_document_extractor_registry()
registry.register(CustomExtractor())
```
=== "Ruby"
```ruby title="custom_extractor.rb"
require 'kreuzberg'
# Custom document extractor for new file format support
class CustomExtractor
def name
'custom-format-extractor'
end
def supported_mime_types
['application/x-custom']
end
def extract_bytes(content, mime_type, config)
# Implement your custom format parsing logic
text = parse_custom_format(content)
Kreuzberg::Result.new(
content: text,
mime_type: mime_type,
metadata: {}
)
end
end
# Register the custom extractor
Kreuzberg.register_document_extractor(CustomExtractor.new)
```
=== "Rust"
```rust title="custom_extractor.rs"
use kreuzberg::plugins::{DocumentExtractor, Plugin};
use kreuzberg::types::ExtractionResult;
use async_trait::async_trait;
// Custom document extractor for proprietary file formats
pub struct CustomExtractor;
impl Plugin for CustomExtractor {
fn name(&self) -> &str {
"custom-format-extractor"
}
fn version(&self) -> String {
"1.0.0".to_string()
}
}
#[async_trait]
impl DocumentExtractor for CustomExtractor {
async fn extract_bytes(
&self,
content: &[u8],
mime_type: &str,
config: &ExtractionConfig,
) -> kreuzberg::Result<ExtractionResult> {
// Implement format-specific parsing logic
let text = parse_custom_format(content)?;
Ok(ExtractionResult {
content: text,
mime_type: mime_type.to_string(),
..Default::default()
})
}
fn supported_mime_types(&self) -> &[&str] {
&["application/x-custom"]
}
}
// Register the custom extractor with Kreuzberg's plugin registry
use kreuzberg::plugins::registry::get_document_extractor_registry;
use std::sync::Arc;
let registry = get_document_extractor_registry();
registry.write().unwrap().register(Arc::new(CustomExtractor))?;
```
=== "TypeScript"
```typescript title="custom_extractor.ts"
import { registerDocumentExtractor, type DocumentExtractorProtocol } from '@kreuzberg/node';
// Custom document extractor for new or proprietary file formats
class CustomExtractor implements DocumentExtractorProtocol {
name(): string {
return "custom-format-extractor";
}
supportedMimeTypes(): string[] {
return ["application/x-custom"];
}
async extractBytes(content: Uint8Array, mimeType: string, config?: ExtractionConfig): Promise<ExtractionResult> {
// Implement custom parsing logic for your format
const text = parseCustomFormat(content);
return {
content: text,
mimeType: mimeType,
success: true,
metadata: {}
};
}
}
// Register the custom extractor
registerDocumentExtractor(new CustomExtractor());
```
## See Also
- [Configuration Reference](configuration.md) - Detailed configuration options
- [Extraction Guide](../guides/extraction.md) - Extraction examples
- [OCR Guide](../guides/ocr.md) - OCR configuration and usage
- [Plugin System](../concepts/plugin-system.md) - Custom extractor development