726 lines
40 KiB
Markdown
726 lines
40 KiB
Markdown
|
|
# Format Support
|
||
|
|
|
||
|
|
Kreuzberg supports 90+ file formats across major categories, providing comprehensive document intelligence capabilities through native Rust extractors.
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
Kreuzberg v4 uses a high-performance Rust core with two extraction methods:
|
||
|
|
|
||
|
|
- **Native Rust Extractors**: Fast, memory-efficient extractors for all supported formats
|
||
|
|
|
||
|
|
> **Note:** LibreOffice was a required system dependency for legacy .doc/.ppt extraction in Kreuzberg < 4.3. Since 4.3, these formats are extracted natively without any external tools.
|
||
|
|
|
||
|
|
All formats support async/await and batch processing. Image formats and PDFs support optional OCR when configured.
|
||
|
|
|
||
|
|
## Format Support Matrix
|
||
|
|
|
||
|
|
### Office Documents
|
||
|
|
|
||
|
|
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|
||
|
|
| ------------------------ | --------------------------------------------------------------------------- | --------------------------------------------------------------------------- | ----------------------- | ------------------------- | ----------------------------------------------------------- |
|
||
|
|
| PDF | `.pdf` | `application/pdf` | Native Rust (pdf_oxide) | Yes | Metadata extraction, image extraction, text layer detection |
|
||
|
|
| Excel | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xlam`, `.xla`, `.xltx`, `.xlt`, `.ods` | Various Excel MIME types | Native Rust (calamine) | No | Multi-sheet support, formula preservation |
|
||
|
|
| PowerPoint | `.pptx`, `.pptm`, `.ppsx` | `application/vnd.openxmlformats-officedocument.presentationml.presentation` | Native Rust (roxmltree) | Yes (for embedded images) | Slide extraction, image OCR, table detection |
|
||
|
|
| PowerPoint Template | `.potx`, `.potm`, `.pot` | Various PowerPoint template MIME types | Native Rust (roxmltree) | Yes (for embedded images) | Template slide extraction |
|
||
|
|
| Word (Modern) | `.docx` | `application/vnd.openxmlformats-officedocument.wordprocessingml.document` | Native Rust | No | Preserves formatting, extracts metadata |
|
||
|
|
| Word (Macro/Template) | `.docm`, `.dotx`, `.dotm`, `.dot` | Various Word MIME types | Native Rust | No | Macro-enabled and template variants |
|
||
|
|
| Word (Legacy) | `.doc` | `application/msword` | Native OLE/CFB | Yes | Direct binary parsing |
|
||
|
|
| PowerPoint (Legacy) | `.ppt` | `application/vnd.ms-powerpoint` | Native OLE/CFB | Yes | Direct binary parsing |
|
||
|
|
| OpenDocument Text | `.odt` | `application/vnd.oasis.opendocument.text` | Native Rust | No | Full OpenDocument support |
|
||
|
|
| OpenDocument Spreadsheet | `.ods` | `application/vnd.oasis.opendocument.spreadsheet` | Native Rust (calamine) | No | Multi-sheet support |
|
||
|
|
| dBASE | `.dbf` | `application/x-dbf` | Native Rust (dbase) | No | Table data extraction, field type support |
|
||
|
|
| Hangul Word Processor | `.hwp`, `.hwpx` | `application/x-hwp` | Native Rust (hwpers) | No | Korean document format, text extraction |
|
||
|
|
| Apple Pages | `.pages` | `application/x-iwork-pages-sffpages` | Native Rust | No | Modern iWork format support |
|
||
|
|
| Apple Numbers | `.numbers` | `application/x-iwork-numbers-sffnumbers` | Native Rust | No | Spreadsheet extraction |
|
||
|
|
| Apple Keynote | `.key` | `application/x-iwork-keynote-sffkey` | Native Rust | No | Slide and speaker notes extraction |
|
||
|
|
|
||
|
|
### Text & Markup
|
||
|
|
|
||
|
|
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|
||
|
|
| ---------------- | ------------------ | ------------------------------------ | -------------------------------------------------------------------------------- | ----------- | -------------------------------------------------------------------------------- |
|
||
|
|
| Plain Text | `.txt` | `text/plain` | Native Rust (streaming) | No | Line/word/character counting, memory-efficient streaming |
|
||
|
|
| Markdown | `.md`, `.markdown` | `text/markdown`, `text/x-markdown` | Native Rust (streaming) | No | Header extraction, link detection, code block detection |
|
||
|
|
| HTML | `.html`, `.htm` | `text/html`, `application/xhtml+xml` | Native Rust ([html-to-markdown-rs](https://docs.html-to-markdown.kreuzberg.dev)) | No | Converts to Markdown, metadata extraction |
|
||
|
|
| XML | `.xml` | `application/xml`, `text/xml` | Native Rust (quick-xml streaming) | No | Element counting, unique element tracking |
|
||
|
|
| SVG | `.svg` | `image/svg+xml` | Native Rust (XML parser) | No | Treated as XML document |
|
||
|
|
| reStructuredText | `.rst` | `text/x-rst` | Native (rst-parser) | No | Full reST syntax support |
|
||
|
|
| Org Mode | `.org` | `text/x-org` | Native (org) | No | Emacs Org mode support |
|
||
|
|
| Rich Text Format | `.rtf` | `application/rtf`, `text/rtf` | Native (rtf-parser) | No | RTF 1.x support |
|
||
|
|
| Djot | `.djot` | `text/x-djot` | Native Rust (jotdown) | No | Smart punctuation, tables, code blocks, YAML frontmatter, footnotes, math blocks |
|
||
|
|
| MDX | `.mdx` | `text/mdx` | Native Rust (pulldown-cmark) | No | JSX-in-Markdown, component-based documents |
|
||
|
|
|
||
|
|
### Structured Data
|
||
|
|
|
||
|
|
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|
||
|
|
| ------ | --------------- | ------------------------------------------------ | ------------------------ | ----------- | ------------------------------------------- |
|
||
|
|
| JSON | `.json` | `application/json`, `text/json` | Native Rust (serde_json) | No | Field counting, nested structure extraction |
|
||
|
|
| YAML | `.yaml`, `.yml` | `application/x-yaml`, `text/yaml`, `text/x-yaml` | Native Rust (serde_yaml) | No | Multi-document support, field counting |
|
||
|
|
| TOML | `.toml` | `application/toml`, `text/toml` | Native Rust (toml crate) | No | Configuration file support |
|
||
|
|
| CSV | `.csv` | `text/csv` | Native Rust | No | Tabular data extraction |
|
||
|
|
| TSV | `.tsv` | `text/tab-separated-values` | Native Rust | No | Tab-separated data extraction |
|
||
|
|
|
||
|
|
### Email
|
||
|
|
|
||
|
|
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|
||
|
|
| ------ | ---------- | ---------------------------- | ------------------------- | ----------- | ---------------------------------------------------------------- |
|
||
|
|
| EML | `.eml` | `message/rfc822` | Native Rust (mail-parser) | No | Header extraction, attachment listing, body text, UTF-16 support |
|
||
|
|
| MSG | `.msg` | `application/vnd.ms-outlook` | Native Rust (mail-parser) | No | Outlook message support, metadata extraction |
|
||
|
|
|
||
|
|
### Images
|
||
|
|
|
||
|
|
All image formats support OCR when configured with `ocr` parameter in `ExtractionConfig`.
|
||
|
|
|
||
|
|
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|
||
|
|
| ---------- | ------------------------------ | -------------------------------------------------- | ---------------------------- | ----------- | --------------------------------------------------------------------------------------------------------------------------- |
|
||
|
|
| PNG | `.png` | `image/png` | Native Rust (image-rs) | Yes | EXIF metadata extraction |
|
||
|
|
| JPEG | `.jpg`, `.jpeg` | `image/jpeg`, `image/jpg` | Native Rust (image-rs) | Yes | EXIF metadata extraction |
|
||
|
|
| WebP | `.webp` | `image/webp` | Native Rust (image-rs) | Yes | Modern format support |
|
||
|
|
| BMP | `.bmp` | `image/bmp`, `image/x-bmp`, `image/x-ms-bmp` | Native Rust (image-rs) | Yes | Uncompressed format |
|
||
|
|
| TIFF | `.tiff`, `.tif` | `image/tiff`, `image/x-tiff` | Native Rust (image-rs) | Yes | Multi-page support |
|
||
|
|
| GIF | `.gif` | `image/gif` | Native Rust (image-rs) | Yes | Animation frame extraction |
|
||
|
|
| JPEG 2000 | `.jp2`, `.jpx`, `.jpm`, `.mj2` | `image/jp2`, `image/jpx`, `image/jpm`, `image/mj2` | Native Rust (hayro-jpeg2000) | Yes | OCR: Pure Rust, memory-safe decoder for JP2 container and J2K codestream formats, table detection, format-specific metadata |
|
||
|
|
| JBIG2 | `.jbig2`, `.jb2` | `image/x-jbig2` | Native Rust (hayro-jbig2) | Yes | OCR: Pure Rust bi-level decoder, commonly found in scanned PDFs |
|
||
|
|
| PNM Family | `.pnm`, `.pbm`, `.pgm`, `.ppm` | `image/x-portable-anymap`, and so on. | Native Rust (image-rs) | Yes | NetPBM formats |
|
||
|
|
|
||
|
|
### Archives
|
||
|
|
|
||
|
|
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|
||
|
|
| ------ | -------------- | ----------------------------------------------------------------------------------- | ------------------------- | ----------- | ------------------------------------------------ |
|
||
|
|
| ZIP | `.zip` | `application/zip`, `application/x-zip-compressed` | Native Rust (zip crate) | No | File listing, text content extraction |
|
||
|
|
| TAR | `.tar`, `.tgz` | `application/x-tar`, `application/tar`, `application/x-gtar`, `application/x-ustar` | Native Rust (tar crate) | No | Unix archive support, gzip compression detection |
|
||
|
|
| 7-Zip | `.7z` | `application/x-7z-compressed` | Native Rust (sevenz-rust) | No | High compression format support |
|
||
|
|
| Gzip | `.gz` | `application/gzip`, `application/x-gzip` | Native Rust (flate2) | No | Gzip decompression with text extraction |
|
||
|
|
|
||
|
|
### Academic & Publishing (Native)
|
||
|
|
|
||
|
|
| Format | Extensions | MIME Type | Extraction Method | OCR Support | Special Features |
|
||
|
|
| ---------------- | ------------------ | ------------------------------------------------ | --------------------------------------------------------------------------------------------- | ----------- | ------------------------------------------------------------------------------ |
|
||
|
|
| LaTeX | `.tex`, `.latex` | `application/x-latex`, `text/x-tex` | Native (manual parser) | No | Full LaTeX document support |
|
||
|
|
| EPUB | `.epub` | `application/epub+zip` | Native (zip + roxmltree + [html-to-markdown-rs](https://docs.html-to-markdown.kreuzberg.dev)) | No | E-book format, metadata extraction |
|
||
|
|
| BibTeX | `.bib` | `application/x-bibtex`, `application/x-biblatex` | Native (biblatex) | No | Bibliography database support |
|
||
|
|
| Typst | `.typst`, `.typ` | `application/x-typst` | Native (typst-syntax) | No | Modern typesetting format |
|
||
|
|
| Jupyter Notebook | `.ipynb` | `application/x-ipynb+json` | Native (JSON parsing) | No | Code cells, markdown cells, output extraction |
|
||
|
|
| FictionBook | `.fb2` | `application/x-fictionbook+xml` | Native (fb2) | No | XML-based e-book format |
|
||
|
|
| DocBook | `.docbook`, `.dbk` | `application/docbook+xml` | Native (roxmltree) | No | Technical documentation format |
|
||
|
|
| JATS | `.jats` | `application/x-jats+xml` | Native (roxmltree) | No | Journal article XML format |
|
||
|
|
| OPML | `.opml` | `application/x-opml+xml` | Native (roxmltree) | No | Outline format |
|
||
|
|
| RIS | `.ris` | `application/x-research-info-systems` | Native (biblib) | No | Structured citation parsing with title, authors, DOI, and abstract extraction |
|
||
|
|
| EndNote XML | `.enw` | `application/x-endnote+xml` | Native (biblib) | No | Structured citation parsing with title, authors, DOI, and keywords extraction |
|
||
|
|
| PubMed/MEDLINE | `.nbib` | `application/x-pubmed` | Native (biblib) | No | Structured citation parsing with author affiliations, MeSH terms, and abstract |
|
||
|
|
| CSL JSON | `.csl` | `application/csl+json` | Native (JSON parser) | No | Citation Style Language JSON |
|
||
|
|
|
||
|
|
### Markdown Variants (Native)
|
||
|
|
|
||
|
|
| Format | MIME Type | Extraction Method | Special Features |
|
||
|
|
| ------------------------ | ----------------------- | ----------------------- | -------------------------------------------- |
|
||
|
|
| CommonMark | `text/x-commonmark` | Native (pulldown-cmark) | Standard Markdown spec |
|
||
|
|
| GitHub Flavored Markdown | `text/x-gfm` | Native (pulldown-cmark) | GFM extensions (tables, strikethrough, etc.) |
|
||
|
|
| MultiMarkdown | `text/x-multimarkdown` | Native (pulldown-cmark) | MMD extensions |
|
||
|
|
| Markdown Extra | `text/x-markdown-extra` | Native (pulldown-cmark) | PHP Markdown Extra extensions |
|
||
|
|
| MDX | `text/mdx` | Native (pulldown-cmark) | JSX-in-Markdown format |
|
||
|
|
| Djot | `text/x-djot` | Native (jotdown) | Djot markup format with extended features |
|
||
|
|
|
||
|
|
### Other Formats
|
||
|
|
|
||
|
|
| Format | MIME Type | Extraction Method | Special Features |
|
||
|
|
| --------- | ----------------- | ------------------------ | ------------------------- |
|
||
|
|
| Man Pages | `text/x-mdoc` | Native (mdoc-parser) | Unix manual page format |
|
||
|
|
| Troff | `text/troff` | Native (troff-parser) | Unix document format |
|
||
|
|
| POD | `text/x-pod` | Native (pod-parser) | Perl documentation format |
|
||
|
|
| DokuWiki | `text/x-dokuwiki` | Native (dokuwiki-parser) | Wiki markup format |
|
||
|
|
|
||
|
|
## Wire Formats vs Content Formats
|
||
|
|
|
||
|
|
Kreuzberg distinguishes between two kinds of format:
|
||
|
|
|
||
|
|
### Wire Formats (`--format`)
|
||
|
|
|
||
|
|
Wire formats control how the extraction result is **serialized** for output. They determine the structure of the data you receive.
|
||
|
|
|
||
|
|
| Format | Flag | Description |
|
||
|
|
| -------- | --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
|
|
| **Text** | `--format text` | Plain text output of the `content` field only. Default for `extract`. |
|
||
|
|
| **JSON** | `--format json` | Standard JSON serialization of the full result object. Default for `batch`. |
|
||
|
|
| **TOON** | `--format toon` | Token-Oriented Object Notation. Losslessly convertible to/from JSON, but optimized for LLM prompts. Produces ~30-50% fewer tokens than equivalent JSON. |
|
||
|
|
|
||
|
|
TOON is designed for RAG and LLM pipelines where every token counts against context window limits and API costs. It encodes the same information as JSON but uses a more compact notation that language models parse equally well.
|
||
|
|
|
||
|
|
### Content Formats (`--content-format`)
|
||
|
|
|
||
|
|
Content formats control how extracted text is **rendered** inside the `content` field of the result. This determines the markup used for the document's textual content.
|
||
|
|
|
||
|
|
| Format | Flag | Description |
|
||
|
|
| ------------ | --------------------------- | ----------------------------------------------------------------------------- |
|
||
|
|
| **Plain** | `--content-format plain` | Raw text with no markup. Default. |
|
||
|
|
| **Markdown** | `--content-format markdown` | GitHub Flavored Markdown (GFM) via comrak. Tables, headings, lists preserved. |
|
||
|
|
| **HTML** | `--content-format html` | HTML5 rendering via comrak. |
|
||
|
|
| **Djot** | `--content-format djot` | Djot markup format. |
|
||
|
|
|
||
|
|
Wire format and content format are orthogonal. You can combine them freely, for example `--content-format markdown --format toon` produces a TOON-serialized result where the `content` field contains Markdown-formatted text.
|
||
|
|
|
||
|
|
!!! Note
|
||
|
|
The `--output-format` flag is a deprecated alias for `--content-format` and will be removed in a future release.
|
||
|
|
|
||
|
|
## Architecture Diagram
|
||
|
|
|
||
|
|
```mermaid
|
||
|
|
graph TD
|
||
|
|
A[File Input] --> B{MIME Detection}
|
||
|
|
B --> C{Extraction Method}
|
||
|
|
|
||
|
|
C -->|Native Format| D[Rust Core Extractors]
|
||
|
|
|
||
|
|
D --> G[PDF Extractor]
|
||
|
|
D --> H[Excel Extractor]
|
||
|
|
D --> I[Image Extractor]
|
||
|
|
D --> J[XML/Text/HTML Extractors]
|
||
|
|
D --> K[Email Extractor]
|
||
|
|
D --> L[Archive Extractor]
|
||
|
|
D --> M[OLE/CFB Parser for .doc/.ppt]
|
||
|
|
|
||
|
|
G --> P{OCR Needed?}
|
||
|
|
I --> P
|
||
|
|
P -->|Yes| Q[Tesseract OCR]
|
||
|
|
P -->|No| R[Text Output]
|
||
|
|
Q --> R
|
||
|
|
|
||
|
|
H --> R
|
||
|
|
J --> R
|
||
|
|
K --> R
|
||
|
|
L --> R
|
||
|
|
M --> R
|
||
|
|
|
||
|
|
R --> S[Post-Processing Pipeline]
|
||
|
|
S --> T[Final Result]
|
||
|
|
```
|
||
|
|
|
||
|
|
## Feature Flags
|
||
|
|
|
||
|
|
Kreuzberg uses Cargo feature flags to enable optional format support:
|
||
|
|
|
||
|
|
| Feature Flag | Formats Enabled | Default |
|
||
|
|
| ------------ | --------------------------------- | ------- |
|
||
|
|
| `pdf` | PDF documents | No |
|
||
|
|
| `excel` | Excel spreadsheets (all variants) | No |
|
||
|
|
| `office` | PowerPoint and Office formats | No |
|
||
|
|
| `ocr` | OCR for images and PDFs | No |
|
||
|
|
| `email` | EML, MSG email formats | No |
|
||
|
|
| `html` | HTML to Markdown conversion | No |
|
||
|
|
| `xml` | XML document parsing | No |
|
||
|
|
| `archives` | ZIP, TAR, 7z archive support | No |
|
||
|
|
| `markdown` | Markdown documents | No |
|
||
|
|
| `djot` | Djot documents | No |
|
||
|
|
| `mdx` | MDX documents | No |
|
||
|
|
|
||
|
|
**Note:** No features are enabled by default (`default = []`). You must explicitly enable the features you need.
|
||
|
|
|
||
|
|
To enable specific features:
|
||
|
|
|
||
|
|
```toml title="Cargo.toml"
|
||
|
|
[dependencies]
|
||
|
|
# Enable only PDF and Excel format support
|
||
|
|
kreuzberg = { version = "4.0", features = ["pdf", "excel"] }
|
||
|
|
```
|
||
|
|
|
||
|
|
To enable all features with `--all-features`:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
# Build with all format extraction features enabled
|
||
|
|
cargo build --all-features
|
||
|
|
```
|
||
|
|
|
||
|
|
Or use the convenience bundles:
|
||
|
|
|
||
|
|
All format extraction features (no server components):
|
||
|
|
|
||
|
|
```toml title="Cargo.toml"
|
||
|
|
[dependencies]
|
||
|
|
kreuzberg = { version = "4.0", features = ["full"] }
|
||
|
|
```
|
||
|
|
|
||
|
|
Server features (API, MCP) with common format support:
|
||
|
|
|
||
|
|
```toml title="Cargo.toml"
|
||
|
|
[dependencies]
|
||
|
|
kreuzberg = { version = "4.0", features = ["server"] }
|
||
|
|
```
|
||
|
|
|
||
|
|
CLI features with commonly used formats:
|
||
|
|
|
||
|
|
```toml title="Cargo.toml"
|
||
|
|
[dependencies]
|
||
|
|
kreuzberg = { version = "4.0", features = ["cli"] }
|
||
|
|
```
|
||
|
|
|
||
|
|
## System Dependencies
|
||
|
|
|
||
|
|
Some formats require external system tools:
|
||
|
|
|
||
|
|
### Tesseract OCR (Optional)
|
||
|
|
|
||
|
|
Required for OCR on images and PDFs:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
# Install Tesseract OCR on macOS
|
||
|
|
brew install tesseract
|
||
|
|
|
||
|
|
# Install Tesseract OCR on Ubuntu/Debian
|
||
|
|
sudo apt-get install tesseract-ocr
|
||
|
|
|
||
|
|
# Install Tesseract OCR on RHEL/CentOS/Fedora
|
||
|
|
sudo dnf install tesseract
|
||
|
|
|
||
|
|
# Install Tesseract OCR on Windows (using Scoop)
|
||
|
|
scoop install tesseract
|
||
|
|
```
|
||
|
|
|
||
|
|
**Docker Note**: All system dependencies are pre-installed in official Kreuzberg Docker images.
|
||
|
|
|
||
|
|
## Format Detection
|
||
|
|
|
||
|
|
Kreuzberg automatically detects file formats using:
|
||
|
|
|
||
|
|
1. **File Extension Mapping**: 85+ formats mapped to MIME types
|
||
|
|
2. **mime_guess Crate**: Fallback for unknown extensions
|
||
|
|
3. **Manual Override**: Explicit MIME type can be provided
|
||
|
|
|
||
|
|
Example with manual override:
|
||
|
|
|
||
|
|
=== "C#"
|
||
|
|
|
||
|
|
```csharp title="format_detection.cs"
|
||
|
|
using Kreuzberg;
|
||
|
|
|
||
|
|
// Automatic format detection from file extension
|
||
|
|
var result = KreuzbergClient.ExtractFileSync("document.pdf");
|
||
|
|
|
||
|
|
// Manual MIME type override for files without extensions
|
||
|
|
var result2 = KreuzbergClient.ExtractFileAsBytes(rawBytes, "application/pdf", null);
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "Go"
|
||
|
|
|
||
|
|
```go title="format_detection.go"
|
||
|
|
import "kreuzberg"
|
||
|
|
|
||
|
|
// Automatic format detection from file extension
|
||
|
|
result, err := kreuzberg.ExtractFileSync("document.pdf", nil)
|
||
|
|
if err != nil {
|
||
|
|
log.Fatal(err)
|
||
|
|
}
|
||
|
|
|
||
|
|
// Manual MIME type override for ambiguous files
|
||
|
|
config := &kreuzberg.ExtractionConfig{}
|
||
|
|
mimeBytes, _ := ioutil.ReadFile("document.dat")
|
||
|
|
result2, err := kreuzberg.ExtractBytesSync(mimeBytes, "application/pdf", config)
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "Java"
|
||
|
|
|
||
|
|
```java title="FormatDetection.java"
|
||
|
|
import dev.kreuzberg.Kreuzberg;
|
||
|
|
import dev.kreuzberg.ExtractionResult;
|
||
|
|
|
||
|
|
// Automatic format detection from file extension
|
||
|
|
ExtractionResult result = Kreuzberg.extractFile("document.pdf");
|
||
|
|
|
||
|
|
// Manual MIME type override using detectMimeType for byte arrays
|
||
|
|
String mimeType = Kreuzberg.detectMimeType(new byte[]{/* PDF header bytes */});
|
||
|
|
ExtractionResult result2 = Kreuzberg.extractFileAsBytes(rawBytes, mimeType, null);
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "Python"
|
||
|
|
|
||
|
|
```python title="format_detection.py"
|
||
|
|
from kreuzberg import extract_file
|
||
|
|
|
||
|
|
# Automatic format detection from file extension
|
||
|
|
result = extract_file("document.pdf")
|
||
|
|
|
||
|
|
# Manual MIME type override for unknown extensions
|
||
|
|
result = extract_file("document.dat", mime_type="application/pdf")
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "Ruby"
|
||
|
|
|
||
|
|
```ruby title="format_detection.rb"
|
||
|
|
require 'kreuzberg'
|
||
|
|
|
||
|
|
# Automatic format detection from file extension
|
||
|
|
result = Kreuzberg.extract_file_sync('document.pdf')
|
||
|
|
|
||
|
|
# Manual MIME type override for files with ambiguous extensions
|
||
|
|
config = Kreuzberg::Config::Extraction.new
|
||
|
|
result = Kreuzberg.extract_file_sync('document.dat', mime_type: 'application/pdf', config: config)
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "Rust"
|
||
|
|
|
||
|
|
```rust title="format_detection.rs"
|
||
|
|
use kreuzberg::{extract_file, ExtractionConfig};
|
||
|
|
|
||
|
|
#[tokio::main]
|
||
|
|
async fn main() -> kreuzberg::Result<()> {
|
||
|
|
let config = ExtractionConfig::default();
|
||
|
|
|
||
|
|
// Automatic format detection from file extension
|
||
|
|
let result = extract_file("document.pdf", None, &config).await?;
|
||
|
|
|
||
|
|
// Manual MIME type override for extensionless files
|
||
|
|
let result = extract_file("document.dat", Some("application/pdf"), &config).await?;
|
||
|
|
|
||
|
|
Ok(())
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "TypeScript"
|
||
|
|
|
||
|
|
```typescript title="format_detection.ts"
|
||
|
|
import { extractFile } from '@kreuzberg/node';
|
||
|
|
|
||
|
|
// Automatic format detection from file extension
|
||
|
|
const result = await extractFile('document.pdf');
|
||
|
|
|
||
|
|
// Manual MIME type override for files with no extension
|
||
|
|
const result2 = await extractFile('document.dat', { mimeType: 'application/pdf' });
|
||
|
|
```
|
||
|
|
|
||
|
|
## OCR Support
|
||
|
|
|
||
|
|
OCR is available for:
|
||
|
|
|
||
|
|
- All image formats (PNG, JPEG, WebP, BMP, TIFF, GIF, etc.)
|
||
|
|
- PDF documents (with automatic fallback for scanned PDFs)
|
||
|
|
- Embedded images in PowerPoint presentations
|
||
|
|
|
||
|
|
### Configuration
|
||
|
|
|
||
|
|
```python title="ocr_configuration.py"
|
||
|
|
from kreuzberg import extract_file, ExtractionConfig, OcrConfig, TesseractConfig
|
||
|
|
|
||
|
|
# Configure OCR with multi-language support and custom Tesseract settings
|
||
|
|
config = ExtractionConfig(
|
||
|
|
ocr=OcrConfig(
|
||
|
|
tesseract_config=TesseractConfig(
|
||
|
|
lang="eng+deu", # Multiple languages: English and German
|
||
|
|
psm=3, # Page segmentation mode: Auto
|
||
|
|
oem=1 # OCR Engine mode: LSTM neural net
|
||
|
|
)
|
||
|
|
),
|
||
|
|
force_ocr=False # Only use OCR when native text extraction is insufficient
|
||
|
|
)
|
||
|
|
|
||
|
|
result = extract_file("scanned_document.pdf", config=config)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Automatic OCR Decision
|
||
|
|
|
||
|
|
For PDFs, Kreuzberg automatically decides whether OCR is needed by analyzing native text:
|
||
|
|
|
||
|
|
- **No OCR**: Document has substantial, meaningful text (>64 non-whitespace chars, >32 chars/page average)
|
||
|
|
- **OCR Fallback**: Document appears scanned (mostly punctuation, very low alphanumeric ratio)
|
||
|
|
|
||
|
|
Override with `force_ocr=True` to always use OCR regardless of native text quality.
|
||
|
|
|
||
|
|
## Performance Characteristics
|
||
|
|
|
||
|
|
### Native Rust Extractors
|
||
|
|
|
||
|
|
- **PDF**: Significantly faster than Python libraries due to native Rust implementation
|
||
|
|
- **Excel**: Streaming parser, handles multi-GB files
|
||
|
|
- **XML**: Streaming parser, memory-efficient for large documents
|
||
|
|
- **Text/Markdown**: Streaming parser with lazy regex compilation
|
||
|
|
- **Archives**: Efficient extraction without full decompression
|
||
|
|
|
||
|
|
### OLE/CFB Extractors
|
||
|
|
|
||
|
|
- Direct binary parsing of OLE2/CFB compound files
|
||
|
|
- Used for legacy formats (`.doc`, `.ppt`)
|
||
|
|
- No external tool dependencies, native Rust implementation
|
||
|
|
|
||
|
|
### Batch Processing
|
||
|
|
|
||
|
|
All formats support concurrent batch processing:
|
||
|
|
|
||
|
|
```python title="batch_processing.py"
|
||
|
|
from kreuzberg import batch_extract_file, ExtractionConfig
|
||
|
|
|
||
|
|
# Process multiple files concurrently for better throughput
|
||
|
|
paths = ["file1.pdf", "file2.docx", "file3.xlsx"]
|
||
|
|
config = ExtractionConfig(max_concurrent_extractions=8)
|
||
|
|
|
||
|
|
results = batch_extract_file(paths, config=config)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Format Limitations
|
||
|
|
|
||
|
|
### Known Limitations
|
||
|
|
|
||
|
|
- **Password-Protected PDFs**: Requires `crypto` extra (`pip install kreuzberg[crypto]`)
|
||
|
|
- **Legacy Excel (.xls)**: Formula evaluation not supported (values only)
|
||
|
|
- **Encrypted Office Documents**: Password protection not supported
|
||
|
|
- **Multi-page TIFF**: OCR processes first page only (configurable)
|
||
|
|
- **Animated GIF**: Extracts first frame only
|
||
|
|
|
||
|
|
### Unsupported Formats
|
||
|
|
|
||
|
|
- Video formats (MP4, AVI, MOV, etc.)
|
||
|
|
- Audio formats (MP3, WAV, FLAC, etc.)
|
||
|
|
- CAD formats (DWG, DXF, etc.)
|
||
|
|
- Database files (MDB, ACCDB, etc.)
|
||
|
|
- Compressed Office formats without proper headers
|
||
|
|
|
||
|
|
## Adding New Formats
|
||
|
|
|
||
|
|
Kreuzberg's plugin system allows adding custom format extractors:
|
||
|
|
|
||
|
|
=== "C#"
|
||
|
|
|
||
|
|
```csharp title="CustomExtractor.cs"
|
||
|
|
using Kreuzberg;
|
||
|
|
using Kreuzberg.Plugins;
|
||
|
|
|
||
|
|
// Custom document extractor for proprietary format support
|
||
|
|
public class CustomExtractor : IDocumentExtractor
|
||
|
|
{
|
||
|
|
public string Name => "custom-format-extractor";
|
||
|
|
|
||
|
|
public string[] SupportedMimeTypes => new[] { "application/x-custom" };
|
||
|
|
|
||
|
|
public ExtractionResult ExtractBytes(byte[] content, string mimeType, ExtractionConfig config)
|
||
|
|
{
|
||
|
|
// Implement custom extraction logic for your format
|
||
|
|
var text = ParseCustomFormat(content);
|
||
|
|
return new ExtractionResult
|
||
|
|
{
|
||
|
|
Content = text,
|
||
|
|
MimeType = mimeType,
|
||
|
|
Metadata = new Dictionary<string, object>()
|
||
|
|
};
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// Register the custom extractor with Kreuzberg
|
||
|
|
KreuzbergClient.RegisterDocumentExtractor(new CustomExtractor());
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "Go"
|
||
|
|
|
||
|
|
```go title="custom_extractor.go"
|
||
|
|
package main
|
||
|
|
|
||
|
|
import (
|
||
|
|
"kreuzberg"
|
||
|
|
"log"
|
||
|
|
)
|
||
|
|
|
||
|
|
// CustomExtractor implements DocumentExtractor for proprietary formats
|
||
|
|
type CustomExtractor struct{}
|
||
|
|
|
||
|
|
func (e *CustomExtractor) Name() string {
|
||
|
|
return "custom-format-extractor"
|
||
|
|
}
|
||
|
|
|
||
|
|
func (e *CustomExtractor) SupportedMimeTypes() []string {
|
||
|
|
return []string{"application/x-custom"}
|
||
|
|
}
|
||
|
|
|
||
|
|
func (e *CustomExtractor) ExtractBytes(content []byte, mimeType string, config *kreuzberg.ExtractionConfig) (*kreuzberg.ExtractionResult, error) {
|
||
|
|
// Implement custom parsing logic for your file format
|
||
|
|
text := parseCustomFormat(content)
|
||
|
|
return &kreuzberg.ExtractionResult{
|
||
|
|
Content: text,
|
||
|
|
MimeType: mimeType,
|
||
|
|
Success: true,
|
||
|
|
}, nil
|
||
|
|
}
|
||
|
|
|
||
|
|
// Register the custom extractor during package initialization
|
||
|
|
func init() {
|
||
|
|
if err := kreuzberg.RegisterDocumentExtractor("custom-format-extractor", &CustomExtractor{}); err != nil {
|
||
|
|
log.Fatal(err)
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "Java"
|
||
|
|
|
||
|
|
```java title="CustomExtractor.java"
|
||
|
|
import dev.kreuzberg.Kreuzberg;
|
||
|
|
import dev.kreuzberg.DocumentExtractorProtocol;
|
||
|
|
import dev.kreuzberg.ExtractionResult;
|
||
|
|
import dev.kreuzberg.config.ExtractionConfig;
|
||
|
|
|
||
|
|
// Custom document extractor for unsupported file formats
|
||
|
|
public class CustomExtractor implements DocumentExtractorProtocol {
|
||
|
|
@Override
|
||
|
|
public String name() {
|
||
|
|
return "custom-format-extractor";
|
||
|
|
}
|
||
|
|
|
||
|
|
@Override
|
||
|
|
public String[] supportedMimeTypes() {
|
||
|
|
return new String[]{"application/x-custom"};
|
||
|
|
}
|
||
|
|
|
||
|
|
@Override
|
||
|
|
public ExtractionResult extractBytes(
|
||
|
|
byte[] content,
|
||
|
|
String mimeType,
|
||
|
|
ExtractionConfig config) throws Exception {
|
||
|
|
// Implement format-specific extraction logic
|
||
|
|
String text = parseCustomFormat(content);
|
||
|
|
return new ExtractionResult(text, mimeType, true, null);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// Register the custom extractor
|
||
|
|
Kreuzberg.registerDocumentExtractor(new CustomExtractor());
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "Python"
|
||
|
|
|
||
|
|
```python title="custom_extractor.py"
|
||
|
|
from kreuzberg import DocumentExtractor, ExtractionResult, Metadata
|
||
|
|
|
||
|
|
# Custom extractor for proprietary or unsupported file formats
|
||
|
|
class CustomExtractor(DocumentExtractor):
|
||
|
|
def name(self) -> str:
|
||
|
|
return "custom-format-extractor"
|
||
|
|
|
||
|
|
def supported_mime_types(self) -> list[str]:
|
||
|
|
return ["application/x-custom"]
|
||
|
|
|
||
|
|
def extract_bytes(self, content: bytes, mime_type: str, config) -> ExtractionResult:
|
||
|
|
# Implement parsing logic specific to your format
|
||
|
|
text = parse_custom_format(content)
|
||
|
|
return ExtractionResult(
|
||
|
|
content=text,
|
||
|
|
mime_type=mime_type,
|
||
|
|
metadata=Metadata()
|
||
|
|
)
|
||
|
|
|
||
|
|
# Register the custom extractor with Kreuzberg's registry
|
||
|
|
from kreuzberg import get_document_extractor_registry
|
||
|
|
registry = get_document_extractor_registry()
|
||
|
|
registry.register(CustomExtractor())
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "Ruby"
|
||
|
|
|
||
|
|
```ruby title="custom_extractor.rb"
|
||
|
|
require 'kreuzberg'
|
||
|
|
|
||
|
|
# Custom document extractor for new file format support
|
||
|
|
class CustomExtractor
|
||
|
|
def name
|
||
|
|
'custom-format-extractor'
|
||
|
|
end
|
||
|
|
|
||
|
|
def supported_mime_types
|
||
|
|
['application/x-custom']
|
||
|
|
end
|
||
|
|
|
||
|
|
def extract_bytes(content, mime_type, config)
|
||
|
|
# Implement your custom format parsing logic
|
||
|
|
text = parse_custom_format(content)
|
||
|
|
Kreuzberg::Result.new(
|
||
|
|
content: text,
|
||
|
|
mime_type: mime_type,
|
||
|
|
metadata: {}
|
||
|
|
)
|
||
|
|
end
|
||
|
|
end
|
||
|
|
|
||
|
|
# Register the custom extractor
|
||
|
|
Kreuzberg.register_document_extractor(CustomExtractor.new)
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "Rust"
|
||
|
|
|
||
|
|
```rust title="custom_extractor.rs"
|
||
|
|
use kreuzberg::plugins::{DocumentExtractor, Plugin};
|
||
|
|
use kreuzberg::types::ExtractionResult;
|
||
|
|
use async_trait::async_trait;
|
||
|
|
|
||
|
|
// Custom document extractor for proprietary file formats
|
||
|
|
pub struct CustomExtractor;
|
||
|
|
|
||
|
|
impl Plugin for CustomExtractor {
|
||
|
|
fn name(&self) -> &str {
|
||
|
|
"custom-format-extractor"
|
||
|
|
}
|
||
|
|
|
||
|
|
fn version(&self) -> String {
|
||
|
|
"1.0.0".to_string()
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
#[async_trait]
|
||
|
|
impl DocumentExtractor for CustomExtractor {
|
||
|
|
async fn extract_bytes(
|
||
|
|
&self,
|
||
|
|
content: &[u8],
|
||
|
|
mime_type: &str,
|
||
|
|
config: &ExtractionConfig,
|
||
|
|
) -> kreuzberg::Result<ExtractionResult> {
|
||
|
|
// Implement format-specific parsing logic
|
||
|
|
let text = parse_custom_format(content)?;
|
||
|
|
Ok(ExtractionResult {
|
||
|
|
content: text,
|
||
|
|
mime_type: mime_type.to_string(),
|
||
|
|
..Default::default()
|
||
|
|
})
|
||
|
|
}
|
||
|
|
|
||
|
|
fn supported_mime_types(&self) -> &[&str] {
|
||
|
|
&["application/x-custom"]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// Register the custom extractor with Kreuzberg's plugin registry
|
||
|
|
use kreuzberg::plugins::registry::get_document_extractor_registry;
|
||
|
|
use std::sync::Arc;
|
||
|
|
|
||
|
|
let registry = get_document_extractor_registry();
|
||
|
|
registry.write().unwrap().register(Arc::new(CustomExtractor))?;
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "TypeScript"
|
||
|
|
|
||
|
|
```typescript title="custom_extractor.ts"
|
||
|
|
import { registerDocumentExtractor, type DocumentExtractorProtocol } from '@kreuzberg/node';
|
||
|
|
|
||
|
|
// Custom document extractor for new or proprietary file formats
|
||
|
|
class CustomExtractor implements DocumentExtractorProtocol {
|
||
|
|
name(): string {
|
||
|
|
return "custom-format-extractor";
|
||
|
|
}
|
||
|
|
|
||
|
|
supportedMimeTypes(): string[] {
|
||
|
|
return ["application/x-custom"];
|
||
|
|
}
|
||
|
|
|
||
|
|
async extractBytes(content: Uint8Array, mimeType: string, config?: ExtractionConfig): Promise<ExtractionResult> {
|
||
|
|
// Implement custom parsing logic for your format
|
||
|
|
const text = parseCustomFormat(content);
|
||
|
|
return {
|
||
|
|
content: text,
|
||
|
|
mimeType: mimeType,
|
||
|
|
success: true,
|
||
|
|
metadata: {}
|
||
|
|
};
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// Register the custom extractor
|
||
|
|
registerDocumentExtractor(new CustomExtractor());
|
||
|
|
```
|
||
|
|
|
||
|
|
## See Also
|
||
|
|
|
||
|
|
- [Configuration Reference](configuration.md) - Detailed configuration options
|
||
|
|
- [Extraction Guide](../guides/extraction.md) - Extraction examples
|
||
|
|
- [OCR Guide](../guides/ocr.md) - OCR configuration and usage
|
||
|
|
- [Plugin System](../concepts/plugin-system.md) - Custom extractor development
|