fil/skills/kreuzberg/references/supported-formats.md

# Supported Formats Reference

Kreuzberg supports 91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction. All formats support text and metadata extraction. Additional capabilities like OCR and table detection are noted per format.

## Office Documents

### Word Processing

| Format             | Extensions               | MIME Type                                                                 | Capabilities                                                    |
| ------------------ | ------------------------ | ------------------------------------------------------------------------- | --------------------------------------------------------------- |
| Microsoft Word     | `.docx`                  | `application/vnd.openxmlformats-officedocument.wordprocessingml.document` | Full text extraction, tables, embedded images, metadata, styles |
| Word Macro-Enabled | `.docm`                  | `application/vnd.ms-word.document.macroEnabled.12`                        | Macro-enabled document extraction, metadata                     |
| Word Template      | `.dotx`, `.dotm`, `.dot` | Various Word template MIME types                                          | Template document extraction, metadata                          |
| OpenDocument Text  | `.odt`                   | `application/vnd.oasis.opendocument.text`                                 | Full text extraction, tables, embedded images, metadata, styles |

### Spreadsheets

| Format                   | Extensions | MIME Type                                                              | Capabilities                                             |
| ------------------------ | ---------- | ---------------------------------------------------------------------- | -------------------------------------------------------- |
| Excel Workbook           | `.xlsx`    | `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`    | Sheet data, cell values, formulas, cell metadata, charts |
| Excel Macro-Enabled      | `.xlsm`    | `application/vnd.ms-excel.sheet.macroEnabled.12`                       | Sheet data, formulas, macros (text only), metadata       |
| Excel Binary             | `.xlsb`    | `application/vnd.ms-excel.sheet.binary.macroEnabled.12`                | Binary sheet data extraction, metadata                   |
| Excel Legacy             | `.xls`     | `application/vnd.ms-excel`                                             | Legacy sheet data extraction, metadata                   |
| Excel Add-in             | `.xla`     | `application/vnd.ms-excel`                                             | Add-in data extraction                                   |
| Excel Macro Add-in       | `.xlam`    | `application/vnd.ms-excel.addin.macroEnabled.12`                       | Macro add-in metadata                                    |
| Excel Template           | `.xltm`    | `application/vnd.ms-excel.template.macroEnabled.12`                    | Template data and metadata                               |
| Excel Template (XML)     | `.xltx`    | `application/vnd.openxmlformats-officedocument.spreadsheetml.template` | XML template data and metadata                           |
| Excel Template (Legacy)  | `.xlt`     | `application/vnd.ms-excel`                                             | Legacy template data extraction                          |
| OpenDocument Spreadsheet | `.ods`     | `application/vnd.oasis.opendocument.spreadsheet`                       | Sheet data, cell values, formulas, metadata              |

### Presentations

| Format                  | Extensions               | MIME Type                                                                   | Capabilities                                         |
| ----------------------- | ------------------------ | --------------------------------------------------------------------------- | ---------------------------------------------------- |
| PowerPoint Presentation | `.pptx`                  | `application/vnd.openxmlformats-officedocument.presentationml.presentation` | Slide text, speaker notes, embedded images, metadata |
| PowerPoint Legacy       | `.ppt`                   | `application/vnd.ms-powerpoint`                                             | Legacy slide text extraction, metadata               |
| PowerPoint Slideshow    | `.ppsx`                  | `application/vnd.openxmlformats-officedocument.presentationml.slideshow`    | Slideshow content, speaker notes, metadata           |
| PowerPoint Template     | `.potx`, `.potm`, `.pot` | Various PowerPoint template MIME types                                      | Template slide extraction, metadata                  |

### PDF

| Format                   | Extensions | MIME Type         | Capabilities                                                                                       |
| ------------------------ | ---------- | ----------------- | -------------------------------------------------------------------------------------------------- |
| Portable Document Format | `.pdf`     | `application/pdf` | Text extraction, tables, embedded images, metadata, OCR (when needed), password protection support |

### eBooks

| Format      | Extensions | MIME Type                       | Capabilities                                           |
| ----------- | ---------- | ------------------------------- | ------------------------------------------------------ |
| EPUB        | `.epub`    | `application/epub+zip`          | Chapter text, metadata, embedded resources, navigation |
| FictionBook | `.fb2`     | `application/x-fictionbook+xml` | Book content, metadata, chapter structure              |

### Database

| Format | Extensions | MIME Type           | Capabilities                                          |
| ------ | ---------- | ------------------- | ----------------------------------------------------- |
| dBASE  | `.dbf`     | `application/x-dbf` | Table data extraction as markdown, field type support |

### Hangul

| Format                | Extensions      | MIME Type                                       | Capabilities                            |
| --------------------- | --------------- | ----------------------------------------------- | --------------------------------------- |
| Hangul Word Processor | `.hwp`, `.hwpx` | `application/x-hwp`, `application/haansofthwpx` | Korean document format, text extraction |

## Images (OCR-Enabled)

### Raster Images

| Format | Extensions      | MIME Type    | Capabilities                                                                 |
| ------ | --------------- | ------------ | ---------------------------------------------------------------------------- |
| PNG    | `.png`          | `image/png`  | OCR text extraction, table detection, EXIF metadata, dimensions, color space |
| JPEG   | `.jpg`, `.jpeg` | `image/jpeg` | OCR text extraction, table detection, EXIF metadata, color profile           |
| GIF    | `.gif`          | `image/gif`  | OCR text extraction, animation metadata, dimensions                          |
| WebP   | `.webp`         | `image/webp` | OCR text extraction, metadata, lossy/lossless detection                      |
| Bitmap | `.bmp`          | `image/bmp`  | OCR text extraction, dimensions, color depth                                 |
| TIFF   | `.tiff`, `.tif` | `image/tiff` | OCR text extraction, multi-page support, EXIF metadata, compression info     |

### Advanced Image Formats

| Format             | Extensions                     | MIME Type                 | Capabilities                                                                     |
| ------------------ | ------------------------------ | ------------------------- | -------------------------------------------------------------------------------- |
| JPEG 2000          | `.jp2`                         | `image/jp2`               | OCR via pure Rust decoder (hayro-jpeg2000), table detection, resolution metadata |
| JPEG 2000 Extended | `.jpx`                         | `image/jpx`               | Advanced JPEG 2000 features, high-resolution content, metadata                   |
| JPEG 2000 Compound | `.jpm`                         | `image/jpm`               | Compound image support, mixed content                                            |
| Motion JPEG 2000   | `.mj2`                         | `video/mj2`               | JPEG 2000 video/sequence metadata                                                |
| JBIG2              | `.jbig2`, `.jb2`               | `image/jbig2`             | Bi-level image OCR, high compression, technical documents                        |
| Portable PixMap    | `.pnm`, `.pbm`, `.pgm`, `.ppm` | `image/x-portable-pixmap` | OCR for plain image formats, raw pixel data                                      |

### Vector Graphics

| Format                   | Extensions | MIME Type       | Capabilities                                                              |
| ------------------------ | ---------- | --------------- | ------------------------------------------------------------------------- |
| Scalable Vector Graphics | `.svg`     | `image/svg+xml` | DOM parsing, embedded text extraction, graphics metadata, vector elements |

## Web & Data

### Markup & Structured Text

| Format           | Extensions      | MIME Type               | Capabilities                                                                       |
| ---------------- | --------------- | ----------------------- | ---------------------------------------------------------------------------------- |
| HyperText Markup | `.html`, `.htm` | `text/html`             | DOM parsing, text extraction, metadata (Open Graph, Twitter Card), link extraction |
| XHTML            | `.xhtml`        | `application/xhtml+xml` | XHTML parsing, metadata extraction, semantic structure                             |
| XML              | `.xml`          | `application/xml`       | DOM parsing, namespace handling, text extraction, structure analysis               |

### Structured Data Formats

| Format | Extensions      | MIME Type                   | Capabilities                                               |
| ------ | --------------- | --------------------------- | ---------------------------------------------------------- |
| JSON   | `.json`         | `application/json`          | Schema detection, nested structure parsing, validation     |
| YAML   | `.yaml`, `.yml` | `application/x-yaml`        | Hierarchical data parsing, custom tags, nested structures  |
| TOML   | `.toml`         | `application/toml`          | Configuration parsing, table structures, type preservation |
| CSV    | `.csv`          | `text/csv`                  | Delimiter detection, header inference, type detection      |
| TSV    | `.tsv`          | `text/tab-separated-values` | Tab-separated value parsing, header detection              |

### Text & Markup Languages

| Format           | Extensions         | MIME Type         | Capabilities                                      |
| ---------------- | ------------------ | ----------------- | ------------------------------------------------- |
| Plain Text       | `.txt`             | `text/plain`      | Raw text extraction, encoding detection           |
| Markdown         | `.md`, `.markdown` | `text/markdown`   | CommonMark parsing, GFM extensions, front matter  |
| Djot             | `.djot`            | `text/djot`       | Djot format parsing, semantic structure           |
| reStructuredText | `.rst`             | `text/x-rst`      | RST parsing, directive handling, role extraction  |
| Org Mode         | `.org`             | `text/org`        | Org mode structure, outline parsing, metadata     |
| Rich Text Format | `.rtf`             | `application/rtf` | Text with formatting extraction, font information |

## Email & Archives

### Email Formats

| Format            | Extensions | MIME Type                    | Capabilities                                                                           |
| ----------------- | ---------- | ---------------------------- | -------------------------------------------------------------------------------------- |
| Email Message     | `.eml`     | `message/rfc822`             | Headers (from, to, subject, date), body (HTML/plain text), attachments, threading info |
| Microsoft Outlook | `.msg`     | `application/vnd.ms-outlook` | Outlook headers, body content, attachments, recipient metadata                         |

### Archive Formats

| Format      | Extensions | MIME Type                     | Capabilities                                               |
| ----------- | ---------- | ----------------------------- | ---------------------------------------------------------- |
| ZIP Archive | `.zip`     | `application/zip`             | File listing, nested archive support, compression metadata |
| Tar Archive | `.tar`     | `application/x-tar`           | File listing, permission metadata, nested archives         |
| Gzip Tar    | `.tgz`     | `application/gzip`            | Compressed archive listing, metadata                       |
| Gzip        | `.gz`      | `application/gzip`            | Compressed file metadata                                   |
| 7-Zip       | `.7z`      | `application/x-7z-compressed` | File listing, compression info, nested archives            |

## Academic & Scientific

### Citation Formats

| Format                  | Extensions  | MIME Type                                | Capabilities                                      |
| ----------------------- | ----------- | ---------------------------------------- | ------------------------------------------------- |
| BibTeX                  | `.bib`      | `text/bibtex`                            | Structured parsing, entry types, field extraction |
| BibLaTeX                | `.biblatex` | `text/bibtex`                            | Extended BibTeX format, advanced field support    |
| RIS                     | `.ris`      | `application/x-research-info-systems`    | Structured RIS format parsing, type detection     |
| NIH RIS                 | `.nbib`     | `application/x-research-info-systems`    | NIH/PubMed format, structured citation data       |
| EndNote                 | `.enw`      | `application/x-endnote`                  | EndNote XML format, citation metadata             |
| Citation Style Language | `.csl`      | `application/vnd.citationstyles.csl+xml` | CSL JSON/XML parsing, style definitions           |

### Scientific & Technical Formats

| Format           | Extensions       | MIME Type                  | Capabilities                                                |
| ---------------- | ---------------- | -------------------------- | ----------------------------------------------------------- |
| LaTeX            | `.tex`, `.latex` | `application/x-latex`      | LaTeX source parsing, commands, document structure          |
| Typst            | `.typ`           | `text/plain`               | Typst markup parsing, document structure                    |
| JATS XML         | `.jats`          | `application/xml`          | PubMed JATS parsing, article structure, metadata            |
| Jupyter Notebook | `.ipynb`         | `application/x-ipynb+json` | Cell extraction (code + markdown), output parsing, metadata |
| DocBook          | `.docbook`       | `application/docbook+xml`  | DocBook XML parsing, semantic structure                     |

### Documentation Formats

| Format      | Extensions | MIME Type                | Capabilities                                    |
| ----------- | ---------- | ------------------------ | ----------------------------------------------- |
| OPML        | `.opml`    | `application/x-opml+xml` | Outline parsing, hierarchy extraction, metadata |
| Perl POD    | `.pod`     | `text/x-pod`             | Perl documentation parsing, section extraction  |
| Manual Page | `.mdoc`    | `text/plain`             | UNIX manual page parsing, section structure     |
| Troff/Groff | `.troff`   | `text/troff`             | Typesetting markup parsing, document structure  |

## Format Capabilities Summary

### Text Extraction

All 85+ formats support full or partial text extraction. Document structure and encoding are automatically detected.

### Metadata Support

Comprehensive metadata extraction includes:

- Document properties (title, author, subject, creation date, modification date)
- Format-specific metadata (page count, dimensions, encoding, language)
- EXIF data (for images)
- Document statistics (word count, character count)

### OCR (Optical Character Recognition)

OCR is available for image formats:

- **Raster Images**: PNG, JPEG, GIF, WebP, BMP, TIFF
- **Advanced Formats**: JPEG 2000, JBIG2, PNM/PBM/PGM/PPM
- **Configurable Backends**: Tesseract (all languages), EasyOCR, PaddleOCR (Python), Guten (Node.js)

### Table Detection

Smart table detection and reconstruction available for:

- PDF documents (native tables and scanned content with OCR)
- Office documents (Excel, Word)
- Images (via OCR backends)
- HTML/XML (from markup structure)

### Archive & Nested Document Support

Archives and nested formats support file listing and sequential extraction:

- ZIP, TAR, TGZ, 7Z archives
- Email attachments
- Nested archives within archives

## Getting Started

For language-specific examples and detailed API documentation, see the [API Reference](https://docs.kreuzberg.dev/reference/api-python/).

For OCR configuration and backend selection, see the [OCR Backends Guide](https://docs.kreuzberg.dev/guides/ocr/).

For comprehensive format details and format detection, see the [Complete Format Reference](https://docs.kreuzberg.dev/reference/formats/).