Files
fil/skills/kreuzberg/references/supported-formats.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

17 KiB

Supported Formats Reference

Kreuzberg supports 91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction. All formats support text and metadata extraction. Additional capabilities like OCR and table detection are noted per format.

Office Documents

Word Processing

Format Extensions MIME Type Capabilities
Microsoft Word .docx application/vnd.openxmlformats-officedocument.wordprocessingml.document Full text extraction, tables, embedded images, metadata, styles
Word Macro-Enabled .docm application/vnd.ms-word.document.macroEnabled.12 Macro-enabled document extraction, metadata
Word Template .dotx, .dotm, .dot Various Word template MIME types Template document extraction, metadata
OpenDocument Text .odt application/vnd.oasis.opendocument.text Full text extraction, tables, embedded images, metadata, styles

Spreadsheets

Format Extensions MIME Type Capabilities
Excel Workbook .xlsx application/vnd.openxmlformats-officedocument.spreadsheetml.sheet Sheet data, cell values, formulas, cell metadata, charts
Excel Macro-Enabled .xlsm application/vnd.ms-excel.sheet.macroEnabled.12 Sheet data, formulas, macros (text only), metadata
Excel Binary .xlsb application/vnd.ms-excel.sheet.binary.macroEnabled.12 Binary sheet data extraction, metadata
Excel Legacy .xls application/vnd.ms-excel Legacy sheet data extraction, metadata
Excel Add-in .xla application/vnd.ms-excel Add-in data extraction
Excel Macro Add-in .xlam application/vnd.ms-excel.addin.macroEnabled.12 Macro add-in metadata
Excel Template .xltm application/vnd.ms-excel.template.macroEnabled.12 Template data and metadata
Excel Template (XML) .xltx application/vnd.openxmlformats-officedocument.spreadsheetml.template XML template data and metadata
Excel Template (Legacy) .xlt application/vnd.ms-excel Legacy template data extraction
OpenDocument Spreadsheet .ods application/vnd.oasis.opendocument.spreadsheet Sheet data, cell values, formulas, metadata

Presentations

Format Extensions MIME Type Capabilities
PowerPoint Presentation .pptx application/vnd.openxmlformats-officedocument.presentationml.presentation Slide text, speaker notes, embedded images, metadata
PowerPoint Legacy .ppt application/vnd.ms-powerpoint Legacy slide text extraction, metadata
PowerPoint Slideshow .ppsx application/vnd.openxmlformats-officedocument.presentationml.slideshow Slideshow content, speaker notes, metadata
PowerPoint Template .potx, .potm, .pot Various PowerPoint template MIME types Template slide extraction, metadata

PDF

Format Extensions MIME Type Capabilities
Portable Document Format .pdf application/pdf Text extraction, tables, embedded images, metadata, OCR (when needed), password protection support

eBooks

Format Extensions MIME Type Capabilities
EPUB .epub application/epub+zip Chapter text, metadata, embedded resources, navigation
FictionBook .fb2 application/x-fictionbook+xml Book content, metadata, chapter structure

Database

Format Extensions MIME Type Capabilities
dBASE .dbf application/x-dbf Table data extraction as markdown, field type support

Hangul

Format Extensions MIME Type Capabilities
Hangul Word Processor .hwp, .hwpx application/x-hwp, application/haansofthwpx Korean document format, text extraction

Images (OCR-Enabled)

Raster Images

Format Extensions MIME Type Capabilities
PNG .png image/png OCR text extraction, table detection, EXIF metadata, dimensions, color space
JPEG .jpg, .jpeg image/jpeg OCR text extraction, table detection, EXIF metadata, color profile
GIF .gif image/gif OCR text extraction, animation metadata, dimensions
WebP .webp image/webp OCR text extraction, metadata, lossy/lossless detection
Bitmap .bmp image/bmp OCR text extraction, dimensions, color depth
TIFF .tiff, .tif image/tiff OCR text extraction, multi-page support, EXIF metadata, compression info

Advanced Image Formats

Format Extensions MIME Type Capabilities
JPEG 2000 .jp2 image/jp2 OCR via pure Rust decoder (hayro-jpeg2000), table detection, resolution metadata
JPEG 2000 Extended .jpx image/jpx Advanced JPEG 2000 features, high-resolution content, metadata
JPEG 2000 Compound .jpm image/jpm Compound image support, mixed content
Motion JPEG 2000 .mj2 video/mj2 JPEG 2000 video/sequence metadata
JBIG2 .jbig2, .jb2 image/jbig2 Bi-level image OCR, high compression, technical documents
Portable PixMap .pnm, .pbm, .pgm, .ppm image/x-portable-pixmap OCR for plain image formats, raw pixel data

Vector Graphics

Format Extensions MIME Type Capabilities
Scalable Vector Graphics .svg image/svg+xml DOM parsing, embedded text extraction, graphics metadata, vector elements

Web & Data

Markup & Structured Text

Format Extensions MIME Type Capabilities
HyperText Markup .html, .htm text/html DOM parsing, text extraction, metadata (Open Graph, Twitter Card), link extraction
XHTML .xhtml application/xhtml+xml XHTML parsing, metadata extraction, semantic structure
XML .xml application/xml DOM parsing, namespace handling, text extraction, structure analysis

Structured Data Formats

Format Extensions MIME Type Capabilities
JSON .json application/json Schema detection, nested structure parsing, validation
YAML .yaml, .yml application/x-yaml Hierarchical data parsing, custom tags, nested structures
TOML .toml application/toml Configuration parsing, table structures, type preservation
CSV .csv text/csv Delimiter detection, header inference, type detection
TSV .tsv text/tab-separated-values Tab-separated value parsing, header detection

Text & Markup Languages

Format Extensions MIME Type Capabilities
Plain Text .txt text/plain Raw text extraction, encoding detection
Markdown .md, .markdown text/markdown CommonMark parsing, GFM extensions, front matter
Djot .djot text/djot Djot format parsing, semantic structure
reStructuredText .rst text/x-rst RST parsing, directive handling, role extraction
Org Mode .org text/org Org mode structure, outline parsing, metadata
Rich Text Format .rtf application/rtf Text with formatting extraction, font information

Email & Archives

Email Formats

Format Extensions MIME Type Capabilities
Email Message .eml message/rfc822 Headers (from, to, subject, date), body (HTML/plain text), attachments, threading info
Microsoft Outlook .msg application/vnd.ms-outlook Outlook headers, body content, attachments, recipient metadata

Archive Formats

Format Extensions MIME Type Capabilities
ZIP Archive .zip application/zip File listing, nested archive support, compression metadata
Tar Archive .tar application/x-tar File listing, permission metadata, nested archives
Gzip Tar .tgz application/gzip Compressed archive listing, metadata
Gzip .gz application/gzip Compressed file metadata
7-Zip .7z application/x-7z-compressed File listing, compression info, nested archives

Academic & Scientific

Citation Formats

Format Extensions MIME Type Capabilities
BibTeX .bib text/bibtex Structured parsing, entry types, field extraction
BibLaTeX .biblatex text/bibtex Extended BibTeX format, advanced field support
RIS .ris application/x-research-info-systems Structured RIS format parsing, type detection
NIH RIS .nbib application/x-research-info-systems NIH/PubMed format, structured citation data
EndNote .enw application/x-endnote EndNote XML format, citation metadata
Citation Style Language .csl application/vnd.citationstyles.csl+xml CSL JSON/XML parsing, style definitions

Scientific & Technical Formats

Format Extensions MIME Type Capabilities
LaTeX .tex, .latex application/x-latex LaTeX source parsing, commands, document structure
Typst .typ text/plain Typst markup parsing, document structure
JATS XML .jats application/xml PubMed JATS parsing, article structure, metadata
Jupyter Notebook .ipynb application/x-ipynb+json Cell extraction (code + markdown), output parsing, metadata
DocBook .docbook application/docbook+xml DocBook XML parsing, semantic structure

Documentation Formats

Format Extensions MIME Type Capabilities
OPML .opml application/x-opml+xml Outline parsing, hierarchy extraction, metadata
Perl POD .pod text/x-pod Perl documentation parsing, section extraction
Manual Page .mdoc text/plain UNIX manual page parsing, section structure
Troff/Groff .troff text/troff Typesetting markup parsing, document structure

Format Capabilities Summary

Text Extraction

All 85+ formats support full or partial text extraction. Document structure and encoding are automatically detected.

Metadata Support

Comprehensive metadata extraction includes:

  • Document properties (title, author, subject, creation date, modification date)
  • Format-specific metadata (page count, dimensions, encoding, language)
  • EXIF data (for images)
  • Document statistics (word count, character count)

OCR (Optical Character Recognition)

OCR is available for image formats:

  • Raster Images: PNG, JPEG, GIF, WebP, BMP, TIFF
  • Advanced Formats: JPEG 2000, JBIG2, PNM/PBM/PGM/PPM
  • Configurable Backends: Tesseract (all languages), EasyOCR, PaddleOCR (Python), Guten (Node.js)

Table Detection

Smart table detection and reconstruction available for:

  • PDF documents (native tables and scanned content with OCR)
  • Office documents (Excel, Word)
  • Images (via OCR backends)
  • HTML/XML (from markup structure)

Archive & Nested Document Support

Archives and nested formats support file listing and sequential extraction:

  • ZIP, TAR, TGZ, 7Z archives
  • Email attachments
  • Nested archives within archives

Getting Started

For language-specific examples and detailed API documentation, see the API Reference.

For OCR configuration and backend selection, see the OCR Backends Guide.

For comprehensive format details and format detection, see the Complete Format Reference.