hjess/fil

Fork 0

Files

Henrik Jess Nielsen b4c07d3693

Deploy fil (kreuzberg) / deploy (push) Successful in 49s

Details

Nomad changes

2026-06-01 23:40:55 +02:00

17 KiB

Raw Blame History

Supported Formats Reference

Kreuzberg supports 91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction. All formats support text and metadata extraction. Additional capabilities like OCR and table detection are noted per format.

Office Documents

Word Processing

Format	Extensions	MIME Type	Capabilities
Microsoft Word	`.docx`	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	Full text extraction, tables, embedded images, metadata, styles
Word Macro-Enabled	`.docm`	`application/vnd.ms-word.document.macroEnabled.12`	Macro-enabled document extraction, metadata
Word Template	`.dotx`, `.dotm`, `.dot`	Various Word template MIME types	Template document extraction, metadata
OpenDocument Text	`.odt`	`application/vnd.oasis.opendocument.text`	Full text extraction, tables, embedded images, metadata, styles

Spreadsheets

Format	Extensions	MIME Type	Capabilities
Excel Workbook	`.xlsx`	`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`	Sheet data, cell values, formulas, cell metadata, charts
Excel Macro-Enabled	`.xlsm`	`application/vnd.ms-excel.sheet.macroEnabled.12`	Sheet data, formulas, macros (text only), metadata
Excel Binary	`.xlsb`	`application/vnd.ms-excel.sheet.binary.macroEnabled.12`	Binary sheet data extraction, metadata
Excel Legacy	`.xls`	`application/vnd.ms-excel`	Legacy sheet data extraction, metadata
Excel Add-in	`.xla`	`application/vnd.ms-excel`	Add-in data extraction
Excel Macro Add-in	`.xlam`	`application/vnd.ms-excel.addin.macroEnabled.12`	Macro add-in metadata
Excel Template	`.xltm`	`application/vnd.ms-excel.template.macroEnabled.12`	Template data and metadata
Excel Template (XML)	`.xltx`	`application/vnd.openxmlformats-officedocument.spreadsheetml.template`	XML template data and metadata
Excel Template (Legacy)	`.xlt`	`application/vnd.ms-excel`	Legacy template data extraction
OpenDocument Spreadsheet	`.ods`	`application/vnd.oasis.opendocument.spreadsheet`	Sheet data, cell values, formulas, metadata

Presentations

Format	Extensions	MIME Type	Capabilities
PowerPoint Presentation	`.pptx`	`application/vnd.openxmlformats-officedocument.presentationml.presentation`	Slide text, speaker notes, embedded images, metadata
PowerPoint Legacy	`.ppt`	`application/vnd.ms-powerpoint`	Legacy slide text extraction, metadata
PowerPoint Slideshow	`.ppsx`	`application/vnd.openxmlformats-officedocument.presentationml.slideshow`	Slideshow content, speaker notes, metadata
PowerPoint Template	`.potx`, `.potm`, `.pot`	Various PowerPoint template MIME types	Template slide extraction, metadata

PDF

Format	Extensions	MIME Type	Capabilities
Portable Document Format	`.pdf`	`application/pdf`	Text extraction, tables, embedded images, metadata, OCR (when needed), password protection support

eBooks

Format	Extensions	MIME Type	Capabilities
EPUB	`.epub`	`application/epub+zip`	Chapter text, metadata, embedded resources, navigation
FictionBook	`.fb2`	`application/x-fictionbook+xml`	Book content, metadata, chapter structure

Database

Format	Extensions	MIME Type	Capabilities
dBASE	`.dbf`	`application/x-dbf`	Table data extraction as markdown, field type support

Hangul

Format	Extensions	MIME Type	Capabilities
Hangul Word Processor	`.hwp`, `.hwpx`	`application/x-hwp`, `application/haansofthwpx`	Korean document format, text extraction

Images (OCR-Enabled)

Raster Images

Format	Extensions	MIME Type	Capabilities
PNG	`.png`	`image/png`	OCR text extraction, table detection, EXIF metadata, dimensions, color space
JPEG	`.jpg`, `.jpeg`	`image/jpeg`	OCR text extraction, table detection, EXIF metadata, color profile
GIF	`.gif`	`image/gif`	OCR text extraction, animation metadata, dimensions
WebP	`.webp`	`image/webp`	OCR text extraction, metadata, lossy/lossless detection
Bitmap	`.bmp`	`image/bmp`	OCR text extraction, dimensions, color depth
TIFF	`.tiff`, `.tif`	`image/tiff`	OCR text extraction, multi-page support, EXIF metadata, compression info

Advanced Image Formats

Format	Extensions	MIME Type	Capabilities
JPEG 2000	`.jp2`	`image/jp2`	OCR via pure Rust decoder (hayro-jpeg2000), table detection, resolution metadata
JPEG 2000 Extended	`.jpx`	`image/jpx`	Advanced JPEG 2000 features, high-resolution content, metadata
JPEG 2000 Compound	`.jpm`	`image/jpm`	Compound image support, mixed content
Motion JPEG 2000	`.mj2`	`video/mj2`	JPEG 2000 video/sequence metadata
JBIG2	`.jbig2`, `.jb2`	`image/jbig2`	Bi-level image OCR, high compression, technical documents
Portable PixMap	`.pnm`, `.pbm`, `.pgm`, `.ppm`	`image/x-portable-pixmap`	OCR for plain image formats, raw pixel data

Vector Graphics

Format	Extensions	MIME Type	Capabilities
Scalable Vector Graphics	`.svg`	`image/svg+xml`	DOM parsing, embedded text extraction, graphics metadata, vector elements

Web & Data

Markup & Structured Text

Format	Extensions	MIME Type	Capabilities
HyperText Markup	`.html`, `.htm`	`text/html`	DOM parsing, text extraction, metadata (Open Graph, Twitter Card), link extraction
XHTML	`.xhtml`	`application/xhtml+xml`	XHTML parsing, metadata extraction, semantic structure
XML	`.xml`	`application/xml`	DOM parsing, namespace handling, text extraction, structure analysis

Structured Data Formats

Format	Extensions	MIME Type	Capabilities
JSON	`.json`	`application/json`	Schema detection, nested structure parsing, validation
YAML	`.yaml`, `.yml`	`application/x-yaml`	Hierarchical data parsing, custom tags, nested structures
TOML	`.toml`	`application/toml`	Configuration parsing, table structures, type preservation
CSV	`.csv`	`text/csv`	Delimiter detection, header inference, type detection
TSV	`.tsv`	`text/tab-separated-values`	Tab-separated value parsing, header detection

Text & Markup Languages

Format	Extensions	MIME Type	Capabilities
Plain Text	`.txt`	`text/plain`	Raw text extraction, encoding detection
Markdown	`.md`, `.markdown`	`text/markdown`	CommonMark parsing, GFM extensions, front matter
Djot	`.djot`	`text/djot`	Djot format parsing, semantic structure
reStructuredText	`.rst`	`text/x-rst`	RST parsing, directive handling, role extraction
Org Mode	`.org`	`text/org`	Org mode structure, outline parsing, metadata
Rich Text Format	`.rtf`	`application/rtf`	Text with formatting extraction, font information

Email & Archives

Email Formats

Format	Extensions	MIME Type	Capabilities
Email Message	`.eml`	`message/rfc822`	Headers (from, to, subject, date), body (HTML/plain text), attachments, threading info
Microsoft Outlook	`.msg`	`application/vnd.ms-outlook`	Outlook headers, body content, attachments, recipient metadata

Archive Formats

Format	Extensions	MIME Type	Capabilities
ZIP Archive	`.zip`	`application/zip`	File listing, nested archive support, compression metadata
Tar Archive	`.tar`	`application/x-tar`	File listing, permission metadata, nested archives
Gzip Tar	`.tgz`	`application/gzip`	Compressed archive listing, metadata
Gzip	`.gz`	`application/gzip`	Compressed file metadata
7-Zip	`.7z`	`application/x-7z-compressed`	File listing, compression info, nested archives

Academic & Scientific

Citation Formats

Format	Extensions	MIME Type	Capabilities
BibTeX	`.bib`	`text/bibtex`	Structured parsing, entry types, field extraction
BibLaTeX	`.biblatex`	`text/bibtex`	Extended BibTeX format, advanced field support
RIS	`.ris`	`application/x-research-info-systems`	Structured RIS format parsing, type detection
NIH RIS	`.nbib`	`application/x-research-info-systems`	NIH/PubMed format, structured citation data
EndNote	`.enw`	`application/x-endnote`	EndNote XML format, citation metadata
Citation Style Language	`.csl`	`application/vnd.citationstyles.csl+xml`	CSL JSON/XML parsing, style definitions

Scientific & Technical Formats

Format	Extensions	MIME Type	Capabilities
LaTeX	`.tex`, `.latex`	`application/x-latex`	LaTeX source parsing, commands, document structure
Typst	`.typ`	`text/plain`	Typst markup parsing, document structure
JATS XML	`.jats`	`application/xml`	PubMed JATS parsing, article structure, metadata
Jupyter Notebook	`.ipynb`	`application/x-ipynb+json`	Cell extraction (code + markdown), output parsing, metadata
DocBook	`.docbook`	`application/docbook+xml`	DocBook XML parsing, semantic structure

Documentation Formats

Format	Extensions	MIME Type	Capabilities
OPML	`.opml`	`application/x-opml+xml`	Outline parsing, hierarchy extraction, metadata
Perl POD	`.pod`	`text/x-pod`	Perl documentation parsing, section extraction
Manual Page	`.mdoc`	`text/plain`	UNIX manual page parsing, section structure
Troff/Groff	`.troff`	`text/troff`	Typesetting markup parsing, document structure

Format Capabilities Summary

Text Extraction

All 85+ formats support full or partial text extraction. Document structure and encoding are automatically detected.

Metadata Support

Comprehensive metadata extraction includes:

Document properties (title, author, subject, creation date, modification date)
Format-specific metadata (page count, dimensions, encoding, language)
EXIF data (for images)
Document statistics (word count, character count)

OCR (Optical Character Recognition)

OCR is available for image formats:

Raster Images: PNG, JPEG, GIF, WebP, BMP, TIFF
Advanced Formats: JPEG 2000, JBIG2, PNM/PBM/PGM/PPM
Configurable Backends: Tesseract (all languages), EasyOCR, PaddleOCR (Python), Guten (Node.js)

Table Detection

Smart table detection and reconstruction available for:

PDF documents (native tables and scanned content with OCR)
Office documents (Excel, Word)
Images (via OCR backends)
HTML/XML (from markup structure)

Archive & Nested Document Support

Archives and nested formats support file listing and sequential extraction:

ZIP, TAR, TGZ, 7Z archives
Email attachments
Nested archives within archives

Getting Started

For language-specific examples and detailed API documentation, see the API Reference.

For OCR configuration and backend selection, see the OCR Backends Guide.

For comprehensive format details and format detection, see the Complete Format Reference.

17 KiB Raw Blame History