--- title: "Types Reference" --- ## Types Reference All types defined by the library, grouped by category. Types are shown using Rust as the canonical representation. ### Result Types #### StructuredDataResult | Field | Type | Default | Description | |-------|------|---------|-------------| | `content` | `String` | — | The extracted text content | | `format` | `String` | — | Format | | `metadata` | `HashMap` | — | Document metadata | | `text_fields` | `Vec` | — | Text fields | --- #### ExtractionResult General extraction result used by the core extraction API. This is the main result type returned by all extraction functions. | Field | Type | Default | Description | |-------|------|---------|-------------| | `content` | `String` | — | The extracted text content | | `mime_type` | `String` | — | The detected MIME type | | `metadata` | `Metadata` | — | Document metadata | | `extraction_method` | `Option` | `Default::default()` | Extraction strategy used to produce the returned text. Populated when the extractor can reliably distinguish native text extraction, OCR-only extraction, or mixed native/OCR output. | | `tables` | `Vec` | `vec![]` | Tables extracted from the document | | `detected_languages` | `Vec` | `vec![]` | Detected languages | | `chunks` | `Vec` | `vec![]` | Text chunks when chunking is enabled. When chunking configuration is provided, the content is split into overlapping chunks for efficient processing. Each chunk contains the text, optional embeddings (if enabled), and metadata about its position. | | `images` | `Vec` | `vec![]` | Extracted images from the document. When image extraction is enabled via `ImageExtractionConfig`, this field contains all images found in the document with their raw data and metadata. Each image may optionally contain a nested `ocr_result` if OCR was performed. | | `pages` | `Vec` | `vec![]` | Per-page content when page extraction is enabled. When page extraction is configured, the document is split into per-page content with tables and images mapped to their respective pages. | | `elements` | `Vec` | `vec![]` | Semantic elements when element-based result format is enabled. When result_format is set to ElementBased, this field contains semantic elements with type classification, unique identifiers, and metadata for Unstructured-compatible element-based processing. | | `djot_content` | `Option` | `Default::default()` | Rich Djot content structure (when extracting Djot documents). When extracting Djot documents with structured extraction enabled, this field contains the full semantic structure including: - Block-level elements with nesting - Inline formatting with attributes - Links, images, footnotes - Math expressions - Complete attribute information The `content` field still contains plain text for backward compatibility. Always `None` for non-Djot documents. | | `ocr_elements` | `Vec` | `vec![]` | OCR elements with full spatial and confidence metadata. When OCR is performed with element extraction enabled, this field contains the structured representation of detected text including: - Bounding geometry (rectangles or quadrilaterals) - Confidence scores (detection and recognition) - Rotation information - Hierarchical relationships (Tesseract only) This field preserves all metadata that would otherwise be lost when converting to plain text or markdown output formats. Only populated when `OcrElementConfig.include_elements` is true. | | `document` | `Option` | `Default::default()` | Structured document tree (when document structure extraction is enabled). When `include_document_structure` is true in `ExtractionConfig`, this field contains the full hierarchical representation of the document including: - Heading-driven section nesting - Table grids with cell-level metadata - Content layer classification (body, header, footer, footnote) - Inline text annotations (formatting, links) - Bounding boxes and page numbers Independent of `result_format` — can be combined with Unified or ElementBased. | | `extracted_keywords` | `Vec` | `vec![]` | Extracted keywords when keyword extraction is enabled. When keyword extraction (RAKE or YAKE) is configured, this field contains the extracted keywords with scores, algorithm info, and position data. Previously stored in `metadata.additional["keywords"]`. | | `quality_score` | `Option` | `Default::default()` | Document quality score from quality analysis. A value between 0.0 and 1.0 indicating the overall text quality. Previously stored in `metadata.additional["quality_score"]`. | | `processing_warnings` | `Vec` | `vec![]` | Non-fatal warnings collected during processing pipeline stages. Captures errors from optional pipeline features (embedding, chunking, language detection, output formatting) that don't prevent extraction but may indicate degraded results. Previously stored as individual keys in `metadata.additional`. | | `annotations` | `Vec` | `vec![]` | PDF annotations extracted from the document. When annotation extraction is enabled via `PdfConfig.extract_annotations`, this field contains text notes, highlights, links, stamps, and other annotations found in PDF documents. | | `children` | `Vec` | `vec![]` | Nested extraction results from archive contents. When extracting archives, each processable file inside produces its own full extraction result. Set to `None` for non-archive formats. Use `max_archive_depth` in config to control recursion depth. | | `uris` | `Vec` | `vec![]` | URIs/links discovered during document extraction. Contains hyperlinks, image references, citations, email addresses, and other URI-like references found in the document. Always extracted when present in the source document. | | `revisions` | `Vec` | `vec![]` | Tracked changes embedded in the source document. Populated by per-format extractors that understand change-tracking metadata (DOCX `w:ins`/`w:del`/`w:rPrChange`, ODT `text:change-*`, …). Every extractor defaults to `None` until its format-specific implementation is added. Extractors that do populate this field follow the "accepted-changes" convention: inserted text is present in `content`, deleted text is absent — the revision list is the separate audit trail. | | `structured_output` | `Option` | `Default::default()` | Structured extraction output from LLM-based JSON schema extraction. When `structured_extraction` is configured in `ExtractionConfig`, the extracted document content is sent to a VLM with the provided JSON schema. The response is parsed and stored here as a JSON value matching the schema. | | `code_intelligence` | `Option` | `Default::default()` | Code intelligence results from tree-sitter analysis. Populated when extracting source code files with the `tree-sitter` feature. Contains metrics, structural analysis, imports/exports, comments, docstrings, symbols, diagnostics, and optionally chunked code segments. Stored as an opaque JSON value so that all language bindings (Go, Java, C#, …) can deserialize it as a raw JSON object rather than a typed struct. The underlying type is `tree_sitter_language_pack.ProcessResult`. | | `llm_usage` | `Vec` | `vec![]` | LLM token usage and cost data for all LLM calls made during this extraction. Contains one entry per LLM call. Multiple entries are produced when VLM OCR, structured extraction, or LLM embeddings run during the same extraction. `None` when no LLM was used. | | `formatted_content` | `Option` | `Default::default()` | Pre-rendered content in the requested output format. Populated during `derive_extraction_result` before tree derivation consumes element data. `apply_output_format` swaps this into `content` at the end of the pipeline, after post-processors have operated on plain text. | | `ocr_internal_document` | `Option` | `Default::default()` | Structured hOCR document for the OCR+layout pipeline. When tesseract produces hOCR output, the parsed `InternalDocument` carries paragraph structure with bounding boxes and confidence scores. The layout classification step enriches these elements before final rendering. | --- #### XmlExtractionResult XML extraction result. Contains extracted text content from XML files along with structural statistics about the XML document. | Field | Type | Default | Description | |-------|------|---------|-------------| | `content` | `String` | — | Extracted text content (XML structure filtered out) | | `element_count` | `usize` | — | Total number of XML elements processed | | `unique_elements` | `Vec` | — | List of unique element names found (sorted) | --- #### TextExtractionResult Plain text and Markdown extraction result. Contains the extracted text along with statistics and, for Markdown files, structural elements like headers and links. | Field | Type | Default | Description | |-------|------|---------|-------------| | `content` | `String` | — | Extracted text content | | `line_count` | `usize` | — | Number of lines | | `word_count` | `usize` | — | Number of words | | `character_count` | `usize` | — | Number of characters | | `headers` | `Vec` | `None` | Markdown headers (text only, Markdown files only) | | `links` | `Vec>` | `None` | Markdown links as (text, URL) tuples (Markdown files only) | | `code_blocks` | `Vec>` | `None` | Code blocks as (language, code) tuples (Markdown files only) | --- #### PptxExtractionResult PowerPoint (PPTX) extraction result. Contains extracted slide content, metadata, and embedded images/tables. | Field | Type | Default | Description | |-------|------|---------|-------------| | `content` | `String` | — | Extracted text content from all slides | | `metadata` | `PptxMetadata` | — | Presentation metadata | | `slide_count` | `usize` | — | Total number of slides | | `image_count` | `usize` | — | Total number of embedded images | | `table_count` | `usize` | — | Total number of tables | | `images` | `Vec` | — | Extracted images from the presentation | | `page_structure` | `Option` | `None` | Slide structure with boundaries (when page tracking is enabled) | | `page_contents` | `Vec` | `None` | Per-slide content (when page tracking is enabled) | | `document` | `Option` | `None` | Structured document representation | | `hyperlinks` | `Vec` | `/* serde(default) */` | Hyperlinks discovered in slides as (url, optional_label) pairs. | | `office_metadata` | `HashMap` | `/* serde(default) */` | Office metadata extracted from docProps/core.xml and docProps/app.xml. Contains keys like "title", "author", "created_by", "subject", "keywords", "modified_by", "created_at", "modified_at", etc. | | `revisions` | `Vec` | `/* serde(default) */` | Slide comments as revisions. Each `` element in `ppt/comments/comment{N}.xml` becomes a `DocumentRevision { kind: Comment }` with author (resolved from `ppt/commentAuthors.xml`), ISO-8601 timestamp, and `RevisionAnchor.Slide { index }`. `None` when no comment XML parts exist. | --- #### EmailExtractionResult Email extraction result. Complete representation of an extracted email message (.eml or .msg) including headers, body content, and attachments. | Field | Type | Default | Description | |-------|------|---------|-------------| | `subject` | `Option` | `None` | Email subject line | | `from_email` | `Option` | `None` | Sender email address | | `to_emails` | `Vec` | — | Primary recipient email addresses | | `cc_emails` | `Vec` | — | CC recipient email addresses | | `bcc_emails` | `Vec` | — | BCC recipient email addresses | | `date` | `Option` | `None` | Email date/timestamp | | `message_id` | `Option` | `None` | Message-ID header value | | `plain_text` | `Option` | `None` | Plain text version of the email body | | `html_content` | `Option` | `None` | HTML version of the email body | | `content` | `String` | — | Cleaned/processed text content. Aliased as `cleaned_text` for back-compat. | | `attachments` | `Vec` | — | List of email attachments | | `metadata` | `HashMap` | — | Additional email headers and metadata | --- #### OcrExtractionResult OCR extraction result. Result of performing OCR on an image or scanned document, including recognized text and detected tables. | Field | Type | Default | Description | |-------|------|---------|-------------| | `content` | `String` | — | Recognized text content | | `mime_type` | `String` | — | Original MIME type of the processed image | | `metadata` | `HashMap` | — | OCR processing metadata (confidence scores, language, etc.) | | `tables` | `Vec` | — | Tables detected and extracted via OCR | | `ocr_elements` | `Vec` | `/* serde(default) */` | Structured OCR elements with bounding boxes and confidence scores. Available when TSV output is requested or table detection is enabled. | | `internal_document` | `Option` | `None` | Structured document produced from hOCR parsing. Carries paragraph structure, bounding boxes, and confidence scores that the flattened `content` string discards. | --- #### OrientationResult Document orientation detection result. | Field | Type | Default | Description | |-------|------|---------|-------------| | `degrees` | `u32` | — | Detected orientation in degrees (0, 90, 180, or 270). | | `confidence` | `f32` | — | Confidence score (0.0-1.0). | --- #### DetectionResult Page-level detection result containing all detections and page metadata. | Field | Type | Default | Description | |-------|------|---------|-------------| | `page_width` | `u32` | — | Page width | | `page_height` | `u32` | — | Page height | | `detections` | `Vec` | — | Detections | --- ### Configuration Types See [Configuration Reference](configuration.md) for detailed defaults and language-specific representations. #### AccelerationConfig Hardware acceleration configuration for ONNX Runtime models. Controls which execution provider (CPU, CoreML, CUDA, TensorRT) is used for inference in layout detection and embedding generation. | Field | Type | Default | Description | |-------|------|---------|-------------| | `provider` | `ExecutionProviderType` | `ExecutionProviderType::Auto` | Execution provider to use for ONNX inference. | | `device_id` | `u32` | — | GPU device ID (for CUDA/TensorRT). Ignored for CPU/CoreML/Auto. | --- #### ContentFilterConfig Cross-extractor content filtering configuration. Controls whether "furniture" content (headers, footers, page numbers, watermarks, repeating text) is included in or stripped from extraction results. Applies across all extractors (PDF, DOCX, RTF, ODT, HTML, etc.) with format-specific implementation. When `None` on `ExtractionConfig`, each extractor uses its current default behavior unchanged. | Field | Type | Default | Description | |-------|------|---------|-------------| | `include_headers` | `bool` | `false` | Include running headers in extraction output. - PDF: Disables top-margin furniture stripping and prevents the layout model from treating `PageHeader`-classified regions as furniture. - DOCX: Includes document headers in text output. - RTF/ODT: Headers already included; this is a no-op when true. - HTML/EPUB: Keeps `
` element content. Default: `false` (headers are stripped or excluded). | | `include_footers` | `bool` | `false` | Include running footers in extraction output. - PDF: Disables bottom-margin furniture stripping and prevents the layout model from treating `PageFooter`-classified regions as furniture. - DOCX: Includes document footers in text output. - RTF/ODT: Footers already included; this is a no-op when true. - HTML/EPUB: Keeps `
` element content. Default: `false` (footers are stripped or excluded). | | `strip_repeating_text` | `bool` | `true` | Enable the heuristic cross-page repeating text detector. When `true` (default), text that repeats verbatim across a supermajority of pages is classified as furniture and stripped. Disable this if brand names or repeated headings are being incorrectly removed by the heuristic. Note: when a layout-detection model is active, the model may independently classify page-header / page-footer regions as furniture on a per-page basis. To preserve those regions, set `include_headers = true`, `include_footers = true`, or both, in addition to disabling this flag. Primarily affects PDF extraction. Default: `true`. | | `include_watermarks` | `bool` | `false` | Include watermark text in extraction output. - PDF: Keeps watermark artifacts and arXiv identifiers. - Other formats: No effect currently. Default: `false` (watermarks are stripped). | --- #### EmailConfig Configuration for email extraction. | Field | Type | Default | Description | |-------|------|---------|-------------| | `msg_fallback_codepage` | `Option` | `Default::default()` | Windows codepage number to use when an MSG file contains no codepage property. Defaults to `None`, which falls back to windows-1252. If an unrecognized or invalid codepage number is supplied (including 0), the behavior silently falls back to windows-1252 — the same as when the MSG file itself contains an unrecognized codepage. No error or warning is emitted. Users should verify output when supplying unusual values. Common values: - 1250: Central European (Polish, Czech, Hungarian, etc.) - 1251: Cyrillic (Russian, Ukrainian, Bulgarian, etc.) - 1252: Western European (default) - 1253: Greek - 1254: Turkish - 1255: Hebrew - 1256: Arabic - 932: Japanese (Shift-JIS) - 936: Simplified Chinese (GBK) | --- #### ExtractionConfig Main extraction configuration. This struct contains all configuration options for the extraction process. It can be loaded from TOML, YAML, or JSON files, or created programmatically. | Field | Type | Default | Description | |-------|------|---------|-------------| | `use_cache` | `bool` | `true` | Enable caching of extraction results | | `enable_quality_processing` | `bool` | `true` | Enable quality post-processing | | `ocr` | `Option` | `None` | OCR configuration (None = OCR disabled) | | `force_ocr` | `bool` | `false` | Force OCR even for searchable PDFs | | `force_ocr_pages` | `Vec` | `None` | Force OCR on specific pages only (1-indexed page numbers, must be >= 1). When set, only the listed pages are OCR'd regardless of text layer quality. Unlisted pages use native text extraction. Ignored when `force_ocr` is `true`. Only applies to PDF documents. Duplicates are automatically deduplicated. An `ocr` config is recommended for backend/language selection; defaults are used if absent. | | `disable_ocr` | `bool` | `false` | Disable OCR entirely, even for images. When `true`, OCR is skipped for all document types. Images return metadata only (dimensions, format, EXIF) without text extraction. PDFs use only native text extraction without OCR fallback. Cannot be `true` simultaneously with `force_ocr`. *Added in v4.7.0.* | | `chunking` | `Option` | `None` | Text chunking configuration (None = chunking disabled) | | `content_filter` | `Option` | `None` | Content filtering configuration (None = use extractor defaults). Controls whether document "furniture" (headers, footers, watermarks, repeating text) is included in or stripped from extraction results. See `ContentFilterConfig` for per-field documentation. | | `images` | `Option` | `None` | Image extraction configuration (None = no image extraction) | | `pdf_options` | `Option` | `None` | PDF-specific options (None = use defaults) | | `token_reduction` | `Option` | `None` | Token reduction configuration (None = no token reduction) | | `language_detection` | `Option` | `None` | Language detection configuration (None = no language detection) | | `pages` | `Option` | `None` | Page extraction configuration (None = no page tracking) | | `keywords` | `Option` | `None` | Keyword extraction configuration (None = no keyword extraction) | | `postprocessor` | `Option` | `None` | Post-processor configuration (None = use defaults) | | `html_options` | `Option` | `None` | HTML to Markdown conversion options (None = use defaults) Configure how HTML documents are converted to Markdown, including heading styles, list formatting, code block styles, and preprocessing options. | | `html_output` | `Option` | `None` | Styled HTML output configuration. When set alongside `output_format = OutputFormat.Html`, the extraction pipeline uses `StyledHtmlRenderer` which emits stable `kb-*` CSS class hooks on every structural element and optionally embeds theme CSS or user-supplied CSS in a `