This commit is contained in:
398
docs/guides/output-formats.md
Normal file
398
docs/guides/output-formats.md
Normal file
@@ -0,0 +1,398 @@
|
||||
# Output Formats <span class="version-badge">v4.1.0</span>
|
||||
|
||||
Choose the format that matches your downstream processing:
|
||||
|
||||
- **Unified (default)** — Plain text/Markdown, for LLM prompts and full-text search
|
||||
- **Element-Based** — Flat array of typed elements with metadata, for RAG chunking and semantic search
|
||||
- **Document Structure** — Hierarchical tree with explicit parent-child references, for knowledge graphs and structured apps
|
||||
- **PDF Hierarchy** — Font-size classification into heading levels (H1–H6) for PDFs
|
||||
|
||||
## Unified Output (Default)
|
||||
|
||||
No configuration required. The result contains:
|
||||
|
||||
- `content` — Full document text with minimal formatting
|
||||
- `pages` — Per-page breakdown for PDFs, DOCX, and PPTX
|
||||
- `tables` — Extracted tables in structured format
|
||||
- `images` — Image metadata and paths
|
||||
|
||||
---
|
||||
|
||||
## Element-Based Output <span class="version-badge">v4.1.0</span>
|
||||
|
||||
A flat array of typed elements (titles, paragraphs, tables, list items, code blocks, images, etc.). Each carries a page number; PDF text elements also carry bounding boxes when hierarchy extraction is enabled.
|
||||
|
||||
Use for RAG chunking, semantic search, or Unstructured.io-compatible pipelines.
|
||||
|
||||
### Enable
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/element_based_output.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/element_based_output.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/config/element_based_output.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/element_based_output.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/element_based_output.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/element_based_output.md"
|
||||
|
||||
=== "PHP"
|
||||
|
||||
--8<-- "snippets/php/config/element_based_output.md"
|
||||
|
||||
Elements are in `result.elements`. Each element has `element_id`, `element_type`, `text`, and `metadata`.
|
||||
|
||||
### Element Types
|
||||
|
||||
| `element_type` | Description | Key `additional` fields |
|
||||
| ---------------- | ---------------------------------- | ------------------------------------------ |
|
||||
| `title` | Main title or top-level heading | `level` (h1–h6), `font_size`, `font_name` |
|
||||
| `heading` | Section/subsection heading | `level` (h1–h6) |
|
||||
| `narrative_text` | Body paragraph | — |
|
||||
| `list_item` | Bullet, numbered, or indented item | `list_type`, `list_marker`, `indent_level` |
|
||||
| `table` | Tabular data | `row_count`, `column_count`, `format` |
|
||||
| `image` | Embedded image | `format`, `width`, `height`, `alt_text` |
|
||||
| `code_block` | Code snippet | `language`, `line_count` |
|
||||
| `block_quote` | Quoted text | — |
|
||||
| `header` | Recurring page header | `position` |
|
||||
| `footer` | Recurring page footer | `position` |
|
||||
| `page_break` | Page boundary marker | `next_page` |
|
||||
|
||||
### Metadata
|
||||
|
||||
Every element's `metadata` contains:
|
||||
|
||||
| Field | Type | Description |
|
||||
| --------------- | --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `page_number` | `int \| None` | 1-indexed page number (PDF, DOCX, PPTX) |
|
||||
| `filename` | `str \| None` | Source filename |
|
||||
| `coordinates` | `BoundingBox \| None` | `x0`, `y0`, `x1`, `y1` in PDF points. Only populated for **text elements** when `pdf_options.hierarchy` is enabled with `include_bbox=True`. Table and image elements do not carry coordinates. |
|
||||
| `element_index` | `int` | Zero-based position in the elements array |
|
||||
| `additional` | `dict[str, str]` | Element-type-specific fields (see table above) |
|
||||
|
||||
PDF coordinates use bottom-left origin in points (1/72 inch).
|
||||
|
||||
### Example Output
|
||||
|
||||
```json
|
||||
{
|
||||
"element_id": "elem-a3f2b1c4",
|
||||
"element_type": "title",
|
||||
"text": "Introduction to Machine Learning",
|
||||
"metadata": {
|
||||
"page_number": 1,
|
||||
"element_index": 0,
|
||||
"coordinates": { "x0": 72.0, "y0": 700.0, "x1": 540.0, "y1": 730.0 },
|
||||
"additional": { "level": "h1", "font_size": "24" }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Filtering Elements
|
||||
|
||||
```python
|
||||
config = ExtractionConfig(result_format="element_based")
|
||||
result = extract_file_sync("document.pdf", config=config)
|
||||
|
||||
titles = [e for e in result.elements if e.element_type == "title"]
|
||||
tables = [e for e in result.elements if e.element_type == "table"]
|
||||
|
||||
for title in titles:
|
||||
level = title.metadata.additional.get("level", "h1")
|
||||
print(f"[{level}] {title.text}")
|
||||
```
|
||||
|
||||
### Migrating from Unstructured.io
|
||||
|
||||
If you're migrating from Unstructured.io, element-based output follows a similar structure with these key differences:
|
||||
|
||||
| Aspect | Unstructured.io | Kreuzberg |
|
||||
| ----------- | ------------------------------------- | ------------------------------------------- |
|
||||
| Type names | PascalCase (`Title`, `NarrativeText`) | snake_case (`title`, `narrative_text`) |
|
||||
| Element IDs | Not always present | Always present (deterministic hash) |
|
||||
| Metadata | Basic (`page_number`, `filename`) | Extended (coordinates, `additional` fields) |
|
||||
| Config key | — | `result_format="element_based"` |
|
||||
|
||||
---
|
||||
|
||||
## Document Structure
|
||||
|
||||
A flat list of nodes with explicit parent-child index references — a traversable tree with heading levels, content layers, inline annotations, and structured table grids.
|
||||
|
||||
Use when you need hierarchical relationships between sections.
|
||||
|
||||
### Comparison
|
||||
|
||||
| Aspect | Unified (default) | Element-based | Document structure |
|
||||
| ------------------ | ---------------------- | -------------------- | --------------------------------- |
|
||||
| Output shape | `content: string` | `elements: array` | `nodes: array` with index refs |
|
||||
| Hierarchy | None | Inferred from levels | Explicit parent/child indices |
|
||||
| Inline annotations | No | No | Bold, italic, links per node |
|
||||
| Tables | `result.tables` | Table elements | `TableGrid` with cell coords |
|
||||
| Content layers | Not classified | Not classified | body, header, footer, footnote |
|
||||
| Best for | LLM prompts, full-text | RAG chunking | Knowledge graphs, structured apps |
|
||||
|
||||
### Enable
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/document_structure_config.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/document_structure_config.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/config/document_structure_config.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/document_structure_config.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/config/document_structure_config.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/config/document_structure_config.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/document_structure_config.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/config/document_structure_config.md"
|
||||
|
||||
### Node Shape
|
||||
|
||||
Each node in `result.document.nodes`:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "node-a3f2b1c4",
|
||||
"content": { "node_type": "heading", "level": 2, "text": "Supervised Learning" },
|
||||
"parent": 0,
|
||||
"children": [4, 5, 6],
|
||||
"content_layer": "body",
|
||||
"page": 5,
|
||||
"page_end": null,
|
||||
"bbox": { "x0": 72.0, "y0": 600.0, "x1": 400.0, "y1": 620.0 },
|
||||
"annotations": []
|
||||
}
|
||||
```
|
||||
|
||||
- `parent` and `children` are integer indices into the `nodes` array (`null` if absent)
|
||||
- `bbox` is present when bounding box data is available
|
||||
- `annotations` contains inline formatting spans
|
||||
|
||||
### Node Types
|
||||
|
||||
| `node_type` | Key fields | Notes |
|
||||
| ------------ | ---------------------------------------- | ------------------------------------------- |
|
||||
| `title` | `text` | Document title |
|
||||
| `heading` | `level` (1–6), `text` | Section heading |
|
||||
| `paragraph` | `text` | Body paragraph; may have `annotations` |
|
||||
| `list` | `ordered` (bool) | Container; children are `list_item` nodes |
|
||||
| `list_item` | `text` | Child of `list` |
|
||||
| `table` | `grid` ([TableGrid](#table-grid)) | Grid with cell-level data |
|
||||
| `image` | `description`, `image_index` | `image_index` references `result.images` |
|
||||
| `code` | `text`, `language` | Code block |
|
||||
| `quote` | _(container)_ | Children are typically paragraphs |
|
||||
| `formula` | `text` | Math formula (plain text, LaTeX, or MathML) |
|
||||
| `footnote` | `text` | Usually `content_layer: "footnote"` |
|
||||
| `group` | `label`, `heading_level`, `heading_text` | Section grouping container |
|
||||
| `page_break` | _(marker)_ | Page boundary |
|
||||
|
||||
### Content Layers
|
||||
|
||||
| Layer | Description |
|
||||
| ---------- | ------------------------------------------ |
|
||||
| `body` | Main document content |
|
||||
| `header` | Page header area (repeated chapter titles) |
|
||||
| `footer` | Page footer area (page numbers, copyright) |
|
||||
| `footnote` | Footnotes and endnotes |
|
||||
|
||||
```python
|
||||
for node in result.document["nodes"]:
|
||||
if node["content_layer"] == "body":
|
||||
process_main_content(node)
|
||||
```
|
||||
|
||||
### Text Annotations
|
||||
|
||||
Paragraphs carry a list of `annotations` marking character spans:
|
||||
|
||||
```json
|
||||
{ "start": 0, "end": 16, "kind": { "annotation_type": "bold" } }
|
||||
```
|
||||
|
||||
| `annotation_type` | Extra fields |
|
||||
| ---------------------------------------------- | ------------------------- |
|
||||
| `bold`, `italic`, `underline`, `strikethrough` | — |
|
||||
| `code`, `subscript`, `superscript` | — |
|
||||
| `link` | `url`, `title` (optional) |
|
||||
|
||||
```python
|
||||
for node in result.document["nodes"]:
|
||||
for ann in node.get("annotations", []):
|
||||
text = node["content"].get("text", "")
|
||||
span = text[ann["start"]:ann["end"]]
|
||||
kind = ann["kind"]["annotation_type"]
|
||||
if kind == "link":
|
||||
print(f"Link: {span} -> {ann['kind']['url']}")
|
||||
else:
|
||||
print(f"{kind}: {span}")
|
||||
```
|
||||
|
||||
### Table Grid
|
||||
|
||||
Table nodes contain a `grid` with cell-level data:
|
||||
|
||||
```json
|
||||
{
|
||||
"rows": 3,
|
||||
"cols": 3,
|
||||
"cells": [
|
||||
{ "content": "Algorithm", "row": 0, "col": 0, "row_span": 1, "col_span": 1, "is_header": true },
|
||||
{
|
||||
"content": "Decision Tree",
|
||||
"row": 1,
|
||||
"col": 0,
|
||||
"row_span": 1,
|
||||
"col_span": 1,
|
||||
"is_header": false
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Each cell has `row`, `col`, `row_span`, `col_span`, `is_header`, and optionally `bbox`.
|
||||
|
||||
```python
|
||||
for node in result.document["nodes"]:
|
||||
if node["content"]["node_type"] == "table":
|
||||
grid = node["content"]["grid"]
|
||||
rows, cols = grid["rows"], grid["cols"]
|
||||
table = [[None] * cols for _ in range(rows)]
|
||||
for cell in grid["cells"]:
|
||||
table[cell["row"]][cell["col"]] = cell["content"]
|
||||
for row in table:
|
||||
print(" | ".join(str(c or "") for c in row))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## PDF Hierarchy Detection
|
||||
|
||||
Classifies PDF text blocks into heading levels (H1–H6) and body text via K-means clustering on font sizes — largest cluster is H1, second-largest H2, and so on.
|
||||
|
||||
### Quick Start
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/config/pdf_hierarchy_config.md"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/config/pdf_hierarchy_config.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/config/pdf_hierarchy_config.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/config/pdf_hierarchy_config.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/config/pdf_hierarchy_config.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/config/pdf_hierarchy_config.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/config/pdf_hierarchy_config.md"
|
||||
|
||||
### Output
|
||||
|
||||
Hierarchy data is in `result.pages[n].hierarchy`. Each page has a `blocks` list:
|
||||
|
||||
```json
|
||||
{
|
||||
"block_count": 4,
|
||||
"blocks": [
|
||||
{
|
||||
"text": "Chapter 1: Introduction",
|
||||
"level": "h1",
|
||||
"font_size": 24.0,
|
||||
"bbox": [50.0, 100.0, 400.0, 125.0]
|
||||
},
|
||||
{ "text": "Background", "level": "h2", "font_size": 18.0, "bbox": [50.0, 150.0, 300.0, 168.0] },
|
||||
{
|
||||
"text": "This chapter provides...",
|
||||
"level": "body",
|
||||
"font_size": 12.0,
|
||||
"bbox": [50.0, 200.0, 550.0, 450.0]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
- `bbox`: `[left, top, right, bottom]` in PDF points (present when `include_bbox=True`). This is the only way to obtain bounding box coordinates for text elements — they are not included by default.
|
||||
- `level`: `"h1"` – `"h6"` or `"body"`
|
||||
|
||||
### Configuration
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
| ------------------------ | --------------- | ------- | --------------------------------------------------- |
|
||||
| `enabled` | `bool` | `true` | Enable hierarchy extraction |
|
||||
| `k_clusters` | `int` | `6` | Font size clusters (2–10), maps to heading levels |
|
||||
| `include_bbox` | `bool` | `true` | Include bounding box coordinates |
|
||||
| `ocr_coverage_threshold` | `float \| None` | `None` | Trigger OCR if text coverage is below this fraction |
|
||||
|
||||
#### Choosing k_clusters
|
||||
|
||||
| `k_clusters` | Heading levels | Use when |
|
||||
| ------------ | -------------- | --------------------------------------- |
|
||||
| 2–3 | H1–H2 | Simple documents with 1–2 heading sizes |
|
||||
| 4–5 | H1–H4 | Standard documents |
|
||||
| 6 (default) | H1–H6 | Most documents |
|
||||
| 7–8 | H1–H6+ | Books, specs with deep nesting |
|
||||
|
||||
#### Ocr_coverage_threshold
|
||||
|
||||
| Threshold | Behavior |
|
||||
| --------- | ------------------------------- |
|
||||
| `None` | OCR never triggered by coverage |
|
||||
| `0.3` | OCR if < 30% of page has text |
|
||||
| `0.5` | OCR if < 50% of page has text |
|
||||
|
||||
Requires an OCR backend to be configured separately.
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
- **`hierarchy` is `None`** — Check `hierarchy.enabled` is `True`. If the PDF is image-only, enable OCR. If fewer text blocks than `k_clusters`, reduce `k_clusters`.
|
||||
- **Most blocks classified as `body`** — Document may use uniform font sizes. Reduce `k_clusters` (try 3–4).
|
||||
- **Heading levels don't match visual inspection** — Levels are assigned by font size rank, not absolute size. Filter on `block.font_size` directly for absolute thresholds.
|
||||
|
||||
See the [HierarchyConfig reference](../reference/configuration.md#hierarchyconfig) for the full parameter list.
|
||||
Reference in New Issue
Block a user