Files
fil/docs/guides/layout-detection.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

243 lines
9.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Layout Detection <span class="version-badge">v4.5.0</span>
Detect document layout regions (tables, figures, headers, text blocks, etc.) in PDFs using ONNX-based deep learning models. Enables table extraction, figure isolation, reading-order reconstruction, and selective OCR.
!!! Note "Feature gate" Requires the `layout-detection` Cargo feature. Not included in the default feature set.
## Model
Layout detection uses the **RT-DETR v2** model, an ONNX-based deep learning model that detects 17 layout element classes: text blocks, tables, figures, headers, footers, captions, code, lists, sections, formulas, footnotes, page headers/footers, titles, checkboxes, key-value regions, and document indices.
### When to Enable
**Recommended for:** complex multi-column PDFs, scanned documents, academic papers, business forms, and any document where layout understanding improves extraction accuracy.
**Less beneficial for:** simple single-column text documents, high-throughput pipelines where latency is critical (consider GPU acceleration), or documents already well-handled by PDF structure trees.
### Performance Impact
| Pipeline | Structure F1 | Text F1 | Avg time/doc |
| -------- | ------------ | ------- | ------------ |
| Baseline | 33.9% | 87.4% | 447 ms |
| Layout | 41.1% | 90.1% | 1500 ms |
_171-document PDF corpus, CPU only. GPU acceleration significantly reduces the time penalty._
!!! Note "Layout Detection Model" Kreuzberg uses only the RT-DETR v2 model for layout detection. The `preset` field is not available in `LayoutDetectionConfig`. Configure table structure recognition separately via `table_model` — see "Table Structure Models" below.
## Configuration
=== "Python"
```python
from kreuzberg import ExtractionConfig, LayoutDetectionConfig, extract_file
config = ExtractionConfig(
layout=LayoutDetectionConfig(
confidence_threshold=0.5,
apply_heuristics=True,
table_model="tatr",
)
)
result = await extract_file("document.pdf", config=config)
```
=== "TypeScript"
```typescript
const result = await extract("document.pdf", {
layout: {
confidenceThreshold: 0.5,
applyHeuristics: true,
tableModel: "tatr",
},
});
```
=== "Rust"
```rust
use kreuzberg::core::{ExtractionConfig, LayoutDetectionConfig};
let config = ExtractionConfig {
layout: Some(LayoutDetectionConfig {
confidence_threshold: Some(0.5),
apply_heuristics: true,
table_model: Some("tatr".to_string()),
..Default::default()
}),
..Default::default()
};
```
=== "TOML"
```toml title="kreuzberg.toml"
[layout]
apply_heuristics = true
# table_model = "tatr"
```
=== "CLI"
```bash title="Terminal"
# Enable layout detection with default settings
kreuzberg extract document.pdf --layout --content-format markdown
# Custom confidence threshold
kreuzberg extract document.pdf --layout-confidence 0.5 --content-format markdown
# Specific table model
kreuzberg extract document.pdf --layout --layout-table-model slanet_wired
# Combined with GPU acceleration
kreuzberg extract document.pdf --layout --acceleration coreml
```
See [LayoutDetectionConfig](../reference/configuration.md#layoutdetectionconfig) for all fields.
## Table Structure Models <span class="version-badge">v4.5.3</span>
When layout detection identifies a table region, a table structure model analyzes rows, columns, headers, and spanning cells. Set `LayoutDetectionConfig.table_model` to one of:
| Value | Notes |
| ----------------- | ----------------------------------------------------------- |
| `tatr` | Default. Fast (~30 MB). General-purpose. |
| `slanet_wired` | Higher accuracy for bordered/gridlined tables (~365 MB). |
| `slanet_wireless` | Higher accuracy for borderless tables (~365 MB). |
| `slanet_auto` | Auto-classifies per page (~737 MB). Slowest. |
| `slanet_plus` | Smallest (~7.78 MB). For resource-constrained environments. |
| `disabled` | Skip table structure recognition. |
!!! Note "Model Download" SLANeXT models are not downloaded by default. Use `cache warm --all-table-models` to pre-download, or they download automatically on first use.
## GPU Acceleration
Layout detection uses ONNX Runtime with automatic provider selection:
| Provider | Platform | Notes |
| -------- | -------------- | ----------------------------- |
| CPU | All | Default, no setup needed |
| CUDA | Linux, Windows | Requires CUDA toolkit + cuDNN |
| CoreML | macOS | Automatic on Apple Silicon |
| TensorRT | Linux | Requires TensorRT |
To override:
```python
config = ExtractionConfig(
layout=LayoutDetectionConfig(),
acceleration=AccelerationConfig(provider="cuda", device_id=0)
)
```
See [AccelerationConfig reference](../reference/configuration.md#accelerationconfig) for details.
## Layout Classes
The RT-DETR v2 model detects 17 classes. Each `LayoutRegion.class_name` is one of:
`caption`, `footnote`, `formula`, `list_item`, `page_footer`, `page_header`, `picture`, `section_header`, `table`, `text`, `title`, `document_index`, `code`, `checkbox_selected`, `checkbox_unselected`, `form`, `key_value_region`.
See [`LayoutRegion`](../reference/types.md#layoutregion) in the types reference for the full field shape.
## Accessing Layout Regions
When layout detection is enabled AND page extraction is enabled, each page in the result includes `layout_regions` — a list of detected regions with class, confidence score, bounding box, and area fraction. This enables programmatic filtering and analysis of specific layout elements.
=== "Python"
```python
from kreuzberg import extract_file, ExtractionConfig, LayoutDetectionConfig, PagesConfig
result = await extract_file(
"document.pdf",
config=ExtractionConfig(
layout=LayoutDetectionConfig(),
pages=PagesConfig(extract_pages=True),
),
)
for page in result.pages:
if page.layout_regions:
for region in page.layout_regions:
if region.class_name == "picture" and region.confidence > 0.9:
print(f"Page {page.page_number}: diagram detected "
f"(confidence={region.confidence:.2f}, "
f"area={region.area_fraction:.0%})")
```
=== "TypeScript"
```typescript
const result = await extract("document.pdf", {
layout: {},
pages: { extractPages: true },
});
for (const page of result.pages ?? []) {
if (page.layoutRegions) {
for (const region of page.layoutRegions) {
if (region.className === "picture" && region.confidence > 0.9) {
console.log(
`Page ${page.pageNumber}: diagram detected ` +
`(confidence=${region.confidence.toFixed(2)}, ` +
`area=${(region.areaFraction * 100).toFixed(0)}%)`
);
}
}
}
}
```
=== "Rust"
```rust
use kreuzberg::core::{ExtractionConfig, LayoutDetectionConfig, PagesConfig};
let result = extract_file(
"document.pdf",
ExtractionConfig {
layout: Some(LayoutDetectionConfig::default()),
pages: Some(PagesConfig {
extract_pages: true,
..Default::default()
}),
..Default::default()
},
).await?;
for page in &result.pages {
if let Some(regions) = &page.layout_regions {
for region in regions {
if region.class_name == "picture" && region.confidence > 0.9 {
println!(
"Page {}: diagram detected (confidence={:.2}, area={:.0}%)",
page.page_number,
region.confidence,
region.area_fraction * 100.0
);
}
}
}
}
```
### Tips
- Use `confidence` to filter low-confidence detections — typically ≥ 0.80.9 for downstream operations
- Use `area_fraction` to distinguish between inline images and full-page diagrams (e.g., `area_fraction > 0.1` for significant figures)
- Regions are independent of page extraction — enable both to access both content and layout structure
- Available across all bindings (Python, TypeScript, Rust, Ruby, Java, Go, Elixir, C#, PHP)
## Acknowledgments
- **[Docling](https://github.com/DS4SD/docling)** — RT-DETR v2 model and layout classification approach
- **[TATR](https://github.com/microsoft/table-transformer)** — Table structure recognition with ONNX
- **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)** — SLANeXT table structure and PP-LCNet classifier models
## Related
- [Configuration Reference](../reference/configuration.md#layoutdetectionconfig) — full field reference
- [Element-Based Output](output-formats.md#element-based-output-v410) — using layout-aware results