This commit is contained in:
242
docs/guides/layout-detection.md
Normal file
242
docs/guides/layout-detection.md
Normal file
@@ -0,0 +1,242 @@
|
||||
# Layout Detection <span class="version-badge">v4.5.0</span>
|
||||
|
||||
Detect document layout regions (tables, figures, headers, text blocks, etc.) in PDFs using ONNX-based deep learning models. Enables table extraction, figure isolation, reading-order reconstruction, and selective OCR.
|
||||
|
||||
!!! Note "Feature gate" Requires the `layout-detection` Cargo feature. Not included in the default feature set.
|
||||
|
||||
## Model
|
||||
|
||||
Layout detection uses the **RT-DETR v2** model, an ONNX-based deep learning model that detects 17 layout element classes: text blocks, tables, figures, headers, footers, captions, code, lists, sections, formulas, footnotes, page headers/footers, titles, checkboxes, key-value regions, and document indices.
|
||||
|
||||
### When to Enable
|
||||
|
||||
**Recommended for:** complex multi-column PDFs, scanned documents, academic papers, business forms, and any document where layout understanding improves extraction accuracy.
|
||||
|
||||
**Less beneficial for:** simple single-column text documents, high-throughput pipelines where latency is critical (consider GPU acceleration), or documents already well-handled by PDF structure trees.
|
||||
|
||||
### Performance Impact
|
||||
|
||||
| Pipeline | Structure F1 | Text F1 | Avg time/doc |
|
||||
| -------- | ------------ | ------- | ------------ |
|
||||
| Baseline | 33.9% | 87.4% | 447 ms |
|
||||
| Layout | 41.1% | 90.1% | 1500 ms |
|
||||
|
||||
_171-document PDF corpus, CPU only. GPU acceleration significantly reduces the time penalty._
|
||||
|
||||
!!! Note "Layout Detection Model" Kreuzberg uses only the RT-DETR v2 model for layout detection. The `preset` field is not available in `LayoutDetectionConfig`. Configure table structure recognition separately via `table_model` — see "Table Structure Models" below.
|
||||
|
||||
## Configuration
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python
|
||||
from kreuzberg import ExtractionConfig, LayoutDetectionConfig, extract_file
|
||||
|
||||
config = ExtractionConfig(
|
||||
layout=LayoutDetectionConfig(
|
||||
confidence_threshold=0.5,
|
||||
apply_heuristics=True,
|
||||
table_model="tatr",
|
||||
)
|
||||
)
|
||||
result = await extract_file("document.pdf", config=config)
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript
|
||||
const result = await extract("document.pdf", {
|
||||
layout: {
|
||||
confidenceThreshold: 0.5,
|
||||
applyHeuristics: true,
|
||||
tableModel: "tatr",
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust
|
||||
use kreuzberg::core::{ExtractionConfig, LayoutDetectionConfig};
|
||||
|
||||
let config = ExtractionConfig {
|
||||
layout: Some(LayoutDetectionConfig {
|
||||
confidence_threshold: Some(0.5),
|
||||
apply_heuristics: true,
|
||||
table_model: Some("tatr".to_string()),
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
```
|
||||
|
||||
=== "TOML"
|
||||
|
||||
```toml title="kreuzberg.toml"
|
||||
[layout]
|
||||
apply_heuristics = true
|
||||
# table_model = "tatr"
|
||||
```
|
||||
|
||||
=== "CLI"
|
||||
|
||||
```bash title="Terminal"
|
||||
# Enable layout detection with default settings
|
||||
kreuzberg extract document.pdf --layout --content-format markdown
|
||||
|
||||
# Custom confidence threshold
|
||||
kreuzberg extract document.pdf --layout-confidence 0.5 --content-format markdown
|
||||
|
||||
# Specific table model
|
||||
kreuzberg extract document.pdf --layout --layout-table-model slanet_wired
|
||||
|
||||
# Combined with GPU acceleration
|
||||
kreuzberg extract document.pdf --layout --acceleration coreml
|
||||
```
|
||||
|
||||
See [LayoutDetectionConfig](../reference/configuration.md#layoutdetectionconfig) for all fields.
|
||||
|
||||
## Table Structure Models <span class="version-badge">v4.5.3</span>
|
||||
|
||||
When layout detection identifies a table region, a table structure model analyzes rows, columns, headers, and spanning cells. Set `LayoutDetectionConfig.table_model` to one of:
|
||||
|
||||
| Value | Notes |
|
||||
| ----------------- | ----------------------------------------------------------- |
|
||||
| `tatr` | Default. Fast (~30 MB). General-purpose. |
|
||||
| `slanet_wired` | Higher accuracy for bordered/gridlined tables (~365 MB). |
|
||||
| `slanet_wireless` | Higher accuracy for borderless tables (~365 MB). |
|
||||
| `slanet_auto` | Auto-classifies per page (~737 MB). Slowest. |
|
||||
| `slanet_plus` | Smallest (~7.78 MB). For resource-constrained environments. |
|
||||
| `disabled` | Skip table structure recognition. |
|
||||
|
||||
!!! Note "Model Download" SLANeXT models are not downloaded by default. Use `cache warm --all-table-models` to pre-download, or they download automatically on first use.
|
||||
|
||||
## GPU Acceleration
|
||||
|
||||
Layout detection uses ONNX Runtime with automatic provider selection:
|
||||
|
||||
| Provider | Platform | Notes |
|
||||
| -------- | -------------- | ----------------------------- |
|
||||
| CPU | All | Default, no setup needed |
|
||||
| CUDA | Linux, Windows | Requires CUDA toolkit + cuDNN |
|
||||
| CoreML | macOS | Automatic on Apple Silicon |
|
||||
| TensorRT | Linux | Requires TensorRT |
|
||||
|
||||
To override:
|
||||
|
||||
```python
|
||||
config = ExtractionConfig(
|
||||
layout=LayoutDetectionConfig(),
|
||||
acceleration=AccelerationConfig(provider="cuda", device_id=0)
|
||||
)
|
||||
```
|
||||
|
||||
See [AccelerationConfig reference](../reference/configuration.md#accelerationconfig) for details.
|
||||
|
||||
## Layout Classes
|
||||
|
||||
The RT-DETR v2 model detects 17 classes. Each `LayoutRegion.class_name` is one of:
|
||||
|
||||
`caption`, `footnote`, `formula`, `list_item`, `page_footer`, `page_header`, `picture`, `section_header`, `table`, `text`, `title`, `document_index`, `code`, `checkbox_selected`, `checkbox_unselected`, `form`, `key_value_region`.
|
||||
|
||||
See [`LayoutRegion`](../reference/types.md#layoutregion) in the types reference for the full field shape.
|
||||
|
||||
## Accessing Layout Regions
|
||||
|
||||
When layout detection is enabled AND page extraction is enabled, each page in the result includes `layout_regions` — a list of detected regions with class, confidence score, bounding box, and area fraction. This enables programmatic filtering and analysis of specific layout elements.
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file, ExtractionConfig, LayoutDetectionConfig, PagesConfig
|
||||
|
||||
result = await extract_file(
|
||||
"document.pdf",
|
||||
config=ExtractionConfig(
|
||||
layout=LayoutDetectionConfig(),
|
||||
pages=PagesConfig(extract_pages=True),
|
||||
),
|
||||
)
|
||||
|
||||
for page in result.pages:
|
||||
if page.layout_regions:
|
||||
for region in page.layout_regions:
|
||||
if region.class_name == "picture" and region.confidence > 0.9:
|
||||
print(f"Page {page.page_number}: diagram detected "
|
||||
f"(confidence={region.confidence:.2f}, "
|
||||
f"area={region.area_fraction:.0%})")
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript
|
||||
const result = await extract("document.pdf", {
|
||||
layout: {},
|
||||
pages: { extractPages: true },
|
||||
});
|
||||
|
||||
for (const page of result.pages ?? []) {
|
||||
if (page.layoutRegions) {
|
||||
for (const region of page.layoutRegions) {
|
||||
if (region.className === "picture" && region.confidence > 0.9) {
|
||||
console.log(
|
||||
`Page ${page.pageNumber}: diagram detected ` +
|
||||
`(confidence=${region.confidence.toFixed(2)}, ` +
|
||||
`area=${(region.areaFraction * 100).toFixed(0)}%)`
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust
|
||||
use kreuzberg::core::{ExtractionConfig, LayoutDetectionConfig, PagesConfig};
|
||||
|
||||
let result = extract_file(
|
||||
"document.pdf",
|
||||
ExtractionConfig {
|
||||
layout: Some(LayoutDetectionConfig::default()),
|
||||
pages: Some(PagesConfig {
|
||||
extract_pages: true,
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
},
|
||||
).await?;
|
||||
|
||||
for page in &result.pages {
|
||||
if let Some(regions) = &page.layout_regions {
|
||||
for region in regions {
|
||||
if region.class_name == "picture" && region.confidence > 0.9 {
|
||||
println!(
|
||||
"Page {}: diagram detected (confidence={:.2}, area={:.0}%)",
|
||||
page.page_number,
|
||||
region.confidence,
|
||||
region.area_fraction * 100.0
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Tips
|
||||
|
||||
- Use `confidence` to filter low-confidence detections — typically ≥ 0.8–0.9 for downstream operations
|
||||
- Use `area_fraction` to distinguish between inline images and full-page diagrams (e.g., `area_fraction > 0.1` for significant figures)
|
||||
- Regions are independent of page extraction — enable both to access both content and layout structure
|
||||
- Available across all bindings (Python, TypeScript, Rust, Ruby, Java, Go, Elixir, C#, PHP)
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
- **[Docling](https://github.com/DS4SD/docling)** — RT-DETR v2 model and layout classification approach
|
||||
- **[TATR](https://github.com/microsoft/table-transformer)** — Table structure recognition with ONNX
|
||||
- **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)** — SLANeXT table structure and PP-LCNet classifier models
|
||||
|
||||
## Related
|
||||
|
||||
- [Configuration Reference](../reference/configuration.md#layoutdetectionconfig) — full field reference
|
||||
- [Element-Based Output](output-formats.md#element-based-output-v410) — using layout-aware results
|
||||
Reference in New Issue
Block a user