Nomad changes

2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions
--- a/docs/guides/layout-detection.md
+++ b/docs/guides/layout-detection.md
@@ -0,0 +1,242 @@
+# Layout Detection <span class="version-badge">v4.5.0</span>
+
+Detect document layout regions (tables, figures, headers, text blocks, etc.) in PDFs using ONNX-based deep learning models. Enables table extraction, figure isolation, reading-order reconstruction, and selective OCR.
+
+!!! Note "Feature gate" Requires the `layout-detection` Cargo feature. Not included in the default feature set.
+
+## Model
+
+Layout detection uses the **RT-DETR v2** model, an ONNX-based deep learning model that detects 17 layout element classes: text blocks, tables, figures, headers, footers, captions, code, lists, sections, formulas, footnotes, page headers/footers, titles, checkboxes, key-value regions, and document indices.
+
+### When to Enable
+
+**Recommended for:** complex multi-column PDFs, scanned documents, academic papers, business forms, and any document where layout understanding improves extraction accuracy.
+
+**Less beneficial for:** simple single-column text documents, high-throughput pipelines where latency is critical (consider GPU acceleration), or documents already well-handled by PDF structure trees.
+
+### Performance Impact
+
+| Pipeline | Structure F1 | Text F1 | Avg time/doc |
+| -------- | ------------ | ------- | ------------ |
+| Baseline | 33.9%        | 87.4%   | 447 ms       |
+| Layout   | 41.1%        | 90.1%   | 1500 ms      |
+
+_171-document PDF corpus, CPU only. GPU acceleration significantly reduces the time penalty._
+
+!!! Note "Layout Detection Model" Kreuzberg uses only the RT-DETR v2 model for layout detection. The `preset` field is not available in `LayoutDetectionConfig`. Configure table structure recognition separately via `table_model` — see "Table Structure Models" below.
+
+## Configuration
+
+=== "Python"
+
+    ```python
+    from kreuzberg import ExtractionConfig, LayoutDetectionConfig, extract_file
+
+    config = ExtractionConfig(
+        layout=LayoutDetectionConfig(
+            confidence_threshold=0.5,
+            apply_heuristics=True,
+            table_model="tatr",
+        )
+    )
+    result = await extract_file("document.pdf", config=config)
+    ```
+
+=== "TypeScript"
+
+    ```typescript
+    const result = await extract("document.pdf", {
+      layout: {
+        confidenceThreshold: 0.5,
+        applyHeuristics: true,
+        tableModel: "tatr",
+      },
+    });
+    ```
+
+=== "Rust"
+
+    ```rust
+    use kreuzberg::core::{ExtractionConfig, LayoutDetectionConfig};
+
+    let config = ExtractionConfig {
+        layout: Some(LayoutDetectionConfig {
+            confidence_threshold: Some(0.5),
+            apply_heuristics: true,
+            table_model: Some("tatr".to_string()),
+            ..Default::default()
+        }),
+        ..Default::default()
+    };
+    ```
+
+=== "TOML"
+
+    ```toml title="kreuzberg.toml"
+    [layout]
+    apply_heuristics = true
+    # table_model = "tatr"
+    ```
+
+=== "CLI"
+
+    ```bash title="Terminal"
+    # Enable layout detection with default settings
+    kreuzberg extract document.pdf --layout --content-format markdown
+
+    # Custom confidence threshold
+    kreuzberg extract document.pdf --layout-confidence 0.5 --content-format markdown
+
+    # Specific table model
+    kreuzberg extract document.pdf --layout --layout-table-model slanet_wired
+
+    # Combined with GPU acceleration
+    kreuzberg extract document.pdf --layout --acceleration coreml
+    ```
+
+See [LayoutDetectionConfig](../reference/configuration.md#layoutdetectionconfig) for all fields.
+
+## Table Structure Models <span class="version-badge">v4.5.3</span>
+
+When layout detection identifies a table region, a table structure model analyzes rows, columns, headers, and spanning cells. Set `LayoutDetectionConfig.table_model` to one of:
+
+| Value             | Notes                                                       |
+| ----------------- | ----------------------------------------------------------- |
+| `tatr`            | Default. Fast (~30 MB). General-purpose.                    |
+| `slanet_wired`    | Higher accuracy for bordered/gridlined tables (~365 MB).    |
+| `slanet_wireless` | Higher accuracy for borderless tables (~365 MB).            |
+| `slanet_auto`     | Auto-classifies per page (~737 MB). Slowest.                |
+| `slanet_plus`     | Smallest (~7.78 MB). For resource-constrained environments. |
+| `disabled`        | Skip table structure recognition.                           |
+
+!!! Note "Model Download" SLANeXT models are not downloaded by default. Use `cache warm --all-table-models` to pre-download, or they download automatically on first use.
+
+## GPU Acceleration
+
+Layout detection uses ONNX Runtime with automatic provider selection:
+
+| Provider | Platform       | Notes                         |
+| -------- | -------------- | ----------------------------- |
+| CPU      | All            | Default, no setup needed      |
+| CUDA     | Linux, Windows | Requires CUDA toolkit + cuDNN |
+| CoreML   | macOS          | Automatic on Apple Silicon    |
+| TensorRT | Linux          | Requires TensorRT             |
+
+To override:
+
+```python
+config = ExtractionConfig(
+    layout=LayoutDetectionConfig(),
+    acceleration=AccelerationConfig(provider="cuda", device_id=0)
+)
+```
+
+See [AccelerationConfig reference](../reference/configuration.md#accelerationconfig) for details.
+
+## Layout Classes
+
+The RT-DETR v2 model detects 17 classes. Each `LayoutRegion.class_name` is one of:
+
+`caption`, `footnote`, `formula`, `list_item`, `page_footer`, `page_header`, `picture`, `section_header`, `table`, `text`, `title`, `document_index`, `code`, `checkbox_selected`, `checkbox_unselected`, `form`, `key_value_region`.
+
+See [`LayoutRegion`](../reference/types.md#layoutregion) in the types reference for the full field shape.
+
+## Accessing Layout Regions
+
+When layout detection is enabled AND page extraction is enabled, each page in the result includes `layout_regions` — a list of detected regions with class, confidence score, bounding box, and area fraction. This enables programmatic filtering and analysis of specific layout elements.
+
+=== "Python"
+
+    ```python
+    from kreuzberg import extract_file, ExtractionConfig, LayoutDetectionConfig, PagesConfig
+
+    result = await extract_file(
+        "document.pdf",
+        config=ExtractionConfig(
+            layout=LayoutDetectionConfig(),
+            pages=PagesConfig(extract_pages=True),
+        ),
+    )
+
+    for page in result.pages:
+        if page.layout_regions:
+            for region in page.layout_regions:
+                if region.class_name == "picture" and region.confidence > 0.9:
+                    print(f"Page {page.page_number}: diagram detected "
+                          f"(confidence={region.confidence:.2f}, "
+                          f"area={region.area_fraction:.0%})")
+    ```
+
+=== "TypeScript"
+
+    ```typescript
+    const result = await extract("document.pdf", {
+      layout: {},
+      pages: { extractPages: true },
+    });
+
+    for (const page of result.pages ?? []) {
+      if (page.layoutRegions) {
+        for (const region of page.layoutRegions) {
+          if (region.className === "picture" && region.confidence > 0.9) {
+            console.log(
+              `Page ${page.pageNumber}: diagram detected ` +
+              `(confidence=${region.confidence.toFixed(2)}, ` +
+              `area=${(region.areaFraction * 100).toFixed(0)}%)`
+            );
+          }
+        }
+      }
+    }
+    ```
+
+=== "Rust"
+
+    ```rust
+    use kreuzberg::core::{ExtractionConfig, LayoutDetectionConfig, PagesConfig};
+
+    let result = extract_file(
+        "document.pdf",
+        ExtractionConfig {
+            layout: Some(LayoutDetectionConfig::default()),
+            pages: Some(PagesConfig {
+                extract_pages: true,
+                ..Default::default()
+            }),
+            ..Default::default()
+        },
+    ).await?;
+
+    for page in &result.pages {
+        if let Some(regions) = &page.layout_regions {
+            for region in regions {
+                if region.class_name == "picture" && region.confidence > 0.9 {
+                    println!(
+                        "Page {}: diagram detected (confidence={:.2}, area={:.0}%)",
+                        page.page_number,
+                        region.confidence,
+                        region.area_fraction * 100.0
+                    );
+                }
+            }
+        }
+    }
+    ```
+
+### Tips
+
+- Use `confidence` to filter low-confidence detections — typically ≥ 0.8–0.9 for downstream operations
+- Use `area_fraction` to distinguish between inline images and full-page diagrams (e.g., `area_fraction > 0.1` for significant figures)
+- Regions are independent of page extraction — enable both to access both content and layout structure
+- Available across all bindings (Python, TypeScript, Rust, Ruby, Java, Go, Elixir, C#, PHP)
+
+## Acknowledgments
+
+- **[Docling](https://github.com/DS4SD/docling)** — RT-DETR v2 model and layout classification approach
+- **[TATR](https://github.com/microsoft/table-transformer)** — Table structure recognition with ONNX
+- **[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)** — SLANeXT table structure and PP-LCNet classifier models
+
+## Related
+
+- [Configuration Reference](../reference/configuration.md#layoutdetectionconfig) — full field reference
+- [Element-Based Output](output-formats.md#element-based-output-v410) — using layout-aware results