Nomad changes

2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions
--- a/docs/migration/from-unstructured.md
+++ b/docs/migration/from-unstructured.md
@@ -0,0 +1,322 @@
+# Migrating from Unstructured to Kreuzberg
+
+This guide helps you migrate from Unstructured.io to Kreuzberg for document intelligence workloads.
+
+## Quick Start
+
+**Unstructured API**:
+
+```bash
+curl -X POST "https://api.unstructured.io/general/v0/general" \
+  -F 'files=@document.pdf'
+```
+
+**Kreuzberg API**:
+
+```bash
+curl -X POST "http://localhost:8080/extract" \
+  -F 'files=@document.pdf' \
+  -F 'output_format=element_based'
+```
+
+## Output Format Comparison
+
+### Unified Output (Default)
+
+Kreuzberg's default output provides richer metadata than Unstructured:
+
+**Kreuzberg Unified**:
+
+```json
+{
+  "content": "Full document text...",
+  "mime_type": "application/pdf",
+  "metadata": {
+    "title": "Document Title",
+    "authors": ["Author Name"],
+    "created_at": "2024-01-15T10:30:00Z",
+    "format": {
+      "format_type": "pdf",
+      "page_count": 10,
+      "version": "1.7"
+    }
+  },
+  "tables": [...],
+  "images": [...],
+  "pages": [...]
+}
+```
+
+### Element-Based Output
+
+**Kreuzberg** (when `output_format=element_based`):
+
+```json
+{
+  "elements": [
+    {
+      "element_id": "elem-a3f2b1c4",
+      "element_type": "title",
+      "text": "Introduction",
+      "metadata": {
+        "page_number": 1,
+        "filename": "Document Title",
+        "coordinates": {
+          "x0": 72.0,
+          "y0": 100.0,
+          "x1": 540.0,
+          "y1": 130.0
+        },
+        "element_index": 0,
+        "additional": {
+          "level": "h1",
+          "font_size": "24.0"
+        }
+      }
+    },
+    {
+      "element_type": "narrative_text",
+      "text": "This is a paragraph...",
+      "metadata": {
+        "page_number": 1
+      }
+    }
+  ]
+}
+```
+
+**Unstructured**:
+
+```json
+[
+  {
+    "type": "Title",
+    "text": "Introduction",
+    "metadata": {
+      "page_number": 1,
+      "filename": "document.pdf"
+    }
+  },
+  {
+    "type": "NarrativeText",
+    "text": "This is a paragraph...",
+    "metadata": {
+      "page_number": 1
+    }
+  }
+]
+```
+
+## API Endpoint Mapping
+
+| Unstructured               | Kreuzberg          | Notes                             |
+| -------------------------- | ------------------ | --------------------------------- |
+| `POST /general/v0/general` | `POST /extract`    | Single/batch extraction           |
+| N/A                        | `POST /embed`      | Built-in embeddings (ONNX models) |
+| N/A                        | `GET /health`      | Health check                      |
+| N/A                        | `GET /cache/stats` | Cache statistics                  |
+
+## Element Type Mapping
+
+| Unstructured    | Kreuzberg        | Notes                               |
+| --------------- | ---------------- | ----------------------------------- |
+| `Title`         | `title`          | PDF hierarchy (h1-h6) detection     |
+| `NarrativeText` | `narrative_text` | Paragraphs split on double newlines |
+| `ListItem`      | `list_item`      | Bullets, numbered, lettered         |
+| `Table`         | `table`          | Tab-separated text representation   |
+| `Image`         | `image`          | Format, dimensions in metadata      |
+| `PageBreak`     | `page_break`     | Between pages in multi-page docs    |
+| `Header`        | `header`         | Page header text                    |
+| `Footer`        | `footer`         | Page footer text                    |
+| N/A             | `heading`        | Section headings (beyond title)     |
+| N/A             | `code_block`     | Code snippets                       |
+| N/A             | `block_quote`    | Quoted text blocks                  |
+
+## Code Examples
+
+### Python
+
+**Unstructured**:
+
+```python
+from unstructured.partition.auto import partition
+
+elements = partition(filename="document.pdf")
+for element in elements:
+    print(f"{element.category}: {element.text}")
+```
+
+**Kreuzberg**:
+
+```python
+from kreuzberg import extract_bytes
+
+# Option 1: Element-based output
+config = {"output_format": "element_based"}
+result = extract_bytes(pdf_bytes, "application/pdf", config)
+
+for element in result.elements:
+    print(f"{element.element_type}: {element.text}")
+    if element.metadata.page_number:
+        print(f"  Page: {element.metadata.page_number}")
+
+# Option 2: Unified output (default, richer metadata)
+result = extract_bytes(pdf_bytes, "application/pdf")
+print(result.content)  # Full text
+print(result.metadata.title)  # Document metadata
+for page in result.pages:
+    print(f"Page {page.page_number}: {page.content[:100]}")
+```
+
+### TypeScript
+
+**Unstructured** (via API):
+
+```typescript
+const formData = new FormData();
+formData.append("files", fileBlob);
+
+const response = await fetch("https://api.unstructured.io/general/v0/general", {
+  method: "POST",
+  body: formData,
+});
+const elements = await response.json();
+```
+
+**Kreuzberg**:
+
+```typescript
+import { extractBytes } from "kreuzberg";
+
+// Option 1: Element-based output
+const result = await extractBytes(pdfBuffer, "application/pdf", {
+  output_format: "element_based",
+});
+
+for (const element of result.elements) {
+  console.log(`${element.element_type}: ${element.text}`);
+}
+
+// Option 2: Unified output with pages
+const result = await extractBytes(pdfBuffer, "application/pdf", {
+  pages: { extract_pages: true },
+});
+
+for (const page of result.pages) {
+  console.log(`Page ${page.page_number}:`, page.content);
+}
+```
+
+### CURL
+
+**Unstructured**:
+
+```bash
+curl -X POST "https://api.unstructured.io/general/v0/general" \
+  -H "unstructured-api-key: $API_KEY" \
+  -F 'files=@document.pdf' \
+  -F 'strategy=hi_res'
+```
+
+**Kreuzberg**:
+
+```bash
+# Element-based output
+curl -X POST "http://localhost:8080/extract" \
+  -F 'files=@document.pdf' \
+  -F 'output_format=element_based'
+
+# With configuration JSON
+curl -X POST "http://localhost:8080/extract" \
+  -F 'files=@document.pdf' \
+  -F 'config={"output_format":"element_based","pages":{"extract_pages":true}}'
+```
+
+## Feature Comparison
+
+### What Kreuzberg Adds
+
+1. **Richer Metadata**: Format-specific discriminated unions (PDF, Excel, Email, etc.)
+2. **Native Per-Page**: `PageContent` with byte offsets, hierarchy, tables, images per page
+3. **90+ Formats**: vs Unstructured's ~30 formats
+4. **Performance**: Rust-based native implementation (vs Python-based)
+5. **10 Language Bindings**: Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, Rust, WASM
+6. **Built-in Embeddings**: ONNX models via `/embed` endpoint (no external API)
+7. **Smart Hierarchy**: PDF font-size clustering for h1-h6 detection
+8. **Bounding Boxes**: Preserved from PDF source in element coordinates
+
+### What Unstructured Has
+
+1. **Layout Detection Models**: ML-based layout analysis (GPU-accelerated)
+2. **Cloud API**: Hosted service (Kreuzberg requires self-hosting)
+3. **More Element Types**: More granular element classification
+4. **Mature Ecosystem**: Larger community, more integrations
+
+## Configuration Mapping
+
+| Unstructured Parameter                | Kreuzberg Config                     | Notes                              |
+| ------------------------------------- | ------------------------------------ | ---------------------------------- |
+| `strategy=hi_res`                     | `pdf_options.hierarchy.enabled=true` | PDF hierarchy extraction           |
+| `coordinates=true`                    | Always included when available       | Bounding boxes in element metadata |
+| `languages=["eng"]`                   | `ocr.language="eng"`                 | OCR language                       |
+| `extract_image_block_types=["image"]` | `images.extract_images=true`         | Image extraction                   |
+| `chunking_strategy="by_title"`        | `chunking.max_chars=1000`            | Text chunking (basic)              |
+| `embedding_model="..."`               | `chunking.embedding.model="..."`     | Embedding generation               |
+
+## Migration Checklist
+
+- [ ] Update API endpoint URLs (Unstructured → Kreuzberg)
+- [ ] Add `output_format=element_based` if using element-based workflow
+- [ ] Update element type references (`Title` → `title`, camelCase → snake_case)
+- [ ] Update metadata field references (Kreuzberg has richer metadata structure)
+- [ ] Test with sample documents to verify output equivalence
+- [ ] Update error handling (Kreuzberg uses HTTP 422 for validation errors)
+- [ ] Configure caching if needed (Kreuzberg has built-in file-based cache)
+- [ ] Set up embeddings if using RAG pipeline (Kreuzberg has built-in ONNX support)
+
+## Advanced: Hybrid Approach
+
+You can use **both formats** simultaneously:
+
+```python
+from kreuzberg import extract_bytes
+
+result = extract_bytes(pdf_bytes, "application/pdf", {
+    "output_format": "element_based",  # Get elements
+    "pages": {"extract_pages": true}   # Also get per-page content
+})
+
+# Element-based processing
+for element in result.elements:
+    if element.element_type == "title":
+        index_heading(element.text)
+
+# Page-based processing
+for page in result.pages:
+    if page.hierarchy:
+        for block in page.hierarchy.blocks:
+            if block.level == "h1":
+                process_section(block.text)
+```
+
+## Performance Tips
+
+1. **Enable Caching**: `use_cache: true` (default) for repeated extractions
+2. **Disable OCR**: If documents are searchable PDFs, set `force_ocr: false`
+3. **Limit Page Extraction**: Only enable `pages` if you need per-page content
+4. **Batch Processing**: Send multiple files in single request (up to 10MB total)
+5. **Use Embeddings Wisely**: Enable only for chunked content destined for vector DB
+
+## Getting Help
+
+- **Documentation**: <https://github.com/kreuzberg-dev/Kreuzberg>
+- **Issues**: <https://github.com/kreuzberg-dev/kreuzberg/issues>
+- **API Reference**: See `docs/api/` for endpoint documentation
+
+## Next Steps
+
+After migration:
+
+1. Review the [Kreuzberg vs Unstructured Comparison](../comparisons/kreuzberg-vs-unstructured.md)
+2. Explore Kreuzberg-specific features (hierarchy, per-page metadata, embeddings)
+3. Optimize your pipeline with native Rust performance
--- a/docs/migration/v4.0-fonts.md
+++ b/docs/migration/v4.0-fonts.md
@@ -0,0 +1,644 @@
+# Font Configuration Breaking Change (v4.0)
+
+## Summary
+
+Custom font provider is now **enabled by default** for improved PDF performance.
+
+## Breaking Change
+
+**Previous behavior** (v3.x):
+
+- Font provider always enabled, not configurable
+- Used system fonts only
+- No user control over font loading
+
+**New behavior** (v4.0):
+
+- Font provider enabled by default
+- Configurable via `FontConfig` in `PdfConfig`
+- Can disable or add custom font directories
+- ~12-13% faster PDF processing with font caching
+
+## Impact
+
+**Who is affected?**
+
+- Users who rely on the PDF extractor's default font fallback behavior
+- Users who want to disable the custom font provider
+- Users who need to add custom font directories
+
+**What changes?**
+
+- Default: Custom font provider now active (breaking change)
+- Performance: PDF extraction 12-13% faster
+- API: New `font_config` option in `PdfConfig`
+
+## Migration
+
+### No Action Required (Recommended)
+
+For most users, no changes needed. Default behavior provides performance improvements:
+
+=== "Rust"
+
+    ```rust
+    use kreuzberg::ExtractionConfig;
+
+    // Previous (v4.0) - no font configuration
+    let config = ExtractionConfig::default();
+
+    // Current (v4.0) - same code, now with font provider enabled
+    let config = ExtractionConfig::default();
+    // Font provider automatically enabled with system fonts
+    ```
+
+=== "Python"
+
+    ```python
+    from kreuzberg import ExtractionConfig
+
+    # Previous (v4.0)
+    config = ExtractionConfig()
+
+    # Current (v4.0) - same code, now with font provider enabled
+    config = ExtractionConfig()
+    # Font provider automatically enabled with system fonts
+    ```
+
+=== "TypeScript"
+
+    ```typescript
+    import { ExtractionConfig } from 'kreuzberg';
+
+    // Previous (v4.0)
+    const config: ExtractionConfig = {};
+
+    // Current (v4.0) - same code, now with font provider enabled
+    const config: ExtractionConfig = {};
+    // Font provider automatically enabled with system fonts
+    ```
+
+=== "Java"
+
+    ```java
+    import dev.kreuzberg.config.*;
+
+    // Previous (v4.0)
+    ExtractionConfig config = ExtractionConfig.builder().build();
+
+    // Current (v4.0) - same code, now with font provider enabled
+    ExtractionConfig config = ExtractionConfig.builder().build();
+    // Font provider automatically enabled with system fonts
+    ```
+
+=== "Go"
+
+    ```go
+    import "github.com/kreuzberg-dev/kreuzberg/v4"
+
+    // Previous (v4.0)
+    config := &kreuzberg.ExtractionConfig{}
+
+    // Current (v4.0) - same code, now with font provider enabled
+    config := &kreuzberg.ExtractionConfig{}
+    // Font provider automatically enabled with system fonts
+    ```
+
+=== "Ruby"
+
+    ```ruby
+    require 'kreuzberg'
+
+    # Previous (v4.0)
+    config = Kreuzberg::ExtractionConfig.new
+
+    # Current (v4.0) - same code, now with font provider enabled
+    config = Kreuzberg::ExtractionConfig.new
+    # Font provider automatically enabled with system fonts
+    ```
+
+=== "C#"
+
+    ```csharp
+    using Kreuzberg;
+
+    // Previous (v4.0)
+    var config = new ExtractionConfig();
+
+    // Current (v4.0) - same code, now with font provider enabled
+    var config = new ExtractionConfig();
+    // Font provider automatically enabled with system fonts
+    ```
+
+### Disable Font Provider
+
+If you prefer the default font handling:
+
+=== "Rust"
+
+    ```rust
+    use kreuzberg::{ExtractionConfig, PdfConfig, FontConfig};
+
+    let config = ExtractionConfig {
+        pdf_options: Some(PdfConfig {
+            font_config: Some(FontConfig {
+                enabled: false,
+                custom_font_dirs: None,
+            }),
+            ..Default::default()
+        }),
+        ..Default::default()
+    };
+    ```
+
+=== "Python"
+
+    ```python
+    from kreuzberg import ExtractionConfig, PdfConfig, FontConfig
+
+    config = ExtractionConfig(
+        pdf_options=PdfConfig(
+            font_config=FontConfig(enabled=False)
+        )
+    )
+    ```
+
+=== "TypeScript"
+
+    ```typescript
+    import { ExtractionConfig } from 'kreuzberg';
+
+    const config: ExtractionConfig = {
+      pdfOptions: {
+        fontConfig: {
+          enabled: false
+        }
+      }
+    };
+    ```
+
+=== "Java"
+
+    ```java
+    import dev.kreuzberg.config.*;
+
+    FontConfig fontConfig = FontConfig.builder()
+        .enabled(false)
+        .build();
+
+    PdfConfig pdfConfig = PdfConfig.builder()
+        .fontConfig(fontConfig)
+        .build();
+
+    ExtractionConfig config = ExtractionConfig.builder()
+        .pdfOptions(pdfConfig)
+        .build();
+    ```
+
+=== "Go"
+
+    ```go
+    import "github.com/kreuzberg-dev/kreuzberg/v4"
+
+    config := &kreuzberg.ExtractionConfig{
+        PdfOptions: &kreuzberg.PdfConfig{
+            FontConfig: &kreuzberg.FontConfig{
+                Enabled: false,
+            },
+        },
+    }
+    ```
+
+=== "Ruby"
+
+    ```ruby
+    require 'kreuzberg'
+
+    config = Kreuzberg::ExtractionConfig.new(
+      pdf_options: Kreuzberg::PdfConfig.new(
+        font_config: Kreuzberg::FontConfig.new(enabled: false)
+      )
+    )
+    ```
+
+=== "C#"
+
+    ```csharp
+    using Kreuzberg;
+
+    var fontConfig = new FontConfig { Enabled = false };
+    var pdfConfig = new PdfConfig { FontConfig = fontConfig };
+    var config = new ExtractionConfig { PdfOptions = pdfConfig };
+    ```
+
+### Add Custom Font Directories
+
+To use fonts from custom directories (in addition system fonts):
+
+=== "Rust"
+
+    ```rust
+    use kreuzberg::{ExtractionConfig, PdfConfig, FontConfig};
+    use std::path::PathBuf;
+
+    let config = ExtractionConfig {
+        pdf_options: Some(PdfConfig {
+            font_config: Some(FontConfig {
+                enabled: true,
+                custom_font_dirs: Some(vec![
+                    PathBuf::from("/usr/share/fonts/custom"),
+                    PathBuf::from("~/my-fonts"),  // Tilde expanded automatically
+                ]),
+            }),
+            ..Default::default()
+        }),
+        ..Default::default()
+    };
+    ```
+
+=== "Python"
+
+    ```python
+    from kreuzberg import ExtractionConfig, PdfConfig, FontConfig
+
+    config = ExtractionConfig(
+        pdf_options=PdfConfig(
+            font_config=FontConfig(
+                enabled=True,
+                custom_font_dirs=[
+                    "/usr/share/fonts/custom",
+                    "~/my-fonts"  # Tilde expanded automatically
+                ]
+            )
+        )
+    )
+    ```
+
+=== "TypeScript"
+
+    ```typescript
+    import { ExtractionConfig } from 'kreuzberg';
+
+    const config: ExtractionConfig = {
+      pdfOptions: {
+        fontConfig: {
+          enabled: true,
+          customFontDirs: [
+            '/usr/share/fonts/custom',
+            '~/my-fonts'  // Tilde expanded automatically
+          ]
+        }
+      }
+    };
+    ```
+
+=== "Java"
+
+    ```java
+    import dev.kreuzberg.config.*;
+    import java.nio.file.Paths;
+
+    FontConfig fontConfig = FontConfig.builder()
+        .enabled(true)
+        .customFontDirs(Arrays.asList(
+            Paths.get("/usr/share/fonts/custom"),
+            Paths.get("~/my-fonts")  // Tilde expanded automatically
+        ))
+        .build();
+
+    PdfConfig pdfConfig = PdfConfig.builder()
+        .fontConfig(fontConfig)
+        .build();
+
+    ExtractionConfig config = ExtractionConfig.builder()
+        .pdfOptions(pdfConfig)
+        .build();
+    ```
+
+=== "Go"
+
+    ```go
+    import "github.com/kreuzberg-dev/kreuzberg/v4"
+
+    config := &kreuzberg.ExtractionConfig{
+        PdfOptions: &kreuzberg.PdfConfig{
+            FontConfig: &kreuzberg.FontConfig{
+                Enabled: true,
+                CustomFontDirs: []string{
+                    "/usr/share/fonts/custom",
+                    "~/my-fonts",  // Tilde expanded automatically
+                },
+            },
+        },
+    }
+    ```
+
+=== "Ruby"
+
+    ```ruby
+    require 'kreuzberg'
+
+    config = Kreuzberg::ExtractionConfig.new(
+      pdf_options: Kreuzberg::PdfConfig.new(
+        font_config: Kreuzberg::FontConfig.new(
+          enabled: true,
+          custom_font_dirs: [
+            '/usr/share/fonts/custom',
+            '~/my-fonts'  # Tilde expanded automatically
+          ]
+        )
+      )
+    )
+    ```
+
+=== "C#"
+
+    ```csharp
+    using Kreuzberg;
+
+    var fontConfig = new FontConfig
+    {
+        Enabled = true,
+        CustomFontDirs = new[]
+        {
+            "/usr/share/fonts/custom",
+            "~/my-fonts"  // Tilde expanded automatically
+        }
+    };
+
+    var pdfConfig = new PdfConfig { FontConfig = fontConfig };
+    var config = new ExtractionConfig { PdfOptions = pdfConfig };
+    ```
+
+## Configuration Files
+
+### TOML Format
+
+```toml title="Font Configuration in TOML"
+[pdf_options.font_config]
+enabled = true
+custom_font_dirs = ["/usr/share/fonts/custom", "~/my-fonts"]
+```
+
+### YAML Format
+
+```yaml title="Font Configuration in YAML"
+pdf_options:
+  font_config:
+    enabled: true
+    custom_font_dirs:
+      - /usr/share/fonts/custom
+      - ~/my-fonts
+```
+
+### JSON Format
+
+```json title="Font Configuration in JSON"
+{
+  "pdf_options": {
+    "font_config": {
+      "enabled": true,
+      "custom_font_dirs": ["/usr/share/fonts/custom", "~/my-fonts"]
+    }
+  }
+}
+```
+
+## Path Handling
+
+The font configuration automatically handles:
+
+- **Tilde expansion**: `~/fonts` → `/Users/username/fonts`
+- **Relative paths**: `./fonts` → `/absolute/path/to/fonts`
+- **Symlinks**: Resolved to canonical paths (security measure)
+- **Validation**: Directories must exist; warnings logged if not found
+- **Graceful degradation**: Missing directories don't cause failures
+
+## Global Configuration
+
+**Important**: Font configuration is global per process and must be set **before the first PDF extraction**.
+
+=== "Rust"
+
+    ```rust
+    // CORRECT: Set config before first extraction
+    let config = ExtractionConfig {
+        pdf_options: Some(PdfConfig {
+            font_config: Some(FontConfig {
+                enabled: true,
+                custom_font_dirs: Some(vec![
+                    PathBuf::from("/usr/share/fonts/custom"),
+                ]),
+            }),
+            ..Default::default()
+        }),
+        ..Default::default()
+    };
+
+    let result = kreuzberg::extract_file("document.pdf", &config)?;
+
+    // INCORRECT: Attempting to change config after first extraction
+    let new_config = ExtractionConfig {
+        pdf_options: Some(PdfConfig {
+            font_config: Some(FontConfig {
+                enabled: false,
+                custom_font_dirs: None,
+            }),
+            ..Default::default()
+        }),
+        ..Default::default()
+    };
+    let result2 = kreuzberg::extract_file("document2.pdf", &new_config)?;
+    // Warning logged: "Font config already initialized"
+    ```
+
+=== "Python"
+
+    ```python
+    # CORRECT: Set config before first extraction
+    config = ExtractionConfig(
+        pdf_options=PdfConfig(
+            font_config=FontConfig(
+                enabled=True,
+                custom_font_dirs=["/usr/share/fonts/custom"]
+            )
+        )
+    )
+    result = extract_file("document.pdf", config)
+
+    # INCORRECT: Attempting to change config after first extraction
+    new_config = ExtractionConfig(
+        pdf_options=PdfConfig(
+            font_config=FontConfig(enabled=False)
+        )
+    )
+    result2 = extract_file("document2.pdf", new_config)
+    # Warning logged: "Font config already initialized"
+    ```
+
+=== "TypeScript"
+
+    ```typescript
+    // CORRECT: Set config before first extraction
+    const config: ExtractionConfig = {
+      pdfOptions: {
+        fontConfig: {
+          enabled: true,
+          customFontDirs: ['/usr/share/fonts/custom']
+        }
+      }
+    };
+    const result = await extractFile('document.pdf', config);
+
+    // INCORRECT: Attempting to change config after first extraction
+    const newConfig: ExtractionConfig = {
+      pdfOptions: {
+        fontConfig: { enabled: false }
+      }
+    };
+    const result2 = await extractFile('document2.pdf', newConfig);
+    // Warning logged: "Font config already initialized"
+    ```
+
+=== "Java"
+
+    ```java
+    // CORRECT: Set config before first extraction
+    FontConfig fontConfig = FontConfig.builder()
+        .enabled(true)
+        .customFontDirs(Arrays.asList(Paths.get("/usr/share/fonts/custom")))
+        .build();
+    PdfConfig pdfConfig = PdfConfig.builder()
+        .fontConfig(fontConfig)
+        .build();
+    ExtractionConfig config = ExtractionConfig.builder()
+        .pdfOptions(pdfConfig)
+        .build();
+    ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
+
+    // INCORRECT: Attempting to change config after first extraction
+    FontConfig newFontConfig = FontConfig.builder()
+        .enabled(false)
+        .build();
+    PdfConfig newPdfConfig = PdfConfig.builder()
+        .fontConfig(newFontConfig)
+        .build();
+    ExtractionConfig newConfig = ExtractionConfig.builder()
+        .pdfOptions(newPdfConfig)
+        .build();
+    ExtractionResult result2 = Kreuzberg.extractFile("document2.pdf", newConfig);
+    // Warning logged: "Font config already initialized"
+    ```
+
+=== "Go"
+
+    ```go
+    // CORRECT: Set config before first extraction
+    config := &kreuzberg.ExtractionConfig{
+        PdfOptions: &kreuzberg.PdfConfig{
+            FontConfig: &kreuzberg.FontConfig{
+                Enabled: true,
+                CustomFontDirs: []string{"/usr/share/fonts/custom"},
+            },
+        },
+    }
+    result, _ := kreuzberg.ExtractFile("document.pdf", config)
+
+    // INCORRECT: Attempting to change config after first extraction
+    newConfig := &kreuzberg.ExtractionConfig{
+        PdfOptions: &kreuzberg.PdfConfig{
+            FontConfig: &kreuzberg.FontConfig{
+                Enabled: false,
+            },
+        },
+    }
+    result2, _ := kreuzberg.ExtractFile("document2.pdf", newConfig)
+    // Warning logged: "Font config already initialized"
+    ```
+
+=== "Ruby"
+
+    ```ruby
+    # CORRECT: Set config before first extraction
+    config = Kreuzberg::ExtractionConfig.new(
+      pdf_options: Kreuzberg::PdfConfig.new(
+        font_config: Kreuzberg::FontConfig.new(
+          enabled: true,
+          custom_font_dirs: ['/usr/share/fonts/custom']
+        )
+      )
+    )
+    result = Kreuzberg.extract_file('document.pdf', config)
+
+    # INCORRECT: Attempting to change config after first extraction
+    new_config = Kreuzberg::ExtractionConfig.new(
+      pdf_options: Kreuzberg::PdfConfig.new(
+        font_config: Kreuzberg::FontConfig.new(enabled: false)
+      )
+    )
+    result2 = Kreuzberg.extract_file('document2.pdf', new_config)
+    # Warning logged: "Font config already initialized"
+    ```
+
+=== "C#"
+
+    ```csharp
+    // CORRECT: Set config before first extraction
+    var fontConfig = new FontConfig
+    {
+        Enabled = true,
+        CustomFontDirs = new[] { "/usr/share/fonts/custom" }
+    };
+    var pdfConfig = new PdfConfig { FontConfig = fontConfig };
+    var config = new ExtractionConfig { PdfOptions = pdfConfig };
+    var result = Kreuzberg.ExtractFile("document.pdf", config);
+
+    // INCORRECT: Attempting to change config after first extraction
+    var newFontConfig = new FontConfig { Enabled = false };
+    var newPdfConfig = new PdfConfig { FontConfig = newFontConfig };
+    var newConfig = new ExtractionConfig { PdfOptions = newPdfConfig };
+    var result2 = Kreuzberg.ExtractFile("document2.pdf", newConfig);
+    // Warning logged: "Font config already initialized"
+    ```
+
+## Performance Impact
+
+With default settings (enabled=true, system fonts):
+
+- **PDF extraction**: ~12-13% faster
+- **Memory**: Minimal increase (~100KB for font cache)
+- **Startup**: Lazy initialization (no overhead for non-PDF workloads)
+
+## Troubleshooting
+
+### Custom fonts not working
+
+**Symptom**: PDF still uses fallback fonts
+
+**Solutions**:
+
+1. Verify directories exist and contain .ttf/.otf/.ttc files
+2. Check logs for "Custom font directory not found" warnings
+3. Ensure paths are absolute or properly expanded
+4. Verify font files are readable
+
+### "Font config already initialized" warning
+
+**Symptom**: Configuration changes ignored after first PDF extraction
+
+**Solution**: Set FontConfig in the **first** ExtractionConfig used. Subsequent config changes are not supported (global limitation).
+
+### Performance regression
+
+**Symptom**: PDF extraction slower after upgrade
+
+**Solution**: This is unexpected. Please report as a bug with:
+
+- PDF sample (if shareable)
+- Benchmark comparison (before/after)
+- Configuration used
+
+## Questions?
+
+- **Issue tracker**: <https://github.com/kreuzberg-dev/kreuzberg/issues>
+- **Discussions**: <https://github.com/kreuzberg-dev/kreuzberg/discussions>
--- a/docs/migration/v4.0-html-metadata.md
+++ b/docs/migration/v4.0-html-metadata.md
@@ -0,0 +1,799 @@
+# HTML Metadata Structure Changes (v4.0)
+
+## Summary
+
+HTML metadata has been restructured for better organization and type safety. The changes consolidate individual Open Graph and Twitter Card fields into maps, and convert keywords from a single string to an array.
+
+## Breaking Changes
+
+### 1. Keywords: String to Array
+
+**Before (v3.x):**
+
+```rust title="Keywords as Comma-Separated String"
+// Option<String> - comma-separated or space-separated
+html_meta.keywords  // "seo, metadata, html"
+```
+
+**After (v4.0):**
+
+```rust title="Keywords as Structured Array"
+// Vec<String> - structured array
+html_meta.keywords  // vec!["seo", "metadata", "html"]
+```
+
+### 2. Canonical URL: Field Rename
+
+**Before (v3.x):**
+
+```rust title="Canonical Field (v3.x)"
+html_meta.canonical  // Option<String>
+```
+
+**After (v4.0):**
+
+```rust title="Canonical URL Field (v4.0)"
+html_meta.canonical_url  // Option<String>
+```
+
+### 3. Open Graph: Individual Fields to Map
+
+**Before (v3.x):**
+
+```rust title="Open Graph as Individual Fields"
+html_meta.og_title          // Option<String>
+html_meta.og_description    // Option<String>
+html_meta.og_image          // Option<String>
+html_meta.og_url            // Option<String>
+html_meta.og_type           // Option<String>
+html_meta.og_site_name      // Option<String>
+```
+
+**After (v4.0):**
+
+```rust title="Open Graph as Map Structure"
+html_meta.open_graph        // BTreeMap<String, String>
+html_meta.open_graph.get("title")         // Option<&String>
+html_meta.open_graph.get("description")   // Option<&String>
+html_meta.open_graph.get("image")         // Option<&String>
+html_meta.open_graph.get("url")           // Option<&String>
+html_meta.open_graph.get("type")          // Option<&String>
+html_meta.open_graph.get("site_name")     // Option<&String>
+```
+
+### 4. Twitter Card: Individual Fields to Map
+
+**Before (v3.x):**
+
+```rust title="Twitter Card as Individual Fields"
+html_meta.twitter_card          // Option<String>
+html_meta.twitter_title         // Option<String>
+html_meta.twitter_description   // Option<String>
+html_meta.twitter_image         // Option<String>
+html_meta.twitter_site          // Option<String>
+html_meta.twitter_creator       // Option<String>
+```
+
+**After (v4.0):**
+
+```rust title="Twitter Card as Map Structure"
+html_meta.twitter_card          // BTreeMap<String, String>
+html_meta.twitter_card.get("card")          // Option<&String>
+html_meta.twitter_card.get("title")         // Option<&String>
+html_meta.twitter_card.get("description")   // Option<&String>
+html_meta.twitter_card.get("image")         // Option<&String>
+html_meta.twitter_card.get("site")          // Option<&String>
+html_meta.twitter_card.get("creator")       // Option<&String>
+```
+
+### 5. Removed Fields
+
+The following link-related fields have been removed:
+
+- `link_author`
+- `link_license`
+- `link_alternate`
+
+Use the new `links` field instead for comprehensive link extraction.
+
+### 6. New Fields
+
+HTML metadata now includes rich metadata about page content:
+
+- **`language`**: Document language (for example, "en", "fr")
+- **`text_direction`**: Text direction ("ltr", "rtl")
+- **`headers`**: List of page headers/headings with structured metadata
+- **`links`**: List of links with detailed metadata and type classification
+- **`images`**: List of images with alt text, dimensions, and type classification
+- **`structured_data`**: Parsed JSON-LD, microdata, and RDFa data
+- **`meta_tags`**: All meta tags as a map
+
+## Migration Guide
+
+### Rust
+
+=== "Before (v3.x)"
+
+    ```rust
+    use kreuzberg::{extract_file_sync, ExtractionConfig};
+
+    let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
+    if let Some(html_meta) = result.metadata.html {
+        // Keywords as single string
+        if let Some(keywords) = html_meta.keywords {
+            let keyword_vec: Vec<&str> = keywords.split(',').map(|s| s.trim()).collect();
+            println!("Keywords: {:?}", keyword_vec);
+        }
+
+        // Canonical as separate field
+        if let Some(canonical) = html_meta.canonical {
+            println!("Canonical: {}", canonical);
+        }
+
+        // Open Graph as individual fields
+        if let Some(og_title) = html_meta.og_title {
+            println!("OG Title: {}", og_title);
+        }
+        if let Some(og_image) = html_meta.og_image {
+            println!("OG Image: {}", og_image);
+        }
+
+        // Twitter as individual fields
+        if let Some(twitter_card) = html_meta.twitter_card {
+            println!("Twitter Card: {}", twitter_card);
+        }
+    }
+    ```
+
+=== "After (v4.0)"
+
+    ```rust
+    use kreuzberg::{extract_file_sync, ExtractionConfig};
+
+    let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
+    if let Some(html_meta) = result.metadata.html {
+        // Keywords as array
+        if !html_meta.keywords.is_empty() {
+            println!("Keywords: {:?}", html_meta.keywords);
+        }
+
+        // Canonical renamed
+        if let Some(canonical_url) = html_meta.canonical_url {
+            println!("Canonical URL: {}", canonical_url);
+        }
+
+        // Open Graph from map
+        if let Some(og_title) = html_meta.open_graph.get("title") {
+            println!("OG Title: {}", og_title);
+        }
+        if let Some(og_image) = html_meta.open_graph.get("image") {
+            println!("OG Image: {}", og_image);
+        }
+
+        // Twitter from map
+        if let Some(twitter_card) = html_meta.twitter_card.get("card") {
+            println!("Twitter Card: {}", twitter_card);
+        }
+
+        // New fields
+        if let Some(lang) = html_meta.language {
+            println!("Language: {}", lang);
+        }
+        if let Some(headers) = html_meta.headers {
+            println!("Headers: {:?}", headers);
+        }
+        if let Some(links) = html_meta.links {
+            for (url, text) in links {
+                println!("Link: {} ({})", url, text);
+            }
+        }
+    }
+    ```
+
+### Python
+
+=== "Before (v3.x)"
+
+    ```python
+    from kreuzberg import extract_file_sync, ExtractionConfig
+
+    result = extract_file_sync("page.html", config=ExtractionConfig())
+    html_meta = result.metadata.get("html", {})
+
+    # Keywords as single string
+    if html_meta.get('keywords'):
+        keyword_list = html_meta['keywords'].split(',')
+        print(f"Keywords: {keyword_list}")
+
+    # Canonical as separate field
+    if html_meta.get('canonical'):
+        print(f"Canonical: {html_meta['canonical']}")
+
+    # Open Graph as individual fields
+    if html_meta.get('og_title'):
+        print(f"OG Title: {html_meta['og_title']}")
+    if html_meta.get('og_image'):
+        print(f"OG Image: {html_meta['og_image']}")
+
+    # Twitter as individual fields
+    if html_meta.get('twitter_card'):
+        print(f"Twitter Card: {html_meta['twitter_card']}")
+    ```
+
+=== "After (v4.0)"
+
+    ```python
+    from kreuzberg import extract_file_sync, ExtractionConfig
+
+    result = extract_file_sync("page.html", config=ExtractionConfig())
+    html_meta = result.metadata.get("html", {})
+
+    # Keywords as array
+    if html_meta.get('keywords'):
+        print(f"Keywords: {html_meta['keywords']}")
+
+    # Canonical renamed
+    if html_meta.get('canonical_url'):
+        print(f"Canonical URL: {html_meta['canonical_url']}")
+
+    # Open Graph from map
+    open_graph = html_meta.get('open_graph', {})
+    if open_graph.get('title'):
+        print(f"OG Title: {open_graph['title']}")
+    if open_graph.get('image'):
+        print(f"OG Image: {open_graph['image']}")
+
+    # Twitter from map
+    twitter_card = html_meta.get('twitter_card', {})
+    if twitter_card.get('card'):
+        print(f"Twitter Card: {twitter_card['card']}")
+
+    # New fields
+    if html_meta.get('language'):
+        print(f"Language: {html_meta['language']}")
+
+    if html_meta.get('headers'):
+        print(f"Headers: {html_meta['headers']}")
+
+    if html_meta.get('links'):
+        for url, text in html_meta['links']:
+            print(f"Link: {url} ({text})")
+    ```
+
+### TypeScript
+
+=== "Before (v3.x)"
+
+    ```typescript
+    import { extractFileSync } from '@kreuzberg/node';
+
+    const result = extractFileSync('page.html');
+    const htmlMeta = result.metadata;
+
+    // Keywords as single string
+    if (htmlMeta.keywords) {
+        const keywordArray = htmlMeta.keywords.split(',');
+        console.log('Keywords:', keywordArray);
+    }
+
+    // Canonical as separate field
+    if (htmlMeta.canonical) {
+        console.log('Canonical:', htmlMeta.canonical);
+    }
+
+    // Open Graph as individual fields
+    if (htmlMeta.ogTitle) {
+        console.log('OG Title:', htmlMeta.ogTitle);
+    }
+    if (htmlMeta.ogImage) {
+        console.log('OG Image:', htmlMeta.ogImage);
+    }
+
+    // Twitter as individual fields
+    if (htmlMeta.twitterCard) {
+        console.log('Twitter Card:', htmlMeta.twitterCard);
+    }
+    ```
+
+=== "After (v4.0)"
+
+    ```typescript
+    import { extractFileSync } from '@kreuzberg/node';
+
+    const result = extractFileSync('page.html');
+    const htmlMeta = result.metadata;
+
+    // Keywords as array
+    if (htmlMeta.keywords?.length > 0) {
+        console.log('Keywords:', htmlMeta.keywords);
+    }
+
+    // Canonical renamed
+    if (htmlMeta.canonicalUrl) {
+        console.log('Canonical URL:', htmlMeta.canonicalUrl);
+    }
+
+    // Open Graph from map
+    if (htmlMeta.openGraph) {
+        if (htmlMeta.openGraph['title']) {
+            console.log('OG Title:', htmlMeta.openGraph['title']);
+        }
+        if (htmlMeta.openGraph['image']) {
+            console.log('OG Image:', htmlMeta.openGraph['image']);
+        }
+    }
+
+    // Twitter from map
+    if (htmlMeta.twitterCard) {
+        if (htmlMeta.twitterCard['card']) {
+            console.log('Twitter Card:', htmlMeta.twitterCard['card']);
+        }
+    }
+
+    // New fields
+    if (htmlMeta.language) {
+        console.log('Language:', htmlMeta.language);
+    }
+
+    if (htmlMeta.headers?.length > 0) {
+        console.log('Headers:', htmlMeta.headers);
+    }
+
+    if (htmlMeta.links?.length > 0) {
+        htmlMeta.links.forEach(([url, text]) => {
+            console.log(`Link: ${url} (${text})`);
+        });
+    }
+    ```
+
+### Java
+
+=== "Before (v3.x)"
+
+    ```java
+    import dev.kreuzberg.Kreuzberg;
+    import dev.kreuzberg.ExtractionResult;
+    import java.util.Map;
+
+    ExtractionResult result = Kreuzberg.extractFileSync("page.html");
+    Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");
+
+    // Keywords as single string
+    String keywords = (String) htmlMeta.get("keywords");
+    if (keywords != null) {
+        String[] keywordArray = keywords.split(",");
+        System.out.println("Keywords: " + Arrays.toString(keywordArray));
+    }
+
+    // Canonical as separate field
+    String canonical = (String) htmlMeta.get("canonical");
+    if (canonical != null) {
+        System.out.println("Canonical: " + canonical);
+    }
+
+    // Open Graph as individual fields
+    String ogTitle = (String) htmlMeta.get("og_title");
+    if (ogTitle != null) {
+        System.out.println("OG Title: " + ogTitle);
+    }
+
+    // Twitter as individual fields
+    String twitterCard = (String) htmlMeta.get("twitter_card");
+    if (twitterCard != null) {
+        System.out.println("Twitter Card: " + twitterCard);
+    }
+    ```
+
+=== "After (v4.0)"
+
+    ```java
+    import dev.kreuzberg.Kreuzberg;
+    import dev.kreuzberg.ExtractionResult;
+    import java.util.Map;
+    import java.util.List;
+
+    ExtractionResult result = Kreuzberg.extractFileSync("page.html");
+    Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");
+
+    // Keywords as array
+    @SuppressWarnings("unchecked")
+    List<String> keywords = (List<String>) htmlMeta.get("keywords");
+    if (keywords != null && !keywords.isEmpty()) {
+        System.out.println("Keywords: " + keywords);
+    }
+
+    // Canonical renamed
+    String canonicalUrl = (String) htmlMeta.get("canonical_url");
+    if (canonicalUrl != null) {
+        System.out.println("Canonical URL: " + canonicalUrl);
+    }
+
+    // Open Graph from map
+    @SuppressWarnings("unchecked")
+    Map<String, String> openGraph = (Map<String, String>) htmlMeta.get("open_graph");
+    if (openGraph != null) {
+        String ogTitle = openGraph.get("title");
+        if (ogTitle != null) {
+            System.out.println("OG Title: " + ogTitle);
+        }
+    }
+
+    // Twitter from map
+    @SuppressWarnings("unchecked")
+    Map<String, String> twitterCard = (Map<String, String>) htmlMeta.get("twitter_card");
+    if (twitterCard != null) {
+        String card = twitterCard.get("card");
+        if (card != null) {
+            System.out.println("Twitter Card: " + card);
+        }
+    }
+
+    // New fields
+    String language = (String) htmlMeta.get("language");
+    if (language != null) {
+        System.out.println("Language: " + language);
+    }
+
+    @SuppressWarnings("unchecked")
+    List<String> headers = (List<String>) htmlMeta.get("headers");
+    if (headers != null && !headers.isEmpty()) {
+        System.out.println("Headers: " + headers);
+    }
+    ```
+
+### Go
+
+=== "Before (v3.x)"
+
+    ```go
+    package main
+
+    import (
+        "fmt"
+        "log"
+        "strings"
+        "github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
+    )
+
+    func main() {
+        result, err := kreuzberg.ExtractFileSync("page.html", nil)
+        if err != nil {
+            log.Fatalf("extract: %v", err)
+        }
+
+        if html, ok := result.Metadata.HTMLMetadata(); ok {
+            // Keywords as single string
+            if html.Keywords != nil {
+                keywordSlice := strings.Split(*html.Keywords, ",")
+                fmt.Println("Keywords:", keywordSlice)
+            }
+
+            // Canonical as separate field
+            if html.Canonical != nil {
+                fmt.Println("Canonical:", *html.Canonical)
+            }
+
+            // Open Graph as individual fields
+            if html.OGTitle != nil {
+                fmt.Println("OG Title:", *html.OGTitle)
+            }
+            if html.OGImage != nil {
+                fmt.Println("OG Image:", *html.OGImage)
+            }
+
+            // Twitter as individual fields
+            if html.TwitterCard != nil {
+                fmt.Println("Twitter Card:", *html.TwitterCard)
+            }
+        }
+    }
+    ```
+
+=== "After (v4.0)"
+
+    ```go
+    package main
+
+    import (
+        "fmt"
+        "log"
+        "strings"
+        "github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
+    )
+
+    func main() {
+        result, err := kreuzberg.ExtractFileSync("page.html", nil)
+        if err != nil {
+            log.Fatalf("extract: %v", err)
+        }
+
+        if html, ok := result.Metadata.HTMLMetadata(); ok {
+            // Keywords as array
+            if len(html.Keywords) > 0 {
+                fmt.Println("Keywords:", strings.Join(html.Keywords, ", "))
+            }
+
+            // Canonical renamed
+            if html.CanonicalURL != nil {
+                fmt.Println("Canonical URL:", *html.CanonicalURL)
+            }
+
+            // Open Graph from map
+            if len(html.OpenGraph) > 0 {
+                if ogTitle, ok := html.OpenGraph["title"]; ok {
+                    fmt.Println("OG Title:", ogTitle)
+                }
+                if ogImage, ok := html.OpenGraph["image"]; ok {
+                    fmt.Println("OG Image:", ogImage)
+                }
+            }
+
+            // Twitter from map
+            if len(html.TwitterCard) > 0 {
+                if card, ok := html.TwitterCard["card"]; ok {
+                    fmt.Println("Twitter Card:", card)
+                }
+            }
+
+            // New fields
+            if html.Language != nil {
+                fmt.Println("Language:", *html.Language)
+            }
+
+            if len(html.Headers) > 0 {
+                fmt.Println("Headers:", strings.Join(html.Headers, ", "))
+            }
+
+            if len(html.Links) > 0 {
+                for _, link := range html.Links {
+                    fmt.Printf("Link: %s (%s)\n", link[0], link[1])
+                }
+            }
+        }
+    }
+    ```
+
+### Ruby
+
+=== "Before (v3.x)"
+
+    ```ruby
+    require 'kreuzberg'
+
+    result = Kreuzberg.extract_file_sync('page.html')
+    html_meta = result.metadata['html']
+
+    # Keywords as single string
+    if html_meta['keywords']
+        keyword_array = html_meta['keywords'].split(',').map(&:strip)
+        puts "Keywords: #{keyword_array}"
+    end
+
+    # Canonical as separate field
+    if html_meta['canonical']
+        puts "Canonical: #{html_meta['canonical']}"
+    end
+
+    # Open Graph as individual fields
+    if html_meta['og_title']
+        puts "OG Title: #{html_meta['og_title']}"
+    end
+    if html_meta['og_image']
+        puts "OG Image: #{html_meta['og_image']}"
+    end
+
+    # Twitter as individual fields
+    if html_meta['twitter_card']
+        puts "Twitter Card: #{html_meta['twitter_card']}"
+    end
+    ```
+
+=== "After (v4.0)"
+
+    ```ruby
+    require 'kreuzberg'
+
+    result = Kreuzberg.extract_file_sync('page.html')
+    html_meta = result.metadata['html']
+
+    # Keywords as array
+    if html_meta['keywords'] && !html_meta['keywords'].empty?
+        puts "Keywords: #{html_meta['keywords']}"
+    end
+
+    # Canonical renamed
+    if html_meta['canonical_url']
+        puts "Canonical URL: #{html_meta['canonical_url']}"
+    end
+
+    # Open Graph from map
+    open_graph = html_meta['open_graph'] || {}
+    if open_graph['title']
+        puts "OG Title: #{open_graph['title']}"
+    end
+    if open_graph['image']
+        puts "OG Image: #{open_graph['image']}"
+    end
+
+    # Twitter from map
+    twitter_card = html_meta['twitter_card'] || {}
+    if twitter_card['card']
+        puts "Twitter Card: #{twitter_card['card']}"
+    end
+
+    # New fields
+    if html_meta['language']
+        puts "Language: #{html_meta['language']}"
+    end
+
+    if html_meta['headers'] && !html_meta['headers'].empty?
+        puts "Headers: #{html_meta['headers'].join(', ')}"
+    end
+
+    if html_meta['links'] && !html_meta['links'].empty?
+        html_meta['links'].each do |url, text|
+            puts "Link: #{url} (#{text})"
+        end
+    end
+    ```
+
+## API Reference
+
+For complete details on all HTML metadata fields and types, see:
+
+- [HTML Metadata Type Reference](../reference/types.md#htmlmetadata)
+
+## Structured Types Reference
+
+### HeaderMetadata
+
+Header elements extracted from the HTML document with hierarchy information.
+
+```rust title="HeaderMetadata Struct Definition"
+pub struct HeaderMetadata {
+    pub level: u8,                    // 1-6 (h1-h6)
+    pub text: String,                // Normalized text content
+    pub id: Option<String>,           // HTML id attribute
+    pub depth: usize,                 // Document tree depth
+    pub html_offset: usize,           // Byte offset in original HTML
+}
+```
+
+**Example:**
+
+```json title="HeaderMetadata JSON Example"
+{
+  "level": 1,
+  "text": "Welcome to Our Site",
+  "id": "welcome-section",
+  "depth": 2,
+  "html_offset": 512
+}
+```
+
+### LinkMetadata
+
+Link elements with type classification and detailed attributes.
+
+```rust title="LinkMetadata Struct and LinkType Enum"
+pub struct LinkMetadata {
+    pub href: String,                        // The href URL value
+    pub text: String,                        // Link text content
+    pub title: Option<String>,               // Title attribute
+    pub link_type: LinkType,                 // Classification enum
+    pub rel: Vec<String>,                    // Rel attribute values
+    pub attributes: HashMap<String, String>, // Additional attributes
+}
+
+pub enum LinkType {
+    Anchor,    // #section anchors
+    Internal,  // Same domain links
+    External,  // Different domain links
+    Email,     // mailto: links
+    Phone,     // tel: links
+    Other,     // Other link types
+}
+```
+
+**Example:**
+
+```json title="LinkMetadata JSON Example"
+{
+  "href": "https://example.com",
+  "text": "Visit Example",
+  "title": "Example Website",
+  "link_type": "external",
+  "rel": ["nofollow"],
+  "attributes": {
+    "data-tracking": "yes"
+  }
+}
+```
+
+### ImageMetadataType
+
+Image elements with type classification and dimensions.
+
+```rust title="ImageMetadataType Struct and ImageType Enum"
+pub struct ImageMetadataType {
+    pub src: String,                         // Image source (URL, data URI, or SVG)
+    pub alt: Option<String>,                 // Alt text
+    pub title: Option<String>,               // Title attribute
+    pub dimensions: Option<(u32, u32)>,      // Width x Height
+    pub image_type: ImageType,               // Classification enum
+    pub attributes: HashMap<String, String>, // Additional attributes
+}
+
+pub enum ImageType {
+    DataUri,    // data: URI
+    InlineSvg,  // Inline <svg> content
+    External,   // External URL
+    Relative,   // Relative path
+}
+```
+
+**Example:**
+
+```json title="ImageMetadataType JSON Example"
+{
+  "src": "https://cdn.example.com/image.jpg",
+  "alt": "Product photo",
+  "title": "Featured product",
+  "dimensions": [400, 300],
+  "image_type": "external",
+  "attributes": {
+    "loading": "lazy"
+  }
+}
+```
+
+### StructuredData
+
+Extracted structured data blocks (JSON-LD, microdata, RDFa).
+
+```rust title="StructuredData Struct and StructuredDataType Enum"
+pub struct StructuredData {
+    pub data_type: StructuredDataType,  // Classification enum
+    pub raw_json: String,               // Raw JSON string
+    pub schema_type: Option<String>,    // Schema type (e.g., "Article")
+}
+
+pub enum StructuredDataType {
+    JsonLd,   // JSON-LD
+    Microdata, // microdata
+    RDFa,     // RDFa
+}
+```
+
+**Example:**
+
+```json title="StructuredData JSON Example"
+{
+  "data_type": "json-ld",
+  "raw_json": "{\"@context\": \"https://schema.org\", \"@type\": \"Article\", ...}",
+  "schema_type": "Article"
+}
+```
+
+## Summary of Changes
+
+| Field                                           | v3.x                               | v4.0                                                |
+| ----------------------------------------------- | ---------------------------------- | --------------------------------------------------- |
+| `keywords`                                      | `Option<String>`                   | `Vec<String>` with `#[serde(default)]`              |
+| `canonical`                                     | `Option<String>`                   | Renamed to `canonical_url`                          |
+| `og_*` fields (7 fields)                        | Individual `Option<String>` fields | `open_graph: BTreeMap<String, String>`              |
+| `twitter_*` fields (6 fields)                   | Individual `Option<String>` fields | `twitter_card: BTreeMap<String, String>`            |
+| `link_author`, `link_license`, `link_alternate` | Individual fields                  | Removed (use `links` field)                         |
+| New: `language`                                 | N/A                                | `Option<String>`                                    |
+| New: `text_direction`                           | N/A                                | `Option<TextDirection>`                             |
+| New: `headers`                                  | N/A                                | `Vec<HeaderMetadata>` with `#[serde(default)]`      |
+| New: `links`                                    | N/A                                | `Vec<LinkMetadata>` with `#[serde(default)]`        |
+| New: `images`                                   | N/A                                | `Vec<ImageMetadataType>` with `#[serde(default)]`   |
+| New: `structured_data`                          | N/A                                | `Vec<StructuredData>` with `#[serde(default)]`      |
+| New: `meta_tags`                                | N/A                                | `BTreeMap<String, String>` with `#[serde(default)]` |
+
+## Questions?
+
+- See the [Types Reference](../reference/types.md) for complete API details
+- Check [Working with Metadata](../getting-started/quickstart.md#read-document-metadata) for examples
+- Open an issue on [GitHub](https://github.com/kreuzberg-dev/kreuzberg/issues)
--- a/docs/migration/v5.0-image-indices.md
+++ b/docs/migration/v5.0-image-indices.md
@@ -0,0 +1,52 @@
+# Image Index References (v5.0)
+
+## Summary
+
+`PageContent.images: Vec<Arc<ExtractedImage>>` is removed. Pages now carry `image_indices: Vec<u32>` — zero-based indices into `ExtractionResult.images`.
+
+## Breaking Change
+
+**Previous behavior** (v4.x):
+
+```rust
+let result = extractor.extract(path, &config).await?;
+for page in result.pages.unwrap_or_default() {
+    for image in &page.images {
+        println!("{:?}", image.data);
+    }
+}
+```
+
+**New behavior** (v5.0):
+
+```rust
+let result = extractor.extract(path, &config).await?;
+let images = result.images.as_deref().unwrap_or(&[]);
+for page in result.pages.unwrap_or_default() {
+    for &idx in &page.image_indices {
+        println!("{:?}", images[idx as usize].data);
+    }
+}
+```
+
+`ChunkMetadata` gains the same `image_indices: Vec<u32>` field, populated post-chunking by matching each image's `page_number` against `[first_page, last_page]`.
+
+## Impact
+
+**Who is affected?**
+
+- Users reading `page.images` directly
+- Users passing `PageContent` values across FFI boundaries
+- All polyglot bindings (Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir) — bindings are regenerated automatically
+
+**What changes?**
+
+| Before                   | After                                                         |
+| ------------------------ | ------------------------------------------------------------- |
+| `page.images[i].data`    | `result.images.unwrap()[page.image_indices[i] as usize].data` |
+| `page.images.len()`      | `page.image_indices.len()`                                    |
+| `page.images.is_empty()` | `page.image_indices.is_empty()`                               |
+
+## Known Limitation
+
+`YamlSectionChunker` does not track page provenance (`first_page`/`last_page` are always `None`), so its chunks always produce empty `image_indices`. Tracked in a separate issue.