Files
fil/docs/migration/v4.0-html-metadata.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

24 KiB

HTML Metadata Structure Changes (v4.0)

Summary

HTML metadata has been restructured for better organization and type safety. The changes consolidate individual Open Graph and Twitter Card fields into maps, and convert keywords from a single string to an array.

Breaking Changes

1. Keywords: String to Array

Before (v3.x):

// Option<String> - comma-separated or space-separated
html_meta.keywords  // "seo, metadata, html"

After (v4.0):

// Vec<String> - structured array
html_meta.keywords  // vec!["seo", "metadata", "html"]

2. Canonical URL: Field Rename

Before (v3.x):

html_meta.canonical  // Option<String>

After (v4.0):

html_meta.canonical_url  // Option<String>

3. Open Graph: Individual Fields to Map

Before (v3.x):

html_meta.og_title          // Option<String>
html_meta.og_description    // Option<String>
html_meta.og_image          // Option<String>
html_meta.og_url            // Option<String>
html_meta.og_type           // Option<String>
html_meta.og_site_name      // Option<String>

After (v4.0):

html_meta.open_graph        // BTreeMap<String, String>
html_meta.open_graph.get("title")         // Option<&String>
html_meta.open_graph.get("description")   // Option<&String>
html_meta.open_graph.get("image")         // Option<&String>
html_meta.open_graph.get("url")           // Option<&String>
html_meta.open_graph.get("type")          // Option<&String>
html_meta.open_graph.get("site_name")     // Option<&String>

4. Twitter Card: Individual Fields to Map

Before (v3.x):

html_meta.twitter_card          // Option<String>
html_meta.twitter_title         // Option<String>
html_meta.twitter_description   // Option<String>
html_meta.twitter_image         // Option<String>
html_meta.twitter_site          // Option<String>
html_meta.twitter_creator       // Option<String>

After (v4.0):

html_meta.twitter_card          // BTreeMap<String, String>
html_meta.twitter_card.get("card")          // Option<&String>
html_meta.twitter_card.get("title")         // Option<&String>
html_meta.twitter_card.get("description")   // Option<&String>
html_meta.twitter_card.get("image")         // Option<&String>
html_meta.twitter_card.get("site")          // Option<&String>
html_meta.twitter_card.get("creator")       // Option<&String>

5. Removed Fields

The following link-related fields have been removed:

  • link_author
  • link_license
  • link_alternate

Use the new links field instead for comprehensive link extraction.

6. New Fields

HTML metadata now includes rich metadata about page content:

  • language: Document language (for example, "en", "fr")
  • text_direction: Text direction ("ltr", "rtl")
  • headers: List of page headers/headings with structured metadata
  • links: List of links with detailed metadata and type classification
  • images: List of images with alt text, dimensions, and type classification
  • structured_data: Parsed JSON-LD, microdata, and RDFa data
  • meta_tags: All meta tags as a map

Migration Guide

Rust

=== "Before (v3.x)"

```rust
use kreuzberg::{extract_file_sync, ExtractionConfig};

let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
if let Some(html_meta) = result.metadata.html {
    // Keywords as single string
    if let Some(keywords) = html_meta.keywords {
        let keyword_vec: Vec<&str> = keywords.split(',').map(|s| s.trim()).collect();
        println!("Keywords: {:?}", keyword_vec);
    }

    // Canonical as separate field
    if let Some(canonical) = html_meta.canonical {
        println!("Canonical: {}", canonical);
    }

    // Open Graph as individual fields
    if let Some(og_title) = html_meta.og_title {
        println!("OG Title: {}", og_title);
    }
    if let Some(og_image) = html_meta.og_image {
        println!("OG Image: {}", og_image);
    }

    // Twitter as individual fields
    if let Some(twitter_card) = html_meta.twitter_card {
        println!("Twitter Card: {}", twitter_card);
    }
}
```

=== "After (v4.0)"

```rust
use kreuzberg::{extract_file_sync, ExtractionConfig};

let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
if let Some(html_meta) = result.metadata.html {
    // Keywords as array
    if !html_meta.keywords.is_empty() {
        println!("Keywords: {:?}", html_meta.keywords);
    }

    // Canonical renamed
    if let Some(canonical_url) = html_meta.canonical_url {
        println!("Canonical URL: {}", canonical_url);
    }

    // Open Graph from map
    if let Some(og_title) = html_meta.open_graph.get("title") {
        println!("OG Title: {}", og_title);
    }
    if let Some(og_image) = html_meta.open_graph.get("image") {
        println!("OG Image: {}", og_image);
    }

    // Twitter from map
    if let Some(twitter_card) = html_meta.twitter_card.get("card") {
        println!("Twitter Card: {}", twitter_card);
    }

    // New fields
    if let Some(lang) = html_meta.language {
        println!("Language: {}", lang);
    }
    if let Some(headers) = html_meta.headers {
        println!("Headers: {:?}", headers);
    }
    if let Some(links) = html_meta.links {
        for (url, text) in links {
            println!("Link: {} ({})", url, text);
        }
    }
}
```

Python

=== "Before (v3.x)"

```python
from kreuzberg import extract_file_sync, ExtractionConfig

result = extract_file_sync("page.html", config=ExtractionConfig())
html_meta = result.metadata.get("html", {})

# Keywords as single string
if html_meta.get('keywords'):
    keyword_list = html_meta['keywords'].split(',')
    print(f"Keywords: {keyword_list}")

# Canonical as separate field
if html_meta.get('canonical'):
    print(f"Canonical: {html_meta['canonical']}")

# Open Graph as individual fields
if html_meta.get('og_title'):
    print(f"OG Title: {html_meta['og_title']}")
if html_meta.get('og_image'):
    print(f"OG Image: {html_meta['og_image']}")

# Twitter as individual fields
if html_meta.get('twitter_card'):
    print(f"Twitter Card: {html_meta['twitter_card']}")
```

=== "After (v4.0)"

```python
from kreuzberg import extract_file_sync, ExtractionConfig

result = extract_file_sync("page.html", config=ExtractionConfig())
html_meta = result.metadata.get("html", {})

# Keywords as array
if html_meta.get('keywords'):
    print(f"Keywords: {html_meta['keywords']}")

# Canonical renamed
if html_meta.get('canonical_url'):
    print(f"Canonical URL: {html_meta['canonical_url']}")

# Open Graph from map
open_graph = html_meta.get('open_graph', {})
if open_graph.get('title'):
    print(f"OG Title: {open_graph['title']}")
if open_graph.get('image'):
    print(f"OG Image: {open_graph['image']}")

# Twitter from map
twitter_card = html_meta.get('twitter_card', {})
if twitter_card.get('card'):
    print(f"Twitter Card: {twitter_card['card']}")

# New fields
if html_meta.get('language'):
    print(f"Language: {html_meta['language']}")

if html_meta.get('headers'):
    print(f"Headers: {html_meta['headers']}")

if html_meta.get('links'):
    for url, text in html_meta['links']:
        print(f"Link: {url} ({text})")
```

TypeScript

=== "Before (v3.x)"

```typescript
import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('page.html');
const htmlMeta = result.metadata;

// Keywords as single string
if (htmlMeta.keywords) {
    const keywordArray = htmlMeta.keywords.split(',');
    console.log('Keywords:', keywordArray);
}

// Canonical as separate field
if (htmlMeta.canonical) {
    console.log('Canonical:', htmlMeta.canonical);
}

// Open Graph as individual fields
if (htmlMeta.ogTitle) {
    console.log('OG Title:', htmlMeta.ogTitle);
}
if (htmlMeta.ogImage) {
    console.log('OG Image:', htmlMeta.ogImage);
}

// Twitter as individual fields
if (htmlMeta.twitterCard) {
    console.log('Twitter Card:', htmlMeta.twitterCard);
}
```

=== "After (v4.0)"

```typescript
import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('page.html');
const htmlMeta = result.metadata;

// Keywords as array
if (htmlMeta.keywords?.length > 0) {
    console.log('Keywords:', htmlMeta.keywords);
}

// Canonical renamed
if (htmlMeta.canonicalUrl) {
    console.log('Canonical URL:', htmlMeta.canonicalUrl);
}

// Open Graph from map
if (htmlMeta.openGraph) {
    if (htmlMeta.openGraph['title']) {
        console.log('OG Title:', htmlMeta.openGraph['title']);
    }
    if (htmlMeta.openGraph['image']) {
        console.log('OG Image:', htmlMeta.openGraph['image']);
    }
}

// Twitter from map
if (htmlMeta.twitterCard) {
    if (htmlMeta.twitterCard['card']) {
        console.log('Twitter Card:', htmlMeta.twitterCard['card']);
    }
}

// New fields
if (htmlMeta.language) {
    console.log('Language:', htmlMeta.language);
}

if (htmlMeta.headers?.length > 0) {
    console.log('Headers:', htmlMeta.headers);
}

if (htmlMeta.links?.length > 0) {
    htmlMeta.links.forEach(([url, text]) => {
        console.log(`Link: ${url} (${text})`);
    });
}
```

Java

=== "Before (v3.x)"

```java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import java.util.Map;

ExtractionResult result = Kreuzberg.extractFileSync("page.html");
Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");

// Keywords as single string
String keywords = (String) htmlMeta.get("keywords");
if (keywords != null) {
    String[] keywordArray = keywords.split(",");
    System.out.println("Keywords: " + Arrays.toString(keywordArray));
}

// Canonical as separate field
String canonical = (String) htmlMeta.get("canonical");
if (canonical != null) {
    System.out.println("Canonical: " + canonical);
}

// Open Graph as individual fields
String ogTitle = (String) htmlMeta.get("og_title");
if (ogTitle != null) {
    System.out.println("OG Title: " + ogTitle);
}

// Twitter as individual fields
String twitterCard = (String) htmlMeta.get("twitter_card");
if (twitterCard != null) {
    System.out.println("Twitter Card: " + twitterCard);
}
```

=== "After (v4.0)"

```java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import java.util.Map;
import java.util.List;

ExtractionResult result = Kreuzberg.extractFileSync("page.html");
Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");

// Keywords as array
@SuppressWarnings("unchecked")
List<String> keywords = (List<String>) htmlMeta.get("keywords");
if (keywords != null && !keywords.isEmpty()) {
    System.out.println("Keywords: " + keywords);
}

// Canonical renamed
String canonicalUrl = (String) htmlMeta.get("canonical_url");
if (canonicalUrl != null) {
    System.out.println("Canonical URL: " + canonicalUrl);
}

// Open Graph from map
@SuppressWarnings("unchecked")
Map<String, String> openGraph = (Map<String, String>) htmlMeta.get("open_graph");
if (openGraph != null) {
    String ogTitle = openGraph.get("title");
    if (ogTitle != null) {
        System.out.println("OG Title: " + ogTitle);
    }
}

// Twitter from map
@SuppressWarnings("unchecked")
Map<String, String> twitterCard = (Map<String, String>) htmlMeta.get("twitter_card");
if (twitterCard != null) {
    String card = twitterCard.get("card");
    if (card != null) {
        System.out.println("Twitter Card: " + card);
    }
}

// New fields
String language = (String) htmlMeta.get("language");
if (language != null) {
    System.out.println("Language: " + language);
}

@SuppressWarnings("unchecked")
List<String> headers = (List<String>) htmlMeta.get("headers");
if (headers != null && !headers.isEmpty()) {
    System.out.println("Headers: " + headers);
}
```

Go

=== "Before (v3.x)"

```go
package main

import (
    "fmt"
    "log"
    "strings"
    "github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)

func main() {
    result, err := kreuzberg.ExtractFileSync("page.html", nil)
    if err != nil {
        log.Fatalf("extract: %v", err)
    }

    if html, ok := result.Metadata.HTMLMetadata(); ok {
        // Keywords as single string
        if html.Keywords != nil {
            keywordSlice := strings.Split(*html.Keywords, ",")
            fmt.Println("Keywords:", keywordSlice)
        }

        // Canonical as separate field
        if html.Canonical != nil {
            fmt.Println("Canonical:", *html.Canonical)
        }

        // Open Graph as individual fields
        if html.OGTitle != nil {
            fmt.Println("OG Title:", *html.OGTitle)
        }
        if html.OGImage != nil {
            fmt.Println("OG Image:", *html.OGImage)
        }

        // Twitter as individual fields
        if html.TwitterCard != nil {
            fmt.Println("Twitter Card:", *html.TwitterCard)
        }
    }
}
```

=== "After (v4.0)"

```go
package main

import (
    "fmt"
    "log"
    "strings"
    "github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)

func main() {
    result, err := kreuzberg.ExtractFileSync("page.html", nil)
    if err != nil {
        log.Fatalf("extract: %v", err)
    }

    if html, ok := result.Metadata.HTMLMetadata(); ok {
        // Keywords as array
        if len(html.Keywords) > 0 {
            fmt.Println("Keywords:", strings.Join(html.Keywords, ", "))
        }

        // Canonical renamed
        if html.CanonicalURL != nil {
            fmt.Println("Canonical URL:", *html.CanonicalURL)
        }

        // Open Graph from map
        if len(html.OpenGraph) > 0 {
            if ogTitle, ok := html.OpenGraph["title"]; ok {
                fmt.Println("OG Title:", ogTitle)
            }
            if ogImage, ok := html.OpenGraph["image"]; ok {
                fmt.Println("OG Image:", ogImage)
            }
        }

        // Twitter from map
        if len(html.TwitterCard) > 0 {
            if card, ok := html.TwitterCard["card"]; ok {
                fmt.Println("Twitter Card:", card)
            }
        }

        // New fields
        if html.Language != nil {
            fmt.Println("Language:", *html.Language)
        }

        if len(html.Headers) > 0 {
            fmt.Println("Headers:", strings.Join(html.Headers, ", "))
        }

        if len(html.Links) > 0 {
            for _, link := range html.Links {
                fmt.Printf("Link: %s (%s)\n", link[0], link[1])
            }
        }
    }
}
```

Ruby

=== "Before (v3.x)"

```ruby
require 'kreuzberg'

result = Kreuzberg.extract_file_sync('page.html')
html_meta = result.metadata['html']

# Keywords as single string
if html_meta['keywords']
    keyword_array = html_meta['keywords'].split(',').map(&:strip)
    puts "Keywords: #{keyword_array}"
end

# Canonical as separate field
if html_meta['canonical']
    puts "Canonical: #{html_meta['canonical']}"
end

# Open Graph as individual fields
if html_meta['og_title']
    puts "OG Title: #{html_meta['og_title']}"
end
if html_meta['og_image']
    puts "OG Image: #{html_meta['og_image']}"
end

# Twitter as individual fields
if html_meta['twitter_card']
    puts "Twitter Card: #{html_meta['twitter_card']}"
end
```

=== "After (v4.0)"

```ruby
require 'kreuzberg'

result = Kreuzberg.extract_file_sync('page.html')
html_meta = result.metadata['html']

# Keywords as array
if html_meta['keywords'] && !html_meta['keywords'].empty?
    puts "Keywords: #{html_meta['keywords']}"
end

# Canonical renamed
if html_meta['canonical_url']
    puts "Canonical URL: #{html_meta['canonical_url']}"
end

# Open Graph from map
open_graph = html_meta['open_graph'] || {}
if open_graph['title']
    puts "OG Title: #{open_graph['title']}"
end
if open_graph['image']
    puts "OG Image: #{open_graph['image']}"
end

# Twitter from map
twitter_card = html_meta['twitter_card'] || {}
if twitter_card['card']
    puts "Twitter Card: #{twitter_card['card']}"
end

# New fields
if html_meta['language']
    puts "Language: #{html_meta['language']}"
end

if html_meta['headers'] && !html_meta['headers'].empty?
    puts "Headers: #{html_meta['headers'].join(', ')}"
end

if html_meta['links'] && !html_meta['links'].empty?
    html_meta['links'].each do |url, text|
        puts "Link: #{url} (#{text})"
    end
end
```

API Reference

For complete details on all HTML metadata fields and types, see:

Structured Types Reference

HeaderMetadata

Header elements extracted from the HTML document with hierarchy information.

pub struct HeaderMetadata {
    pub level: u8,                    // 1-6 (h1-h6)
    pub text: String,                // Normalized text content
    pub id: Option<String>,           // HTML id attribute
    pub depth: usize,                 // Document tree depth
    pub html_offset: usize,           // Byte offset in original HTML
}

Example:

{
  "level": 1,
  "text": "Welcome to Our Site",
  "id": "welcome-section",
  "depth": 2,
  "html_offset": 512
}

LinkMetadata

Link elements with type classification and detailed attributes.

pub struct LinkMetadata {
    pub href: String,                        // The href URL value
    pub text: String,                        // Link text content
    pub title: Option<String>,               // Title attribute
    pub link_type: LinkType,                 // Classification enum
    pub rel: Vec<String>,                    // Rel attribute values
    pub attributes: HashMap<String, String>, // Additional attributes
}

pub enum LinkType {
    Anchor,    // #section anchors
    Internal,  // Same domain links
    External,  // Different domain links
    Email,     // mailto: links
    Phone,     // tel: links
    Other,     // Other link types
}

Example:

{
  "href": "https://example.com",
  "text": "Visit Example",
  "title": "Example Website",
  "link_type": "external",
  "rel": ["nofollow"],
  "attributes": {
    "data-tracking": "yes"
  }
}

ImageMetadataType

Image elements with type classification and dimensions.

pub struct ImageMetadataType {
    pub src: String,                         // Image source (URL, data URI, or SVG)
    pub alt: Option<String>,                 // Alt text
    pub title: Option<String>,               // Title attribute
    pub dimensions: Option<(u32, u32)>,      // Width x Height
    pub image_type: ImageType,               // Classification enum
    pub attributes: HashMap<String, String>, // Additional attributes
}

pub enum ImageType {
    DataUri,    // data: URI
    InlineSvg,  // Inline <svg> content
    External,   // External URL
    Relative,   // Relative path
}

Example:

{
  "src": "https://cdn.example.com/image.jpg",
  "alt": "Product photo",
  "title": "Featured product",
  "dimensions": [400, 300],
  "image_type": "external",
  "attributes": {
    "loading": "lazy"
  }
}

StructuredData

Extracted structured data blocks (JSON-LD, microdata, RDFa).

pub struct StructuredData {
    pub data_type: StructuredDataType,  // Classification enum
    pub raw_json: String,               // Raw JSON string
    pub schema_type: Option<String>,    // Schema type (e.g., "Article")
}

pub enum StructuredDataType {
    JsonLd,   // JSON-LD
    Microdata, // microdata
    RDFa,     // RDFa
}

Example:

{
  "data_type": "json-ld",
  "raw_json": "{\"@context\": \"https://schema.org\", \"@type\": \"Article\", ...}",
  "schema_type": "Article"
}

Summary of Changes

Field v3.x v4.0
keywords Option<String> Vec<String> with #[serde(default)]
canonical Option<String> Renamed to canonical_url
og_* fields (7 fields) Individual Option<String> fields open_graph: BTreeMap<String, String>
twitter_* fields (6 fields) Individual Option<String> fields twitter_card: BTreeMap<String, String>
link_author, link_license, link_alternate Individual fields Removed (use links field)
New: language N/A Option<String>
New: text_direction N/A Option<TextDirection>
New: headers N/A Vec<HeaderMetadata> with #[serde(default)]
New: links N/A Vec<LinkMetadata> with #[serde(default)]
New: images N/A Vec<ImageMetadataType> with #[serde(default)]
New: structured_data N/A Vec<StructuredData> with #[serde(default)]
New: meta_tags N/A BTreeMap<String, String> with #[serde(default)]

Questions?