# HTML Metadata Structure Changes (v4.0) ## Summary HTML metadata has been restructured for better organization and type safety. The changes consolidate individual Open Graph and Twitter Card fields into maps, and convert keywords from a single string to an array. ## Breaking Changes ### 1. Keywords: String to Array **Before (v3.x):** ```rust title="Keywords as Comma-Separated String" // Option - comma-separated or space-separated html_meta.keywords // "seo, metadata, html" ``` **After (v4.0):** ```rust title="Keywords as Structured Array" // Vec - structured array html_meta.keywords // vec!["seo", "metadata", "html"] ``` ### 2. Canonical URL: Field Rename **Before (v3.x):** ```rust title="Canonical Field (v3.x)" html_meta.canonical // Option ``` **After (v4.0):** ```rust title="Canonical URL Field (v4.0)" html_meta.canonical_url // Option ``` ### 3. Open Graph: Individual Fields to Map **Before (v3.x):** ```rust title="Open Graph as Individual Fields" html_meta.og_title // Option html_meta.og_description // Option html_meta.og_image // Option html_meta.og_url // Option html_meta.og_type // Option html_meta.og_site_name // Option ``` **After (v4.0):** ```rust title="Open Graph as Map Structure" html_meta.open_graph // BTreeMap html_meta.open_graph.get("title") // Option<&String> html_meta.open_graph.get("description") // Option<&String> html_meta.open_graph.get("image") // Option<&String> html_meta.open_graph.get("url") // Option<&String> html_meta.open_graph.get("type") // Option<&String> html_meta.open_graph.get("site_name") // Option<&String> ``` ### 4. Twitter Card: Individual Fields to Map **Before (v3.x):** ```rust title="Twitter Card as Individual Fields" html_meta.twitter_card // Option html_meta.twitter_title // Option html_meta.twitter_description // Option html_meta.twitter_image // Option html_meta.twitter_site // Option html_meta.twitter_creator // Option ``` **After (v4.0):** ```rust title="Twitter Card as Map Structure" html_meta.twitter_card // BTreeMap html_meta.twitter_card.get("card") // Option<&String> html_meta.twitter_card.get("title") // Option<&String> html_meta.twitter_card.get("description") // Option<&String> html_meta.twitter_card.get("image") // Option<&String> html_meta.twitter_card.get("site") // Option<&String> html_meta.twitter_card.get("creator") // Option<&String> ``` ### 5. Removed Fields The following link-related fields have been removed: - `link_author` - `link_license` - `link_alternate` Use the new `links` field instead for comprehensive link extraction. ### 6. New Fields HTML metadata now includes rich metadata about page content: - **`language`**: Document language (for example, "en", "fr") - **`text_direction`**: Text direction ("ltr", "rtl") - **`headers`**: List of page headers/headings with structured metadata - **`links`**: List of links with detailed metadata and type classification - **`images`**: List of images with alt text, dimensions, and type classification - **`structured_data`**: Parsed JSON-LD, microdata, and RDFa data - **`meta_tags`**: All meta tags as a map ## Migration Guide ### Rust === "Before (v3.x)" ```rust use kreuzberg::{extract_file_sync, ExtractionConfig}; let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?; if let Some(html_meta) = result.metadata.html { // Keywords as single string if let Some(keywords) = html_meta.keywords { let keyword_vec: Vec<&str> = keywords.split(',').map(|s| s.trim()).collect(); println!("Keywords: {:?}", keyword_vec); } // Canonical as separate field if let Some(canonical) = html_meta.canonical { println!("Canonical: {}", canonical); } // Open Graph as individual fields if let Some(og_title) = html_meta.og_title { println!("OG Title: {}", og_title); } if let Some(og_image) = html_meta.og_image { println!("OG Image: {}", og_image); } // Twitter as individual fields if let Some(twitter_card) = html_meta.twitter_card { println!("Twitter Card: {}", twitter_card); } } ``` === "After (v4.0)" ```rust use kreuzberg::{extract_file_sync, ExtractionConfig}; let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?; if let Some(html_meta) = result.metadata.html { // Keywords as array if !html_meta.keywords.is_empty() { println!("Keywords: {:?}", html_meta.keywords); } // Canonical renamed if let Some(canonical_url) = html_meta.canonical_url { println!("Canonical URL: {}", canonical_url); } // Open Graph from map if let Some(og_title) = html_meta.open_graph.get("title") { println!("OG Title: {}", og_title); } if let Some(og_image) = html_meta.open_graph.get("image") { println!("OG Image: {}", og_image); } // Twitter from map if let Some(twitter_card) = html_meta.twitter_card.get("card") { println!("Twitter Card: {}", twitter_card); } // New fields if let Some(lang) = html_meta.language { println!("Language: {}", lang); } if let Some(headers) = html_meta.headers { println!("Headers: {:?}", headers); } if let Some(links) = html_meta.links { for (url, text) in links { println!("Link: {} ({})", url, text); } } } ``` ### Python === "Before (v3.x)" ```python from kreuzberg import extract_file_sync, ExtractionConfig result = extract_file_sync("page.html", config=ExtractionConfig()) html_meta = result.metadata.get("html", {}) # Keywords as single string if html_meta.get('keywords'): keyword_list = html_meta['keywords'].split(',') print(f"Keywords: {keyword_list}") # Canonical as separate field if html_meta.get('canonical'): print(f"Canonical: {html_meta['canonical']}") # Open Graph as individual fields if html_meta.get('og_title'): print(f"OG Title: {html_meta['og_title']}") if html_meta.get('og_image'): print(f"OG Image: {html_meta['og_image']}") # Twitter as individual fields if html_meta.get('twitter_card'): print(f"Twitter Card: {html_meta['twitter_card']}") ``` === "After (v4.0)" ```python from kreuzberg import extract_file_sync, ExtractionConfig result = extract_file_sync("page.html", config=ExtractionConfig()) html_meta = result.metadata.get("html", {}) # Keywords as array if html_meta.get('keywords'): print(f"Keywords: {html_meta['keywords']}") # Canonical renamed if html_meta.get('canonical_url'): print(f"Canonical URL: {html_meta['canonical_url']}") # Open Graph from map open_graph = html_meta.get('open_graph', {}) if open_graph.get('title'): print(f"OG Title: {open_graph['title']}") if open_graph.get('image'): print(f"OG Image: {open_graph['image']}") # Twitter from map twitter_card = html_meta.get('twitter_card', {}) if twitter_card.get('card'): print(f"Twitter Card: {twitter_card['card']}") # New fields if html_meta.get('language'): print(f"Language: {html_meta['language']}") if html_meta.get('headers'): print(f"Headers: {html_meta['headers']}") if html_meta.get('links'): for url, text in html_meta['links']: print(f"Link: {url} ({text})") ``` ### TypeScript === "Before (v3.x)" ```typescript import { extractFileSync } from '@kreuzberg/node'; const result = extractFileSync('page.html'); const htmlMeta = result.metadata; // Keywords as single string if (htmlMeta.keywords) { const keywordArray = htmlMeta.keywords.split(','); console.log('Keywords:', keywordArray); } // Canonical as separate field if (htmlMeta.canonical) { console.log('Canonical:', htmlMeta.canonical); } // Open Graph as individual fields if (htmlMeta.ogTitle) { console.log('OG Title:', htmlMeta.ogTitle); } if (htmlMeta.ogImage) { console.log('OG Image:', htmlMeta.ogImage); } // Twitter as individual fields if (htmlMeta.twitterCard) { console.log('Twitter Card:', htmlMeta.twitterCard); } ``` === "After (v4.0)" ```typescript import { extractFileSync } from '@kreuzberg/node'; const result = extractFileSync('page.html'); const htmlMeta = result.metadata; // Keywords as array if (htmlMeta.keywords?.length > 0) { console.log('Keywords:', htmlMeta.keywords); } // Canonical renamed if (htmlMeta.canonicalUrl) { console.log('Canonical URL:', htmlMeta.canonicalUrl); } // Open Graph from map if (htmlMeta.openGraph) { if (htmlMeta.openGraph['title']) { console.log('OG Title:', htmlMeta.openGraph['title']); } if (htmlMeta.openGraph['image']) { console.log('OG Image:', htmlMeta.openGraph['image']); } } // Twitter from map if (htmlMeta.twitterCard) { if (htmlMeta.twitterCard['card']) { console.log('Twitter Card:', htmlMeta.twitterCard['card']); } } // New fields if (htmlMeta.language) { console.log('Language:', htmlMeta.language); } if (htmlMeta.headers?.length > 0) { console.log('Headers:', htmlMeta.headers); } if (htmlMeta.links?.length > 0) { htmlMeta.links.forEach(([url, text]) => { console.log(`Link: ${url} (${text})`); }); } ``` ### Java === "Before (v3.x)" ```java import dev.kreuzberg.Kreuzberg; import dev.kreuzberg.ExtractionResult; import java.util.Map; ExtractionResult result = Kreuzberg.extractFileSync("page.html"); Map htmlMeta = (Map) result.getMetadata().get("html"); // Keywords as single string String keywords = (String) htmlMeta.get("keywords"); if (keywords != null) { String[] keywordArray = keywords.split(","); System.out.println("Keywords: " + Arrays.toString(keywordArray)); } // Canonical as separate field String canonical = (String) htmlMeta.get("canonical"); if (canonical != null) { System.out.println("Canonical: " + canonical); } // Open Graph as individual fields String ogTitle = (String) htmlMeta.get("og_title"); if (ogTitle != null) { System.out.println("OG Title: " + ogTitle); } // Twitter as individual fields String twitterCard = (String) htmlMeta.get("twitter_card"); if (twitterCard != null) { System.out.println("Twitter Card: " + twitterCard); } ``` === "After (v4.0)" ```java import dev.kreuzberg.Kreuzberg; import dev.kreuzberg.ExtractionResult; import java.util.Map; import java.util.List; ExtractionResult result = Kreuzberg.extractFileSync("page.html"); Map htmlMeta = (Map) result.getMetadata().get("html"); // Keywords as array @SuppressWarnings("unchecked") List keywords = (List) htmlMeta.get("keywords"); if (keywords != null && !keywords.isEmpty()) { System.out.println("Keywords: " + keywords); } // Canonical renamed String canonicalUrl = (String) htmlMeta.get("canonical_url"); if (canonicalUrl != null) { System.out.println("Canonical URL: " + canonicalUrl); } // Open Graph from map @SuppressWarnings("unchecked") Map openGraph = (Map) htmlMeta.get("open_graph"); if (openGraph != null) { String ogTitle = openGraph.get("title"); if (ogTitle != null) { System.out.println("OG Title: " + ogTitle); } } // Twitter from map @SuppressWarnings("unchecked") Map twitterCard = (Map) htmlMeta.get("twitter_card"); if (twitterCard != null) { String card = twitterCard.get("card"); if (card != null) { System.out.println("Twitter Card: " + card); } } // New fields String language = (String) htmlMeta.get("language"); if (language != null) { System.out.println("Language: " + language); } @SuppressWarnings("unchecked") List headers = (List) htmlMeta.get("headers"); if (headers != null && !headers.isEmpty()) { System.out.println("Headers: " + headers); } ``` ### Go === "Before (v3.x)" ```go package main import ( "fmt" "log" "strings" "github.com/kreuzberg-dev/kreuzberg/packages/go/v5" ) func main() { result, err := kreuzberg.ExtractFileSync("page.html", nil) if err != nil { log.Fatalf("extract: %v", err) } if html, ok := result.Metadata.HTMLMetadata(); ok { // Keywords as single string if html.Keywords != nil { keywordSlice := strings.Split(*html.Keywords, ",") fmt.Println("Keywords:", keywordSlice) } // Canonical as separate field if html.Canonical != nil { fmt.Println("Canonical:", *html.Canonical) } // Open Graph as individual fields if html.OGTitle != nil { fmt.Println("OG Title:", *html.OGTitle) } if html.OGImage != nil { fmt.Println("OG Image:", *html.OGImage) } // Twitter as individual fields if html.TwitterCard != nil { fmt.Println("Twitter Card:", *html.TwitterCard) } } } ``` === "After (v4.0)" ```go package main import ( "fmt" "log" "strings" "github.com/kreuzberg-dev/kreuzberg/packages/go/v5" ) func main() { result, err := kreuzberg.ExtractFileSync("page.html", nil) if err != nil { log.Fatalf("extract: %v", err) } if html, ok := result.Metadata.HTMLMetadata(); ok { // Keywords as array if len(html.Keywords) > 0 { fmt.Println("Keywords:", strings.Join(html.Keywords, ", ")) } // Canonical renamed if html.CanonicalURL != nil { fmt.Println("Canonical URL:", *html.CanonicalURL) } // Open Graph from map if len(html.OpenGraph) > 0 { if ogTitle, ok := html.OpenGraph["title"]; ok { fmt.Println("OG Title:", ogTitle) } if ogImage, ok := html.OpenGraph["image"]; ok { fmt.Println("OG Image:", ogImage) } } // Twitter from map if len(html.TwitterCard) > 0 { if card, ok := html.TwitterCard["card"]; ok { fmt.Println("Twitter Card:", card) } } // New fields if html.Language != nil { fmt.Println("Language:", *html.Language) } if len(html.Headers) > 0 { fmt.Println("Headers:", strings.Join(html.Headers, ", ")) } if len(html.Links) > 0 { for _, link := range html.Links { fmt.Printf("Link: %s (%s)\n", link[0], link[1]) } } } } ``` ### Ruby === "Before (v3.x)" ```ruby require 'kreuzberg' result = Kreuzberg.extract_file_sync('page.html') html_meta = result.metadata['html'] # Keywords as single string if html_meta['keywords'] keyword_array = html_meta['keywords'].split(',').map(&:strip) puts "Keywords: #{keyword_array}" end # Canonical as separate field if html_meta['canonical'] puts "Canonical: #{html_meta['canonical']}" end # Open Graph as individual fields if html_meta['og_title'] puts "OG Title: #{html_meta['og_title']}" end if html_meta['og_image'] puts "OG Image: #{html_meta['og_image']}" end # Twitter as individual fields if html_meta['twitter_card'] puts "Twitter Card: #{html_meta['twitter_card']}" end ``` === "After (v4.0)" ```ruby require 'kreuzberg' result = Kreuzberg.extract_file_sync('page.html') html_meta = result.metadata['html'] # Keywords as array if html_meta['keywords'] && !html_meta['keywords'].empty? puts "Keywords: #{html_meta['keywords']}" end # Canonical renamed if html_meta['canonical_url'] puts "Canonical URL: #{html_meta['canonical_url']}" end # Open Graph from map open_graph = html_meta['open_graph'] || {} if open_graph['title'] puts "OG Title: #{open_graph['title']}" end if open_graph['image'] puts "OG Image: #{open_graph['image']}" end # Twitter from map twitter_card = html_meta['twitter_card'] || {} if twitter_card['card'] puts "Twitter Card: #{twitter_card['card']}" end # New fields if html_meta['language'] puts "Language: #{html_meta['language']}" end if html_meta['headers'] && !html_meta['headers'].empty? puts "Headers: #{html_meta['headers'].join(', ')}" end if html_meta['links'] && !html_meta['links'].empty? html_meta['links'].each do |url, text| puts "Link: #{url} (#{text})" end end ``` ## API Reference For complete details on all HTML metadata fields and types, see: - [HTML Metadata Type Reference](../reference/types.md#htmlmetadata) ## Structured Types Reference ### HeaderMetadata Header elements extracted from the HTML document with hierarchy information. ```rust title="HeaderMetadata Struct Definition" pub struct HeaderMetadata { pub level: u8, // 1-6 (h1-h6) pub text: String, // Normalized text content pub id: Option, // HTML id attribute pub depth: usize, // Document tree depth pub html_offset: usize, // Byte offset in original HTML } ``` **Example:** ```json title="HeaderMetadata JSON Example" { "level": 1, "text": "Welcome to Our Site", "id": "welcome-section", "depth": 2, "html_offset": 512 } ``` ### LinkMetadata Link elements with type classification and detailed attributes. ```rust title="LinkMetadata Struct and LinkType Enum" pub struct LinkMetadata { pub href: String, // The href URL value pub text: String, // Link text content pub title: Option, // Title attribute pub link_type: LinkType, // Classification enum pub rel: Vec, // Rel attribute values pub attributes: HashMap, // Additional attributes } pub enum LinkType { Anchor, // #section anchors Internal, // Same domain links External, // Different domain links Email, // mailto: links Phone, // tel: links Other, // Other link types } ``` **Example:** ```json title="LinkMetadata JSON Example" { "href": "https://example.com", "text": "Visit Example", "title": "Example Website", "link_type": "external", "rel": ["nofollow"], "attributes": { "data-tracking": "yes" } } ``` ### ImageMetadataType Image elements with type classification and dimensions. ```rust title="ImageMetadataType Struct and ImageType Enum" pub struct ImageMetadataType { pub src: String, // Image source (URL, data URI, or SVG) pub alt: Option, // Alt text pub title: Option, // Title attribute pub dimensions: Option<(u32, u32)>, // Width x Height pub image_type: ImageType, // Classification enum pub attributes: HashMap, // Additional attributes } pub enum ImageType { DataUri, // data: URI InlineSvg, // Inline content External, // External URL Relative, // Relative path } ``` **Example:** ```json title="ImageMetadataType JSON Example" { "src": "https://cdn.example.com/image.jpg", "alt": "Product photo", "title": "Featured product", "dimensions": [400, 300], "image_type": "external", "attributes": { "loading": "lazy" } } ``` ### StructuredData Extracted structured data blocks (JSON-LD, microdata, RDFa). ```rust title="StructuredData Struct and StructuredDataType Enum" pub struct StructuredData { pub data_type: StructuredDataType, // Classification enum pub raw_json: String, // Raw JSON string pub schema_type: Option, // Schema type (e.g., "Article") } pub enum StructuredDataType { JsonLd, // JSON-LD Microdata, // microdata RDFa, // RDFa } ``` **Example:** ```json title="StructuredData JSON Example" { "data_type": "json-ld", "raw_json": "{\"@context\": \"https://schema.org\", \"@type\": \"Article\", ...}", "schema_type": "Article" } ``` ## Summary of Changes | Field | v3.x | v4.0 | | ----------------------------------------------- | ---------------------------------- | --------------------------------------------------- | | `keywords` | `Option` | `Vec` with `#[serde(default)]` | | `canonical` | `Option` | Renamed to `canonical_url` | | `og_*` fields (7 fields) | Individual `Option` fields | `open_graph: BTreeMap` | | `twitter_*` fields (6 fields) | Individual `Option` fields | `twitter_card: BTreeMap` | | `link_author`, `link_license`, `link_alternate` | Individual fields | Removed (use `links` field) | | New: `language` | N/A | `Option` | | New: `text_direction` | N/A | `Option` | | New: `headers` | N/A | `Vec` with `#[serde(default)]` | | New: `links` | N/A | `Vec` with `#[serde(default)]` | | New: `images` | N/A | `Vec` with `#[serde(default)]` | | New: `structured_data` | N/A | `Vec` with `#[serde(default)]` | | New: `meta_tags` | N/A | `BTreeMap` with `#[serde(default)]` | ## Questions? - See the [Types Reference](../reference/types.md) for complete API details - Check [Working with Metadata](../getting-started/quickstart.md#read-document-metadata) for examples - Open an issue on [GitHub](https://github.com/kreuzberg-dev/kreuzberg/issues)