24 KiB
HTML Metadata Structure Changes (v4.0)
Summary
HTML metadata has been restructured for better organization and type safety. The changes consolidate individual Open Graph and Twitter Card fields into maps, and convert keywords from a single string to an array.
Breaking Changes
1. Keywords: String to Array
Before (v3.x):
// Option<String> - comma-separated or space-separated
html_meta.keywords // "seo, metadata, html"
After (v4.0):
// Vec<String> - structured array
html_meta.keywords // vec!["seo", "metadata", "html"]
2. Canonical URL: Field Rename
Before (v3.x):
html_meta.canonical // Option<String>
After (v4.0):
html_meta.canonical_url // Option<String>
3. Open Graph: Individual Fields to Map
Before (v3.x):
html_meta.og_title // Option<String>
html_meta.og_description // Option<String>
html_meta.og_image // Option<String>
html_meta.og_url // Option<String>
html_meta.og_type // Option<String>
html_meta.og_site_name // Option<String>
After (v4.0):
html_meta.open_graph // BTreeMap<String, String>
html_meta.open_graph.get("title") // Option<&String>
html_meta.open_graph.get("description") // Option<&String>
html_meta.open_graph.get("image") // Option<&String>
html_meta.open_graph.get("url") // Option<&String>
html_meta.open_graph.get("type") // Option<&String>
html_meta.open_graph.get("site_name") // Option<&String>
4. Twitter Card: Individual Fields to Map
Before (v3.x):
html_meta.twitter_card // Option<String>
html_meta.twitter_title // Option<String>
html_meta.twitter_description // Option<String>
html_meta.twitter_image // Option<String>
html_meta.twitter_site // Option<String>
html_meta.twitter_creator // Option<String>
After (v4.0):
html_meta.twitter_card // BTreeMap<String, String>
html_meta.twitter_card.get("card") // Option<&String>
html_meta.twitter_card.get("title") // Option<&String>
html_meta.twitter_card.get("description") // Option<&String>
html_meta.twitter_card.get("image") // Option<&String>
html_meta.twitter_card.get("site") // Option<&String>
html_meta.twitter_card.get("creator") // Option<&String>
5. Removed Fields
The following link-related fields have been removed:
link_authorlink_licenselink_alternate
Use the new links field instead for comprehensive link extraction.
6. New Fields
HTML metadata now includes rich metadata about page content:
language: Document language (for example, "en", "fr")text_direction: Text direction ("ltr", "rtl")headers: List of page headers/headings with structured metadatalinks: List of links with detailed metadata and type classificationimages: List of images with alt text, dimensions, and type classificationstructured_data: Parsed JSON-LD, microdata, and RDFa datameta_tags: All meta tags as a map
Migration Guide
Rust
=== "Before (v3.x)"
```rust
use kreuzberg::{extract_file_sync, ExtractionConfig};
let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
if let Some(html_meta) = result.metadata.html {
// Keywords as single string
if let Some(keywords) = html_meta.keywords {
let keyword_vec: Vec<&str> = keywords.split(',').map(|s| s.trim()).collect();
println!("Keywords: {:?}", keyword_vec);
}
// Canonical as separate field
if let Some(canonical) = html_meta.canonical {
println!("Canonical: {}", canonical);
}
// Open Graph as individual fields
if let Some(og_title) = html_meta.og_title {
println!("OG Title: {}", og_title);
}
if let Some(og_image) = html_meta.og_image {
println!("OG Image: {}", og_image);
}
// Twitter as individual fields
if let Some(twitter_card) = html_meta.twitter_card {
println!("Twitter Card: {}", twitter_card);
}
}
```
=== "After (v4.0)"
```rust
use kreuzberg::{extract_file_sync, ExtractionConfig};
let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
if let Some(html_meta) = result.metadata.html {
// Keywords as array
if !html_meta.keywords.is_empty() {
println!("Keywords: {:?}", html_meta.keywords);
}
// Canonical renamed
if let Some(canonical_url) = html_meta.canonical_url {
println!("Canonical URL: {}", canonical_url);
}
// Open Graph from map
if let Some(og_title) = html_meta.open_graph.get("title") {
println!("OG Title: {}", og_title);
}
if let Some(og_image) = html_meta.open_graph.get("image") {
println!("OG Image: {}", og_image);
}
// Twitter from map
if let Some(twitter_card) = html_meta.twitter_card.get("card") {
println!("Twitter Card: {}", twitter_card);
}
// New fields
if let Some(lang) = html_meta.language {
println!("Language: {}", lang);
}
if let Some(headers) = html_meta.headers {
println!("Headers: {:?}", headers);
}
if let Some(links) = html_meta.links {
for (url, text) in links {
println!("Link: {} ({})", url, text);
}
}
}
```
Python
=== "Before (v3.x)"
```python
from kreuzberg import extract_file_sync, ExtractionConfig
result = extract_file_sync("page.html", config=ExtractionConfig())
html_meta = result.metadata.get("html", {})
# Keywords as single string
if html_meta.get('keywords'):
keyword_list = html_meta['keywords'].split(',')
print(f"Keywords: {keyword_list}")
# Canonical as separate field
if html_meta.get('canonical'):
print(f"Canonical: {html_meta['canonical']}")
# Open Graph as individual fields
if html_meta.get('og_title'):
print(f"OG Title: {html_meta['og_title']}")
if html_meta.get('og_image'):
print(f"OG Image: {html_meta['og_image']}")
# Twitter as individual fields
if html_meta.get('twitter_card'):
print(f"Twitter Card: {html_meta['twitter_card']}")
```
=== "After (v4.0)"
```python
from kreuzberg import extract_file_sync, ExtractionConfig
result = extract_file_sync("page.html", config=ExtractionConfig())
html_meta = result.metadata.get("html", {})
# Keywords as array
if html_meta.get('keywords'):
print(f"Keywords: {html_meta['keywords']}")
# Canonical renamed
if html_meta.get('canonical_url'):
print(f"Canonical URL: {html_meta['canonical_url']}")
# Open Graph from map
open_graph = html_meta.get('open_graph', {})
if open_graph.get('title'):
print(f"OG Title: {open_graph['title']}")
if open_graph.get('image'):
print(f"OG Image: {open_graph['image']}")
# Twitter from map
twitter_card = html_meta.get('twitter_card', {})
if twitter_card.get('card'):
print(f"Twitter Card: {twitter_card['card']}")
# New fields
if html_meta.get('language'):
print(f"Language: {html_meta['language']}")
if html_meta.get('headers'):
print(f"Headers: {html_meta['headers']}")
if html_meta.get('links'):
for url, text in html_meta['links']:
print(f"Link: {url} ({text})")
```
TypeScript
=== "Before (v3.x)"
```typescript
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('page.html');
const htmlMeta = result.metadata;
// Keywords as single string
if (htmlMeta.keywords) {
const keywordArray = htmlMeta.keywords.split(',');
console.log('Keywords:', keywordArray);
}
// Canonical as separate field
if (htmlMeta.canonical) {
console.log('Canonical:', htmlMeta.canonical);
}
// Open Graph as individual fields
if (htmlMeta.ogTitle) {
console.log('OG Title:', htmlMeta.ogTitle);
}
if (htmlMeta.ogImage) {
console.log('OG Image:', htmlMeta.ogImage);
}
// Twitter as individual fields
if (htmlMeta.twitterCard) {
console.log('Twitter Card:', htmlMeta.twitterCard);
}
```
=== "After (v4.0)"
```typescript
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('page.html');
const htmlMeta = result.metadata;
// Keywords as array
if (htmlMeta.keywords?.length > 0) {
console.log('Keywords:', htmlMeta.keywords);
}
// Canonical renamed
if (htmlMeta.canonicalUrl) {
console.log('Canonical URL:', htmlMeta.canonicalUrl);
}
// Open Graph from map
if (htmlMeta.openGraph) {
if (htmlMeta.openGraph['title']) {
console.log('OG Title:', htmlMeta.openGraph['title']);
}
if (htmlMeta.openGraph['image']) {
console.log('OG Image:', htmlMeta.openGraph['image']);
}
}
// Twitter from map
if (htmlMeta.twitterCard) {
if (htmlMeta.twitterCard['card']) {
console.log('Twitter Card:', htmlMeta.twitterCard['card']);
}
}
// New fields
if (htmlMeta.language) {
console.log('Language:', htmlMeta.language);
}
if (htmlMeta.headers?.length > 0) {
console.log('Headers:', htmlMeta.headers);
}
if (htmlMeta.links?.length > 0) {
htmlMeta.links.forEach(([url, text]) => {
console.log(`Link: ${url} (${text})`);
});
}
```
Java
=== "Before (v3.x)"
```java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import java.util.Map;
ExtractionResult result = Kreuzberg.extractFileSync("page.html");
Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");
// Keywords as single string
String keywords = (String) htmlMeta.get("keywords");
if (keywords != null) {
String[] keywordArray = keywords.split(",");
System.out.println("Keywords: " + Arrays.toString(keywordArray));
}
// Canonical as separate field
String canonical = (String) htmlMeta.get("canonical");
if (canonical != null) {
System.out.println("Canonical: " + canonical);
}
// Open Graph as individual fields
String ogTitle = (String) htmlMeta.get("og_title");
if (ogTitle != null) {
System.out.println("OG Title: " + ogTitle);
}
// Twitter as individual fields
String twitterCard = (String) htmlMeta.get("twitter_card");
if (twitterCard != null) {
System.out.println("Twitter Card: " + twitterCard);
}
```
=== "After (v4.0)"
```java
import dev.kreuzberg.Kreuzberg;
import dev.kreuzberg.ExtractionResult;
import java.util.Map;
import java.util.List;
ExtractionResult result = Kreuzberg.extractFileSync("page.html");
Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");
// Keywords as array
@SuppressWarnings("unchecked")
List<String> keywords = (List<String>) htmlMeta.get("keywords");
if (keywords != null && !keywords.isEmpty()) {
System.out.println("Keywords: " + keywords);
}
// Canonical renamed
String canonicalUrl = (String) htmlMeta.get("canonical_url");
if (canonicalUrl != null) {
System.out.println("Canonical URL: " + canonicalUrl);
}
// Open Graph from map
@SuppressWarnings("unchecked")
Map<String, String> openGraph = (Map<String, String>) htmlMeta.get("open_graph");
if (openGraph != null) {
String ogTitle = openGraph.get("title");
if (ogTitle != null) {
System.out.println("OG Title: " + ogTitle);
}
}
// Twitter from map
@SuppressWarnings("unchecked")
Map<String, String> twitterCard = (Map<String, String>) htmlMeta.get("twitter_card");
if (twitterCard != null) {
String card = twitterCard.get("card");
if (card != null) {
System.out.println("Twitter Card: " + card);
}
}
// New fields
String language = (String) htmlMeta.get("language");
if (language != null) {
System.out.println("Language: " + language);
}
@SuppressWarnings("unchecked")
List<String> headers = (List<String>) htmlMeta.get("headers");
if (headers != null && !headers.isEmpty()) {
System.out.println("Headers: " + headers);
}
```
Go
=== "Before (v3.x)"
```go
package main
import (
"fmt"
"log"
"strings"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)
func main() {
result, err := kreuzberg.ExtractFileSync("page.html", nil)
if err != nil {
log.Fatalf("extract: %v", err)
}
if html, ok := result.Metadata.HTMLMetadata(); ok {
// Keywords as single string
if html.Keywords != nil {
keywordSlice := strings.Split(*html.Keywords, ",")
fmt.Println("Keywords:", keywordSlice)
}
// Canonical as separate field
if html.Canonical != nil {
fmt.Println("Canonical:", *html.Canonical)
}
// Open Graph as individual fields
if html.OGTitle != nil {
fmt.Println("OG Title:", *html.OGTitle)
}
if html.OGImage != nil {
fmt.Println("OG Image:", *html.OGImage)
}
// Twitter as individual fields
if html.TwitterCard != nil {
fmt.Println("Twitter Card:", *html.TwitterCard)
}
}
}
```
=== "After (v4.0)"
```go
package main
import (
"fmt"
"log"
"strings"
"github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
)
func main() {
result, err := kreuzberg.ExtractFileSync("page.html", nil)
if err != nil {
log.Fatalf("extract: %v", err)
}
if html, ok := result.Metadata.HTMLMetadata(); ok {
// Keywords as array
if len(html.Keywords) > 0 {
fmt.Println("Keywords:", strings.Join(html.Keywords, ", "))
}
// Canonical renamed
if html.CanonicalURL != nil {
fmt.Println("Canonical URL:", *html.CanonicalURL)
}
// Open Graph from map
if len(html.OpenGraph) > 0 {
if ogTitle, ok := html.OpenGraph["title"]; ok {
fmt.Println("OG Title:", ogTitle)
}
if ogImage, ok := html.OpenGraph["image"]; ok {
fmt.Println("OG Image:", ogImage)
}
}
// Twitter from map
if len(html.TwitterCard) > 0 {
if card, ok := html.TwitterCard["card"]; ok {
fmt.Println("Twitter Card:", card)
}
}
// New fields
if html.Language != nil {
fmt.Println("Language:", *html.Language)
}
if len(html.Headers) > 0 {
fmt.Println("Headers:", strings.Join(html.Headers, ", "))
}
if len(html.Links) > 0 {
for _, link := range html.Links {
fmt.Printf("Link: %s (%s)\n", link[0], link[1])
}
}
}
}
```
Ruby
=== "Before (v3.x)"
```ruby
require 'kreuzberg'
result = Kreuzberg.extract_file_sync('page.html')
html_meta = result.metadata['html']
# Keywords as single string
if html_meta['keywords']
keyword_array = html_meta['keywords'].split(',').map(&:strip)
puts "Keywords: #{keyword_array}"
end
# Canonical as separate field
if html_meta['canonical']
puts "Canonical: #{html_meta['canonical']}"
end
# Open Graph as individual fields
if html_meta['og_title']
puts "OG Title: #{html_meta['og_title']}"
end
if html_meta['og_image']
puts "OG Image: #{html_meta['og_image']}"
end
# Twitter as individual fields
if html_meta['twitter_card']
puts "Twitter Card: #{html_meta['twitter_card']}"
end
```
=== "After (v4.0)"
```ruby
require 'kreuzberg'
result = Kreuzberg.extract_file_sync('page.html')
html_meta = result.metadata['html']
# Keywords as array
if html_meta['keywords'] && !html_meta['keywords'].empty?
puts "Keywords: #{html_meta['keywords']}"
end
# Canonical renamed
if html_meta['canonical_url']
puts "Canonical URL: #{html_meta['canonical_url']}"
end
# Open Graph from map
open_graph = html_meta['open_graph'] || {}
if open_graph['title']
puts "OG Title: #{open_graph['title']}"
end
if open_graph['image']
puts "OG Image: #{open_graph['image']}"
end
# Twitter from map
twitter_card = html_meta['twitter_card'] || {}
if twitter_card['card']
puts "Twitter Card: #{twitter_card['card']}"
end
# New fields
if html_meta['language']
puts "Language: #{html_meta['language']}"
end
if html_meta['headers'] && !html_meta['headers'].empty?
puts "Headers: #{html_meta['headers'].join(', ')}"
end
if html_meta['links'] && !html_meta['links'].empty?
html_meta['links'].each do |url, text|
puts "Link: #{url} (#{text})"
end
end
```
API Reference
For complete details on all HTML metadata fields and types, see:
Structured Types Reference
HeaderMetadata
Header elements extracted from the HTML document with hierarchy information.
pub struct HeaderMetadata {
pub level: u8, // 1-6 (h1-h6)
pub text: String, // Normalized text content
pub id: Option<String>, // HTML id attribute
pub depth: usize, // Document tree depth
pub html_offset: usize, // Byte offset in original HTML
}
Example:
{
"level": 1,
"text": "Welcome to Our Site",
"id": "welcome-section",
"depth": 2,
"html_offset": 512
}
LinkMetadata
Link elements with type classification and detailed attributes.
pub struct LinkMetadata {
pub href: String, // The href URL value
pub text: String, // Link text content
pub title: Option<String>, // Title attribute
pub link_type: LinkType, // Classification enum
pub rel: Vec<String>, // Rel attribute values
pub attributes: HashMap<String, String>, // Additional attributes
}
pub enum LinkType {
Anchor, // #section anchors
Internal, // Same domain links
External, // Different domain links
Email, // mailto: links
Phone, // tel: links
Other, // Other link types
}
Example:
{
"href": "https://example.com",
"text": "Visit Example",
"title": "Example Website",
"link_type": "external",
"rel": ["nofollow"],
"attributes": {
"data-tracking": "yes"
}
}
ImageMetadataType
Image elements with type classification and dimensions.
pub struct ImageMetadataType {
pub src: String, // Image source (URL, data URI, or SVG)
pub alt: Option<String>, // Alt text
pub title: Option<String>, // Title attribute
pub dimensions: Option<(u32, u32)>, // Width x Height
pub image_type: ImageType, // Classification enum
pub attributes: HashMap<String, String>, // Additional attributes
}
pub enum ImageType {
DataUri, // data: URI
InlineSvg, // Inline <svg> content
External, // External URL
Relative, // Relative path
}
Example:
{
"src": "https://cdn.example.com/image.jpg",
"alt": "Product photo",
"title": "Featured product",
"dimensions": [400, 300],
"image_type": "external",
"attributes": {
"loading": "lazy"
}
}
StructuredData
Extracted structured data blocks (JSON-LD, microdata, RDFa).
pub struct StructuredData {
pub data_type: StructuredDataType, // Classification enum
pub raw_json: String, // Raw JSON string
pub schema_type: Option<String>, // Schema type (e.g., "Article")
}
pub enum StructuredDataType {
JsonLd, // JSON-LD
Microdata, // microdata
RDFa, // RDFa
}
Example:
{
"data_type": "json-ld",
"raw_json": "{\"@context\": \"https://schema.org\", \"@type\": \"Article\", ...}",
"schema_type": "Article"
}
Summary of Changes
| Field | v3.x | v4.0 |
|---|---|---|
keywords |
Option<String> |
Vec<String> with #[serde(default)] |
canonical |
Option<String> |
Renamed to canonical_url |
og_* fields (7 fields) |
Individual Option<String> fields |
open_graph: BTreeMap<String, String> |
twitter_* fields (6 fields) |
Individual Option<String> fields |
twitter_card: BTreeMap<String, String> |
link_author, link_license, link_alternate |
Individual fields | Removed (use links field) |
New: language |
N/A | Option<String> |
New: text_direction |
N/A | Option<TextDirection> |
New: headers |
N/A | Vec<HeaderMetadata> with #[serde(default)] |
New: links |
N/A | Vec<LinkMetadata> with #[serde(default)] |
New: images |
N/A | Vec<ImageMetadataType> with #[serde(default)] |
New: structured_data |
N/A | Vec<StructuredData> with #[serde(default)] |
New: meta_tags |
N/A | BTreeMap<String, String> with #[serde(default)] |
Questions?
- See the Types Reference for complete API details
- Check Working with Metadata for examples
- Open an issue on GitHub