800 lines
24 KiB
Markdown
800 lines
24 KiB
Markdown
|
|
# HTML Metadata Structure Changes (v4.0)
|
||
|
|
|
||
|
|
## Summary
|
||
|
|
|
||
|
|
HTML metadata has been restructured for better organization and type safety. The changes consolidate individual Open Graph and Twitter Card fields into maps, and convert keywords from a single string to an array.
|
||
|
|
|
||
|
|
## Breaking Changes
|
||
|
|
|
||
|
|
### 1. Keywords: String to Array
|
||
|
|
|
||
|
|
**Before (v3.x):**
|
||
|
|
|
||
|
|
```rust title="Keywords as Comma-Separated String"
|
||
|
|
// Option<String> - comma-separated or space-separated
|
||
|
|
html_meta.keywords // "seo, metadata, html"
|
||
|
|
```
|
||
|
|
|
||
|
|
**After (v4.0):**
|
||
|
|
|
||
|
|
```rust title="Keywords as Structured Array"
|
||
|
|
// Vec<String> - structured array
|
||
|
|
html_meta.keywords // vec!["seo", "metadata", "html"]
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Canonical URL: Field Rename
|
||
|
|
|
||
|
|
**Before (v3.x):**
|
||
|
|
|
||
|
|
```rust title="Canonical Field (v3.x)"
|
||
|
|
html_meta.canonical // Option<String>
|
||
|
|
```
|
||
|
|
|
||
|
|
**After (v4.0):**
|
||
|
|
|
||
|
|
```rust title="Canonical URL Field (v4.0)"
|
||
|
|
html_meta.canonical_url // Option<String>
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Open Graph: Individual Fields to Map
|
||
|
|
|
||
|
|
**Before (v3.x):**
|
||
|
|
|
||
|
|
```rust title="Open Graph as Individual Fields"
|
||
|
|
html_meta.og_title // Option<String>
|
||
|
|
html_meta.og_description // Option<String>
|
||
|
|
html_meta.og_image // Option<String>
|
||
|
|
html_meta.og_url // Option<String>
|
||
|
|
html_meta.og_type // Option<String>
|
||
|
|
html_meta.og_site_name // Option<String>
|
||
|
|
```
|
||
|
|
|
||
|
|
**After (v4.0):**
|
||
|
|
|
||
|
|
```rust title="Open Graph as Map Structure"
|
||
|
|
html_meta.open_graph // BTreeMap<String, String>
|
||
|
|
html_meta.open_graph.get("title") // Option<&String>
|
||
|
|
html_meta.open_graph.get("description") // Option<&String>
|
||
|
|
html_meta.open_graph.get("image") // Option<&String>
|
||
|
|
html_meta.open_graph.get("url") // Option<&String>
|
||
|
|
html_meta.open_graph.get("type") // Option<&String>
|
||
|
|
html_meta.open_graph.get("site_name") // Option<&String>
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Twitter Card: Individual Fields to Map
|
||
|
|
|
||
|
|
**Before (v3.x):**
|
||
|
|
|
||
|
|
```rust title="Twitter Card as Individual Fields"
|
||
|
|
html_meta.twitter_card // Option<String>
|
||
|
|
html_meta.twitter_title // Option<String>
|
||
|
|
html_meta.twitter_description // Option<String>
|
||
|
|
html_meta.twitter_image // Option<String>
|
||
|
|
html_meta.twitter_site // Option<String>
|
||
|
|
html_meta.twitter_creator // Option<String>
|
||
|
|
```
|
||
|
|
|
||
|
|
**After (v4.0):**
|
||
|
|
|
||
|
|
```rust title="Twitter Card as Map Structure"
|
||
|
|
html_meta.twitter_card // BTreeMap<String, String>
|
||
|
|
html_meta.twitter_card.get("card") // Option<&String>
|
||
|
|
html_meta.twitter_card.get("title") // Option<&String>
|
||
|
|
html_meta.twitter_card.get("description") // Option<&String>
|
||
|
|
html_meta.twitter_card.get("image") // Option<&String>
|
||
|
|
html_meta.twitter_card.get("site") // Option<&String>
|
||
|
|
html_meta.twitter_card.get("creator") // Option<&String>
|
||
|
|
```
|
||
|
|
|
||
|
|
### 5. Removed Fields
|
||
|
|
|
||
|
|
The following link-related fields have been removed:
|
||
|
|
|
||
|
|
- `link_author`
|
||
|
|
- `link_license`
|
||
|
|
- `link_alternate`
|
||
|
|
|
||
|
|
Use the new `links` field instead for comprehensive link extraction.
|
||
|
|
|
||
|
|
### 6. New Fields
|
||
|
|
|
||
|
|
HTML metadata now includes rich metadata about page content:
|
||
|
|
|
||
|
|
- **`language`**: Document language (for example, "en", "fr")
|
||
|
|
- **`text_direction`**: Text direction ("ltr", "rtl")
|
||
|
|
- **`headers`**: List of page headers/headings with structured metadata
|
||
|
|
- **`links`**: List of links with detailed metadata and type classification
|
||
|
|
- **`images`**: List of images with alt text, dimensions, and type classification
|
||
|
|
- **`structured_data`**: Parsed JSON-LD, microdata, and RDFa data
|
||
|
|
- **`meta_tags`**: All meta tags as a map
|
||
|
|
|
||
|
|
## Migration Guide
|
||
|
|
|
||
|
|
### Rust
|
||
|
|
|
||
|
|
=== "Before (v3.x)"
|
||
|
|
|
||
|
|
```rust
|
||
|
|
use kreuzberg::{extract_file_sync, ExtractionConfig};
|
||
|
|
|
||
|
|
let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
|
||
|
|
if let Some(html_meta) = result.metadata.html {
|
||
|
|
// Keywords as single string
|
||
|
|
if let Some(keywords) = html_meta.keywords {
|
||
|
|
let keyword_vec: Vec<&str> = keywords.split(',').map(|s| s.trim()).collect();
|
||
|
|
println!("Keywords: {:?}", keyword_vec);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Canonical as separate field
|
||
|
|
if let Some(canonical) = html_meta.canonical {
|
||
|
|
println!("Canonical: {}", canonical);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Open Graph as individual fields
|
||
|
|
if let Some(og_title) = html_meta.og_title {
|
||
|
|
println!("OG Title: {}", og_title);
|
||
|
|
}
|
||
|
|
if let Some(og_image) = html_meta.og_image {
|
||
|
|
println!("OG Image: {}", og_image);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Twitter as individual fields
|
||
|
|
if let Some(twitter_card) = html_meta.twitter_card {
|
||
|
|
println!("Twitter Card: {}", twitter_card);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "After (v4.0)"
|
||
|
|
|
||
|
|
```rust
|
||
|
|
use kreuzberg::{extract_file_sync, ExtractionConfig};
|
||
|
|
|
||
|
|
let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
|
||
|
|
if let Some(html_meta) = result.metadata.html {
|
||
|
|
// Keywords as array
|
||
|
|
if !html_meta.keywords.is_empty() {
|
||
|
|
println!("Keywords: {:?}", html_meta.keywords);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Canonical renamed
|
||
|
|
if let Some(canonical_url) = html_meta.canonical_url {
|
||
|
|
println!("Canonical URL: {}", canonical_url);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Open Graph from map
|
||
|
|
if let Some(og_title) = html_meta.open_graph.get("title") {
|
||
|
|
println!("OG Title: {}", og_title);
|
||
|
|
}
|
||
|
|
if let Some(og_image) = html_meta.open_graph.get("image") {
|
||
|
|
println!("OG Image: {}", og_image);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Twitter from map
|
||
|
|
if let Some(twitter_card) = html_meta.twitter_card.get("card") {
|
||
|
|
println!("Twitter Card: {}", twitter_card);
|
||
|
|
}
|
||
|
|
|
||
|
|
// New fields
|
||
|
|
if let Some(lang) = html_meta.language {
|
||
|
|
println!("Language: {}", lang);
|
||
|
|
}
|
||
|
|
if let Some(headers) = html_meta.headers {
|
||
|
|
println!("Headers: {:?}", headers);
|
||
|
|
}
|
||
|
|
if let Some(links) = html_meta.links {
|
||
|
|
for (url, text) in links {
|
||
|
|
println!("Link: {} ({})", url, text);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Python
|
||
|
|
|
||
|
|
=== "Before (v3.x)"
|
||
|
|
|
||
|
|
```python
|
||
|
|
from kreuzberg import extract_file_sync, ExtractionConfig
|
||
|
|
|
||
|
|
result = extract_file_sync("page.html", config=ExtractionConfig())
|
||
|
|
html_meta = result.metadata.get("html", {})
|
||
|
|
|
||
|
|
# Keywords as single string
|
||
|
|
if html_meta.get('keywords'):
|
||
|
|
keyword_list = html_meta['keywords'].split(',')
|
||
|
|
print(f"Keywords: {keyword_list}")
|
||
|
|
|
||
|
|
# Canonical as separate field
|
||
|
|
if html_meta.get('canonical'):
|
||
|
|
print(f"Canonical: {html_meta['canonical']}")
|
||
|
|
|
||
|
|
# Open Graph as individual fields
|
||
|
|
if html_meta.get('og_title'):
|
||
|
|
print(f"OG Title: {html_meta['og_title']}")
|
||
|
|
if html_meta.get('og_image'):
|
||
|
|
print(f"OG Image: {html_meta['og_image']}")
|
||
|
|
|
||
|
|
# Twitter as individual fields
|
||
|
|
if html_meta.get('twitter_card'):
|
||
|
|
print(f"Twitter Card: {html_meta['twitter_card']}")
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "After (v4.0)"
|
||
|
|
|
||
|
|
```python
|
||
|
|
from kreuzberg import extract_file_sync, ExtractionConfig
|
||
|
|
|
||
|
|
result = extract_file_sync("page.html", config=ExtractionConfig())
|
||
|
|
html_meta = result.metadata.get("html", {})
|
||
|
|
|
||
|
|
# Keywords as array
|
||
|
|
if html_meta.get('keywords'):
|
||
|
|
print(f"Keywords: {html_meta['keywords']}")
|
||
|
|
|
||
|
|
# Canonical renamed
|
||
|
|
if html_meta.get('canonical_url'):
|
||
|
|
print(f"Canonical URL: {html_meta['canonical_url']}")
|
||
|
|
|
||
|
|
# Open Graph from map
|
||
|
|
open_graph = html_meta.get('open_graph', {})
|
||
|
|
if open_graph.get('title'):
|
||
|
|
print(f"OG Title: {open_graph['title']}")
|
||
|
|
if open_graph.get('image'):
|
||
|
|
print(f"OG Image: {open_graph['image']}")
|
||
|
|
|
||
|
|
# Twitter from map
|
||
|
|
twitter_card = html_meta.get('twitter_card', {})
|
||
|
|
if twitter_card.get('card'):
|
||
|
|
print(f"Twitter Card: {twitter_card['card']}")
|
||
|
|
|
||
|
|
# New fields
|
||
|
|
if html_meta.get('language'):
|
||
|
|
print(f"Language: {html_meta['language']}")
|
||
|
|
|
||
|
|
if html_meta.get('headers'):
|
||
|
|
print(f"Headers: {html_meta['headers']}")
|
||
|
|
|
||
|
|
if html_meta.get('links'):
|
||
|
|
for url, text in html_meta['links']:
|
||
|
|
print(f"Link: {url} ({text})")
|
||
|
|
```
|
||
|
|
|
||
|
|
### TypeScript
|
||
|
|
|
||
|
|
=== "Before (v3.x)"
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
import { extractFileSync } from '@kreuzberg/node';
|
||
|
|
|
||
|
|
const result = extractFileSync('page.html');
|
||
|
|
const htmlMeta = result.metadata;
|
||
|
|
|
||
|
|
// Keywords as single string
|
||
|
|
if (htmlMeta.keywords) {
|
||
|
|
const keywordArray = htmlMeta.keywords.split(',');
|
||
|
|
console.log('Keywords:', keywordArray);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Canonical as separate field
|
||
|
|
if (htmlMeta.canonical) {
|
||
|
|
console.log('Canonical:', htmlMeta.canonical);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Open Graph as individual fields
|
||
|
|
if (htmlMeta.ogTitle) {
|
||
|
|
console.log('OG Title:', htmlMeta.ogTitle);
|
||
|
|
}
|
||
|
|
if (htmlMeta.ogImage) {
|
||
|
|
console.log('OG Image:', htmlMeta.ogImage);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Twitter as individual fields
|
||
|
|
if (htmlMeta.twitterCard) {
|
||
|
|
console.log('Twitter Card:', htmlMeta.twitterCard);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "After (v4.0)"
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
import { extractFileSync } from '@kreuzberg/node';
|
||
|
|
|
||
|
|
const result = extractFileSync('page.html');
|
||
|
|
const htmlMeta = result.metadata;
|
||
|
|
|
||
|
|
// Keywords as array
|
||
|
|
if (htmlMeta.keywords?.length > 0) {
|
||
|
|
console.log('Keywords:', htmlMeta.keywords);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Canonical renamed
|
||
|
|
if (htmlMeta.canonicalUrl) {
|
||
|
|
console.log('Canonical URL:', htmlMeta.canonicalUrl);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Open Graph from map
|
||
|
|
if (htmlMeta.openGraph) {
|
||
|
|
if (htmlMeta.openGraph['title']) {
|
||
|
|
console.log('OG Title:', htmlMeta.openGraph['title']);
|
||
|
|
}
|
||
|
|
if (htmlMeta.openGraph['image']) {
|
||
|
|
console.log('OG Image:', htmlMeta.openGraph['image']);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// Twitter from map
|
||
|
|
if (htmlMeta.twitterCard) {
|
||
|
|
if (htmlMeta.twitterCard['card']) {
|
||
|
|
console.log('Twitter Card:', htmlMeta.twitterCard['card']);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// New fields
|
||
|
|
if (htmlMeta.language) {
|
||
|
|
console.log('Language:', htmlMeta.language);
|
||
|
|
}
|
||
|
|
|
||
|
|
if (htmlMeta.headers?.length > 0) {
|
||
|
|
console.log('Headers:', htmlMeta.headers);
|
||
|
|
}
|
||
|
|
|
||
|
|
if (htmlMeta.links?.length > 0) {
|
||
|
|
htmlMeta.links.forEach(([url, text]) => {
|
||
|
|
console.log(`Link: ${url} (${text})`);
|
||
|
|
});
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Java
|
||
|
|
|
||
|
|
=== "Before (v3.x)"
|
||
|
|
|
||
|
|
```java
|
||
|
|
import dev.kreuzberg.Kreuzberg;
|
||
|
|
import dev.kreuzberg.ExtractionResult;
|
||
|
|
import java.util.Map;
|
||
|
|
|
||
|
|
ExtractionResult result = Kreuzberg.extractFileSync("page.html");
|
||
|
|
Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");
|
||
|
|
|
||
|
|
// Keywords as single string
|
||
|
|
String keywords = (String) htmlMeta.get("keywords");
|
||
|
|
if (keywords != null) {
|
||
|
|
String[] keywordArray = keywords.split(",");
|
||
|
|
System.out.println("Keywords: " + Arrays.toString(keywordArray));
|
||
|
|
}
|
||
|
|
|
||
|
|
// Canonical as separate field
|
||
|
|
String canonical = (String) htmlMeta.get("canonical");
|
||
|
|
if (canonical != null) {
|
||
|
|
System.out.println("Canonical: " + canonical);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Open Graph as individual fields
|
||
|
|
String ogTitle = (String) htmlMeta.get("og_title");
|
||
|
|
if (ogTitle != null) {
|
||
|
|
System.out.println("OG Title: " + ogTitle);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Twitter as individual fields
|
||
|
|
String twitterCard = (String) htmlMeta.get("twitter_card");
|
||
|
|
if (twitterCard != null) {
|
||
|
|
System.out.println("Twitter Card: " + twitterCard);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "After (v4.0)"
|
||
|
|
|
||
|
|
```java
|
||
|
|
import dev.kreuzberg.Kreuzberg;
|
||
|
|
import dev.kreuzberg.ExtractionResult;
|
||
|
|
import java.util.Map;
|
||
|
|
import java.util.List;
|
||
|
|
|
||
|
|
ExtractionResult result = Kreuzberg.extractFileSync("page.html");
|
||
|
|
Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");
|
||
|
|
|
||
|
|
// Keywords as array
|
||
|
|
@SuppressWarnings("unchecked")
|
||
|
|
List<String> keywords = (List<String>) htmlMeta.get("keywords");
|
||
|
|
if (keywords != null && !keywords.isEmpty()) {
|
||
|
|
System.out.println("Keywords: " + keywords);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Canonical renamed
|
||
|
|
String canonicalUrl = (String) htmlMeta.get("canonical_url");
|
||
|
|
if (canonicalUrl != null) {
|
||
|
|
System.out.println("Canonical URL: " + canonicalUrl);
|
||
|
|
}
|
||
|
|
|
||
|
|
// Open Graph from map
|
||
|
|
@SuppressWarnings("unchecked")
|
||
|
|
Map<String, String> openGraph = (Map<String, String>) htmlMeta.get("open_graph");
|
||
|
|
if (openGraph != null) {
|
||
|
|
String ogTitle = openGraph.get("title");
|
||
|
|
if (ogTitle != null) {
|
||
|
|
System.out.println("OG Title: " + ogTitle);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// Twitter from map
|
||
|
|
@SuppressWarnings("unchecked")
|
||
|
|
Map<String, String> twitterCard = (Map<String, String>) htmlMeta.get("twitter_card");
|
||
|
|
if (twitterCard != null) {
|
||
|
|
String card = twitterCard.get("card");
|
||
|
|
if (card != null) {
|
||
|
|
System.out.println("Twitter Card: " + card);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// New fields
|
||
|
|
String language = (String) htmlMeta.get("language");
|
||
|
|
if (language != null) {
|
||
|
|
System.out.println("Language: " + language);
|
||
|
|
}
|
||
|
|
|
||
|
|
@SuppressWarnings("unchecked")
|
||
|
|
List<String> headers = (List<String>) htmlMeta.get("headers");
|
||
|
|
if (headers != null && !headers.isEmpty()) {
|
||
|
|
System.out.println("Headers: " + headers);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Go
|
||
|
|
|
||
|
|
=== "Before (v3.x)"
|
||
|
|
|
||
|
|
```go
|
||
|
|
package main
|
||
|
|
|
||
|
|
import (
|
||
|
|
"fmt"
|
||
|
|
"log"
|
||
|
|
"strings"
|
||
|
|
"github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
|
||
|
|
)
|
||
|
|
|
||
|
|
func main() {
|
||
|
|
result, err := kreuzberg.ExtractFileSync("page.html", nil)
|
||
|
|
if err != nil {
|
||
|
|
log.Fatalf("extract: %v", err)
|
||
|
|
}
|
||
|
|
|
||
|
|
if html, ok := result.Metadata.HTMLMetadata(); ok {
|
||
|
|
// Keywords as single string
|
||
|
|
if html.Keywords != nil {
|
||
|
|
keywordSlice := strings.Split(*html.Keywords, ",")
|
||
|
|
fmt.Println("Keywords:", keywordSlice)
|
||
|
|
}
|
||
|
|
|
||
|
|
// Canonical as separate field
|
||
|
|
if html.Canonical != nil {
|
||
|
|
fmt.Println("Canonical:", *html.Canonical)
|
||
|
|
}
|
||
|
|
|
||
|
|
// Open Graph as individual fields
|
||
|
|
if html.OGTitle != nil {
|
||
|
|
fmt.Println("OG Title:", *html.OGTitle)
|
||
|
|
}
|
||
|
|
if html.OGImage != nil {
|
||
|
|
fmt.Println("OG Image:", *html.OGImage)
|
||
|
|
}
|
||
|
|
|
||
|
|
// Twitter as individual fields
|
||
|
|
if html.TwitterCard != nil {
|
||
|
|
fmt.Println("Twitter Card:", *html.TwitterCard)
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "After (v4.0)"
|
||
|
|
|
||
|
|
```go
|
||
|
|
package main
|
||
|
|
|
||
|
|
import (
|
||
|
|
"fmt"
|
||
|
|
"log"
|
||
|
|
"strings"
|
||
|
|
"github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
|
||
|
|
)
|
||
|
|
|
||
|
|
func main() {
|
||
|
|
result, err := kreuzberg.ExtractFileSync("page.html", nil)
|
||
|
|
if err != nil {
|
||
|
|
log.Fatalf("extract: %v", err)
|
||
|
|
}
|
||
|
|
|
||
|
|
if html, ok := result.Metadata.HTMLMetadata(); ok {
|
||
|
|
// Keywords as array
|
||
|
|
if len(html.Keywords) > 0 {
|
||
|
|
fmt.Println("Keywords:", strings.Join(html.Keywords, ", "))
|
||
|
|
}
|
||
|
|
|
||
|
|
// Canonical renamed
|
||
|
|
if html.CanonicalURL != nil {
|
||
|
|
fmt.Println("Canonical URL:", *html.CanonicalURL)
|
||
|
|
}
|
||
|
|
|
||
|
|
// Open Graph from map
|
||
|
|
if len(html.OpenGraph) > 0 {
|
||
|
|
if ogTitle, ok := html.OpenGraph["title"]; ok {
|
||
|
|
fmt.Println("OG Title:", ogTitle)
|
||
|
|
}
|
||
|
|
if ogImage, ok := html.OpenGraph["image"]; ok {
|
||
|
|
fmt.Println("OG Image:", ogImage)
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// Twitter from map
|
||
|
|
if len(html.TwitterCard) > 0 {
|
||
|
|
if card, ok := html.TwitterCard["card"]; ok {
|
||
|
|
fmt.Println("Twitter Card:", card)
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// New fields
|
||
|
|
if html.Language != nil {
|
||
|
|
fmt.Println("Language:", *html.Language)
|
||
|
|
}
|
||
|
|
|
||
|
|
if len(html.Headers) > 0 {
|
||
|
|
fmt.Println("Headers:", strings.Join(html.Headers, ", "))
|
||
|
|
}
|
||
|
|
|
||
|
|
if len(html.Links) > 0 {
|
||
|
|
for _, link := range html.Links {
|
||
|
|
fmt.Printf("Link: %s (%s)\n", link[0], link[1])
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Ruby
|
||
|
|
|
||
|
|
=== "Before (v3.x)"
|
||
|
|
|
||
|
|
```ruby
|
||
|
|
require 'kreuzberg'
|
||
|
|
|
||
|
|
result = Kreuzberg.extract_file_sync('page.html')
|
||
|
|
html_meta = result.metadata['html']
|
||
|
|
|
||
|
|
# Keywords as single string
|
||
|
|
if html_meta['keywords']
|
||
|
|
keyword_array = html_meta['keywords'].split(',').map(&:strip)
|
||
|
|
puts "Keywords: #{keyword_array}"
|
||
|
|
end
|
||
|
|
|
||
|
|
# Canonical as separate field
|
||
|
|
if html_meta['canonical']
|
||
|
|
puts "Canonical: #{html_meta['canonical']}"
|
||
|
|
end
|
||
|
|
|
||
|
|
# Open Graph as individual fields
|
||
|
|
if html_meta['og_title']
|
||
|
|
puts "OG Title: #{html_meta['og_title']}"
|
||
|
|
end
|
||
|
|
if html_meta['og_image']
|
||
|
|
puts "OG Image: #{html_meta['og_image']}"
|
||
|
|
end
|
||
|
|
|
||
|
|
# Twitter as individual fields
|
||
|
|
if html_meta['twitter_card']
|
||
|
|
puts "Twitter Card: #{html_meta['twitter_card']}"
|
||
|
|
end
|
||
|
|
```
|
||
|
|
|
||
|
|
=== "After (v4.0)"
|
||
|
|
|
||
|
|
```ruby
|
||
|
|
require 'kreuzberg'
|
||
|
|
|
||
|
|
result = Kreuzberg.extract_file_sync('page.html')
|
||
|
|
html_meta = result.metadata['html']
|
||
|
|
|
||
|
|
# Keywords as array
|
||
|
|
if html_meta['keywords'] && !html_meta['keywords'].empty?
|
||
|
|
puts "Keywords: #{html_meta['keywords']}"
|
||
|
|
end
|
||
|
|
|
||
|
|
# Canonical renamed
|
||
|
|
if html_meta['canonical_url']
|
||
|
|
puts "Canonical URL: #{html_meta['canonical_url']}"
|
||
|
|
end
|
||
|
|
|
||
|
|
# Open Graph from map
|
||
|
|
open_graph = html_meta['open_graph'] || {}
|
||
|
|
if open_graph['title']
|
||
|
|
puts "OG Title: #{open_graph['title']}"
|
||
|
|
end
|
||
|
|
if open_graph['image']
|
||
|
|
puts "OG Image: #{open_graph['image']}"
|
||
|
|
end
|
||
|
|
|
||
|
|
# Twitter from map
|
||
|
|
twitter_card = html_meta['twitter_card'] || {}
|
||
|
|
if twitter_card['card']
|
||
|
|
puts "Twitter Card: #{twitter_card['card']}"
|
||
|
|
end
|
||
|
|
|
||
|
|
# New fields
|
||
|
|
if html_meta['language']
|
||
|
|
puts "Language: #{html_meta['language']}"
|
||
|
|
end
|
||
|
|
|
||
|
|
if html_meta['headers'] && !html_meta['headers'].empty?
|
||
|
|
puts "Headers: #{html_meta['headers'].join(', ')}"
|
||
|
|
end
|
||
|
|
|
||
|
|
if html_meta['links'] && !html_meta['links'].empty?
|
||
|
|
html_meta['links'].each do |url, text|
|
||
|
|
puts "Link: #{url} (#{text})"
|
||
|
|
end
|
||
|
|
end
|
||
|
|
```
|
||
|
|
|
||
|
|
## API Reference
|
||
|
|
|
||
|
|
For complete details on all HTML metadata fields and types, see:
|
||
|
|
|
||
|
|
- [HTML Metadata Type Reference](../reference/types.md#htmlmetadata)
|
||
|
|
|
||
|
|
## Structured Types Reference
|
||
|
|
|
||
|
|
### HeaderMetadata
|
||
|
|
|
||
|
|
Header elements extracted from the HTML document with hierarchy information.
|
||
|
|
|
||
|
|
```rust title="HeaderMetadata Struct Definition"
|
||
|
|
pub struct HeaderMetadata {
|
||
|
|
pub level: u8, // 1-6 (h1-h6)
|
||
|
|
pub text: String, // Normalized text content
|
||
|
|
pub id: Option<String>, // HTML id attribute
|
||
|
|
pub depth: usize, // Document tree depth
|
||
|
|
pub html_offset: usize, // Byte offset in original HTML
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Example:**
|
||
|
|
|
||
|
|
```json title="HeaderMetadata JSON Example"
|
||
|
|
{
|
||
|
|
"level": 1,
|
||
|
|
"text": "Welcome to Our Site",
|
||
|
|
"id": "welcome-section",
|
||
|
|
"depth": 2,
|
||
|
|
"html_offset": 512
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### LinkMetadata
|
||
|
|
|
||
|
|
Link elements with type classification and detailed attributes.
|
||
|
|
|
||
|
|
```rust title="LinkMetadata Struct and LinkType Enum"
|
||
|
|
pub struct LinkMetadata {
|
||
|
|
pub href: String, // The href URL value
|
||
|
|
pub text: String, // Link text content
|
||
|
|
pub title: Option<String>, // Title attribute
|
||
|
|
pub link_type: LinkType, // Classification enum
|
||
|
|
pub rel: Vec<String>, // Rel attribute values
|
||
|
|
pub attributes: HashMap<String, String>, // Additional attributes
|
||
|
|
}
|
||
|
|
|
||
|
|
pub enum LinkType {
|
||
|
|
Anchor, // #section anchors
|
||
|
|
Internal, // Same domain links
|
||
|
|
External, // Different domain links
|
||
|
|
Email, // mailto: links
|
||
|
|
Phone, // tel: links
|
||
|
|
Other, // Other link types
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Example:**
|
||
|
|
|
||
|
|
```json title="LinkMetadata JSON Example"
|
||
|
|
{
|
||
|
|
"href": "https://example.com",
|
||
|
|
"text": "Visit Example",
|
||
|
|
"title": "Example Website",
|
||
|
|
"link_type": "external",
|
||
|
|
"rel": ["nofollow"],
|
||
|
|
"attributes": {
|
||
|
|
"data-tracking": "yes"
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### ImageMetadataType
|
||
|
|
|
||
|
|
Image elements with type classification and dimensions.
|
||
|
|
|
||
|
|
```rust title="ImageMetadataType Struct and ImageType Enum"
|
||
|
|
pub struct ImageMetadataType {
|
||
|
|
pub src: String, // Image source (URL, data URI, or SVG)
|
||
|
|
pub alt: Option<String>, // Alt text
|
||
|
|
pub title: Option<String>, // Title attribute
|
||
|
|
pub dimensions: Option<(u32, u32)>, // Width x Height
|
||
|
|
pub image_type: ImageType, // Classification enum
|
||
|
|
pub attributes: HashMap<String, String>, // Additional attributes
|
||
|
|
}
|
||
|
|
|
||
|
|
pub enum ImageType {
|
||
|
|
DataUri, // data: URI
|
||
|
|
InlineSvg, // Inline <svg> content
|
||
|
|
External, // External URL
|
||
|
|
Relative, // Relative path
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Example:**
|
||
|
|
|
||
|
|
```json title="ImageMetadataType JSON Example"
|
||
|
|
{
|
||
|
|
"src": "https://cdn.example.com/image.jpg",
|
||
|
|
"alt": "Product photo",
|
||
|
|
"title": "Featured product",
|
||
|
|
"dimensions": [400, 300],
|
||
|
|
"image_type": "external",
|
||
|
|
"attributes": {
|
||
|
|
"loading": "lazy"
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### StructuredData
|
||
|
|
|
||
|
|
Extracted structured data blocks (JSON-LD, microdata, RDFa).
|
||
|
|
|
||
|
|
```rust title="StructuredData Struct and StructuredDataType Enum"
|
||
|
|
pub struct StructuredData {
|
||
|
|
pub data_type: StructuredDataType, // Classification enum
|
||
|
|
pub raw_json: String, // Raw JSON string
|
||
|
|
pub schema_type: Option<String>, // Schema type (e.g., "Article")
|
||
|
|
}
|
||
|
|
|
||
|
|
pub enum StructuredDataType {
|
||
|
|
JsonLd, // JSON-LD
|
||
|
|
Microdata, // microdata
|
||
|
|
RDFa, // RDFa
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Example:**
|
||
|
|
|
||
|
|
```json title="StructuredData JSON Example"
|
||
|
|
{
|
||
|
|
"data_type": "json-ld",
|
||
|
|
"raw_json": "{\"@context\": \"https://schema.org\", \"@type\": \"Article\", ...}",
|
||
|
|
"schema_type": "Article"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Summary of Changes
|
||
|
|
|
||
|
|
| Field | v3.x | v4.0 |
|
||
|
|
| ----------------------------------------------- | ---------------------------------- | --------------------------------------------------- |
|
||
|
|
| `keywords` | `Option<String>` | `Vec<String>` with `#[serde(default)]` |
|
||
|
|
| `canonical` | `Option<String>` | Renamed to `canonical_url` |
|
||
|
|
| `og_*` fields (7 fields) | Individual `Option<String>` fields | `open_graph: BTreeMap<String, String>` |
|
||
|
|
| `twitter_*` fields (6 fields) | Individual `Option<String>` fields | `twitter_card: BTreeMap<String, String>` |
|
||
|
|
| `link_author`, `link_license`, `link_alternate` | Individual fields | Removed (use `links` field) |
|
||
|
|
| New: `language` | N/A | `Option<String>` |
|
||
|
|
| New: `text_direction` | N/A | `Option<TextDirection>` |
|
||
|
|
| New: `headers` | N/A | `Vec<HeaderMetadata>` with `#[serde(default)]` |
|
||
|
|
| New: `links` | N/A | `Vec<LinkMetadata>` with `#[serde(default)]` |
|
||
|
|
| New: `images` | N/A | `Vec<ImageMetadataType>` with `#[serde(default)]` |
|
||
|
|
| New: `structured_data` | N/A | `Vec<StructuredData>` with `#[serde(default)]` |
|
||
|
|
| New: `meta_tags` | N/A | `BTreeMap<String, String>` with `#[serde(default)]` |
|
||
|
|
|
||
|
|
## Questions?
|
||
|
|
|
||
|
|
- See the [Types Reference](../reference/types.md) for complete API details
|
||
|
|
- Check [Working with Metadata](../getting-started/quickstart.md#read-document-metadata) for examples
|
||
|
|
- Open an issue on [GitHub](https://github.com/kreuzberg-dev/kreuzberg/issues)
|