This commit is contained in:
322
docs/migration/from-unstructured.md
Normal file
322
docs/migration/from-unstructured.md
Normal file
@@ -0,0 +1,322 @@
|
||||
# Migrating from Unstructured to Kreuzberg
|
||||
|
||||
This guide helps you migrate from Unstructured.io to Kreuzberg for document intelligence workloads.
|
||||
|
||||
## Quick Start
|
||||
|
||||
**Unstructured API**:
|
||||
|
||||
```bash
|
||||
curl -X POST "https://api.unstructured.io/general/v0/general" \
|
||||
-F 'files=@document.pdf'
|
||||
```
|
||||
|
||||
**Kreuzberg API**:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8080/extract" \
|
||||
-F 'files=@document.pdf' \
|
||||
-F 'output_format=element_based'
|
||||
```
|
||||
|
||||
## Output Format Comparison
|
||||
|
||||
### Unified Output (Default)
|
||||
|
||||
Kreuzberg's default output provides richer metadata than Unstructured:
|
||||
|
||||
**Kreuzberg Unified**:
|
||||
|
||||
```json
|
||||
{
|
||||
"content": "Full document text...",
|
||||
"mime_type": "application/pdf",
|
||||
"metadata": {
|
||||
"title": "Document Title",
|
||||
"authors": ["Author Name"],
|
||||
"created_at": "2024-01-15T10:30:00Z",
|
||||
"format": {
|
||||
"format_type": "pdf",
|
||||
"page_count": 10,
|
||||
"version": "1.7"
|
||||
}
|
||||
},
|
||||
"tables": [...],
|
||||
"images": [...],
|
||||
"pages": [...]
|
||||
}
|
||||
```
|
||||
|
||||
### Element-Based Output
|
||||
|
||||
**Kreuzberg** (when `output_format=element_based`):
|
||||
|
||||
```json
|
||||
{
|
||||
"elements": [
|
||||
{
|
||||
"element_id": "elem-a3f2b1c4",
|
||||
"element_type": "title",
|
||||
"text": "Introduction",
|
||||
"metadata": {
|
||||
"page_number": 1,
|
||||
"filename": "Document Title",
|
||||
"coordinates": {
|
||||
"x0": 72.0,
|
||||
"y0": 100.0,
|
||||
"x1": 540.0,
|
||||
"y1": 130.0
|
||||
},
|
||||
"element_index": 0,
|
||||
"additional": {
|
||||
"level": "h1",
|
||||
"font_size": "24.0"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"element_type": "narrative_text",
|
||||
"text": "This is a paragraph...",
|
||||
"metadata": {
|
||||
"page_number": 1
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Unstructured**:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"type": "Title",
|
||||
"text": "Introduction",
|
||||
"metadata": {
|
||||
"page_number": 1,
|
||||
"filename": "document.pdf"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"text": "This is a paragraph...",
|
||||
"metadata": {
|
||||
"page_number": 1
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
## API Endpoint Mapping
|
||||
|
||||
| Unstructured | Kreuzberg | Notes |
|
||||
| -------------------------- | ------------------ | --------------------------------- |
|
||||
| `POST /general/v0/general` | `POST /extract` | Single/batch extraction |
|
||||
| N/A | `POST /embed` | Built-in embeddings (ONNX models) |
|
||||
| N/A | `GET /health` | Health check |
|
||||
| N/A | `GET /cache/stats` | Cache statistics |
|
||||
|
||||
## Element Type Mapping
|
||||
|
||||
| Unstructured | Kreuzberg | Notes |
|
||||
| --------------- | ---------------- | ----------------------------------- |
|
||||
| `Title` | `title` | PDF hierarchy (h1-h6) detection |
|
||||
| `NarrativeText` | `narrative_text` | Paragraphs split on double newlines |
|
||||
| `ListItem` | `list_item` | Bullets, numbered, lettered |
|
||||
| `Table` | `table` | Tab-separated text representation |
|
||||
| `Image` | `image` | Format, dimensions in metadata |
|
||||
| `PageBreak` | `page_break` | Between pages in multi-page docs |
|
||||
| `Header` | `header` | Page header text |
|
||||
| `Footer` | `footer` | Page footer text |
|
||||
| N/A | `heading` | Section headings (beyond title) |
|
||||
| N/A | `code_block` | Code snippets |
|
||||
| N/A | `block_quote` | Quoted text blocks |
|
||||
|
||||
## Code Examples
|
||||
|
||||
### Python
|
||||
|
||||
**Unstructured**:
|
||||
|
||||
```python
|
||||
from unstructured.partition.auto import partition
|
||||
|
||||
elements = partition(filename="document.pdf")
|
||||
for element in elements:
|
||||
print(f"{element.category}: {element.text}")
|
||||
```
|
||||
|
||||
**Kreuzberg**:
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_bytes
|
||||
|
||||
# Option 1: Element-based output
|
||||
config = {"output_format": "element_based"}
|
||||
result = extract_bytes(pdf_bytes, "application/pdf", config)
|
||||
|
||||
for element in result.elements:
|
||||
print(f"{element.element_type}: {element.text}")
|
||||
if element.metadata.page_number:
|
||||
print(f" Page: {element.metadata.page_number}")
|
||||
|
||||
# Option 2: Unified output (default, richer metadata)
|
||||
result = extract_bytes(pdf_bytes, "application/pdf")
|
||||
print(result.content) # Full text
|
||||
print(result.metadata.title) # Document metadata
|
||||
for page in result.pages:
|
||||
print(f"Page {page.page_number}: {page.content[:100]}")
|
||||
```
|
||||
|
||||
### TypeScript
|
||||
|
||||
**Unstructured** (via API):
|
||||
|
||||
```typescript
|
||||
const formData = new FormData();
|
||||
formData.append("files", fileBlob);
|
||||
|
||||
const response = await fetch("https://api.unstructured.io/general/v0/general", {
|
||||
method: "POST",
|
||||
body: formData,
|
||||
});
|
||||
const elements = await response.json();
|
||||
```
|
||||
|
||||
**Kreuzberg**:
|
||||
|
||||
```typescript
|
||||
import { extractBytes } from "kreuzberg";
|
||||
|
||||
// Option 1: Element-based output
|
||||
const result = await extractBytes(pdfBuffer, "application/pdf", {
|
||||
output_format: "element_based",
|
||||
});
|
||||
|
||||
for (const element of result.elements) {
|
||||
console.log(`${element.element_type}: ${element.text}`);
|
||||
}
|
||||
|
||||
// Option 2: Unified output with pages
|
||||
const result = await extractBytes(pdfBuffer, "application/pdf", {
|
||||
pages: { extract_pages: true },
|
||||
});
|
||||
|
||||
for (const page of result.pages) {
|
||||
console.log(`Page ${page.page_number}:`, page.content);
|
||||
}
|
||||
```
|
||||
|
||||
### CURL
|
||||
|
||||
**Unstructured**:
|
||||
|
||||
```bash
|
||||
curl -X POST "https://api.unstructured.io/general/v0/general" \
|
||||
-H "unstructured-api-key: $API_KEY" \
|
||||
-F 'files=@document.pdf' \
|
||||
-F 'strategy=hi_res'
|
||||
```
|
||||
|
||||
**Kreuzberg**:
|
||||
|
||||
```bash
|
||||
# Element-based output
|
||||
curl -X POST "http://localhost:8080/extract" \
|
||||
-F 'files=@document.pdf' \
|
||||
-F 'output_format=element_based'
|
||||
|
||||
# With configuration JSON
|
||||
curl -X POST "http://localhost:8080/extract" \
|
||||
-F 'files=@document.pdf' \
|
||||
-F 'config={"output_format":"element_based","pages":{"extract_pages":true}}'
|
||||
```
|
||||
|
||||
## Feature Comparison
|
||||
|
||||
### What Kreuzberg Adds
|
||||
|
||||
1. **Richer Metadata**: Format-specific discriminated unions (PDF, Excel, Email, etc.)
|
||||
2. **Native Per-Page**: `PageContent` with byte offsets, hierarchy, tables, images per page
|
||||
3. **90+ Formats**: vs Unstructured's ~30 formats
|
||||
4. **Performance**: Rust-based native implementation (vs Python-based)
|
||||
5. **10 Language Bindings**: Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, Rust, WASM
|
||||
6. **Built-in Embeddings**: ONNX models via `/embed` endpoint (no external API)
|
||||
7. **Smart Hierarchy**: PDF font-size clustering for h1-h6 detection
|
||||
8. **Bounding Boxes**: Preserved from PDF source in element coordinates
|
||||
|
||||
### What Unstructured Has
|
||||
|
||||
1. **Layout Detection Models**: ML-based layout analysis (GPU-accelerated)
|
||||
2. **Cloud API**: Hosted service (Kreuzberg requires self-hosting)
|
||||
3. **More Element Types**: More granular element classification
|
||||
4. **Mature Ecosystem**: Larger community, more integrations
|
||||
|
||||
## Configuration Mapping
|
||||
|
||||
| Unstructured Parameter | Kreuzberg Config | Notes |
|
||||
| ------------------------------------- | ------------------------------------ | ---------------------------------- |
|
||||
| `strategy=hi_res` | `pdf_options.hierarchy.enabled=true` | PDF hierarchy extraction |
|
||||
| `coordinates=true` | Always included when available | Bounding boxes in element metadata |
|
||||
| `languages=["eng"]` | `ocr.language="eng"` | OCR language |
|
||||
| `extract_image_block_types=["image"]` | `images.extract_images=true` | Image extraction |
|
||||
| `chunking_strategy="by_title"` | `chunking.max_chars=1000` | Text chunking (basic) |
|
||||
| `embedding_model="..."` | `chunking.embedding.model="..."` | Embedding generation |
|
||||
|
||||
## Migration Checklist
|
||||
|
||||
- [ ] Update API endpoint URLs (Unstructured → Kreuzberg)
|
||||
- [ ] Add `output_format=element_based` if using element-based workflow
|
||||
- [ ] Update element type references (`Title` → `title`, camelCase → snake_case)
|
||||
- [ ] Update metadata field references (Kreuzberg has richer metadata structure)
|
||||
- [ ] Test with sample documents to verify output equivalence
|
||||
- [ ] Update error handling (Kreuzberg uses HTTP 422 for validation errors)
|
||||
- [ ] Configure caching if needed (Kreuzberg has built-in file-based cache)
|
||||
- [ ] Set up embeddings if using RAG pipeline (Kreuzberg has built-in ONNX support)
|
||||
|
||||
## Advanced: Hybrid Approach
|
||||
|
||||
You can use **both formats** simultaneously:
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_bytes
|
||||
|
||||
result = extract_bytes(pdf_bytes, "application/pdf", {
|
||||
"output_format": "element_based", # Get elements
|
||||
"pages": {"extract_pages": true} # Also get per-page content
|
||||
})
|
||||
|
||||
# Element-based processing
|
||||
for element in result.elements:
|
||||
if element.element_type == "title":
|
||||
index_heading(element.text)
|
||||
|
||||
# Page-based processing
|
||||
for page in result.pages:
|
||||
if page.hierarchy:
|
||||
for block in page.hierarchy.blocks:
|
||||
if block.level == "h1":
|
||||
process_section(block.text)
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Enable Caching**: `use_cache: true` (default) for repeated extractions
|
||||
2. **Disable OCR**: If documents are searchable PDFs, set `force_ocr: false`
|
||||
3. **Limit Page Extraction**: Only enable `pages` if you need per-page content
|
||||
4. **Batch Processing**: Send multiple files in single request (up to 10MB total)
|
||||
5. **Use Embeddings Wisely**: Enable only for chunked content destined for vector DB
|
||||
|
||||
## Getting Help
|
||||
|
||||
- **Documentation**: <https://github.com/kreuzberg-dev/Kreuzberg>
|
||||
- **Issues**: <https://github.com/kreuzberg-dev/kreuzberg/issues>
|
||||
- **API Reference**: See `docs/api/` for endpoint documentation
|
||||
|
||||
## Next Steps
|
||||
|
||||
After migration:
|
||||
|
||||
1. Review the [Kreuzberg vs Unstructured Comparison](../comparisons/kreuzberg-vs-unstructured.md)
|
||||
2. Explore Kreuzberg-specific features (hierarchy, per-page metadata, embeddings)
|
||||
3. Optimize your pipeline with native Rust performance
|
||||
644
docs/migration/v4.0-fonts.md
Normal file
644
docs/migration/v4.0-fonts.md
Normal file
@@ -0,0 +1,644 @@
|
||||
# Font Configuration Breaking Change (v4.0)
|
||||
|
||||
## Summary
|
||||
|
||||
Custom font provider is now **enabled by default** for improved PDF performance.
|
||||
|
||||
## Breaking Change
|
||||
|
||||
**Previous behavior** (v3.x):
|
||||
|
||||
- Font provider always enabled, not configurable
|
||||
- Used system fonts only
|
||||
- No user control over font loading
|
||||
|
||||
**New behavior** (v4.0):
|
||||
|
||||
- Font provider enabled by default
|
||||
- Configurable via `FontConfig` in `PdfConfig`
|
||||
- Can disable or add custom font directories
|
||||
- ~12-13% faster PDF processing with font caching
|
||||
|
||||
## Impact
|
||||
|
||||
**Who is affected?**
|
||||
|
||||
- Users who rely on the PDF extractor's default font fallback behavior
|
||||
- Users who want to disable the custom font provider
|
||||
- Users who need to add custom font directories
|
||||
|
||||
**What changes?**
|
||||
|
||||
- Default: Custom font provider now active (breaking change)
|
||||
- Performance: PDF extraction 12-13% faster
|
||||
- API: New `font_config` option in `PdfConfig`
|
||||
|
||||
## Migration
|
||||
|
||||
### No Action Required (Recommended)
|
||||
|
||||
For most users, no changes needed. Default behavior provides performance improvements:
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust
|
||||
use kreuzberg::ExtractionConfig;
|
||||
|
||||
// Previous (v4.0) - no font configuration
|
||||
let config = ExtractionConfig::default();
|
||||
|
||||
// Current (v4.0) - same code, now with font provider enabled
|
||||
let config = ExtractionConfig::default();
|
||||
// Font provider automatically enabled with system fonts
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python
|
||||
from kreuzberg import ExtractionConfig
|
||||
|
||||
# Previous (v4.0)
|
||||
config = ExtractionConfig()
|
||||
|
||||
# Current (v4.0) - same code, now with font provider enabled
|
||||
config = ExtractionConfig()
|
||||
# Font provider automatically enabled with system fonts
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript
|
||||
import { ExtractionConfig } from 'kreuzberg';
|
||||
|
||||
// Previous (v4.0)
|
||||
const config: ExtractionConfig = {};
|
||||
|
||||
// Current (v4.0) - same code, now with font provider enabled
|
||||
const config: ExtractionConfig = {};
|
||||
// Font provider automatically enabled with system fonts
|
||||
```
|
||||
|
||||
=== "Java"
|
||||
|
||||
```java
|
||||
import dev.kreuzberg.config.*;
|
||||
|
||||
// Previous (v4.0)
|
||||
ExtractionConfig config = ExtractionConfig.builder().build();
|
||||
|
||||
// Current (v4.0) - same code, now with font provider enabled
|
||||
ExtractionConfig config = ExtractionConfig.builder().build();
|
||||
// Font provider automatically enabled with system fonts
|
||||
```
|
||||
|
||||
=== "Go"
|
||||
|
||||
```go
|
||||
import "github.com/kreuzberg-dev/kreuzberg/v4"
|
||||
|
||||
// Previous (v4.0)
|
||||
config := &kreuzberg.ExtractionConfig{}
|
||||
|
||||
// Current (v4.0) - same code, now with font provider enabled
|
||||
config := &kreuzberg.ExtractionConfig{}
|
||||
// Font provider automatically enabled with system fonts
|
||||
```
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
# Previous (v4.0)
|
||||
config = Kreuzberg::ExtractionConfig.new
|
||||
|
||||
# Current (v4.0) - same code, now with font provider enabled
|
||||
config = Kreuzberg::ExtractionConfig.new
|
||||
# Font provider automatically enabled with system fonts
|
||||
```
|
||||
|
||||
=== "C#"
|
||||
|
||||
```csharp
|
||||
using Kreuzberg;
|
||||
|
||||
// Previous (v4.0)
|
||||
var config = new ExtractionConfig();
|
||||
|
||||
// Current (v4.0) - same code, now with font provider enabled
|
||||
var config = new ExtractionConfig();
|
||||
// Font provider automatically enabled with system fonts
|
||||
```
|
||||
|
||||
### Disable Font Provider
|
||||
|
||||
If you prefer the default font handling:
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust
|
||||
use kreuzberg::{ExtractionConfig, PdfConfig, FontConfig};
|
||||
|
||||
let config = ExtractionConfig {
|
||||
pdf_options: Some(PdfConfig {
|
||||
font_config: Some(FontConfig {
|
||||
enabled: false,
|
||||
custom_font_dirs: None,
|
||||
}),
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python
|
||||
from kreuzberg import ExtractionConfig, PdfConfig, FontConfig
|
||||
|
||||
config = ExtractionConfig(
|
||||
pdf_options=PdfConfig(
|
||||
font_config=FontConfig(enabled=False)
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript
|
||||
import { ExtractionConfig } from 'kreuzberg';
|
||||
|
||||
const config: ExtractionConfig = {
|
||||
pdfOptions: {
|
||||
fontConfig: {
|
||||
enabled: false
|
||||
}
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
=== "Java"
|
||||
|
||||
```java
|
||||
import dev.kreuzberg.config.*;
|
||||
|
||||
FontConfig fontConfig = FontConfig.builder()
|
||||
.enabled(false)
|
||||
.build();
|
||||
|
||||
PdfConfig pdfConfig = PdfConfig.builder()
|
||||
.fontConfig(fontConfig)
|
||||
.build();
|
||||
|
||||
ExtractionConfig config = ExtractionConfig.builder()
|
||||
.pdfOptions(pdfConfig)
|
||||
.build();
|
||||
```
|
||||
|
||||
=== "Go"
|
||||
|
||||
```go
|
||||
import "github.com/kreuzberg-dev/kreuzberg/v4"
|
||||
|
||||
config := &kreuzberg.ExtractionConfig{
|
||||
PdfOptions: &kreuzberg.PdfConfig{
|
||||
FontConfig: &kreuzberg.FontConfig{
|
||||
Enabled: false,
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
config = Kreuzberg::ExtractionConfig.new(
|
||||
pdf_options: Kreuzberg::PdfConfig.new(
|
||||
font_config: Kreuzberg::FontConfig.new(enabled: false)
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
=== "C#"
|
||||
|
||||
```csharp
|
||||
using Kreuzberg;
|
||||
|
||||
var fontConfig = new FontConfig { Enabled = false };
|
||||
var pdfConfig = new PdfConfig { FontConfig = fontConfig };
|
||||
var config = new ExtractionConfig { PdfOptions = pdfConfig };
|
||||
```
|
||||
|
||||
### Add Custom Font Directories
|
||||
|
||||
To use fonts from custom directories (in addition system fonts):
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust
|
||||
use kreuzberg::{ExtractionConfig, PdfConfig, FontConfig};
|
||||
use std::path::PathBuf;
|
||||
|
||||
let config = ExtractionConfig {
|
||||
pdf_options: Some(PdfConfig {
|
||||
font_config: Some(FontConfig {
|
||||
enabled: true,
|
||||
custom_font_dirs: Some(vec![
|
||||
PathBuf::from("/usr/share/fonts/custom"),
|
||||
PathBuf::from("~/my-fonts"), // Tilde expanded automatically
|
||||
]),
|
||||
}),
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python
|
||||
from kreuzberg import ExtractionConfig, PdfConfig, FontConfig
|
||||
|
||||
config = ExtractionConfig(
|
||||
pdf_options=PdfConfig(
|
||||
font_config=FontConfig(
|
||||
enabled=True,
|
||||
custom_font_dirs=[
|
||||
"/usr/share/fonts/custom",
|
||||
"~/my-fonts" # Tilde expanded automatically
|
||||
]
|
||||
)
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript
|
||||
import { ExtractionConfig } from 'kreuzberg';
|
||||
|
||||
const config: ExtractionConfig = {
|
||||
pdfOptions: {
|
||||
fontConfig: {
|
||||
enabled: true,
|
||||
customFontDirs: [
|
||||
'/usr/share/fonts/custom',
|
||||
'~/my-fonts' // Tilde expanded automatically
|
||||
]
|
||||
}
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
=== "Java"
|
||||
|
||||
```java
|
||||
import dev.kreuzberg.config.*;
|
||||
import java.nio.file.Paths;
|
||||
|
||||
FontConfig fontConfig = FontConfig.builder()
|
||||
.enabled(true)
|
||||
.customFontDirs(Arrays.asList(
|
||||
Paths.get("/usr/share/fonts/custom"),
|
||||
Paths.get("~/my-fonts") // Tilde expanded automatically
|
||||
))
|
||||
.build();
|
||||
|
||||
PdfConfig pdfConfig = PdfConfig.builder()
|
||||
.fontConfig(fontConfig)
|
||||
.build();
|
||||
|
||||
ExtractionConfig config = ExtractionConfig.builder()
|
||||
.pdfOptions(pdfConfig)
|
||||
.build();
|
||||
```
|
||||
|
||||
=== "Go"
|
||||
|
||||
```go
|
||||
import "github.com/kreuzberg-dev/kreuzberg/v4"
|
||||
|
||||
config := &kreuzberg.ExtractionConfig{
|
||||
PdfOptions: &kreuzberg.PdfConfig{
|
||||
FontConfig: &kreuzberg.FontConfig{
|
||||
Enabled: true,
|
||||
CustomFontDirs: []string{
|
||||
"/usr/share/fonts/custom",
|
||||
"~/my-fonts", // Tilde expanded automatically
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
config = Kreuzberg::ExtractionConfig.new(
|
||||
pdf_options: Kreuzberg::PdfConfig.new(
|
||||
font_config: Kreuzberg::FontConfig.new(
|
||||
enabled: true,
|
||||
custom_font_dirs: [
|
||||
'/usr/share/fonts/custom',
|
||||
'~/my-fonts' # Tilde expanded automatically
|
||||
]
|
||||
)
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
=== "C#"
|
||||
|
||||
```csharp
|
||||
using Kreuzberg;
|
||||
|
||||
var fontConfig = new FontConfig
|
||||
{
|
||||
Enabled = true,
|
||||
CustomFontDirs = new[]
|
||||
{
|
||||
"/usr/share/fonts/custom",
|
||||
"~/my-fonts" // Tilde expanded automatically
|
||||
}
|
||||
};
|
||||
|
||||
var pdfConfig = new PdfConfig { FontConfig = fontConfig };
|
||||
var config = new ExtractionConfig { PdfOptions = pdfConfig };
|
||||
```
|
||||
|
||||
## Configuration Files
|
||||
|
||||
### TOML Format
|
||||
|
||||
```toml title="Font Configuration in TOML"
|
||||
[pdf_options.font_config]
|
||||
enabled = true
|
||||
custom_font_dirs = ["/usr/share/fonts/custom", "~/my-fonts"]
|
||||
```
|
||||
|
||||
### YAML Format
|
||||
|
||||
```yaml title="Font Configuration in YAML"
|
||||
pdf_options:
|
||||
font_config:
|
||||
enabled: true
|
||||
custom_font_dirs:
|
||||
- /usr/share/fonts/custom
|
||||
- ~/my-fonts
|
||||
```
|
||||
|
||||
### JSON Format
|
||||
|
||||
```json title="Font Configuration in JSON"
|
||||
{
|
||||
"pdf_options": {
|
||||
"font_config": {
|
||||
"enabled": true,
|
||||
"custom_font_dirs": ["/usr/share/fonts/custom", "~/my-fonts"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Path Handling
|
||||
|
||||
The font configuration automatically handles:
|
||||
|
||||
- **Tilde expansion**: `~/fonts` → `/Users/username/fonts`
|
||||
- **Relative paths**: `./fonts` → `/absolute/path/to/fonts`
|
||||
- **Symlinks**: Resolved to canonical paths (security measure)
|
||||
- **Validation**: Directories must exist; warnings logged if not found
|
||||
- **Graceful degradation**: Missing directories don't cause failures
|
||||
|
||||
## Global Configuration
|
||||
|
||||
**Important**: Font configuration is global per process and must be set **before the first PDF extraction**.
|
||||
|
||||
=== "Rust"
|
||||
|
||||
```rust
|
||||
// CORRECT: Set config before first extraction
|
||||
let config = ExtractionConfig {
|
||||
pdf_options: Some(PdfConfig {
|
||||
font_config: Some(FontConfig {
|
||||
enabled: true,
|
||||
custom_font_dirs: Some(vec![
|
||||
PathBuf::from("/usr/share/fonts/custom"),
|
||||
]),
|
||||
}),
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
let result = kreuzberg::extract_file("document.pdf", &config)?;
|
||||
|
||||
// INCORRECT: Attempting to change config after first extraction
|
||||
let new_config = ExtractionConfig {
|
||||
pdf_options: Some(PdfConfig {
|
||||
font_config: Some(FontConfig {
|
||||
enabled: false,
|
||||
custom_font_dirs: None,
|
||||
}),
|
||||
..Default::default()
|
||||
}),
|
||||
..Default::default()
|
||||
};
|
||||
let result2 = kreuzberg::extract_file("document2.pdf", &new_config)?;
|
||||
// Warning logged: "Font config already initialized"
|
||||
```
|
||||
|
||||
=== "Python"
|
||||
|
||||
```python
|
||||
# CORRECT: Set config before first extraction
|
||||
config = ExtractionConfig(
|
||||
pdf_options=PdfConfig(
|
||||
font_config=FontConfig(
|
||||
enabled=True,
|
||||
custom_font_dirs=["/usr/share/fonts/custom"]
|
||||
)
|
||||
)
|
||||
)
|
||||
result = extract_file("document.pdf", config)
|
||||
|
||||
# INCORRECT: Attempting to change config after first extraction
|
||||
new_config = ExtractionConfig(
|
||||
pdf_options=PdfConfig(
|
||||
font_config=FontConfig(enabled=False)
|
||||
)
|
||||
)
|
||||
result2 = extract_file("document2.pdf", new_config)
|
||||
# Warning logged: "Font config already initialized"
|
||||
```
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
```typescript
|
||||
// CORRECT: Set config before first extraction
|
||||
const config: ExtractionConfig = {
|
||||
pdfOptions: {
|
||||
fontConfig: {
|
||||
enabled: true,
|
||||
customFontDirs: ['/usr/share/fonts/custom']
|
||||
}
|
||||
}
|
||||
};
|
||||
const result = await extractFile('document.pdf', config);
|
||||
|
||||
// INCORRECT: Attempting to change config after first extraction
|
||||
const newConfig: ExtractionConfig = {
|
||||
pdfOptions: {
|
||||
fontConfig: { enabled: false }
|
||||
}
|
||||
};
|
||||
const result2 = await extractFile('document2.pdf', newConfig);
|
||||
// Warning logged: "Font config already initialized"
|
||||
```
|
||||
|
||||
=== "Java"
|
||||
|
||||
```java
|
||||
// CORRECT: Set config before first extraction
|
||||
FontConfig fontConfig = FontConfig.builder()
|
||||
.enabled(true)
|
||||
.customFontDirs(Arrays.asList(Paths.get("/usr/share/fonts/custom")))
|
||||
.build();
|
||||
PdfConfig pdfConfig = PdfConfig.builder()
|
||||
.fontConfig(fontConfig)
|
||||
.build();
|
||||
ExtractionConfig config = ExtractionConfig.builder()
|
||||
.pdfOptions(pdfConfig)
|
||||
.build();
|
||||
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
|
||||
|
||||
// INCORRECT: Attempting to change config after first extraction
|
||||
FontConfig newFontConfig = FontConfig.builder()
|
||||
.enabled(false)
|
||||
.build();
|
||||
PdfConfig newPdfConfig = PdfConfig.builder()
|
||||
.fontConfig(newFontConfig)
|
||||
.build();
|
||||
ExtractionConfig newConfig = ExtractionConfig.builder()
|
||||
.pdfOptions(newPdfConfig)
|
||||
.build();
|
||||
ExtractionResult result2 = Kreuzberg.extractFile("document2.pdf", newConfig);
|
||||
// Warning logged: "Font config already initialized"
|
||||
```
|
||||
|
||||
=== "Go"
|
||||
|
||||
```go
|
||||
// CORRECT: Set config before first extraction
|
||||
config := &kreuzberg.ExtractionConfig{
|
||||
PdfOptions: &kreuzberg.PdfConfig{
|
||||
FontConfig: &kreuzberg.FontConfig{
|
||||
Enabled: true,
|
||||
CustomFontDirs: []string{"/usr/share/fonts/custom"},
|
||||
},
|
||||
},
|
||||
}
|
||||
result, _ := kreuzberg.ExtractFile("document.pdf", config)
|
||||
|
||||
// INCORRECT: Attempting to change config after first extraction
|
||||
newConfig := &kreuzberg.ExtractionConfig{
|
||||
PdfOptions: &kreuzberg.PdfConfig{
|
||||
FontConfig: &kreuzberg.FontConfig{
|
||||
Enabled: false,
|
||||
},
|
||||
},
|
||||
}
|
||||
result2, _ := kreuzberg.ExtractFile("document2.pdf", newConfig)
|
||||
// Warning logged: "Font config already initialized"
|
||||
```
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
```ruby
|
||||
# CORRECT: Set config before first extraction
|
||||
config = Kreuzberg::ExtractionConfig.new(
|
||||
pdf_options: Kreuzberg::PdfConfig.new(
|
||||
font_config: Kreuzberg::FontConfig.new(
|
||||
enabled: true,
|
||||
custom_font_dirs: ['/usr/share/fonts/custom']
|
||||
)
|
||||
)
|
||||
)
|
||||
result = Kreuzberg.extract_file('document.pdf', config)
|
||||
|
||||
# INCORRECT: Attempting to change config after first extraction
|
||||
new_config = Kreuzberg::ExtractionConfig.new(
|
||||
pdf_options: Kreuzberg::PdfConfig.new(
|
||||
font_config: Kreuzberg::FontConfig.new(enabled: false)
|
||||
)
|
||||
)
|
||||
result2 = Kreuzberg.extract_file('document2.pdf', new_config)
|
||||
# Warning logged: "Font config already initialized"
|
||||
```
|
||||
|
||||
=== "C#"
|
||||
|
||||
```csharp
|
||||
// CORRECT: Set config before first extraction
|
||||
var fontConfig = new FontConfig
|
||||
{
|
||||
Enabled = true,
|
||||
CustomFontDirs = new[] { "/usr/share/fonts/custom" }
|
||||
};
|
||||
var pdfConfig = new PdfConfig { FontConfig = fontConfig };
|
||||
var config = new ExtractionConfig { PdfOptions = pdfConfig };
|
||||
var result = Kreuzberg.ExtractFile("document.pdf", config);
|
||||
|
||||
// INCORRECT: Attempting to change config after first extraction
|
||||
var newFontConfig = new FontConfig { Enabled = false };
|
||||
var newPdfConfig = new PdfConfig { FontConfig = newFontConfig };
|
||||
var newConfig = new ExtractionConfig { PdfOptions = newPdfConfig };
|
||||
var result2 = Kreuzberg.ExtractFile("document2.pdf", newConfig);
|
||||
// Warning logged: "Font config already initialized"
|
||||
```
|
||||
|
||||
## Performance Impact
|
||||
|
||||
With default settings (enabled=true, system fonts):
|
||||
|
||||
- **PDF extraction**: ~12-13% faster
|
||||
- **Memory**: Minimal increase (~100KB for font cache)
|
||||
- **Startup**: Lazy initialization (no overhead for non-PDF workloads)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Custom fonts not working
|
||||
|
||||
**Symptom**: PDF still uses fallback fonts
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. Verify directories exist and contain .ttf/.otf/.ttc files
|
||||
2. Check logs for "Custom font directory not found" warnings
|
||||
3. Ensure paths are absolute or properly expanded
|
||||
4. Verify font files are readable
|
||||
|
||||
### "Font config already initialized" warning
|
||||
|
||||
**Symptom**: Configuration changes ignored after first PDF extraction
|
||||
|
||||
**Solution**: Set FontConfig in the **first** ExtractionConfig used. Subsequent config changes are not supported (global limitation).
|
||||
|
||||
### Performance regression
|
||||
|
||||
**Symptom**: PDF extraction slower after upgrade
|
||||
|
||||
**Solution**: This is unexpected. Please report as a bug with:
|
||||
|
||||
- PDF sample (if shareable)
|
||||
- Benchmark comparison (before/after)
|
||||
- Configuration used
|
||||
|
||||
## Questions?
|
||||
|
||||
- **Issue tracker**: <https://github.com/kreuzberg-dev/kreuzberg/issues>
|
||||
- **Discussions**: <https://github.com/kreuzberg-dev/kreuzberg/discussions>
|
||||
799
docs/migration/v4.0-html-metadata.md
Normal file
799
docs/migration/v4.0-html-metadata.md
Normal file
@@ -0,0 +1,799 @@
|
||||
# HTML Metadata Structure Changes (v4.0)
|
||||
|
||||
## Summary
|
||||
|
||||
HTML metadata has been restructured for better organization and type safety. The changes consolidate individual Open Graph and Twitter Card fields into maps, and convert keywords from a single string to an array.
|
||||
|
||||
## Breaking Changes
|
||||
|
||||
### 1. Keywords: String to Array
|
||||
|
||||
**Before (v3.x):**
|
||||
|
||||
```rust title="Keywords as Comma-Separated String"
|
||||
// Option<String> - comma-separated or space-separated
|
||||
html_meta.keywords // "seo, metadata, html"
|
||||
```
|
||||
|
||||
**After (v4.0):**
|
||||
|
||||
```rust title="Keywords as Structured Array"
|
||||
// Vec<String> - structured array
|
||||
html_meta.keywords // vec!["seo", "metadata", "html"]
|
||||
```
|
||||
|
||||
### 2. Canonical URL: Field Rename
|
||||
|
||||
**Before (v3.x):**
|
||||
|
||||
```rust title="Canonical Field (v3.x)"
|
||||
html_meta.canonical // Option<String>
|
||||
```
|
||||
|
||||
**After (v4.0):**
|
||||
|
||||
```rust title="Canonical URL Field (v4.0)"
|
||||
html_meta.canonical_url // Option<String>
|
||||
```
|
||||
|
||||
### 3. Open Graph: Individual Fields to Map
|
||||
|
||||
**Before (v3.x):**
|
||||
|
||||
```rust title="Open Graph as Individual Fields"
|
||||
html_meta.og_title // Option<String>
|
||||
html_meta.og_description // Option<String>
|
||||
html_meta.og_image // Option<String>
|
||||
html_meta.og_url // Option<String>
|
||||
html_meta.og_type // Option<String>
|
||||
html_meta.og_site_name // Option<String>
|
||||
```
|
||||
|
||||
**After (v4.0):**
|
||||
|
||||
```rust title="Open Graph as Map Structure"
|
||||
html_meta.open_graph // BTreeMap<String, String>
|
||||
html_meta.open_graph.get("title") // Option<&String>
|
||||
html_meta.open_graph.get("description") // Option<&String>
|
||||
html_meta.open_graph.get("image") // Option<&String>
|
||||
html_meta.open_graph.get("url") // Option<&String>
|
||||
html_meta.open_graph.get("type") // Option<&String>
|
||||
html_meta.open_graph.get("site_name") // Option<&String>
|
||||
```
|
||||
|
||||
### 4. Twitter Card: Individual Fields to Map
|
||||
|
||||
**Before (v3.x):**
|
||||
|
||||
```rust title="Twitter Card as Individual Fields"
|
||||
html_meta.twitter_card // Option<String>
|
||||
html_meta.twitter_title // Option<String>
|
||||
html_meta.twitter_description // Option<String>
|
||||
html_meta.twitter_image // Option<String>
|
||||
html_meta.twitter_site // Option<String>
|
||||
html_meta.twitter_creator // Option<String>
|
||||
```
|
||||
|
||||
**After (v4.0):**
|
||||
|
||||
```rust title="Twitter Card as Map Structure"
|
||||
html_meta.twitter_card // BTreeMap<String, String>
|
||||
html_meta.twitter_card.get("card") // Option<&String>
|
||||
html_meta.twitter_card.get("title") // Option<&String>
|
||||
html_meta.twitter_card.get("description") // Option<&String>
|
||||
html_meta.twitter_card.get("image") // Option<&String>
|
||||
html_meta.twitter_card.get("site") // Option<&String>
|
||||
html_meta.twitter_card.get("creator") // Option<&String>
|
||||
```
|
||||
|
||||
### 5. Removed Fields
|
||||
|
||||
The following link-related fields have been removed:
|
||||
|
||||
- `link_author`
|
||||
- `link_license`
|
||||
- `link_alternate`
|
||||
|
||||
Use the new `links` field instead for comprehensive link extraction.
|
||||
|
||||
### 6. New Fields
|
||||
|
||||
HTML metadata now includes rich metadata about page content:
|
||||
|
||||
- **`language`**: Document language (for example, "en", "fr")
|
||||
- **`text_direction`**: Text direction ("ltr", "rtl")
|
||||
- **`headers`**: List of page headers/headings with structured metadata
|
||||
- **`links`**: List of links with detailed metadata and type classification
|
||||
- **`images`**: List of images with alt text, dimensions, and type classification
|
||||
- **`structured_data`**: Parsed JSON-LD, microdata, and RDFa data
|
||||
- **`meta_tags`**: All meta tags as a map
|
||||
|
||||
## Migration Guide
|
||||
|
||||
### Rust
|
||||
|
||||
=== "Before (v3.x)"
|
||||
|
||||
```rust
|
||||
use kreuzberg::{extract_file_sync, ExtractionConfig};
|
||||
|
||||
let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
|
||||
if let Some(html_meta) = result.metadata.html {
|
||||
// Keywords as single string
|
||||
if let Some(keywords) = html_meta.keywords {
|
||||
let keyword_vec: Vec<&str> = keywords.split(',').map(|s| s.trim()).collect();
|
||||
println!("Keywords: {:?}", keyword_vec);
|
||||
}
|
||||
|
||||
// Canonical as separate field
|
||||
if let Some(canonical) = html_meta.canonical {
|
||||
println!("Canonical: {}", canonical);
|
||||
}
|
||||
|
||||
// Open Graph as individual fields
|
||||
if let Some(og_title) = html_meta.og_title {
|
||||
println!("OG Title: {}", og_title);
|
||||
}
|
||||
if let Some(og_image) = html_meta.og_image {
|
||||
println!("OG Image: {}", og_image);
|
||||
}
|
||||
|
||||
// Twitter as individual fields
|
||||
if let Some(twitter_card) = html_meta.twitter_card {
|
||||
println!("Twitter Card: {}", twitter_card);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
=== "After (v4.0)"
|
||||
|
||||
```rust
|
||||
use kreuzberg::{extract_file_sync, ExtractionConfig};
|
||||
|
||||
let result = extract_file_sync("page.html", None, &ExtractionConfig::default())?;
|
||||
if let Some(html_meta) = result.metadata.html {
|
||||
// Keywords as array
|
||||
if !html_meta.keywords.is_empty() {
|
||||
println!("Keywords: {:?}", html_meta.keywords);
|
||||
}
|
||||
|
||||
// Canonical renamed
|
||||
if let Some(canonical_url) = html_meta.canonical_url {
|
||||
println!("Canonical URL: {}", canonical_url);
|
||||
}
|
||||
|
||||
// Open Graph from map
|
||||
if let Some(og_title) = html_meta.open_graph.get("title") {
|
||||
println!("OG Title: {}", og_title);
|
||||
}
|
||||
if let Some(og_image) = html_meta.open_graph.get("image") {
|
||||
println!("OG Image: {}", og_image);
|
||||
}
|
||||
|
||||
// Twitter from map
|
||||
if let Some(twitter_card) = html_meta.twitter_card.get("card") {
|
||||
println!("Twitter Card: {}", twitter_card);
|
||||
}
|
||||
|
||||
// New fields
|
||||
if let Some(lang) = html_meta.language {
|
||||
println!("Language: {}", lang);
|
||||
}
|
||||
if let Some(headers) = html_meta.headers {
|
||||
println!("Headers: {:?}", headers);
|
||||
}
|
||||
if let Some(links) = html_meta.links {
|
||||
for (url, text) in links {
|
||||
println!("Link: {} ({})", url, text);
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Python
|
||||
|
||||
=== "Before (v3.x)"
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file_sync, ExtractionConfig
|
||||
|
||||
result = extract_file_sync("page.html", config=ExtractionConfig())
|
||||
html_meta = result.metadata.get("html", {})
|
||||
|
||||
# Keywords as single string
|
||||
if html_meta.get('keywords'):
|
||||
keyword_list = html_meta['keywords'].split(',')
|
||||
print(f"Keywords: {keyword_list}")
|
||||
|
||||
# Canonical as separate field
|
||||
if html_meta.get('canonical'):
|
||||
print(f"Canonical: {html_meta['canonical']}")
|
||||
|
||||
# Open Graph as individual fields
|
||||
if html_meta.get('og_title'):
|
||||
print(f"OG Title: {html_meta['og_title']}")
|
||||
if html_meta.get('og_image'):
|
||||
print(f"OG Image: {html_meta['og_image']}")
|
||||
|
||||
# Twitter as individual fields
|
||||
if html_meta.get('twitter_card'):
|
||||
print(f"Twitter Card: {html_meta['twitter_card']}")
|
||||
```
|
||||
|
||||
=== "After (v4.0)"
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file_sync, ExtractionConfig
|
||||
|
||||
result = extract_file_sync("page.html", config=ExtractionConfig())
|
||||
html_meta = result.metadata.get("html", {})
|
||||
|
||||
# Keywords as array
|
||||
if html_meta.get('keywords'):
|
||||
print(f"Keywords: {html_meta['keywords']}")
|
||||
|
||||
# Canonical renamed
|
||||
if html_meta.get('canonical_url'):
|
||||
print(f"Canonical URL: {html_meta['canonical_url']}")
|
||||
|
||||
# Open Graph from map
|
||||
open_graph = html_meta.get('open_graph', {})
|
||||
if open_graph.get('title'):
|
||||
print(f"OG Title: {open_graph['title']}")
|
||||
if open_graph.get('image'):
|
||||
print(f"OG Image: {open_graph['image']}")
|
||||
|
||||
# Twitter from map
|
||||
twitter_card = html_meta.get('twitter_card', {})
|
||||
if twitter_card.get('card'):
|
||||
print(f"Twitter Card: {twitter_card['card']}")
|
||||
|
||||
# New fields
|
||||
if html_meta.get('language'):
|
||||
print(f"Language: {html_meta['language']}")
|
||||
|
||||
if html_meta.get('headers'):
|
||||
print(f"Headers: {html_meta['headers']}")
|
||||
|
||||
if html_meta.get('links'):
|
||||
for url, text in html_meta['links']:
|
||||
print(f"Link: {url} ({text})")
|
||||
```
|
||||
|
||||
### TypeScript
|
||||
|
||||
=== "Before (v3.x)"
|
||||
|
||||
```typescript
|
||||
import { extractFileSync } from '@kreuzberg/node';
|
||||
|
||||
const result = extractFileSync('page.html');
|
||||
const htmlMeta = result.metadata;
|
||||
|
||||
// Keywords as single string
|
||||
if (htmlMeta.keywords) {
|
||||
const keywordArray = htmlMeta.keywords.split(',');
|
||||
console.log('Keywords:', keywordArray);
|
||||
}
|
||||
|
||||
// Canonical as separate field
|
||||
if (htmlMeta.canonical) {
|
||||
console.log('Canonical:', htmlMeta.canonical);
|
||||
}
|
||||
|
||||
// Open Graph as individual fields
|
||||
if (htmlMeta.ogTitle) {
|
||||
console.log('OG Title:', htmlMeta.ogTitle);
|
||||
}
|
||||
if (htmlMeta.ogImage) {
|
||||
console.log('OG Image:', htmlMeta.ogImage);
|
||||
}
|
||||
|
||||
// Twitter as individual fields
|
||||
if (htmlMeta.twitterCard) {
|
||||
console.log('Twitter Card:', htmlMeta.twitterCard);
|
||||
}
|
||||
```
|
||||
|
||||
=== "After (v4.0)"
|
||||
|
||||
```typescript
|
||||
import { extractFileSync } from '@kreuzberg/node';
|
||||
|
||||
const result = extractFileSync('page.html');
|
||||
const htmlMeta = result.metadata;
|
||||
|
||||
// Keywords as array
|
||||
if (htmlMeta.keywords?.length > 0) {
|
||||
console.log('Keywords:', htmlMeta.keywords);
|
||||
}
|
||||
|
||||
// Canonical renamed
|
||||
if (htmlMeta.canonicalUrl) {
|
||||
console.log('Canonical URL:', htmlMeta.canonicalUrl);
|
||||
}
|
||||
|
||||
// Open Graph from map
|
||||
if (htmlMeta.openGraph) {
|
||||
if (htmlMeta.openGraph['title']) {
|
||||
console.log('OG Title:', htmlMeta.openGraph['title']);
|
||||
}
|
||||
if (htmlMeta.openGraph['image']) {
|
||||
console.log('OG Image:', htmlMeta.openGraph['image']);
|
||||
}
|
||||
}
|
||||
|
||||
// Twitter from map
|
||||
if (htmlMeta.twitterCard) {
|
||||
if (htmlMeta.twitterCard['card']) {
|
||||
console.log('Twitter Card:', htmlMeta.twitterCard['card']);
|
||||
}
|
||||
}
|
||||
|
||||
// New fields
|
||||
if (htmlMeta.language) {
|
||||
console.log('Language:', htmlMeta.language);
|
||||
}
|
||||
|
||||
if (htmlMeta.headers?.length > 0) {
|
||||
console.log('Headers:', htmlMeta.headers);
|
||||
}
|
||||
|
||||
if (htmlMeta.links?.length > 0) {
|
||||
htmlMeta.links.forEach(([url, text]) => {
|
||||
console.log(`Link: ${url} (${text})`);
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
### Java
|
||||
|
||||
=== "Before (v3.x)"
|
||||
|
||||
```java
|
||||
import dev.kreuzberg.Kreuzberg;
|
||||
import dev.kreuzberg.ExtractionResult;
|
||||
import java.util.Map;
|
||||
|
||||
ExtractionResult result = Kreuzberg.extractFileSync("page.html");
|
||||
Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");
|
||||
|
||||
// Keywords as single string
|
||||
String keywords = (String) htmlMeta.get("keywords");
|
||||
if (keywords != null) {
|
||||
String[] keywordArray = keywords.split(",");
|
||||
System.out.println("Keywords: " + Arrays.toString(keywordArray));
|
||||
}
|
||||
|
||||
// Canonical as separate field
|
||||
String canonical = (String) htmlMeta.get("canonical");
|
||||
if (canonical != null) {
|
||||
System.out.println("Canonical: " + canonical);
|
||||
}
|
||||
|
||||
// Open Graph as individual fields
|
||||
String ogTitle = (String) htmlMeta.get("og_title");
|
||||
if (ogTitle != null) {
|
||||
System.out.println("OG Title: " + ogTitle);
|
||||
}
|
||||
|
||||
// Twitter as individual fields
|
||||
String twitterCard = (String) htmlMeta.get("twitter_card");
|
||||
if (twitterCard != null) {
|
||||
System.out.println("Twitter Card: " + twitterCard);
|
||||
}
|
||||
```
|
||||
|
||||
=== "After (v4.0)"
|
||||
|
||||
```java
|
||||
import dev.kreuzberg.Kreuzberg;
|
||||
import dev.kreuzberg.ExtractionResult;
|
||||
import java.util.Map;
|
||||
import java.util.List;
|
||||
|
||||
ExtractionResult result = Kreuzberg.extractFileSync("page.html");
|
||||
Map<String, Object> htmlMeta = (Map<String, Object>) result.getMetadata().get("html");
|
||||
|
||||
// Keywords as array
|
||||
@SuppressWarnings("unchecked")
|
||||
List<String> keywords = (List<String>) htmlMeta.get("keywords");
|
||||
if (keywords != null && !keywords.isEmpty()) {
|
||||
System.out.println("Keywords: " + keywords);
|
||||
}
|
||||
|
||||
// Canonical renamed
|
||||
String canonicalUrl = (String) htmlMeta.get("canonical_url");
|
||||
if (canonicalUrl != null) {
|
||||
System.out.println("Canonical URL: " + canonicalUrl);
|
||||
}
|
||||
|
||||
// Open Graph from map
|
||||
@SuppressWarnings("unchecked")
|
||||
Map<String, String> openGraph = (Map<String, String>) htmlMeta.get("open_graph");
|
||||
if (openGraph != null) {
|
||||
String ogTitle = openGraph.get("title");
|
||||
if (ogTitle != null) {
|
||||
System.out.println("OG Title: " + ogTitle);
|
||||
}
|
||||
}
|
||||
|
||||
// Twitter from map
|
||||
@SuppressWarnings("unchecked")
|
||||
Map<String, String> twitterCard = (Map<String, String>) htmlMeta.get("twitter_card");
|
||||
if (twitterCard != null) {
|
||||
String card = twitterCard.get("card");
|
||||
if (card != null) {
|
||||
System.out.println("Twitter Card: " + card);
|
||||
}
|
||||
}
|
||||
|
||||
// New fields
|
||||
String language = (String) htmlMeta.get("language");
|
||||
if (language != null) {
|
||||
System.out.println("Language: " + language);
|
||||
}
|
||||
|
||||
@SuppressWarnings("unchecked")
|
||||
List<String> headers = (List<String>) htmlMeta.get("headers");
|
||||
if (headers != null && !headers.isEmpty()) {
|
||||
System.out.println("Headers: " + headers);
|
||||
}
|
||||
```
|
||||
|
||||
### Go
|
||||
|
||||
=== "Before (v3.x)"
|
||||
|
||||
```go
|
||||
package main
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"log"
|
||||
"strings"
|
||||
"github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
|
||||
)
|
||||
|
||||
func main() {
|
||||
result, err := kreuzberg.ExtractFileSync("page.html", nil)
|
||||
if err != nil {
|
||||
log.Fatalf("extract: %v", err)
|
||||
}
|
||||
|
||||
if html, ok := result.Metadata.HTMLMetadata(); ok {
|
||||
// Keywords as single string
|
||||
if html.Keywords != nil {
|
||||
keywordSlice := strings.Split(*html.Keywords, ",")
|
||||
fmt.Println("Keywords:", keywordSlice)
|
||||
}
|
||||
|
||||
// Canonical as separate field
|
||||
if html.Canonical != nil {
|
||||
fmt.Println("Canonical:", *html.Canonical)
|
||||
}
|
||||
|
||||
// Open Graph as individual fields
|
||||
if html.OGTitle != nil {
|
||||
fmt.Println("OG Title:", *html.OGTitle)
|
||||
}
|
||||
if html.OGImage != nil {
|
||||
fmt.Println("OG Image:", *html.OGImage)
|
||||
}
|
||||
|
||||
// Twitter as individual fields
|
||||
if html.TwitterCard != nil {
|
||||
fmt.Println("Twitter Card:", *html.TwitterCard)
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
=== "After (v4.0)"
|
||||
|
||||
```go
|
||||
package main
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"log"
|
||||
"strings"
|
||||
"github.com/kreuzberg-dev/kreuzberg/packages/go/v5"
|
||||
)
|
||||
|
||||
func main() {
|
||||
result, err := kreuzberg.ExtractFileSync("page.html", nil)
|
||||
if err != nil {
|
||||
log.Fatalf("extract: %v", err)
|
||||
}
|
||||
|
||||
if html, ok := result.Metadata.HTMLMetadata(); ok {
|
||||
// Keywords as array
|
||||
if len(html.Keywords) > 0 {
|
||||
fmt.Println("Keywords:", strings.Join(html.Keywords, ", "))
|
||||
}
|
||||
|
||||
// Canonical renamed
|
||||
if html.CanonicalURL != nil {
|
||||
fmt.Println("Canonical URL:", *html.CanonicalURL)
|
||||
}
|
||||
|
||||
// Open Graph from map
|
||||
if len(html.OpenGraph) > 0 {
|
||||
if ogTitle, ok := html.OpenGraph["title"]; ok {
|
||||
fmt.Println("OG Title:", ogTitle)
|
||||
}
|
||||
if ogImage, ok := html.OpenGraph["image"]; ok {
|
||||
fmt.Println("OG Image:", ogImage)
|
||||
}
|
||||
}
|
||||
|
||||
// Twitter from map
|
||||
if len(html.TwitterCard) > 0 {
|
||||
if card, ok := html.TwitterCard["card"]; ok {
|
||||
fmt.Println("Twitter Card:", card)
|
||||
}
|
||||
}
|
||||
|
||||
// New fields
|
||||
if html.Language != nil {
|
||||
fmt.Println("Language:", *html.Language)
|
||||
}
|
||||
|
||||
if len(html.Headers) > 0 {
|
||||
fmt.Println("Headers:", strings.Join(html.Headers, ", "))
|
||||
}
|
||||
|
||||
if len(html.Links) > 0 {
|
||||
for _, link := range html.Links {
|
||||
fmt.Printf("Link: %s (%s)\n", link[0], link[1])
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Ruby
|
||||
|
||||
=== "Before (v3.x)"
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
result = Kreuzberg.extract_file_sync('page.html')
|
||||
html_meta = result.metadata['html']
|
||||
|
||||
# Keywords as single string
|
||||
if html_meta['keywords']
|
||||
keyword_array = html_meta['keywords'].split(',').map(&:strip)
|
||||
puts "Keywords: #{keyword_array}"
|
||||
end
|
||||
|
||||
# Canonical as separate field
|
||||
if html_meta['canonical']
|
||||
puts "Canonical: #{html_meta['canonical']}"
|
||||
end
|
||||
|
||||
# Open Graph as individual fields
|
||||
if html_meta['og_title']
|
||||
puts "OG Title: #{html_meta['og_title']}"
|
||||
end
|
||||
if html_meta['og_image']
|
||||
puts "OG Image: #{html_meta['og_image']}"
|
||||
end
|
||||
|
||||
# Twitter as individual fields
|
||||
if html_meta['twitter_card']
|
||||
puts "Twitter Card: #{html_meta['twitter_card']}"
|
||||
end
|
||||
```
|
||||
|
||||
=== "After (v4.0)"
|
||||
|
||||
```ruby
|
||||
require 'kreuzberg'
|
||||
|
||||
result = Kreuzberg.extract_file_sync('page.html')
|
||||
html_meta = result.metadata['html']
|
||||
|
||||
# Keywords as array
|
||||
if html_meta['keywords'] && !html_meta['keywords'].empty?
|
||||
puts "Keywords: #{html_meta['keywords']}"
|
||||
end
|
||||
|
||||
# Canonical renamed
|
||||
if html_meta['canonical_url']
|
||||
puts "Canonical URL: #{html_meta['canonical_url']}"
|
||||
end
|
||||
|
||||
# Open Graph from map
|
||||
open_graph = html_meta['open_graph'] || {}
|
||||
if open_graph['title']
|
||||
puts "OG Title: #{open_graph['title']}"
|
||||
end
|
||||
if open_graph['image']
|
||||
puts "OG Image: #{open_graph['image']}"
|
||||
end
|
||||
|
||||
# Twitter from map
|
||||
twitter_card = html_meta['twitter_card'] || {}
|
||||
if twitter_card['card']
|
||||
puts "Twitter Card: #{twitter_card['card']}"
|
||||
end
|
||||
|
||||
# New fields
|
||||
if html_meta['language']
|
||||
puts "Language: #{html_meta['language']}"
|
||||
end
|
||||
|
||||
if html_meta['headers'] && !html_meta['headers'].empty?
|
||||
puts "Headers: #{html_meta['headers'].join(', ')}"
|
||||
end
|
||||
|
||||
if html_meta['links'] && !html_meta['links'].empty?
|
||||
html_meta['links'].each do |url, text|
|
||||
puts "Link: #{url} (#{text})"
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
## API Reference
|
||||
|
||||
For complete details on all HTML metadata fields and types, see:
|
||||
|
||||
- [HTML Metadata Type Reference](../reference/types.md#htmlmetadata)
|
||||
|
||||
## Structured Types Reference
|
||||
|
||||
### HeaderMetadata
|
||||
|
||||
Header elements extracted from the HTML document with hierarchy information.
|
||||
|
||||
```rust title="HeaderMetadata Struct Definition"
|
||||
pub struct HeaderMetadata {
|
||||
pub level: u8, // 1-6 (h1-h6)
|
||||
pub text: String, // Normalized text content
|
||||
pub id: Option<String>, // HTML id attribute
|
||||
pub depth: usize, // Document tree depth
|
||||
pub html_offset: usize, // Byte offset in original HTML
|
||||
}
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
```json title="HeaderMetadata JSON Example"
|
||||
{
|
||||
"level": 1,
|
||||
"text": "Welcome to Our Site",
|
||||
"id": "welcome-section",
|
||||
"depth": 2,
|
||||
"html_offset": 512
|
||||
}
|
||||
```
|
||||
|
||||
### LinkMetadata
|
||||
|
||||
Link elements with type classification and detailed attributes.
|
||||
|
||||
```rust title="LinkMetadata Struct and LinkType Enum"
|
||||
pub struct LinkMetadata {
|
||||
pub href: String, // The href URL value
|
||||
pub text: String, // Link text content
|
||||
pub title: Option<String>, // Title attribute
|
||||
pub link_type: LinkType, // Classification enum
|
||||
pub rel: Vec<String>, // Rel attribute values
|
||||
pub attributes: HashMap<String, String>, // Additional attributes
|
||||
}
|
||||
|
||||
pub enum LinkType {
|
||||
Anchor, // #section anchors
|
||||
Internal, // Same domain links
|
||||
External, // Different domain links
|
||||
Email, // mailto: links
|
||||
Phone, // tel: links
|
||||
Other, // Other link types
|
||||
}
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
```json title="LinkMetadata JSON Example"
|
||||
{
|
||||
"href": "https://example.com",
|
||||
"text": "Visit Example",
|
||||
"title": "Example Website",
|
||||
"link_type": "external",
|
||||
"rel": ["nofollow"],
|
||||
"attributes": {
|
||||
"data-tracking": "yes"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### ImageMetadataType
|
||||
|
||||
Image elements with type classification and dimensions.
|
||||
|
||||
```rust title="ImageMetadataType Struct and ImageType Enum"
|
||||
pub struct ImageMetadataType {
|
||||
pub src: String, // Image source (URL, data URI, or SVG)
|
||||
pub alt: Option<String>, // Alt text
|
||||
pub title: Option<String>, // Title attribute
|
||||
pub dimensions: Option<(u32, u32)>, // Width x Height
|
||||
pub image_type: ImageType, // Classification enum
|
||||
pub attributes: HashMap<String, String>, // Additional attributes
|
||||
}
|
||||
|
||||
pub enum ImageType {
|
||||
DataUri, // data: URI
|
||||
InlineSvg, // Inline <svg> content
|
||||
External, // External URL
|
||||
Relative, // Relative path
|
||||
}
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
```json title="ImageMetadataType JSON Example"
|
||||
{
|
||||
"src": "https://cdn.example.com/image.jpg",
|
||||
"alt": "Product photo",
|
||||
"title": "Featured product",
|
||||
"dimensions": [400, 300],
|
||||
"image_type": "external",
|
||||
"attributes": {
|
||||
"loading": "lazy"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### StructuredData
|
||||
|
||||
Extracted structured data blocks (JSON-LD, microdata, RDFa).
|
||||
|
||||
```rust title="StructuredData Struct and StructuredDataType Enum"
|
||||
pub struct StructuredData {
|
||||
pub data_type: StructuredDataType, // Classification enum
|
||||
pub raw_json: String, // Raw JSON string
|
||||
pub schema_type: Option<String>, // Schema type (e.g., "Article")
|
||||
}
|
||||
|
||||
pub enum StructuredDataType {
|
||||
JsonLd, // JSON-LD
|
||||
Microdata, // microdata
|
||||
RDFa, // RDFa
|
||||
}
|
||||
```
|
||||
|
||||
**Example:**
|
||||
|
||||
```json title="StructuredData JSON Example"
|
||||
{
|
||||
"data_type": "json-ld",
|
||||
"raw_json": "{\"@context\": \"https://schema.org\", \"@type\": \"Article\", ...}",
|
||||
"schema_type": "Article"
|
||||
}
|
||||
```
|
||||
|
||||
## Summary of Changes
|
||||
|
||||
| Field | v3.x | v4.0 |
|
||||
| ----------------------------------------------- | ---------------------------------- | --------------------------------------------------- |
|
||||
| `keywords` | `Option<String>` | `Vec<String>` with `#[serde(default)]` |
|
||||
| `canonical` | `Option<String>` | Renamed to `canonical_url` |
|
||||
| `og_*` fields (7 fields) | Individual `Option<String>` fields | `open_graph: BTreeMap<String, String>` |
|
||||
| `twitter_*` fields (6 fields) | Individual `Option<String>` fields | `twitter_card: BTreeMap<String, String>` |
|
||||
| `link_author`, `link_license`, `link_alternate` | Individual fields | Removed (use `links` field) |
|
||||
| New: `language` | N/A | `Option<String>` |
|
||||
| New: `text_direction` | N/A | `Option<TextDirection>` |
|
||||
| New: `headers` | N/A | `Vec<HeaderMetadata>` with `#[serde(default)]` |
|
||||
| New: `links` | N/A | `Vec<LinkMetadata>` with `#[serde(default)]` |
|
||||
| New: `images` | N/A | `Vec<ImageMetadataType>` with `#[serde(default)]` |
|
||||
| New: `structured_data` | N/A | `Vec<StructuredData>` with `#[serde(default)]` |
|
||||
| New: `meta_tags` | N/A | `BTreeMap<String, String>` with `#[serde(default)]` |
|
||||
|
||||
## Questions?
|
||||
|
||||
- See the [Types Reference](../reference/types.md) for complete API details
|
||||
- Check [Working with Metadata](../getting-started/quickstart.md#read-document-metadata) for examples
|
||||
- Open an issue on [GitHub](https://github.com/kreuzberg-dev/kreuzberg/issues)
|
||||
52
docs/migration/v5.0-image-indices.md
Normal file
52
docs/migration/v5.0-image-indices.md
Normal file
@@ -0,0 +1,52 @@
|
||||
# Image Index References (v5.0)
|
||||
|
||||
## Summary
|
||||
|
||||
`PageContent.images: Vec<Arc<ExtractedImage>>` is removed. Pages now carry `image_indices: Vec<u32>` — zero-based indices into `ExtractionResult.images`.
|
||||
|
||||
## Breaking Change
|
||||
|
||||
**Previous behavior** (v4.x):
|
||||
|
||||
```rust
|
||||
let result = extractor.extract(path, &config).await?;
|
||||
for page in result.pages.unwrap_or_default() {
|
||||
for image in &page.images {
|
||||
println!("{:?}", image.data);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**New behavior** (v5.0):
|
||||
|
||||
```rust
|
||||
let result = extractor.extract(path, &config).await?;
|
||||
let images = result.images.as_deref().unwrap_or(&[]);
|
||||
for page in result.pages.unwrap_or_default() {
|
||||
for &idx in &page.image_indices {
|
||||
println!("{:?}", images[idx as usize].data);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`ChunkMetadata` gains the same `image_indices: Vec<u32>` field, populated post-chunking by matching each image's `page_number` against `[first_page, last_page]`.
|
||||
|
||||
## Impact
|
||||
|
||||
**Who is affected?**
|
||||
|
||||
- Users reading `page.images` directly
|
||||
- Users passing `PageContent` values across FFI boundaries
|
||||
- All polyglot bindings (Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir) — bindings are regenerated automatically
|
||||
|
||||
**What changes?**
|
||||
|
||||
| Before | After |
|
||||
| ------------------------ | ------------------------------------------------------------- |
|
||||
| `page.images[i].data` | `result.images.unwrap()[page.image_indices[i] as usize].data` |
|
||||
| `page.images.len()` | `page.image_indices.len()` |
|
||||
| `page.images.is_empty()` | `page.image_indices.is_empty()` |
|
||||
|
||||
## Known Limitation
|
||||
|
||||
`YamlSectionChunker` does not track page provenance (`first_page`/`last_page` are always `None`), so its chunks always produce empty `image_indices`. Tracked in a separate issue.
|
||||
Reference in New Issue
Block a user