Files
fil/docs/migration/v4.0-fonts.md

645 lines
16 KiB
Markdown
Raw Permalink Normal View History

2026-06-01 23:40:55 +02:00
# Font Configuration Breaking Change (v4.0)
## Summary
Custom font provider is now **enabled by default** for improved PDF performance.
## Breaking Change
**Previous behavior** (v3.x):
- Font provider always enabled, not configurable
- Used system fonts only
- No user control over font loading
**New behavior** (v4.0):
- Font provider enabled by default
- Configurable via `FontConfig` in `PdfConfig`
- Can disable or add custom font directories
- ~12-13% faster PDF processing with font caching
## Impact
**Who is affected?**
- Users who rely on the PDF extractor's default font fallback behavior
- Users who want to disable the custom font provider
- Users who need to add custom font directories
**What changes?**
- Default: Custom font provider now active (breaking change)
- Performance: PDF extraction 12-13% faster
- API: New `font_config` option in `PdfConfig`
## Migration
### No Action Required (Recommended)
For most users, no changes needed. Default behavior provides performance improvements:
=== "Rust"
```rust
use kreuzberg::ExtractionConfig;
// Previous (v4.0) - no font configuration
let config = ExtractionConfig::default();
// Current (v4.0) - same code, now with font provider enabled
let config = ExtractionConfig::default();
// Font provider automatically enabled with system fonts
```
=== "Python"
```python
from kreuzberg import ExtractionConfig
# Previous (v4.0)
config = ExtractionConfig()
# Current (v4.0) - same code, now with font provider enabled
config = ExtractionConfig()
# Font provider automatically enabled with system fonts
```
=== "TypeScript"
```typescript
import { ExtractionConfig } from 'kreuzberg';
// Previous (v4.0)
const config: ExtractionConfig = {};
// Current (v4.0) - same code, now with font provider enabled
const config: ExtractionConfig = {};
// Font provider automatically enabled with system fonts
```
=== "Java"
```java
import dev.kreuzberg.config.*;
// Previous (v4.0)
ExtractionConfig config = ExtractionConfig.builder().build();
// Current (v4.0) - same code, now with font provider enabled
ExtractionConfig config = ExtractionConfig.builder().build();
// Font provider automatically enabled with system fonts
```
=== "Go"
```go
import "github.com/kreuzberg-dev/kreuzberg/v4"
// Previous (v4.0)
config := &kreuzberg.ExtractionConfig{}
// Current (v4.0) - same code, now with font provider enabled
config := &kreuzberg.ExtractionConfig{}
// Font provider automatically enabled with system fonts
```
=== "Ruby"
```ruby
require 'kreuzberg'
# Previous (v4.0)
config = Kreuzberg::ExtractionConfig.new
# Current (v4.0) - same code, now with font provider enabled
config = Kreuzberg::ExtractionConfig.new
# Font provider automatically enabled with system fonts
```
=== "C#"
```csharp
using Kreuzberg;
// Previous (v4.0)
var config = new ExtractionConfig();
// Current (v4.0) - same code, now with font provider enabled
var config = new ExtractionConfig();
// Font provider automatically enabled with system fonts
```
### Disable Font Provider
If you prefer the default font handling:
=== "Rust"
```rust
use kreuzberg::{ExtractionConfig, PdfConfig, FontConfig};
let config = ExtractionConfig {
pdf_options: Some(PdfConfig {
font_config: Some(FontConfig {
enabled: false,
custom_font_dirs: None,
}),
..Default::default()
}),
..Default::default()
};
```
=== "Python"
```python
from kreuzberg import ExtractionConfig, PdfConfig, FontConfig
config = ExtractionConfig(
pdf_options=PdfConfig(
font_config=FontConfig(enabled=False)
)
)
```
=== "TypeScript"
```typescript
import { ExtractionConfig } from 'kreuzberg';
const config: ExtractionConfig = {
pdfOptions: {
fontConfig: {
enabled: false
}
}
};
```
=== "Java"
```java
import dev.kreuzberg.config.*;
FontConfig fontConfig = FontConfig.builder()
.enabled(false)
.build();
PdfConfig pdfConfig = PdfConfig.builder()
.fontConfig(fontConfig)
.build();
ExtractionConfig config = ExtractionConfig.builder()
.pdfOptions(pdfConfig)
.build();
```
=== "Go"
```go
import "github.com/kreuzberg-dev/kreuzberg/v4"
config := &kreuzberg.ExtractionConfig{
PdfOptions: &kreuzberg.PdfConfig{
FontConfig: &kreuzberg.FontConfig{
Enabled: false,
},
},
}
```
=== "Ruby"
```ruby
require 'kreuzberg'
config = Kreuzberg::ExtractionConfig.new(
pdf_options: Kreuzberg::PdfConfig.new(
font_config: Kreuzberg::FontConfig.new(enabled: false)
)
)
```
=== "C#"
```csharp
using Kreuzberg;
var fontConfig = new FontConfig { Enabled = false };
var pdfConfig = new PdfConfig { FontConfig = fontConfig };
var config = new ExtractionConfig { PdfOptions = pdfConfig };
```
### Add Custom Font Directories
To use fonts from custom directories (in addition system fonts):
=== "Rust"
```rust
use kreuzberg::{ExtractionConfig, PdfConfig, FontConfig};
use std::path::PathBuf;
let config = ExtractionConfig {
pdf_options: Some(PdfConfig {
font_config: Some(FontConfig {
enabled: true,
custom_font_dirs: Some(vec![
PathBuf::from("/usr/share/fonts/custom"),
PathBuf::from("~/my-fonts"), // Tilde expanded automatically
]),
}),
..Default::default()
}),
..Default::default()
};
```
=== "Python"
```python
from kreuzberg import ExtractionConfig, PdfConfig, FontConfig
config = ExtractionConfig(
pdf_options=PdfConfig(
font_config=FontConfig(
enabled=True,
custom_font_dirs=[
"/usr/share/fonts/custom",
"~/my-fonts" # Tilde expanded automatically
]
)
)
)
```
=== "TypeScript"
```typescript
import { ExtractionConfig } from 'kreuzberg';
const config: ExtractionConfig = {
pdfOptions: {
fontConfig: {
enabled: true,
customFontDirs: [
'/usr/share/fonts/custom',
'~/my-fonts' // Tilde expanded automatically
]
}
}
};
```
=== "Java"
```java
import dev.kreuzberg.config.*;
import java.nio.file.Paths;
FontConfig fontConfig = FontConfig.builder()
.enabled(true)
.customFontDirs(Arrays.asList(
Paths.get("/usr/share/fonts/custom"),
Paths.get("~/my-fonts") // Tilde expanded automatically
))
.build();
PdfConfig pdfConfig = PdfConfig.builder()
.fontConfig(fontConfig)
.build();
ExtractionConfig config = ExtractionConfig.builder()
.pdfOptions(pdfConfig)
.build();
```
=== "Go"
```go
import "github.com/kreuzberg-dev/kreuzberg/v4"
config := &kreuzberg.ExtractionConfig{
PdfOptions: &kreuzberg.PdfConfig{
FontConfig: &kreuzberg.FontConfig{
Enabled: true,
CustomFontDirs: []string{
"/usr/share/fonts/custom",
"~/my-fonts", // Tilde expanded automatically
},
},
},
}
```
=== "Ruby"
```ruby
require 'kreuzberg'
config = Kreuzberg::ExtractionConfig.new(
pdf_options: Kreuzberg::PdfConfig.new(
font_config: Kreuzberg::FontConfig.new(
enabled: true,
custom_font_dirs: [
'/usr/share/fonts/custom',
'~/my-fonts' # Tilde expanded automatically
]
)
)
)
```
=== "C#"
```csharp
using Kreuzberg;
var fontConfig = new FontConfig
{
Enabled = true,
CustomFontDirs = new[]
{
"/usr/share/fonts/custom",
"~/my-fonts" // Tilde expanded automatically
}
};
var pdfConfig = new PdfConfig { FontConfig = fontConfig };
var config = new ExtractionConfig { PdfOptions = pdfConfig };
```
## Configuration Files
### TOML Format
```toml title="Font Configuration in TOML"
[pdf_options.font_config]
enabled = true
custom_font_dirs = ["/usr/share/fonts/custom", "~/my-fonts"]
```
### YAML Format
```yaml title="Font Configuration in YAML"
pdf_options:
font_config:
enabled: true
custom_font_dirs:
- /usr/share/fonts/custom
- ~/my-fonts
```
### JSON Format
```json title="Font Configuration in JSON"
{
"pdf_options": {
"font_config": {
"enabled": true,
"custom_font_dirs": ["/usr/share/fonts/custom", "~/my-fonts"]
}
}
}
```
## Path Handling
The font configuration automatically handles:
- **Tilde expansion**: `~/fonts``/Users/username/fonts`
- **Relative paths**: `./fonts``/absolute/path/to/fonts`
- **Symlinks**: Resolved to canonical paths (security measure)
- **Validation**: Directories must exist; warnings logged if not found
- **Graceful degradation**: Missing directories don't cause failures
## Global Configuration
**Important**: Font configuration is global per process and must be set **before the first PDF extraction**.
=== "Rust"
```rust
// CORRECT: Set config before first extraction
let config = ExtractionConfig {
pdf_options: Some(PdfConfig {
font_config: Some(FontConfig {
enabled: true,
custom_font_dirs: Some(vec![
PathBuf::from("/usr/share/fonts/custom"),
]),
}),
..Default::default()
}),
..Default::default()
};
let result = kreuzberg::extract_file("document.pdf", &config)?;
// INCORRECT: Attempting to change config after first extraction
let new_config = ExtractionConfig {
pdf_options: Some(PdfConfig {
font_config: Some(FontConfig {
enabled: false,
custom_font_dirs: None,
}),
..Default::default()
}),
..Default::default()
};
let result2 = kreuzberg::extract_file("document2.pdf", &new_config)?;
// Warning logged: "Font config already initialized"
```
=== "Python"
```python
# CORRECT: Set config before first extraction
config = ExtractionConfig(
pdf_options=PdfConfig(
font_config=FontConfig(
enabled=True,
custom_font_dirs=["/usr/share/fonts/custom"]
)
)
)
result = extract_file("document.pdf", config)
# INCORRECT: Attempting to change config after first extraction
new_config = ExtractionConfig(
pdf_options=PdfConfig(
font_config=FontConfig(enabled=False)
)
)
result2 = extract_file("document2.pdf", new_config)
# Warning logged: "Font config already initialized"
```
=== "TypeScript"
```typescript
// CORRECT: Set config before first extraction
const config: ExtractionConfig = {
pdfOptions: {
fontConfig: {
enabled: true,
customFontDirs: ['/usr/share/fonts/custom']
}
}
};
const result = await extractFile('document.pdf', config);
// INCORRECT: Attempting to change config after first extraction
const newConfig: ExtractionConfig = {
pdfOptions: {
fontConfig: { enabled: false }
}
};
const result2 = await extractFile('document2.pdf', newConfig);
// Warning logged: "Font config already initialized"
```
=== "Java"
```java
// CORRECT: Set config before first extraction
FontConfig fontConfig = FontConfig.builder()
.enabled(true)
.customFontDirs(Arrays.asList(Paths.get("/usr/share/fonts/custom")))
.build();
PdfConfig pdfConfig = PdfConfig.builder()
.fontConfig(fontConfig)
.build();
ExtractionConfig config = ExtractionConfig.builder()
.pdfOptions(pdfConfig)
.build();
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
// INCORRECT: Attempting to change config after first extraction
FontConfig newFontConfig = FontConfig.builder()
.enabled(false)
.build();
PdfConfig newPdfConfig = PdfConfig.builder()
.fontConfig(newFontConfig)
.build();
ExtractionConfig newConfig = ExtractionConfig.builder()
.pdfOptions(newPdfConfig)
.build();
ExtractionResult result2 = Kreuzberg.extractFile("document2.pdf", newConfig);
// Warning logged: "Font config already initialized"
```
=== "Go"
```go
// CORRECT: Set config before first extraction
config := &kreuzberg.ExtractionConfig{
PdfOptions: &kreuzberg.PdfConfig{
FontConfig: &kreuzberg.FontConfig{
Enabled: true,
CustomFontDirs: []string{"/usr/share/fonts/custom"},
},
},
}
result, _ := kreuzberg.ExtractFile("document.pdf", config)
// INCORRECT: Attempting to change config after first extraction
newConfig := &kreuzberg.ExtractionConfig{
PdfOptions: &kreuzberg.PdfConfig{
FontConfig: &kreuzberg.FontConfig{
Enabled: false,
},
},
}
result2, _ := kreuzberg.ExtractFile("document2.pdf", newConfig)
// Warning logged: "Font config already initialized"
```
=== "Ruby"
```ruby
# CORRECT: Set config before first extraction
config = Kreuzberg::ExtractionConfig.new(
pdf_options: Kreuzberg::PdfConfig.new(
font_config: Kreuzberg::FontConfig.new(
enabled: true,
custom_font_dirs: ['/usr/share/fonts/custom']
)
)
)
result = Kreuzberg.extract_file('document.pdf', config)
# INCORRECT: Attempting to change config after first extraction
new_config = Kreuzberg::ExtractionConfig.new(
pdf_options: Kreuzberg::PdfConfig.new(
font_config: Kreuzberg::FontConfig.new(enabled: false)
)
)
result2 = Kreuzberg.extract_file('document2.pdf', new_config)
# Warning logged: "Font config already initialized"
```
=== "C#"
```csharp
// CORRECT: Set config before first extraction
var fontConfig = new FontConfig
{
Enabled = true,
CustomFontDirs = new[] { "/usr/share/fonts/custom" }
};
var pdfConfig = new PdfConfig { FontConfig = fontConfig };
var config = new ExtractionConfig { PdfOptions = pdfConfig };
var result = Kreuzberg.ExtractFile("document.pdf", config);
// INCORRECT: Attempting to change config after first extraction
var newFontConfig = new FontConfig { Enabled = false };
var newPdfConfig = new PdfConfig { FontConfig = newFontConfig };
var newConfig = new ExtractionConfig { PdfOptions = newPdfConfig };
var result2 = Kreuzberg.ExtractFile("document2.pdf", newConfig);
// Warning logged: "Font config already initialized"
```
## Performance Impact
With default settings (enabled=true, system fonts):
- **PDF extraction**: ~12-13% faster
- **Memory**: Minimal increase (~100KB for font cache)
- **Startup**: Lazy initialization (no overhead for non-PDF workloads)
## Troubleshooting
### Custom fonts not working
**Symptom**: PDF still uses fallback fonts
**Solutions**:
1. Verify directories exist and contain .ttf/.otf/.ttc files
2. Check logs for "Custom font directory not found" warnings
3. Ensure paths are absolute or properly expanded
4. Verify font files are readable
### "Font config already initialized" warning
**Symptom**: Configuration changes ignored after first PDF extraction
**Solution**: Set FontConfig in the **first** ExtractionConfig used. Subsequent config changes are not supported (global limitation).
### Performance regression
**Symptom**: PDF extraction slower after upgrade
**Solution**: This is unexpected. Please report as a bug with:
- PDF sample (if shareable)
- Benchmark comparison (before/after)
- Configuration used
## Questions?
- **Issue tracker**: <https://github.com/kreuzberg-dev/kreuzberg/issues>
- **Discussions**: <https://github.com/kreuzberg-dev/kreuzberg/discussions>