645 lines
16 KiB
Markdown
645 lines
16 KiB
Markdown
# Font Configuration Breaking Change (v4.0)
|
|
|
|
## Summary
|
|
|
|
Custom font provider is now **enabled by default** for improved PDF performance.
|
|
|
|
## Breaking Change
|
|
|
|
**Previous behavior** (v3.x):
|
|
|
|
- Font provider always enabled, not configurable
|
|
- Used system fonts only
|
|
- No user control over font loading
|
|
|
|
**New behavior** (v4.0):
|
|
|
|
- Font provider enabled by default
|
|
- Configurable via `FontConfig` in `PdfConfig`
|
|
- Can disable or add custom font directories
|
|
- ~12-13% faster PDF processing with font caching
|
|
|
|
## Impact
|
|
|
|
**Who is affected?**
|
|
|
|
- Users who rely on the PDF extractor's default font fallback behavior
|
|
- Users who want to disable the custom font provider
|
|
- Users who need to add custom font directories
|
|
|
|
**What changes?**
|
|
|
|
- Default: Custom font provider now active (breaking change)
|
|
- Performance: PDF extraction 12-13% faster
|
|
- API: New `font_config` option in `PdfConfig`
|
|
|
|
## Migration
|
|
|
|
### No Action Required (Recommended)
|
|
|
|
For most users, no changes needed. Default behavior provides performance improvements:
|
|
|
|
=== "Rust"
|
|
|
|
```rust
|
|
use kreuzberg::ExtractionConfig;
|
|
|
|
// Previous (v4.0) - no font configuration
|
|
let config = ExtractionConfig::default();
|
|
|
|
// Current (v4.0) - same code, now with font provider enabled
|
|
let config = ExtractionConfig::default();
|
|
// Font provider automatically enabled with system fonts
|
|
```
|
|
|
|
=== "Python"
|
|
|
|
```python
|
|
from kreuzberg import ExtractionConfig
|
|
|
|
# Previous (v4.0)
|
|
config = ExtractionConfig()
|
|
|
|
# Current (v4.0) - same code, now with font provider enabled
|
|
config = ExtractionConfig()
|
|
# Font provider automatically enabled with system fonts
|
|
```
|
|
|
|
=== "TypeScript"
|
|
|
|
```typescript
|
|
import { ExtractionConfig } from 'kreuzberg';
|
|
|
|
// Previous (v4.0)
|
|
const config: ExtractionConfig = {};
|
|
|
|
// Current (v4.0) - same code, now with font provider enabled
|
|
const config: ExtractionConfig = {};
|
|
// Font provider automatically enabled with system fonts
|
|
```
|
|
|
|
=== "Java"
|
|
|
|
```java
|
|
import dev.kreuzberg.config.*;
|
|
|
|
// Previous (v4.0)
|
|
ExtractionConfig config = ExtractionConfig.builder().build();
|
|
|
|
// Current (v4.0) - same code, now with font provider enabled
|
|
ExtractionConfig config = ExtractionConfig.builder().build();
|
|
// Font provider automatically enabled with system fonts
|
|
```
|
|
|
|
=== "Go"
|
|
|
|
```go
|
|
import "github.com/kreuzberg-dev/kreuzberg/v4"
|
|
|
|
// Previous (v4.0)
|
|
config := &kreuzberg.ExtractionConfig{}
|
|
|
|
// Current (v4.0) - same code, now with font provider enabled
|
|
config := &kreuzberg.ExtractionConfig{}
|
|
// Font provider automatically enabled with system fonts
|
|
```
|
|
|
|
=== "Ruby"
|
|
|
|
```ruby
|
|
require 'kreuzberg'
|
|
|
|
# Previous (v4.0)
|
|
config = Kreuzberg::ExtractionConfig.new
|
|
|
|
# Current (v4.0) - same code, now with font provider enabled
|
|
config = Kreuzberg::ExtractionConfig.new
|
|
# Font provider automatically enabled with system fonts
|
|
```
|
|
|
|
=== "C#"
|
|
|
|
```csharp
|
|
using Kreuzberg;
|
|
|
|
// Previous (v4.0)
|
|
var config = new ExtractionConfig();
|
|
|
|
// Current (v4.0) - same code, now with font provider enabled
|
|
var config = new ExtractionConfig();
|
|
// Font provider automatically enabled with system fonts
|
|
```
|
|
|
|
### Disable Font Provider
|
|
|
|
If you prefer the default font handling:
|
|
|
|
=== "Rust"
|
|
|
|
```rust
|
|
use kreuzberg::{ExtractionConfig, PdfConfig, FontConfig};
|
|
|
|
let config = ExtractionConfig {
|
|
pdf_options: Some(PdfConfig {
|
|
font_config: Some(FontConfig {
|
|
enabled: false,
|
|
custom_font_dirs: None,
|
|
}),
|
|
..Default::default()
|
|
}),
|
|
..Default::default()
|
|
};
|
|
```
|
|
|
|
=== "Python"
|
|
|
|
```python
|
|
from kreuzberg import ExtractionConfig, PdfConfig, FontConfig
|
|
|
|
config = ExtractionConfig(
|
|
pdf_options=PdfConfig(
|
|
font_config=FontConfig(enabled=False)
|
|
)
|
|
)
|
|
```
|
|
|
|
=== "TypeScript"
|
|
|
|
```typescript
|
|
import { ExtractionConfig } from 'kreuzberg';
|
|
|
|
const config: ExtractionConfig = {
|
|
pdfOptions: {
|
|
fontConfig: {
|
|
enabled: false
|
|
}
|
|
}
|
|
};
|
|
```
|
|
|
|
=== "Java"
|
|
|
|
```java
|
|
import dev.kreuzberg.config.*;
|
|
|
|
FontConfig fontConfig = FontConfig.builder()
|
|
.enabled(false)
|
|
.build();
|
|
|
|
PdfConfig pdfConfig = PdfConfig.builder()
|
|
.fontConfig(fontConfig)
|
|
.build();
|
|
|
|
ExtractionConfig config = ExtractionConfig.builder()
|
|
.pdfOptions(pdfConfig)
|
|
.build();
|
|
```
|
|
|
|
=== "Go"
|
|
|
|
```go
|
|
import "github.com/kreuzberg-dev/kreuzberg/v4"
|
|
|
|
config := &kreuzberg.ExtractionConfig{
|
|
PdfOptions: &kreuzberg.PdfConfig{
|
|
FontConfig: &kreuzberg.FontConfig{
|
|
Enabled: false,
|
|
},
|
|
},
|
|
}
|
|
```
|
|
|
|
=== "Ruby"
|
|
|
|
```ruby
|
|
require 'kreuzberg'
|
|
|
|
config = Kreuzberg::ExtractionConfig.new(
|
|
pdf_options: Kreuzberg::PdfConfig.new(
|
|
font_config: Kreuzberg::FontConfig.new(enabled: false)
|
|
)
|
|
)
|
|
```
|
|
|
|
=== "C#"
|
|
|
|
```csharp
|
|
using Kreuzberg;
|
|
|
|
var fontConfig = new FontConfig { Enabled = false };
|
|
var pdfConfig = new PdfConfig { FontConfig = fontConfig };
|
|
var config = new ExtractionConfig { PdfOptions = pdfConfig };
|
|
```
|
|
|
|
### Add Custom Font Directories
|
|
|
|
To use fonts from custom directories (in addition system fonts):
|
|
|
|
=== "Rust"
|
|
|
|
```rust
|
|
use kreuzberg::{ExtractionConfig, PdfConfig, FontConfig};
|
|
use std::path::PathBuf;
|
|
|
|
let config = ExtractionConfig {
|
|
pdf_options: Some(PdfConfig {
|
|
font_config: Some(FontConfig {
|
|
enabled: true,
|
|
custom_font_dirs: Some(vec![
|
|
PathBuf::from("/usr/share/fonts/custom"),
|
|
PathBuf::from("~/my-fonts"), // Tilde expanded automatically
|
|
]),
|
|
}),
|
|
..Default::default()
|
|
}),
|
|
..Default::default()
|
|
};
|
|
```
|
|
|
|
=== "Python"
|
|
|
|
```python
|
|
from kreuzberg import ExtractionConfig, PdfConfig, FontConfig
|
|
|
|
config = ExtractionConfig(
|
|
pdf_options=PdfConfig(
|
|
font_config=FontConfig(
|
|
enabled=True,
|
|
custom_font_dirs=[
|
|
"/usr/share/fonts/custom",
|
|
"~/my-fonts" # Tilde expanded automatically
|
|
]
|
|
)
|
|
)
|
|
)
|
|
```
|
|
|
|
=== "TypeScript"
|
|
|
|
```typescript
|
|
import { ExtractionConfig } from 'kreuzberg';
|
|
|
|
const config: ExtractionConfig = {
|
|
pdfOptions: {
|
|
fontConfig: {
|
|
enabled: true,
|
|
customFontDirs: [
|
|
'/usr/share/fonts/custom',
|
|
'~/my-fonts' // Tilde expanded automatically
|
|
]
|
|
}
|
|
}
|
|
};
|
|
```
|
|
|
|
=== "Java"
|
|
|
|
```java
|
|
import dev.kreuzberg.config.*;
|
|
import java.nio.file.Paths;
|
|
|
|
FontConfig fontConfig = FontConfig.builder()
|
|
.enabled(true)
|
|
.customFontDirs(Arrays.asList(
|
|
Paths.get("/usr/share/fonts/custom"),
|
|
Paths.get("~/my-fonts") // Tilde expanded automatically
|
|
))
|
|
.build();
|
|
|
|
PdfConfig pdfConfig = PdfConfig.builder()
|
|
.fontConfig(fontConfig)
|
|
.build();
|
|
|
|
ExtractionConfig config = ExtractionConfig.builder()
|
|
.pdfOptions(pdfConfig)
|
|
.build();
|
|
```
|
|
|
|
=== "Go"
|
|
|
|
```go
|
|
import "github.com/kreuzberg-dev/kreuzberg/v4"
|
|
|
|
config := &kreuzberg.ExtractionConfig{
|
|
PdfOptions: &kreuzberg.PdfConfig{
|
|
FontConfig: &kreuzberg.FontConfig{
|
|
Enabled: true,
|
|
CustomFontDirs: []string{
|
|
"/usr/share/fonts/custom",
|
|
"~/my-fonts", // Tilde expanded automatically
|
|
},
|
|
},
|
|
},
|
|
}
|
|
```
|
|
|
|
=== "Ruby"
|
|
|
|
```ruby
|
|
require 'kreuzberg'
|
|
|
|
config = Kreuzberg::ExtractionConfig.new(
|
|
pdf_options: Kreuzberg::PdfConfig.new(
|
|
font_config: Kreuzberg::FontConfig.new(
|
|
enabled: true,
|
|
custom_font_dirs: [
|
|
'/usr/share/fonts/custom',
|
|
'~/my-fonts' # Tilde expanded automatically
|
|
]
|
|
)
|
|
)
|
|
)
|
|
```
|
|
|
|
=== "C#"
|
|
|
|
```csharp
|
|
using Kreuzberg;
|
|
|
|
var fontConfig = new FontConfig
|
|
{
|
|
Enabled = true,
|
|
CustomFontDirs = new[]
|
|
{
|
|
"/usr/share/fonts/custom",
|
|
"~/my-fonts" // Tilde expanded automatically
|
|
}
|
|
};
|
|
|
|
var pdfConfig = new PdfConfig { FontConfig = fontConfig };
|
|
var config = new ExtractionConfig { PdfOptions = pdfConfig };
|
|
```
|
|
|
|
## Configuration Files
|
|
|
|
### TOML Format
|
|
|
|
```toml title="Font Configuration in TOML"
|
|
[pdf_options.font_config]
|
|
enabled = true
|
|
custom_font_dirs = ["/usr/share/fonts/custom", "~/my-fonts"]
|
|
```
|
|
|
|
### YAML Format
|
|
|
|
```yaml title="Font Configuration in YAML"
|
|
pdf_options:
|
|
font_config:
|
|
enabled: true
|
|
custom_font_dirs:
|
|
- /usr/share/fonts/custom
|
|
- ~/my-fonts
|
|
```
|
|
|
|
### JSON Format
|
|
|
|
```json title="Font Configuration in JSON"
|
|
{
|
|
"pdf_options": {
|
|
"font_config": {
|
|
"enabled": true,
|
|
"custom_font_dirs": ["/usr/share/fonts/custom", "~/my-fonts"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Path Handling
|
|
|
|
The font configuration automatically handles:
|
|
|
|
- **Tilde expansion**: `~/fonts` → `/Users/username/fonts`
|
|
- **Relative paths**: `./fonts` → `/absolute/path/to/fonts`
|
|
- **Symlinks**: Resolved to canonical paths (security measure)
|
|
- **Validation**: Directories must exist; warnings logged if not found
|
|
- **Graceful degradation**: Missing directories don't cause failures
|
|
|
|
## Global Configuration
|
|
|
|
**Important**: Font configuration is global per process and must be set **before the first PDF extraction**.
|
|
|
|
=== "Rust"
|
|
|
|
```rust
|
|
// CORRECT: Set config before first extraction
|
|
let config = ExtractionConfig {
|
|
pdf_options: Some(PdfConfig {
|
|
font_config: Some(FontConfig {
|
|
enabled: true,
|
|
custom_font_dirs: Some(vec![
|
|
PathBuf::from("/usr/share/fonts/custom"),
|
|
]),
|
|
}),
|
|
..Default::default()
|
|
}),
|
|
..Default::default()
|
|
};
|
|
|
|
let result = kreuzberg::extract_file("document.pdf", &config)?;
|
|
|
|
// INCORRECT: Attempting to change config after first extraction
|
|
let new_config = ExtractionConfig {
|
|
pdf_options: Some(PdfConfig {
|
|
font_config: Some(FontConfig {
|
|
enabled: false,
|
|
custom_font_dirs: None,
|
|
}),
|
|
..Default::default()
|
|
}),
|
|
..Default::default()
|
|
};
|
|
let result2 = kreuzberg::extract_file("document2.pdf", &new_config)?;
|
|
// Warning logged: "Font config already initialized"
|
|
```
|
|
|
|
=== "Python"
|
|
|
|
```python
|
|
# CORRECT: Set config before first extraction
|
|
config = ExtractionConfig(
|
|
pdf_options=PdfConfig(
|
|
font_config=FontConfig(
|
|
enabled=True,
|
|
custom_font_dirs=["/usr/share/fonts/custom"]
|
|
)
|
|
)
|
|
)
|
|
result = extract_file("document.pdf", config)
|
|
|
|
# INCORRECT: Attempting to change config after first extraction
|
|
new_config = ExtractionConfig(
|
|
pdf_options=PdfConfig(
|
|
font_config=FontConfig(enabled=False)
|
|
)
|
|
)
|
|
result2 = extract_file("document2.pdf", new_config)
|
|
# Warning logged: "Font config already initialized"
|
|
```
|
|
|
|
=== "TypeScript"
|
|
|
|
```typescript
|
|
// CORRECT: Set config before first extraction
|
|
const config: ExtractionConfig = {
|
|
pdfOptions: {
|
|
fontConfig: {
|
|
enabled: true,
|
|
customFontDirs: ['/usr/share/fonts/custom']
|
|
}
|
|
}
|
|
};
|
|
const result = await extractFile('document.pdf', config);
|
|
|
|
// INCORRECT: Attempting to change config after first extraction
|
|
const newConfig: ExtractionConfig = {
|
|
pdfOptions: {
|
|
fontConfig: { enabled: false }
|
|
}
|
|
};
|
|
const result2 = await extractFile('document2.pdf', newConfig);
|
|
// Warning logged: "Font config already initialized"
|
|
```
|
|
|
|
=== "Java"
|
|
|
|
```java
|
|
// CORRECT: Set config before first extraction
|
|
FontConfig fontConfig = FontConfig.builder()
|
|
.enabled(true)
|
|
.customFontDirs(Arrays.asList(Paths.get("/usr/share/fonts/custom")))
|
|
.build();
|
|
PdfConfig pdfConfig = PdfConfig.builder()
|
|
.fontConfig(fontConfig)
|
|
.build();
|
|
ExtractionConfig config = ExtractionConfig.builder()
|
|
.pdfOptions(pdfConfig)
|
|
.build();
|
|
ExtractionResult result = Kreuzberg.extractFile("document.pdf", config);
|
|
|
|
// INCORRECT: Attempting to change config after first extraction
|
|
FontConfig newFontConfig = FontConfig.builder()
|
|
.enabled(false)
|
|
.build();
|
|
PdfConfig newPdfConfig = PdfConfig.builder()
|
|
.fontConfig(newFontConfig)
|
|
.build();
|
|
ExtractionConfig newConfig = ExtractionConfig.builder()
|
|
.pdfOptions(newPdfConfig)
|
|
.build();
|
|
ExtractionResult result2 = Kreuzberg.extractFile("document2.pdf", newConfig);
|
|
// Warning logged: "Font config already initialized"
|
|
```
|
|
|
|
=== "Go"
|
|
|
|
```go
|
|
// CORRECT: Set config before first extraction
|
|
config := &kreuzberg.ExtractionConfig{
|
|
PdfOptions: &kreuzberg.PdfConfig{
|
|
FontConfig: &kreuzberg.FontConfig{
|
|
Enabled: true,
|
|
CustomFontDirs: []string{"/usr/share/fonts/custom"},
|
|
},
|
|
},
|
|
}
|
|
result, _ := kreuzberg.ExtractFile("document.pdf", config)
|
|
|
|
// INCORRECT: Attempting to change config after first extraction
|
|
newConfig := &kreuzberg.ExtractionConfig{
|
|
PdfOptions: &kreuzberg.PdfConfig{
|
|
FontConfig: &kreuzberg.FontConfig{
|
|
Enabled: false,
|
|
},
|
|
},
|
|
}
|
|
result2, _ := kreuzberg.ExtractFile("document2.pdf", newConfig)
|
|
// Warning logged: "Font config already initialized"
|
|
```
|
|
|
|
=== "Ruby"
|
|
|
|
```ruby
|
|
# CORRECT: Set config before first extraction
|
|
config = Kreuzberg::ExtractionConfig.new(
|
|
pdf_options: Kreuzberg::PdfConfig.new(
|
|
font_config: Kreuzberg::FontConfig.new(
|
|
enabled: true,
|
|
custom_font_dirs: ['/usr/share/fonts/custom']
|
|
)
|
|
)
|
|
)
|
|
result = Kreuzberg.extract_file('document.pdf', config)
|
|
|
|
# INCORRECT: Attempting to change config after first extraction
|
|
new_config = Kreuzberg::ExtractionConfig.new(
|
|
pdf_options: Kreuzberg::PdfConfig.new(
|
|
font_config: Kreuzberg::FontConfig.new(enabled: false)
|
|
)
|
|
)
|
|
result2 = Kreuzberg.extract_file('document2.pdf', new_config)
|
|
# Warning logged: "Font config already initialized"
|
|
```
|
|
|
|
=== "C#"
|
|
|
|
```csharp
|
|
// CORRECT: Set config before first extraction
|
|
var fontConfig = new FontConfig
|
|
{
|
|
Enabled = true,
|
|
CustomFontDirs = new[] { "/usr/share/fonts/custom" }
|
|
};
|
|
var pdfConfig = new PdfConfig { FontConfig = fontConfig };
|
|
var config = new ExtractionConfig { PdfOptions = pdfConfig };
|
|
var result = Kreuzberg.ExtractFile("document.pdf", config);
|
|
|
|
// INCORRECT: Attempting to change config after first extraction
|
|
var newFontConfig = new FontConfig { Enabled = false };
|
|
var newPdfConfig = new PdfConfig { FontConfig = newFontConfig };
|
|
var newConfig = new ExtractionConfig { PdfOptions = newPdfConfig };
|
|
var result2 = Kreuzberg.ExtractFile("document2.pdf", newConfig);
|
|
// Warning logged: "Font config already initialized"
|
|
```
|
|
|
|
## Performance Impact
|
|
|
|
With default settings (enabled=true, system fonts):
|
|
|
|
- **PDF extraction**: ~12-13% faster
|
|
- **Memory**: Minimal increase (~100KB for font cache)
|
|
- **Startup**: Lazy initialization (no overhead for non-PDF workloads)
|
|
|
|
## Troubleshooting
|
|
|
|
### Custom fonts not working
|
|
|
|
**Symptom**: PDF still uses fallback fonts
|
|
|
|
**Solutions**:
|
|
|
|
1. Verify directories exist and contain .ttf/.otf/.ttc files
|
|
2. Check logs for "Custom font directory not found" warnings
|
|
3. Ensure paths are absolute or properly expanded
|
|
4. Verify font files are readable
|
|
|
|
### "Font config already initialized" warning
|
|
|
|
**Symptom**: Configuration changes ignored after first PDF extraction
|
|
|
|
**Solution**: Set FontConfig in the **first** ExtractionConfig used. Subsequent config changes are not supported (global limitation).
|
|
|
|
### Performance regression
|
|
|
|
**Symptom**: PDF extraction slower after upgrade
|
|
|
|
**Solution**: This is unexpected. Please report as a bug with:
|
|
|
|
- PDF sample (if shareable)
|
|
- Benchmark comparison (before/after)
|
|
- Configuration used
|
|
|
|
## Questions?
|
|
|
|
- **Issue tracker**: <https://github.com/kreuzberg-dev/kreuzberg/issues>
|
|
- **Discussions**: <https://github.com/kreuzberg-dev/kreuzberg/discussions>
|