Files
fil/docs/migration/v5.0-image-indices.md

53 lines
1.8 KiB
Markdown
Raw Normal View History

2026-06-01 23:40:55 +02:00
# Image Index References (v5.0)
## Summary
`PageContent.images: Vec<Arc<ExtractedImage>>` is removed. Pages now carry `image_indices: Vec<u32>` — zero-based indices into `ExtractionResult.images`.
## Breaking Change
**Previous behavior** (v4.x):
```rust
let result = extractor.extract(path, &config).await?;
for page in result.pages.unwrap_or_default() {
for image in &page.images {
println!("{:?}", image.data);
}
}
```
**New behavior** (v5.0):
```rust
let result = extractor.extract(path, &config).await?;
let images = result.images.as_deref().unwrap_or(&[]);
for page in result.pages.unwrap_or_default() {
for &idx in &page.image_indices {
println!("{:?}", images[idx as usize].data);
}
}
```
`ChunkMetadata` gains the same `image_indices: Vec<u32>` field, populated post-chunking by matching each image's `page_number` against `[first_page, last_page]`.
## Impact
**Who is affected?**
- Users reading `page.images` directly
- Users passing `PageContent` values across FFI boundaries
- All polyglot bindings (Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir) — bindings are regenerated automatically
**What changes?**
| Before | After |
| ------------------------ | ------------------------------------------------------------- |
| `page.images[i].data` | `result.images.unwrap()[page.image_indices[i] as usize].data` |
| `page.images.len()` | `page.image_indices.len()` |
| `page.images.is_empty()` | `page.image_indices.is_empty()` |
## Known Limitation
`YamlSectionChunker` does not track page provenance (`first_page`/`last_page` are always `None`), so its chunks always produce empty `image_indices`. Tracked in a separate issue.