53 lines
1.8 KiB
Markdown
53 lines
1.8 KiB
Markdown
# Image Index References (v5.0)
|
|
|
|
## Summary
|
|
|
|
`PageContent.images: Vec<Arc<ExtractedImage>>` is removed. Pages now carry `image_indices: Vec<u32>` — zero-based indices into `ExtractionResult.images`.
|
|
|
|
## Breaking Change
|
|
|
|
**Previous behavior** (v4.x):
|
|
|
|
```rust
|
|
let result = extractor.extract(path, &config).await?;
|
|
for page in result.pages.unwrap_or_default() {
|
|
for image in &page.images {
|
|
println!("{:?}", image.data);
|
|
}
|
|
}
|
|
```
|
|
|
|
**New behavior** (v5.0):
|
|
|
|
```rust
|
|
let result = extractor.extract(path, &config).await?;
|
|
let images = result.images.as_deref().unwrap_or(&[]);
|
|
for page in result.pages.unwrap_or_default() {
|
|
for &idx in &page.image_indices {
|
|
println!("{:?}", images[idx as usize].data);
|
|
}
|
|
}
|
|
```
|
|
|
|
`ChunkMetadata` gains the same `image_indices: Vec<u32>` field, populated post-chunking by matching each image's `page_number` against `[first_page, last_page]`.
|
|
|
|
## Impact
|
|
|
|
**Who is affected?**
|
|
|
|
- Users reading `page.images` directly
|
|
- Users passing `PageContent` values across FFI boundaries
|
|
- All polyglot bindings (Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir) — bindings are regenerated automatically
|
|
|
|
**What changes?**
|
|
|
|
| Before | After |
|
|
| ------------------------ | ------------------------------------------------------------- |
|
|
| `page.images[i].data` | `result.images.unwrap()[page.image_indices[i] as usize].data` |
|
|
| `page.images.len()` | `page.image_indices.len()` |
|
|
| `page.images.is_empty()` | `page.image_indices.is_empty()` |
|
|
|
|
## Known Limitation
|
|
|
|
`YamlSectionChunker` does not track page provenance (`first_page`/`last_page` are always `None`), so its chunks always produce empty `image_indices`. Tracked in a separate issue.
|