1.8 KiB
1.8 KiB
Image Index References (v5.0)
Summary
PageContent.images: Vec<Arc<ExtractedImage>> is removed. Pages now carry image_indices: Vec<u32> — zero-based indices into ExtractionResult.images.
Breaking Change
Previous behavior (v4.x):
let result = extractor.extract(path, &config).await?;
for page in result.pages.unwrap_or_default() {
for image in &page.images {
println!("{:?}", image.data);
}
}
New behavior (v5.0):
let result = extractor.extract(path, &config).await?;
let images = result.images.as_deref().unwrap_or(&[]);
for page in result.pages.unwrap_or_default() {
for &idx in &page.image_indices {
println!("{:?}", images[idx as usize].data);
}
}
ChunkMetadata gains the same image_indices: Vec<u32> field, populated post-chunking by matching each image's page_number against [first_page, last_page].
Impact
Who is affected?
- Users reading
page.imagesdirectly - Users passing
PageContentvalues across FFI boundaries - All polyglot bindings (Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir) — bindings are regenerated automatically
What changes?
| Before | After |
|---|---|
page.images[i].data |
result.images.unwrap()[page.image_indices[i] as usize].data |
page.images.len() |
page.image_indices.len() |
page.images.is_empty() |
page.image_indices.is_empty() |
Known Limitation
YamlSectionChunker does not track page provenance (first_page/last_page are always None), so its chunks always produce empty image_indices. Tracked in a separate issue.