Image Index References (v5.0)

Summary

PageContent.images: Vec<Arc<ExtractedImage>> is removed. Pages now carry image_indices: Vec<u32> — zero-based indices into ExtractionResult.images.

Breaking Change

Previous behavior (v4.x):

let result = extractor.extract(path, &config).await?;
for page in result.pages.unwrap_or_default() {
    for image in &page.images {
        println!("{:?}", image.data);
    }
}

New behavior (v5.0):

let result = extractor.extract(path, &config).await?;
let images = result.images.as_deref().unwrap_or(&[]);
for page in result.pages.unwrap_or_default() {
    for &idx in &page.image_indices {
        println!("{:?}", images[idx as usize].data);
    }
}

ChunkMetadata gains the same image_indices: Vec<u32> field, populated post-chunking by matching each image's page_number against [first_page, last_page].

Impact

Who is affected?

Users reading page.images directly
Users passing PageContent values across FFI boundaries
All polyglot bindings (Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir) — bindings are regenerated automatically

What changes?

Before	After
`page.images[i].data`	`result.images.unwrap()[page.image_indices[i] as usize].data`
`page.images.len()`	`page.image_indices.len()`
`page.images.is_empty()`	`page.image_indices.is_empty()`

Known Limitation

YamlSectionChunker does not track page provenance (first_page/last_page are always None), so its chunks always produce empty image_indices. Tracked in a separate issue.

1.8 KiB Raw Blame History

Image Index References (v5.0)

Summary

Breaking Change

Impact

Known Limitation

1.8 KiB

Raw Blame History