Files
fil/docs/migration/v5.0-image-indices.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

1.8 KiB

Image Index References (v5.0)

Summary

PageContent.images: Vec<Arc<ExtractedImage>> is removed. Pages now carry image_indices: Vec<u32> — zero-based indices into ExtractionResult.images.

Breaking Change

Previous behavior (v4.x):

let result = extractor.extract(path, &config).await?;
for page in result.pages.unwrap_or_default() {
    for image in &page.images {
        println!("{:?}", image.data);
    }
}

New behavior (v5.0):

let result = extractor.extract(path, &config).await?;
let images = result.images.as_deref().unwrap_or(&[]);
for page in result.pages.unwrap_or_default() {
    for &idx in &page.image_indices {
        println!("{:?}", images[idx as usize].data);
    }
}

ChunkMetadata gains the same image_indices: Vec<u32> field, populated post-chunking by matching each image's page_number against [first_page, last_page].

Impact

Who is affected?

  • Users reading page.images directly
  • Users passing PageContent values across FFI boundaries
  • All polyglot bindings (Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir) — bindings are regenerated automatically

What changes?

Before After
page.images[i].data result.images.unwrap()[page.image_indices[i] as usize].data
page.images.len() page.image_indices.len()
page.images.is_empty() page.image_indices.is_empty()

Known Limitation

YamlSectionChunker does not track page provenance (first_page/last_page are always None), so its chunks always produce empty image_indices. Tracked in a separate issue.