# Changelog All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). --- ## [Unreleased] ### Security - **config**: `ExtractionConfig::extraction_timeout_secs` now defaults to `Some(60)` instead of `None`. Pathological inputs (deeply nested archives, sheets with millions of cells, adversarial PDFs) would otherwise run indefinitely. The 60 s limit is enforced via `tokio::time::timeout` in `extract_bytes`. Set to `None` to disable for trusted input. - **config**: New `ExtractionConfig::max_embedded_file_bytes` field (default `Some(52_428_800)`, 50 MiB) caps the uncompressed size of any single embedded file before recursive extraction is attempted. Applies to OOXML embedded objects (DOCX/PPTX) and email attachments. Files exceeding the cap are skipped with a `ProcessingWarning`. Set to `None` to disable. - **security**: Component-based path traversal detection. The DOCX image extractor previously rejected archive paths with a string search (`!target.contains("..")`), which can be bypassed with normalised paths on some platforms. A new `has_path_traversal(path_str)` helper in `extractors::security` uses `std::path::Path::components()` and checks for `Component::ParentDir`, which is always correct regardless of path encoding or OS conventions. The docx extractor now calls this helper, and the `"1..2"` false-positive in the list-numbering check is correctly untouched (it never dealt with file-system paths). - **pdf/email**: `SecurityBudget` wired into the PDF extractor and the email extractor. Both extractors now enforce `SecurityLimits.max_content_size` (default 100 MiB) on the total accumulated element text after extraction. A crafted document that produces more than the limit via repeated/synthetic content streams returns `KreuzbergError::Security` instead of silently exhausting heap. The check uses the same `From` conversion used by every other budget-aware extractor so bindings observe a unified `Security` error variant. - **pdf**: Decompression ratio guard for PDF embedded file streams. lopdf's `decompressed_content()` was called unconditionally on embedded-file streams in `pdf/embedded_files.rs`, allowing a crafted PDF with a highly-compressed stream to expand to gigabytes of data in memory. The fix records `compressed_size` before decompression and checks the ratio against `SecurityLimits.max_compression_ratio` (default 100×) in `extract_and_process_embedded_files`. Streams exceeding the limit are skipped with a `ProcessingWarning`. The `max_embedded_file_bytes` cap (default 50 MiB) is also enforced for PDF embedded files on the same code path. - **excel**: DDE / external-call formula scanner. `ExcelExtractor` now scans every cell for formula strings matching `=DDE(`, `=WEBSERVICE(`, `=HYPERLINK(`, and `=cmd|` (case-insensitive, anchored). Each match produces a `ProcessingWarning` on the returned `InternalDocument` carrying the sheet name, cell coordinate (R*C* notation), and the classified formula kind. Warnings are capped at 100 per document to bound output on adversarial sheets. Calamine resolves most formulas to their cached result, so the scanner only fires when the raw formula string is stored verbatim — but that is exactly the case for the highest-risk DDE injection payloads. ### Changed - **api/openapi**: Tuple-typed fields that derive `ToSchema` now emit codegen-compatible schemas. Fields previously emitted as OpenAPI 3.1 `prefixItems` arrays with `items: false` (which crashes openapi-python-client 0.28/0.29 and swagger_parser 1.43) now emit as fixed-length homogeneous arrays (`[String; 2]`, `[f64; 2]`, etc.). Affected types: `Attributes.key_values`, `TextMetadata.links`, `TextMetadata.code_blocks`, `LinkMetadata.attributes`, `ImageMetadataType.attributes/dimensions`, `PageInfo.dimensions`, `HierarchicalBlock.bbox`, `ImagePreprocessingMetadata.original_dimensions/original_dpi/new_dimensions`, and `OcrBoundingGeometry::Quadrilateral.points`. The `DjotContent.attributes` field (heterogeneous tuple `(String, Attributes)`) now emits as `serde_json::Value`. Wire format is unchanged. - **api/openapi**: `FormatMetadata` discriminated union now emits a flat `oneOf` with direct `$ref` items and a `discriminator` object (property `format_type`, with per-variant mapping). Previously emitted as `oneOf` of `allOf[$ref + inline property]` wrappers, which openapi-python-client rejected with "Invalid property in union". The change requires a manual `PartialSchema + ToSchema` impl in place of the derived one. Wire format and serde behavior are unchanged. ### Added - **tools/generate_test_fixtures**: Python-based, deterministic fixture-generation toolkit at `tools/generate_test_fixtures/` that produces real on-disk DOCX, ODT, XLSX, PPTX, and PDF documents exercising track-changes, revisions, comments, incremental-update, diff, and security paths. Each binary fixture ships with a `.gt.json` ground-truth sidecar consumed by the new integration-test scaffold under `crates/kreuzberg/tests/`. Driven via `task fixtures:generate` / `task fixtures:test`. - **odt/revisions**: ODT extraction now surfaces internal track-changes in `ExtractionResult.revisions` — insertions, deletions, and format changes from ``. Author + timestamp captured from ``. Anchor is `RevisionAnchor::Paragraph { index }` matching DOCX ergonomics. Extracted text follows the accepted-changes view (insertions live, deletions removed). Orphan body markers referencing unknown change-ids are logged and skipped, not fatal. - **pptx/revisions**: PPTX extraction now surfaces slide comments in `ExtractionResult.revisions` — one `DocumentRevision { kind: Comment }` per `` element with author (resolved via `commentAuthors.xml`), timestamp, and slide-index anchor. The pre-existing document-level revision counter stays in `metadata.additional["revision"]`; this adds the per-comment structured surface. `revisions` is `None` when no `ppt/comments/comment{N}.xml` parts exist. - **excel/revisions**: Excel extraction now parses `xl/revisions/revisionHeaders.xml` when present (legacy shared-workbook collaborative-edit headers) and surfaces each `
` as a `DocumentRevision` with guid, author, and timestamp. Modern xlsx files without `xl/revisions/` continue to produce `revisions = None`. Per-cell change extraction from `revisionLog*.xml` is deferred to a follow-up. - **pdf/revisions**: PDF extraction now exposes incremental-update history in `ExtractionResult.revisions`. Each historical `xref` section in the file's xref chain becomes one `DocumentRevision` carrying the byte offset as `revision_id` (`xref-offset-`), and (when present) the `/Info` dictionary's author and creation/modification timestamps. The `/Prev` chain is walked from the final xref backwards through raw bytes using the already-loaded `lopdf::Document` as the chain anchor. Single-save PDFs (no `/Prev`) yield `revisions = None`. Per-revision content extraction is deferred — `RevisionDelta` is empty for now; `RevisionKind::Insertion` is used as a placeholder (the enum is not `#[non_exhaustive]`, so a typed `Snapshot` variant is a future breaking-change candidate). - **docx/revisions**: DOCX extraction now populates `ExtractionResult.revisions` from OOXML track-changes markup. `w:ins` elements produce `RevisionKind::Insertion` with the inserted text in `RevisionDelta.content` as `DiffLine::Added`; `w:del` / `w:delText` produce `RevisionKind::Deletion` with `DiffLine::Removed`; `w:rPrChange` produces `RevisionKind::FormatChange` with an empty delta (property diff is a future TODO). Each revision carries `w:id` (→ `revision_id`), `w:author`, and `w:date` (ISO-8601 string). Anchor is `RevisionAnchor::Paragraph { index }` pointing at the containing paragraph. Accepted-changes convention is preserved: inserted text appears in `content`, deleted text does not. `revisions` is `None` when the document contains no track-changes markup. - **types**: new `revisions: Option>` field on `ExtractionResult` plus the `DocumentRevision` / `RevisionKind` / `RevisionAnchor` / `RevisionDelta` types. Every extractor defaults to `None` for now; per-format population lands in follow-up commits (DOCX next). `DocumentRevision` is part of the unconditional public surface — works without the `diff` feature. `DiffLine` and `CellChange` are now canonical in `types::revisions` and re-exported from `diff` for backward compat; the `diff` feature's public paths are unchanged. - **diff**: new optional `diff` Cargo feature exposes `kreuzberg::diff::compare(a, b, opts) -> ExtractionDiff` over two `ExtractionResult` values. Backed by `similar` (Myers/LCS). Surfaces content hunks (unified-diff format), table cell-level diffs, metadata changes (add/remove/change map), and embedded-children add/remove/change (recursive). Aggregate features `analysis` and `full` now include `diff`. - **types**: bidirectional `From` impls between `InternalDocument` (rich pipeline type) and `ExtractionResult` (public output type). Lossy conversions used at FFI/trait-bridge boundaries where foreign-language plugins return `ExtractionResult` but the canonical Rust trait signature requires `InternalDocument`. `ExtractionResult → InternalDocument` stashes content in `pre_rendered_content`; `InternalDocument → ExtractionResult` runs `derive_extraction_result` with `OutputFormat::Plain`. ### Fixed - **java**: JVM SIGSEGV (exit 134) in `PluginApiTest.testRegisterDocumentExtractorTraitBridge`. Two Panama FFM bugs in the alef-generated trait bridges: (1) `free_user_data` vtable slot was set to `MemorySegment.NULL` instead of a proper upcall stub — Rust dereferenced NULL on drop causing SIGBUS; (2) the `DocumentExtractor` vtable's `MemoryLayout.structLayout` declared 10 ADDRESS fields where the Rust `#[repr(C)]` struct has 11, mis-aligning every offset past the missing slot. Both fixed at the alef template level + regenerated bindings. Java e2e now 100/100. - **wasm**: `task wasm:e2e` failed before tests ran with `getrandom 0.2: the wasm32-unknown-unknown targets are not supported by default, you may need to enable the "js" feature`. Transitive deps (`const-random-macro`, `redox_users`, `ring`) pull `getrandom 0.2.17` into the wasm build; the pre-existing `getrandom 0.3`/`0.4` pins in the wasm crate manifest don't unify features back to 0.2. alef now emits a third `getrandom_02` alias pinning the 0.2 entry with the `js` feature. - **wasm**: `Swift{Trait}Bridge`-style JSON deserialization in the wasm trait-bridge sync/async method bodies. Four template sites used `.as_string().and_then(|s| serde_json::from_str(&s).ok()) .unwrap_or_default()` for enum and struct return decoding, silently swallowing decode errors and returning `Default::default()`. Replaced `.ok()` with `.map_err(|_| {{ error_deser }})` so decode failures surface during testing while the no-error return contract is preserved. - **excel**: split `PageContent.sheet_name` from `PageContent.section_name` (the latter is PPTX-only) so the type's doc-comment matches the data again. Escape markdown-significant characters in sheet names when rendering per-page headings, preventing double-heading or broken-link rendering on adversarial sheet names (e.g. `## Profit` or `[Sales](evil)`). Normalise the empty-sheet content shape to `## \n\n` so per-page concatenation produces consistent separation between headings. - **core/extractor**: fix `let validated_mime = if ... else { ... }` arm-type mismatch when `tree-sitter` feature is disabled. The octet-stream branch was wrapped in `#[cfg(feature = "tree-sitter")]` with no `#[cfg(not(...))]` fallback, causing the arm to evaluate to `()` under default features (which don't include tree-sitter) while the else arm produced `String`. Added an explicit `#[cfg(not(feature = "tree-sitter"))]` fallback that calls `mime::detect_mime_type_from_bytes`. ### Changed - **kotlin-android (e2e)**: added trait-bridge e2e test generation. alef.toml now includes `class` field overrides for register/unregister/clear trait-bridge calls (e.g., `class = "OcrBackendBridge"`) to properly route e2e tests through the Bridge object methods instead of the Kreuzberg main class. kotlin_android e2e test count increased from 21 to 22 files (PluginApiTest.kt added). - **rust**: `Uri` struct renamed to `ExtractedUri` to avoid collision with `dart:core.Uri` in Dart bindings. This is a breaking change for Rust consumers who reference `kreuzberg::Uri` directly; import `use kreuzberg::ExtractedUri` instead. All language bindings (`dart`, `python`, `node`, `ruby`, `php`, `go`, `java`, `csharp`, `kotlin`, `swift`, `r`, `elixir`, `zig`) automatically inherit the new struct name. - **kreuzberg-ffi**: crate-types extended with `rlib` so downstream Rust crates (e.g. `kreuzberg-jni`) can take an in-process Rust dep on it without re-exporting the C ABI through a separate dylib. ### Fixed - **diff**: hoist `similar` dep to workspace root and reference via `workspace = true` in the kreuzberg crate (the prior direct leaf pin would have failed cargo-sort and risked duplicate-version compiles). Doc-comment fixes: `tables_same_shape` now documents the dimensions-only behaviour explicitly (no claim of header-order matching), and the metadata-diff shape doc drops the misleading RFC 6902 reference in favour of describing the actual `{added, removed, changed}` envelope. - **mime**: accept legacy `application/docx` as an alias for the RFC docx MIME so callers using the non-standard form aren't rejected. (cc kreuzberg-cloud sandbox + production traffic.) - **swift**: extract bytes/sync overloads now resolve the first argument against the test-documents/fixtures directories before falling back to UTF-8 string content. Add Bridge-protocol `register*` overloads so trait-stub e2e fixtures compile against the typed Box surface. Add `name:` argument-label overloads for every `unregister*` function. - **wasm**: feature-gate `LayoutDetectionConfig`, `TreeSitterConfig`, `FormatMetadata::Code`, and `ExtractionResult.code_intelligence` references that aren't part of `wasm-target`. Fix `WasmEmbeddingBackendBridge::dimensions` to parse JS numbers as `f64`/usize rather than JSON strings. Add explicit cfg-guarded match arm for `FormatMetadata::Code` in the `From` impl to resolve non-exhaustive match error when compiling with `--all-features` (code variant is feature-gated and only visible when `tree-sitter` or `tree-sitter-wasm` is enabled). Reorder the e2e setup to install the `require('env')` and WASI shims before the wasm bundle is pre-imported. Gate `tempfile`-based PST extraction behind `not(target_arch = "wasm32")` (WASI doesn't expose `mkstemp`). - **kotlin-android**: force-link every `kreuzberg-ffi` C export into `libkreuzberg_jni.dylib` via the `use kreuzberg_ffi::{…}` import block (no `#[used]` shim needed once we depend on the typed surface directly). Migrate the JNI shim off c_void extern-C forwards onto typed kreuzberg-ffi paths — the Rust compiler then caught three signature-drift bugs (`kreuzberg_embed_texts` takes `*const EmbeddingConfig`, not a JSON c_char; `kreuzberg_get_embedding_preset` returns a typed `*mut EmbeddingPreset`; `kreuzberg_render_pdf_page_to_png` takes 8 args including `out_ptr/out_len/out_cap`). Close the last 12 e2e gaps to land 82/82 green: empty-mime → null FFI pointer, Jackson `ByteArray` (de)serializer that emits JSON int arrays to match Rust `Vec`, `KotlinModule` configured with `NullIsSameAsDefault` / `NullToEmptyCollection` / `SerializationInclusion.NON_EMPTY` / `FAIL_ON_UNKNOWN_PROPERTIES=false` so Rust serde defaults survive the round-trip, custom (de)serializer for the `OutputFormat` sealed class, JSON-text capture for `FormatMetadata.Code`, nullable `DocumentNode.contentLayer` and `ChunkingConfig.sizing`, path-or-bytes resolution in `renderPdfPageToPng`, and `nativeGetExtensionsForMimeImpl` wired to the real FFI. ### Added - **extraction**: `ImageExtractionConfig` gains three new fields for controlling image-OCR output: - `run_ocr_on_images` (default `true`) — when `false`, suppresses per-image OCR even if an OCR backend is configured. Useful for extracting images without OCR overhead. - `ocr_text_only` (default `false`) — replaces the `![alt](url)` image placeholder with the OCR text. Only takes effect when `run_ocr_on_images` is `true`. - `append_ocr_text` (default `false`) — appends an OCR text paragraph after the image placeholder. Takes effect when `run_ocr_on_images` is `true` and `ocr_text_only` is `false`. All three fields serialize as snake_case (`run_ocr_on_images`, `ocr_text_only`, `append_ocr_text`) in config files and JSON. (#1017) - **extraction**: image OCR now routes through the plugin backend registry instead of hard-coding `OcrProcessor`. Custom OCR backends registered via `registerOcrBackend` (Node.js) or the Rust registry API are used for per-image OCR, enabling mixed-mode extraction: native text from the document layer, custom-backend OCR on embedded images. (#1017) - **extraction (pdf)**: when `images.run_ocr_on_images` is `true`, the PDF document-level OCR fallback (`RunFallback`) is skipped in favour of per-image OCR, preserving native PDF text extraction while still OCR-ing embedded images. (#1017) - **node**: `JsOcrBackendBridge` now uses a persistent `napi_ref` to keep the JS object alive after `registerOcrBackend` returns, and a `ThreadsafeFunction` for `process_image` so the callback is invokable from tokio worker threads with the actual image bytes (previously the bytes were passed as a debug string, making the bridge non-functional for real binary payloads). (#1017) - **pptx**: `PageContent` gains two optional fields populated during PPTX per-page extraction (requires `page_config` to be set): `speaker_notes` (text from `ppt/notesSlides/notesSlideN.xml`) and `section_name` (from `` in `ppt/presentation.xml`). Both serialize with `skip_serializing_if = "Option::is_none"` and are `None` for all non-PPTX formats. (#960) - **excel**: XLSX extraction now populates `ExtractionResult.pages` with one `PageContent` per sheet. Top-level `content` remains the concatenation of all sheets, preserving backward compat for callers that do not read `pages`. Sheet name surfaces as `PageContent.section_name`. Empty sheets still emit a `PageContent` so the page index aligns with the sheet index; they are marked `is_blank = true` and carry no tables. Mirrors the PDF/PPTX per-page model and enables cloud billing per sheet (`billable_pages = sheet_count`). - **pdf**: opt-in capture of full-page renders produced during PDF OCR preprocessing. When `ImageExtractionConfig.include_page_rasters = true`, per-page PNG renders are returned as `ImageKind::PageRaster` entries in `ExtractionResult.images`, enabling citation thumbnails and visual grounding downstream. Capture covers all three OCR entry paths (`force_ocr`, `force_ocr_pages`, `RunFallback`); document-level backend bypass emits a `ProcessingWarning`. (#1018) ### Changed - **deps**: bump alef pin v0.19.14 → v0.19.20. v0.19.15-v0.19.20 ship generator fixes addressing the systemic trait-bridge stub regressions surfaced by v5.0.0-rc.3 CI E2E (11 of 14 lang jobs failing on plugin_api stubs missing super-trait methods, wrong return types, internal type leakage, missing PHP interface emission, etc.) — plus the v0.19.20 hotfix that registers the new PHP interface Jinja templates in alef's embedded `TEMPLATES` array. Affects every binding (`crates/kreuzberg-{node,wasm,ffi}`, `packages/*`) and every e2e suite under `e2e/`. - **deps**: bump `html-to-markdown-rs` from `3.4.1` to `3.5.2` — adopts the upstream fix for nested mixed-list (`ul > li > ul > li > ol`) content duplication (kreuzberg-dev/html-to-markdown#385). ### Changed (breaking) - **ffi (ABI)**: `KreuzbergOcrBackendVTable.process_image` and `KreuzbergDocumentExtractorVTable.extract_bytes` now include a `uintptr_t image_bytes_len` / `content_len` parameter **immediately after** the `*const uint8_t` pointer and before the `config` string: ```c // Before (alef < 0.19.21) int32_t (*process_image)(const void *user_data, const uint8_t *image_bytes, const char *config, ...); // After (alef >= 0.19.21) int32_t (*process_image)(const void *user_data, const uint8_t *image_bytes, uintptr_t image_bytes_len, const char *config, ...); ``` The old signature caused silent truncation of binary payloads at the first embedded NUL byte. Any pre-compiled C, Go, Java, or C# callback registered against the old vtable ABI must be recompiled. Fixes #1056. - **types**: `ImageKind` gains a new `PageRaster` variant. Any exhaustive `match` on `ImageKind` in downstream code must add a `PageRaster` arm. (#1018) ### Fixed - **email (RTF)**: `decompress_rtf_compressed` no longer performs a 4 GiB `Vec::with_capacity` on the attacker-controlled `raw_size` field from the OXRTFCP header in `.msg` files. The pre-allocation hint is now capped at 16 MiB; the `Vec` still grows freely beyond that limit for legitimate large payloads, so correctness is unaffected. (#1058) - **table**: `detect_rows` sort comparator changed from `.unwrap()` to `.unwrap_or(Ordering::Equal)` on `partial_cmp`, eliminating a latent panic if a NaN y-center value were ever introduced. Defensive improvement; the `u32` fields of `HocrWord` cannot produce NaN today. (#1057) - **chunking (pptx)**: skip `` elements whose `` lacks `r:embed` instead of aborting the entire slide; logs a warning and preserves the rest of the slide content. (#1016) - **pdf/chunking**: populate `first_page`/`last_page` on every chunk from multi-page PDFs by normalising trailing whitespace in page content before locating it inside `result.content`; previously caused 7 of 24 chunks to have null page metadata. (#1013, #1004) - **chunking (yaml/json)**: populate `first_page`/`last_page` on YAML/JSON section chunks by threading `page_boundaries` into `chunk_yaml_by_sections`. This unblocks the image-indices population step in `features.rs` for YAML chunks. (#963) - **html/chunking**: nested mixed lists (`ul > li > ul > li > ol`) no longer duplicate content in extracted Markdown, and the Markdown chunker no longer panics with an integer underflow on the previously malformed output. (#1004) ## [5.0.0-rc.3] - 2026-05-26 ### Changed - **deps**: bump alef pin to v0.19.14 — bundles all v0.19.12 + v0.19.13 + v0.19.14 generator fixes across every language emitter (Rust e2e unused result, Go missing `encoding/json` import, Zig undeclared types + `callconv(.c)`, Python e2e `initialize()` stub, PHP e2e interface FQN, C# `Bridge.Register(impl)`, Elixir `unregister_*` public wrappers, dart `.as_ref()` for `Option`, dart unit-variant default ctor parens, Swift `RustStr` typedef in placeholder, Ruby cargo-machete metadata, Java duplicate-method dedup, Kotlin Android e2e ExtractionConfig inference + `.size` for ByteArray + EmbeddingConfig model, dart RID-aware FRB native lib loader, kotlin streaming virtual fields gating, csharp per-binding exception class, go arch name mapping for FFI download, elixir always-include `:jason` dep). - **e2e**: add `homebrew` to `[crates.e2e].languages` — registry-mode test_app validates CLI + FFI Homebrew formulae post-publish. ### Fixed - **publish (workflow)**: `release-finalize` job now declares `prepare` in its `needs:` list. The job referenced `needs.prepare.outputs.is_tag`, `dry_run`, `tag`, `is_prerelease` but did not depend on `prepare`, producing an `actionlint` failure that gated CI Lint. - **publish (ruby)**: exclude `ext/*/native/target/` and `ext/*/native/tmp/` directories from gem files list. The gemspec was including the entire Rust build artifact tree (compiled `.rlib`, `.rmeta`, `.dylib` files), bloating the .gem from 33 MB to 388 MB, causing `gem spec` to reject it as invalid due to size constraints. Now filters out build directories via `reject { |f| f.include?("/native/target/") || f.include?("/native/tmp/") }`. - **publish (workflow ordering)**: add new `finalize-github-release-after-uploads` job that publishes the GitHub Release from draft to final state immediately after all native artifact uploads complete, but before `publish-hex` and `publish-zig` jobs run. Hex.pm and Zig package registries fetch asset URLs from the release page, which fails with 404 when the release is in DRAFT state. Now, `publish-hex` and `publish-zig` depend on the early finalize job, ensuring the release is already published before they attempt to download assets. The final `release-finalize` job runs after all publishes for post-publish validation and idempotently handles the Go module tag creation. - **dart (ios)**: use `wasm-target` feature set for iOS x86_64 simulator build instead of `android-target`. The x86_64-apple-ios target was pulling in Android ABI pre-built binaries (pdfium, ort) with mismatched CPU types, causing lipo failures during XCFramework assembly. `wasm-target` is ORT-free and cross-compiles cleanly for all iOS simulators. ## [5.0.0-rc.2] (skipped) - 2026-05-25 rc.2 was tagged + partially published (5/13 registries) on 2026-05-25 then superseded by rc.3. The fixes listed below were carried forward. ### Changed - **deps**: bump alef pin to v0.19.10 (C# `KreuzbergLib.Register*`/`Unregister*` plugin forwarders + `*Bridge.Register*(impl)` static factories; C# trait-bridge stubs emit sync methods with real return types; Kotlin Android test-app switched to JUnit unit tests + correct Maven coordinate `dev.kreuzberg:kreuzberg-android`; Swift dead-code removal of legacy e2e helpers; PHP/Elixir `ahash` machete-ignore; Java output path aligned with pom.xml; Go test-app `go.mod` registry pin; pnpm-workspace allow-builds; Dart `FRB_DART_LOAD_EXTERNAL_LIBRARY_NATIVE_LIB_DIR` env var support; homebrew + php-ext codegen generalized from kreuzberg-specific defaults). ### Fixed - **publish (zig)**: mirror html-to-markdown's `check-zig` + `publish-zig` pattern (uses `registry: github-release` with full tarball asset name + `package-name` override matching alef-emitted test_app URL pattern + dependency on `check-zig` for idempotent reruns on already-published versions). - **publish (pub.dev)**: add `force_republish` check to `trigger-pubdev` condition matching html-to-markdown pattern so already-published versions reroll cleanly when needed. - **publish (ruby + elixir + r + dart)**: explicit `vendor_mode = "registry"` in `alef.toml` so `path = "../../crates/kreuzberg"` deps inside inner Rust crates are rewritten to crates.io version deps at publish time. rc.1 ruby gem failed at install with `failed to read .../kreuzberg/Cargo.toml` for exactly this reason. - **publish (pypi)**: verify non-strict (sdist/wheel allowlist mismatch was aborting publish in rc.1). - **publish (node-platform)**: verify non-strict (NAPI stub sub-packages don't contain `.node` binaries for unbuilt platforms; was aborting node publish in rc.1). - **publish (wasm)**: bound build timeout (90 min) + dedicated cache key to survive GitHub runner preemption that cancelled rc.1's WASM build at ~12 min. - **publish (elixir)**: NIF matrix expansion (Elixir NIF matrix `include` without standalone `os` key was silently overwriting and only producing the Windows NIF; now produces all 12 NIFs across linux + macOS + windows + arm). - **java**: align alef output path to pom.xml ``. The alef backend appends `dev/kreuzberg/` to the configured output; the previous setting produced files at `packages/java/src/main/java/dev/kreuzberg/*.java` but pom.xml's `sourceDirectory=${project.basedir}` expects `packages/java/dev/kreuzberg/*.java`. This fixes Maven compilation failures where the rc.1 JAR was missing generated classes like `JsonUtil` and the plugin registration methods. (`alef.toml`) The alef backend appends the package path (`dev/kreuzberg/`) to the configured output, so `packages/java/src/main/java/` generated files to `packages/java/src/main/java/dev/kreuzberg/*.java`, but pom.xml's `sourceDirectory=${project.basedir}` expects files at `packages/java/dev/kreuzberg/*.java`. This fixes Maven compilation failures where the JAR was missing generated classes like `JsonUtil` and plugin registration methods. The pom.xml layout already matches the FFI-style bindings (Go, C#, Dart, etc.), so aligning Java to the same convention restores consistency. (`alef.toml`) ### Changed - **deps**: bump alef pin to v0.19.5 (bundles AHashMap binding fixes for FFI/Dart/Swift/Elixir/Ruby/PHP and WASM emitter dedup + collect + From-impl + sub-config deserialize fixes). ### Fixed - **bindings**: Stop emitting `calculate_quality_score` in language bindings — was made `pub(crate)` in commit `088168665a`, exposing an internal detail of `text::quality` module that was never part of the public API. The function's signature (`Option<&AHashMap, Value>>` → `f64`) references non-public generic types that don't FFI-convert cleanly (specifically, `as_deref()` failed on `Option` which doesn't implement `Deref`). Added `calculate_quality_score` to `exclude_functions` in `alef.toml` for all languages (Python, Node, Ruby, PHP, Go, Java, Dart, Kotlin Android, Swift, C#, R, Zig, Elixir, FFI) so the function is not exposed in any generated binding. All library builds now succeed without type-checking failures. (`alef.toml`) - **swift**: Regenerate `packages/swift/rust/src/lib.rs` and `packages/swift/rust/Cargo.toml` with alef v0.19.5 fixes for `AHashMap` param types. The prior generated code for `calculate_quality_score` attempted `&serde_json::from_str::>(&metadata).expect("...")` on an `Option` parameter, producing two compile errors: (1) `HashMap` does not implement `Deref` so `.as_deref()` was not available, and (2) the deserialized `HashMap` did not match the core function's expected `Option<&AHashMap, Value>>`. Fixed by alef v0.19.5: the Swift shim now emits a pre-call `let __metadata_ahash` binding that deserializes the JSON string into `HashMap`, converts to `AHashMap`, and passes `.as_ref()` to the core. Added unconditional `ahash = "0.8"` to `packages/swift/rust/Cargo.toml` so generated Swift crates reference `ahash::AHashMap` without manual additions. (`packages/swift/rust/src/lib.rs`, `packages/swift/rust/Cargo.toml`) - **ffi + dart**: Regenerate `crates/kreuzberg-ffi/src/lib.rs` and `packages/dart/rust/src/lib.rs` with alef v0.19.5 fixes for `AHashMap` param types. In the FFI shim, the prior generated code emitted `serde_json::from_str::>` and `.as_deref()` for the `metadata` parameter of `kreuzberg_calculate_quality_score`, producing two compile errors: `HashMap` does not implement `Deref` (so `.as_deref()` fails) and the deserialized `HashMap` did not match the core function's expected `Option<&AHashMap, Value>>`. In the Dart FRB bridge, the prior code attempted `&metadata` where `metadata: Option>`, which also fails to type-check against the core. Both are fixed by alef v0.19.5: the FFI emitter now deserializes into `ahash::AHashMap, Value>` and uses `.as_ref()`; the Dart bridge emits a pre-call `let __metadata_ahash` binding that converts `HashMap` to `AHashMap` before borrowing. Added `ahash = "0.8"` to `crates/kreuzberg-ffi/Cargo.toml` and `packages/dart/rust/Cargo.toml` since both generated files reference `ahash::AHashMap` directly. (`crates/kreuzberg-ffi/src/lib.rs`, `crates/kreuzberg-ffi/Cargo.toml`, `packages/dart/rust/src/lib.rs`, `packages/dart/rust/Cargo.toml`) - **dart**: Regenerated `packages/dart/rust/src/frb_generated.rs` via `flutter_rust_bridge_codegen generate` to pick up type changes from alef v0.19.4 regen. The stale codegen file expected `Option` for bbox fields (`DocumentNode.bbox`, `ElementMetadata.coordinates`, `ExtractedImage.bounding_box`, `GridCell.bbox`, `LayoutRegion.bounding_box`, `PdfAnnotation.bounding_box`, `Table.bounding_box`) and `Vec` for attribute fields, but kreuzberg's recent type updates now expose `BoundingBox` as a concrete struct and attributes as nested vectors. The regenerated file now correctly type-checks against the current Rust core. Also fixed a type mismatch in the `calculate_quality_score` wrapper: the FFI-friendly signature takes `Option>` but the core function expects `Option<&AHashMap>`, so the wrapper now converts the Dart-friendly types to the Rust core format. Added `ahash = "0.8"` dependency to `packages/dart/rust/Cargo.toml` to support the conversion. Fixes macOS arm64 CI build failures. (`packages/dart/rust/src/frb_generated.rs`, `packages/dart/rust/src/lib.rs`, `packages/dart/rust/Cargo.toml`) - **ci**: `.github/workflows/ci-docs.yaml` `lint-docs` job pinned `alef-ref` to `v0.16.69`, so the docs linter ran against a stale alef while generated bindings were produced by the current version (v0.19.4). The version skew caused the docs validation to fail on the first regen push. Bumped to `v0.19.4` to match `alef.toml`. (`.github/workflows/ci-docs.yaml`) - **python**: `_to_rust_chunking_config` no longer passes `sizing=None` to the PyO3 `ChunkingConfig` constructor, which raised `TypeError: argument 'sizing': 'None' is not an instance of 'ChunkSizing'`. The constructor's PyO3 signature provides `ChunkSizing::default()` when `sizing` is omitted, so the Python wrapper now omits the key from kwargs when the user-facing field is `None`, allowing the Rust default to apply. Fixes `test_config_chunking_prepend_heading_context` in the Python e2e suite. (`packages/python/kreuzberg/api.py`) - **bindings**: Expose `BoundingBox` in all alef-generated bindings (Java, C#, Go, Dart, Swift, Kotlin Android, etc.) by removing `#[cfg_attr(alef, alef(skip))]` from `crates/kreuzberg/src/types/extraction.rs`. The type appears as a field on `Table`, `OcrElement`, `ElementMetadata`, `Annotation`, and `Page`; with the type skipped alef fell back to opaque `String`/`Object` mappings, breaking Jackson/JSON deserialization (Java e2e `testConfigSecurityLimits` failed with `Cannot deserialize value of type 'java.lang.String' from Object value` for `Table.boundingBox`). Exposing the type generates a proper record/struct everywhere and unblocks the full Java e2e suite (88/88). (`crates/kreuzberg/src/types/extraction.rs`) - **embeddings**: Serialize concurrent first-time model downloads across processes. `hf-hub`'s own download lock is non-blocking and retries for only ~5s — far shorter than a 100MB+ sentence-transformer model download — so racing processes (e.g. parallel e2e workers) failed outright with `ApiError::LockAcquisition`. `download_model_files` now holds a blocking `flock(LOCK_EX)` on a kreuzberg-owned lock file (`/models--/.kbz-download.lock`) for the full download; other processes block until release, then find the model already cached. The lock is released on drop or by the OS on process exit. Non-unix targets fall back to the prior `hf-hub` behavior. (`crates/kreuzberg/src/embeddings/mod.rs`) - **core**: The global OCR backend registry now self-heals after `clear_ocr_backends`. `OcrBackendRegistry` is the only plugin registry seeded with built-in backends (Tesseract/PaddleOCR/VLM) at construction, and the image extractor looks them up by name with no fallback — so calling `clear_ocr_backends()` permanently broke OCR for the rest of the process. This caused a test-pollution failure in the cross-language e2e suites: the `ocr_backend_management` category (which runs `clear_ocr_backends`) sorts before `smoke`, so the OCR smoke test then failed with `OCR backend 'tesseract' not registered. Available backends: []`. The built-in registration logic is now factored into `OcrBackendRegistry::register_defaults`, and a new crate-internal `ensure_ocr_backends_initialized` re-seeds the registry when it is empty — mirroring the existing `extractors::ensure_initialized` self-heal for the document extractor registry. The image extractor calls it before every OCR dispatch. (`crates/kreuzberg/src/plugins/registry/ocr.rs`, `crates/kreuzberg/src/plugins/ocr.rs`, `crates/kreuzberg/src/plugins/mod.rs`, `crates/kreuzberg/src/extractors/image.rs`) - **pdf/chunking**: Fix `firstPage`/`lastPage` null on chunks extracted from multi-page PDFs (#1013). PDF text extraction produces page content with trailing spaces before `"\n\n"` paragraph separators (a PDF rendering artifact). `render_plain` trims each paragraph via `paragraph.trim()`, so `result.content` lacks those trailing spaces while `PageContent.content` retains them — causing every page's exact-match in `recompute_boundaries_from_pages` to fail. The first-line fallback then used `page.content.len()` (with trailing spaces) as `byte_end`, pushing `search_offset` past subsequent pages and causing a cascade of null-metadata chunks. Fix: normalise page content before searching by splitting on `"\n\n"`, trimming each segment, and rejoining — mirroring what `render_plain` does — so both the search target and the resulting `byte_end` are correct. Also stores `cleaned.into_owned()` in `PageContent.content` (so `is_blank` detection operates on cleaned text) and applies `fix_pdf_control_chars` on OCR-overwrite paths for consistency. - **kotlin-android**: Exclude trait bridge interfaces from code generation by adding `exclude_languages = ["kotlin_android"]` to all trait bridge definitions in `alef.toml` (`OcrBackend`, `PostProcessor`, `Validator`, `EmbeddingBackend`, `DocumentExtractor`, `Renderer`). The alef-backend-java emits trait bridge classes with Java Panama FFM imports (`java.lang.foreign.*`), which are not available on Android. Kotlin-Android is JNI-only (no trait bridge support yet) and doesn't need these interfaces. This prevents compilation errors in `packages/kotlin-android/src/main/java/` from missing Panama FFM API classes. - **swift**: Added `render_pdf_page_to_png` Swift call override in `alef.toml` with explicit argument definitions including optional `dpi` and `password` parameters, plus updated PDF fixtures to include null values for these fields. Resolved compilation errors in `e2e/swift_e2e/Tests/KreuzbergE2ETests/PdfTests.swift` where the generated function calls were missing the required named arguments. Also added `swift` to `skip_languages` for `embed_texts_async` call since async function naming conflicts prevent the binding from generating `embedTextsAsync()` — Callers should use the async wrapper from `embed_texts` instead. This resolves all 7 Swift compilation errors and reduces test failures to 1 (OCR backend registration runtime issue). (`alef.toml`, `fixtures/pdf/`, `e2e/swift_e2e/`) - **kotlin-android**: Added `crates/kreuzberg/src/types/internal.rs` to alef.toml sources list so `InternalDocument` is extracted and generated for Kotlin bindings. The type is used in trait bridge method signatures (e.g., `IRenderer::render(doc: InternalDocument)`) and must be available for the generated interface to compile. Previously the generated Kotlin code referenced `InternalDocument` but the type was never extracted (missing source file), causing "Unresolved reference" errors. The skip attribute was already removed in pass-2 (`bf80c2fce7`); this completes the fix by ensuring the type is extracted. (`alef.toml`) - **kotlin-android**: Added `embed_texts_async` to `exclude_functions` in `alef.toml`. The function creates a naming conflict with the suspend wrapper of `embed_texts` — both generate `suspend fun embedTextsAsync()` in Kotlin, causing overload ambiguity. Callers should use the suspend function from `embed_texts` instead. This resolves duplicate function declaration errors in the generated `Kreuzberg.kt`. (`alef.toml`) ### Changed - **alef**: bumped to v0.17.24. Regenerated all bindings and e2e tests. v0.17.24 includes: conditional `#[php(prop)]` for Prop-compatible types only (fixes E0277 errors for non-Prop fields), Kotlin/Android codegen fixes (named struct field default-construction, sealed-class field annotations, JNI single-param `is_optional` propagation), Dart pubspec single-caret version constraint, alef-e2e/zig module_name path fix, valid `build.zig.zon` declarations, scaffold wasm filename underscore conversion, alef-e2e/csharp synthetic chunk assertion inline predicates (`chunks_have_heading_context`, `first_chunk_starts_with_heading`), alef-e2e/python equivalent. Known regression: `e2e/csharp/tests/ContractTests.cs` is no longer emitted by the C# e2e generator; other languages still emit their contract test file (tracked for upstream fix). (`alef.toml`, ~935 regenerated files) - **ci**: Split pub.dev publishing into a dedicated `publish-pubdev.yaml` workflow. pub.dev OIDC trusted publishing rejects tokens from `release` events; only `push` and `workflow_dispatch` are accepted. The Dart package embeds platform-specific native binaries (Android JNI, iOS XCFramework, server libs), so the main workflow now assembles them into a `dart-package-assembled` artifact and dispatches `publish-pubdev.yaml` via `workflow_dispatch` with the run ID; that workflow downloads the artifact and publishes. One-time setup required: configure pub.dev → kreuzberg package → Admin → Automated publishing with workflow path `.github/workflows/publish-pubdev.yaml`. ### Fixed - **demo**: WASM extraction now runs in a per-file Web Worker, keeping the main thread unblocked throughout. Fixes page freeze on repeated use, stale output visible during a new extraction, and the 30s timeout that could never fire when the main-thread event loop was blocked. The `ArrayBuffer` is transferred (not copied) to halve peak memory on large files. Dead `importmap` entries (`tesseract-wasm`, `comlink`) and the top-level `initWasm`/`enableOcr` call were removed; WASM initialisation now happens inside the Worker. All `console.error` calls are replaced by the `[kreuzberg/wasm]`-prefixed `log` helper so DevTools filtering works. (`docs/demo.html`, #992) - **e2e/csharp**: Added `nested_types` mappings and `options_via = "from_json"` overrides to `alef.toml` C# call configurations. When fixture values are sealed-union or complex config types (EmbeddingModelType, EmbeddingConfig, ChunkingConfig, KeywordConfig, etc.), the codegen now emits `JsonSerializer.Deserialize(json, ConfigOptions)!` instead of raw string literals, eliminating type-mismatch compile errors. Top-level nested_types apply across all calls; per-call overrides refine for specific functions. Reduces e2e/csharp compile errors from 11 to 3 (remaining: `metadata.Format.Trim()` sealed-union bug and `chunks_have_heading_context` synthetic predicate routing — tracked for upstream alef codegen fixes). - **Taskfile**: `kotlin:e2e` now passes `--lang kotlin_android` to `alef test`, matching the language key declared in `alef.toml`. Previously the task invoked `alef test --e2e --lang kotlin` which produced `Language 'kotlin' not in config languages list or test configuration`. (`Taskfile.yml`) - **core**: `EmbeddingModelType::default()` now returns `Preset { name: "balanced" }` instead of `Preset { name: "" }`. Language binding mirror structs (Ruby, PHP, and others) have their own `EmbeddingModelType` with a derived or hand-written `Default` that calls `String::default()` (empty) for the `name` field; when that flows through `From` into `kreuzberg::embed_texts_async`, `get_preset("")` returns `None`, causing "Unknown embedding preset: " errors. All defaults across the codebase converge on "balanced", so the `Default` impl is now consistent with `default_model()` and `EmbeddingConfig::default()`. Added a unit test `test_embedding_model_type_default_is_balanced` to lock this in. (`crates/kreuzberg/src/core/config/processing.rs`) - **e2e/elixir**: Regenerated with alef v0.17.19, which fixes a keyword-opts threshold bug in the Elixir e2e codegen. When a call had 2+ trailing optional parameters (e.g., `mime_type`, `config`), the codegen now emits all optional args in keyword form (`mime_type: "...", config: "..."`), not mixed positional and keyword (`mime_type: "...", "{}"`). This respects Elixir's syntax requirement that all positional args come before keyword args. Fixes smoke_test and other e2e test compile errors. (`alef.toml`, `e2e/elixir/test/*_test.exs`) - **e2e/r**: Marked the `config` argument of `embed_texts_async` as `optional = true` in `alef.toml` so generated e2e tests for languages whose fixtures omit a config (R, Python, Node) no longer call the binding without it. Previously the R wrapper signature `function(texts, config)` had no default and the empty/happy fixtures failed with `argument "config" is missing, with no default`. Regenerated all per-language `embed_async_pending` test suites via `alef generate`. Brings R e2e from 155/158 PASS to 159/160 PASS; the only remaining failure (`test_smoke.R` tesseract OCR backend) is environmental. (`alef.toml`, `e2e/*/...embed_async_pending*`) - **deps**: Loosened `tar` requirement in `crates/kreuzberg/Cargo.toml` from `^0.4.46` back to `^0.4` so it remains compatible with `tree-sitter-language-pack v1.8.1`, which locks `tar` to `0.4.45`. The pinned floor prevented the e2e/rust crate from resolving a consistent version. (`crates/kreuzberg/Cargo.toml`) ### Fixed (core: code_intelligence Go type mismatch) - **core**: Changed `ExtractionResult.code_intelligence` field type from `Option` to `Option`. The `ProcessResult` type lives in `tree_sitter_language_pack` — an external crate whose struct layout alef cannot resolve — so the Go backend emitted `*string`, causing `cannot unmarshal object into string` at runtime. Using `serde_json::Value` maps to `json.RawMessage` in Go, preserving full JSON fidelity while being opaque to alef. Updated `extraction/derive.rs` to serialize the `ProcessResult` before assignment; a `tracing::warn!` guards the rare serialization failure. (`crates/kreuzberg/src/types/extraction.rs`, `crates/kreuzberg/src/extraction/derive.rs`, `packages/go/v5/binding.go`) ### Fixed (core: chunk_size=0 panic) - **core**: `build_chunk_config` in `crates/kreuzberg/src/chunking/builder.rs` now clamps `max_characters == 0` to 1000 (the `default_chunk_size` value) instead of forwarding zero to `text-splitter`, which panics with `chunk size must be non-zero`. Binding mirror structs whose `serde(default)` zeroes the field no longer abort the host process. A `tracing::warn!` line logs the clamp. (`crates/kreuzberg/src/chunking/builder.rs`) ### Fixed (core: null-tolerant serde for ChunkSizing, EmbeddingModelType, FormatMetadata::Image uppercase) - **core**: `ChunkingConfig.sizing` and `EmbeddingConfig.model` fields now tolerate explicit JSON `null` (in addition to missing fields). Language binding mirror structs emit `"sizing": null` / `"model": null` from zero-valued structs; the previous `#[serde(default)]` annotation handled a missing key but not an explicit null for internally-tagged enums. Added `deserialize_null_default` and `deserialize_null_model` helpers in `processing.rs` and switched both fields to use them. (`crates/kreuzberg/src/core/config/processing.rs`) - **core**: `FormatMetadata::Image` `Display` impl now emits the format string in uppercase (e.g., `"PNG"` instead of `"png"`), matching the fixture assertion convention used across all language e2e suites. (`crates/kreuzberg/src/types/metadata.rs`) ### Fixed (e2e/php: hermetic ini) - **scripts/setup-php-ext-ini.sh + alef.toml**: PHP e2e runs were failing system-wide because a sibling project (tree-sitter-language-pack) had left a stale `/opt/homebrew/etc/php/8.4/conf.d/ext-kreuzberg.ini` pointing at a non-existent `ts_pack_core_php.so`. The local `php -c php.ini` flag only overrides the main `php.ini` path, not the scan-dir, so the stale entry kept being loaded. Hardened the e2e launcher to set `PHP_INI_SCAN_DIR=` (disabling conf.d scanning entirely) and made the generated `e2e/php/php.ini` set `extension_dir` explicitly so the hermetic config still finds the built extension. PHP e2e is now reproducible without depending on the host's conf.d state. ### Fixed (R e2e: 159/160, only env tesseract failure remains) - **e2e/r**: Regenerated against alef with four R codegen fixes (no-arg-wrapper input leakage, empty `Vec` → `character(0)`, R-side `extra_args` support, FormatMetadata tagged-enum collapse helper + `result_is_bytes`-aware length assertions). Added an R override on `render_pdf_page_to_png` with `extra_args = ["NULL", "NULL"]` to fill in the extendr-required `dpi`/`password` positionals when the fixture omits them. Combined, kreuzberg R e2e moves from 153/158 to 159/160 — the remaining failure (`test_smoke.R:9:3` PNG-with-OCR) is the pre-existing `tesseract not registered` env condition. (`alef.toml`, `e2e/r/`) ### Fixed (Zig e2e: all 88 tests passing) - **e2e/zig**: Bumped pinned `alef_version` to 0.17.17 and regenerated `e2e/zig/` against the upstream codegen fix that skips the `chunks_have_heading_context` synthetic assertion. Combined with pre-existing zig-specific overrides (`extract_file` and `extract_bytes` redirected to their `_sync` variants with JSON-struct result parsing, `render_pdf_page_to_png` extra `null, null` args for dpi/password, and FormatMetadata-as-display inline accessor on the `metadata.format` assertion), `task zig:e2e` now reports `Build Summary: 43/43 steps succeeded; 88/88 tests passed`. (`alef.toml`, `e2e/zig/`) ### Fixed (Go bindings: re-exported ProcessResult path) - **core**: Switched `ExtractionResult.code_intelligence` from `Option` to `Option` so alef's type visitor resolves the re-exported name and emits the correct `*ProcessResult` Go binding type. Previously alef fell back to `*string`, producing `cannot unmarshal object into string` errors at runtime. ### Fixed (DrawingType alef-skip, zig binding ffi_path) - **core**: Annotate `DrawingType` enum in `crates/kreuzberg/src/extraction/docx/drawing.rs` with `#[cfg_attr(alef, alef(skip))]`. The enclosing `Drawing` struct was already skipped but its inner enum was not, so alef R binding gen emitted `impl From for ...` and broke `packages/r/src/rust/src/lib.rs` with five `kreuzberg::DrawingType not found` errors. - **packages/zig/build.zig**: Default `ffi_path` updated from `../../target/debug` to `../../target/release` so `zig build test` finds the FFI artifact built by the alef test before-hook (`cargo build --release -p kreuzberg-ffi`). Mirrors the alef-scaffold/zig fix. ### Fixed (FormatMetadata Display impl) - **core**: Added `impl std::fmt::Display for FormatMetadata` so generated rust e2e assertions on `metadata.format` can render the variant as a short string ("pdf", "docx", "image"…). The Image variant emits the inner `format` (e.g., "PNG"), matching the fixture's `field_equals` value. Required after alef-e2e/rust switched to Display formatting in `field_access::rust_unwrap_binding`. ### Fixed (java e2e clear_* methods) - **java e2e**: `clear_document_extractors()`, `clear_ocr_backends()`, `clear_embedding_backends()`, `clear_post_processors()`, `clear_renderers()`, `clear_validators()` now generate in the `Kreuzberg` facade class. Root cause: `#[cfg_attr(alef, alef(skip))]` annotations on these functions prevented alef from picking them up during Rust API extraction. Removed the annotations from the 6 `clear_*()` functions (only; `register_*/unregister_*` still skipped since they take `Arc` params unsuitable for FFI). Re-generated all language bindings. ### Fixed (kotlin-android e2e parity sweep) - **kotlin-android**: now honours `#[cfg_attr(alef, alef(skip))]` annotations during DTO and enum emission (alef-backend-kotlin-android filter fix). - **alef-extract disambiguation**: when two types collide and one carries `binding_excluded=true`, the non-excluded one now keeps the original name (instead of getting a "2" suffix). Eliminates spurious `EmbeddingPreset2`, `NodeContent2`, `LayoutClass2`, `PaddleLanguage2`, `FormatMetadata2`, `ExtractionMethod2` from the IR. - **`pdf/metadata.rs::PdfMetadata`**: removed incorrect `alef(skip)` annotation — there's only one definition (no duplicate to disambiguate), so the annotation was hiding the only public type from bindings. - **stale kotlin-android wrappers**: 28+ leftover `*2.kt`, `OcrTesseractConfig.kt`, `PdfMetadata.kt`, `ListType.kt`, `TableTableGrid.kt`, `ArchiveArchive*.kt` files removed after the disambiguation fix. - **alef-e2e trait-bridge fixture support**: lifted the deferred-to-v0.14.5+ gate; trait-bridge fixtures (register/unregister/clear × 6 traits) now generate per-language test calls. - **alef-extract embed_texts_async**: confirmed the cfg-gated generic signature extracts cleanly; cleared the 3 stale skip blocks on `embed_async_pending/embed_texts_async_*.json`. - **e2e generation surface**: regenerated all 16 language e2e suites. All trait-bridge + plugin_api categories now exercise the full public API. Only legitimate WASM/Rust platform-imposed skips remain (7× WASM-async, 1× Rust-missing-file-path). ### Fixed (#985 pdf image extraction hang) - **PDF image extraction hang on dense pages (#985)**: `extract_image_positions` previously ran a full decompression pass over every page unconditionally — even when `extract_images=false` — causing multi-minute hangs on PDFs with many image objects per page (e.g. InDesign/Acrobat exports). The pre-pass is eliminated; image positions are now derived from the capped extraction result, so the decompression path is skipped entirely when images are not requested. Also adds cancellation-token support inside `extract_images_with_data` so `extraction_timeout_secs` can interrupt multi-page extraction between pages, and inside the inline-OCR image loop so a timeout can interrupt per-image OCR. `max_images_per_page` now bounds decompression for XObject images: the pdf_oxide backend enumerates the page XObject resource dictionary and stops decompressing after `limit` images, avoiding the eager cost of `extract_images()` for images beyond the cap. Inline images (`BI`/`EI` content-stream operators) fall back to the eager path with a `.take(limit)` guard; full inline-image support is tracked in #989. - **`kreuzberg::utils::pool` API simplified**: `Pool::acquire()` is now infallible — it returns `PoolGuard` directly instead of `Result, PoolError>`. `PoolError` is removed. `parking_lot::Mutex` cannot poison, so the `Result` wrapper was dead code. `Pool`, `PoolGuard`, `Recyclable`, `StringBufferPool`, and `ByteBufferPool` are marked `#[doc(hidden)]` — they were technically reachable via `kreuzberg::utils::pool` but were never part of the stable public API. ### Added - **`kreuzberg` crate root**: plugin_api surface (`list_*`, `clear_*`, `register_*`, `unregister_*` for all six plugin types; `detect_mime_type_from_bytes`, `get_extensions_for_mime`) was already present in the internal module tree but not wired up as top-level `pub use` items. The re-exports are now confirmed in place and the 9 e2e fixtures that carried a stale "not re-exported" skip block have had that block removed — those fixtures are now active across all language targets. ### Fixed - **alef-backend-php**: emit reverse `From for kreuzberg::Core` impls for metadata DTOs (ArchiveMetadata, BibtexMetadata, DocxMetadata, EmailMetadata, EpubMetadata, ExcelMetadata, FictionBookMetadata, HtmlMetadata, ImageMetadata, JatsMetadata, OcrMetadata, PptxMetadata, PstMetadata, TextMetadata, XmlMetadata, CitationMetadata, CsvMetadata, DbfMetadata, TableGrid). Reverse From impls are required for flat enum binding→core conversion which calls `.into()` on optional variant fields. Previously only forward impls were generated. - **Java `UnsatisfiedLinkError` on `kreuzberg_list_embedding_presets` / `kreuzberg_get_embedding_preset` (#998)**: the alef-generated FFI surface references `kreuzberg::EmbeddingPreset` unconditionally in function return-type positions, so any build of `kreuzberg-ffi` that omitted the `embedding-presets` feature would silently drop these two C symbols — causing the Java static initialiser (which uses `.orElseThrow()`) to crash the entire JVM at class-load time. Added `#[cfg(not(feature = "embedding-presets"))]` stubs in `crates/kreuzberg/src/lib.rs`: a no-op `EmbeddingPreset` struct and empty-return implementations of both functions. The stubs ensure the symbols are always present and callers degrade gracefully (empty list / `null` handle) instead of crashing. - **#962**: PDF text extracted one character per line on glyph-spaced documents (programmatic reproducer in `crates/kreuzberg/tests/pdf_glyph_spacing_issue_962.rs`) — when a Word-exported PDF places each glyph via its own `BT…ET` block with a sinusoidal y-jitter, pdf_oxide's ColumnAware reading order groups spans by y-level rather than reading order, yielding single-character spans out of sequence. Detection: ≥ 3 same-line x-disorder events among short (≤ 3 char) spans, where "same-line" uses a 5 pt absolute ceiling (`MAX_GLYPH_JITTER_PT`) instead of a font-size fraction, preventing false positives on height-zero span lists from normal documents. Reconstruction: sort by y-descending, group by 5 pt y-proximity, re-sort each group by x-ascending, insert spaces at word gaps (x-gap > `font_size × 0.5`). Heuristic constants live in `crate::pdf::structure::constants` with measurement justification; upstream fix shipped in pdf_oxide v0.3.51 (issue #518, closed 2026-05-19); heuristic removable when kreuzberg upgrades to ≥ 0.3.51. Dutch word `relatie` (from "relatie-id" in the Word broken-image placeholder string) added to `.typos.toml`. ### Changed - **API surface lockdown via `#[cfg_attr(alef, alef(skip))]`**: 41 internal types that were leaking into the public polyglot binding surface are now hidden from every binding. Skipped types: REST/MCP wire-envelope DTOs (`ApiDoc`, `InfoResponse`, `ExtractResponse`, `EmbedRequest`/`EmbedResponse`, `ChunkRequest`/`ChunkResponse`, `DetectMimeTypeParams`, `CacheWarmParams`, `EmbedTextParams`, `ExtractStructuredParams`, `ChunkTextParams`, `ManifestEntryResponse`/`ManifestResponse`, `WarmResponse`, `OpenWebDocumentResponse`, `DoclingCompatResponse`, `StructuredExtractionResponse`); HWP/DOCX format parser helpers (`StreamReader`, `CharShape`, `HwpImage`, `Drawing`, `AnchorProperties`, `PageMarginsPoints`, `StyleDefinition`, `ResolvedStyle`); office_metadata internals (`CoreProperties`, `CustomProperties`, `OdtProperties`, `DocxAppProperties`, `XlsxAppProperties`, `PptxAppProperties`); pool/service infrastructure (`TracingLayer`, `SyncExtractor`, `Recyclable`); OCR/chunking intermediates (`ImageOcrResult`, `HtmlExtractionResult`, `ExtractedInlineImage`, `DetectedBoundary`, `ChunkingResult`, `MergedChunk`). These types were never part of the polyglot SDK contract — `extract_file` / `extract_bytes` / `Metadata` / `OcrConfig` remain unchanged. - **Migrated `alef.toml [crates.exclude]` to source-level annotations**: 100+ type/function/method entries previously listed in `alef.toml`'s global exclude block now carry `#[cfg_attr(alef, alef(skip))]` at their definition sites in `crates/kreuzberg/src/**`. Behaviour-neutral migration — the same items remain excluded, but the exclusion declaration lives next to the defined item and stays in sync across refactors instead of drifting away from the source. `alef.toml [crates.exclude]` now retains only the ~20 entries that cannot be source-annotated (fully-qualified path exclusions where the short name collides with a kept public type, generic structs whose bounds prevent `cfg_attr` propagation, trait-bridge `register_*` / `clear_*` functions that need deduplication against the bridge emitter, and trait impl methods like `From::from` / `Display::fmt` / `Deref::deref` auto-emitted by derive). - **`kreuzberg::CacheStats` now refers to `cache::core::CacheStats`** (was previously aliased to `paddle_ocr::CacheStats`). The paddle-OCR variant is renamed to `ModelCacheStats` and re-exported as `kreuzberg::ModelCacheStats`. Breaking change for Python/TypeScript/Ruby/PHP/Go/Java/C#/Elixir/Dart/Swift bindings — consumers of the previous `kreuzberg.CacheStats` (paddle model cache variant) must migrate to `kreuzberg.ModelCacheStats`. - **`kreuzberg::extraction::image::ImageMetadata` renamed to `ExtractedImageMetadata`** to disambiguate from `kreuzberg::types::metadata::ImageMetadata`. Internal use only; no public binding surface impact. ### Removed - **Orphan `kreuzberg::types::formats::CacheStats`** (unused duplicate of `kreuzberg::cache::core::CacheStats`; superseded by the canonical re-export at the crate root). ### Added - **`Serialize`/`Deserialize` derives on DOCX and HWP parser types**: `Table`, `TableRow`, `TableCell`, `Paragraph`, and `Run` in `extraction::docx::parser`, and `Section`, `Paragraph`, and `ParaText` in `extraction::hwp::model`. These types remain internal (still annotated `#[cfg_attr(alef, alef(skip))]` — out of binding surface) but can now flow through Rust-side caching, snapshot tests, and any internal JSON-based pipelines. No change to public binding API. - **Plugin registry functions now exposed in every binding**: `register_ocr_backend`, `register_post_processor`, `register_validator`, `register_embedding_backend`, `register_renderer`, `register_document_extractor`, their `unregister_*` siblings, and `clear_*` group counterparts. Bindings previously hid these because the alef codegen emitted duplicate definitions; alef ≥ v0.16.65 auto-deduplicates trait-bridge registrations, so the kreuzberg `alef.toml` global function exclusions are dropped. - **`kreuzberg-ffi` `register_ocr_backend` / `unregister_ocr_backend` are now callable from C, Go, Java, C#**. The previous `*const c_void` Send issue was resolved by the alef-backend-ffi Jinja migration; the binding now compiles with `unsafe impl Send + Sync` on the bridge struct. - **WASM bindings now expose `HwpxExtractor`, `process_images_with_ocr`, and the canonical `CacheStats`**. alef ≥ v0.16.65 auto-excludes feature-gated types, so the explicit WASM exclusions for these are redundant. ### Added - **`task demo:dev:setup` for WASM demo prerequisites (#1007)**: new task installs `wasm-pack` 0.13.1 (matches CI pin), `wasm-bindgen-cli` at the version pinned in `Cargo.lock`, and WASI SDK 25 — all idempotent with exact version checks and SHA256 verification on the WASI SDK download. Run once before `task demo:dev`. ### Fixed - **`task demo:dev` broken after native WASM OCR migration (#1006)**: commit `198c9e99e` removed the JS OCR worker bridge files without updating `demo.html`, leaving the asset server with nothing to serve for the CDN imports. Restores `dist/index.js`, `dist/extraction/files.js`, and `dist/ocr/enabler.js` for the new architecture where `TesseractWasmBackend` auto-registers at WASM init time and `enableOcr()` is a no-op. - **Isolated numbered headings in untagged PDFs (#961)**: Lines matching `\d{1,2}\. [A-Z]` that appear alone in their paragraph block are now promoted to `Heading` elements instead of being misclassified as `ListItem`. Affects the `ElementBased` transform fallback path (e.g. ReportLab-generated PDFs). Real numbered lists with sibling items are unaffected. Lone lowercase-initial lines (`1. go there alone`) remain `ListItem` and are not affected. - **Bordered 2-column PDF tables now detected (#964)**: Added `extract_tables_bordered` as a second detection tier between the strict native pass and the text-heuristic fallback. Tables whose cells are drawn with stroke lines or rectangles but have only 2 columns are now recovered. ## [5.0.0-rc.1] - 2026-05-16 ### Changed - **Alef binding generator v0.16.14**: payload-derived sealed variant parameter names in Kotlin Android (`Pdf(PdfMetadata)` → `val metadata` instead of `val field0`), C# record properties marked `required` on non-nullable refs without defaults, Java records for fields-only DTOs without builder pattern. ### Fixed - **VLM / force_ocr path omitted image placeholders from Markdown output (#987)**: when `force_ocr=true` (e.g., a VLM OCR backend) with `extract_images=true` and `inject_placeholders=true`, the rendered Markdown contained no `![](image_N.ext)` references. Root cause: the structured `pre_rendered_doc` path (which injects `ElementKind::Image` elements via `assemble_internal_document`) is skipped when `force_ocr=true`; the OCR-path document built from raw OCR text carried no image elements. Fixed by injecting image elements — with correct page attribution from `ExtractedImage.page_number` — into the OCR-path document when images are extracted and `inject_placeholders` is enabled. - **MCID-tagged PDF content dropped in markdown/html output**: two independent failure chains in the structured-PDF path caused content loss. (1) `mark_cross_page_repeating_text` and `mark_cross_page_repeating_short_text` had no first-occurrence exemption — a title block whose text also appeared as a running header on later pages was furniture-marked on every page including the first, silently dropping it from output. Fixed by preserving the first occurrence of any repeating span, matching the exemption already present in pdf_oxide's own `mark_running_artifact_spans`. (2) The OCR skip gate incorrectly suppressed OCR when `fallback=true` (a per-page quality check detected a scanned page): the gate now runs OCR whenever `fallback=true`, regardless of aggregate character count. OCR is still skipped when the pre-rendered doc is substantive and no per-page fallback is needed, or when the content is non-textual. - **Djot renderer emitted dangling `![]()` for orphaned image elements (#914)**: when an `ElementKind::Image` referenced an index outside `doc.images`, the Djot renderer fell through to its normal output path and produced empty image markup. Aligned with the existing `comrak_bridge` behavior: skip the element only when both the image lookup is `None` and the description text is empty. When alt text is present, it is still emitted so user-visible content is preserved. - **Email encoding data corruption (#910)**: replaced brittle 4-byte heuristic for UTF-16 detection in `EmailExtractor` with a robust statistical approach using `chardetng`. Tiny 4-byte ASCII files (e.g., binary sequences with nulls) are no longer incorrectly transcoded as UTF-16LE, preventing silent data corruption. The fix maintains support for legitimate non-BOM UTF-16 messages by requiring a 16-byte minimum sample and verifying the encoding guess against `chardetng` when alternating null patterns are detected. ### Removed - **Gleam binding** (`kreuzberg_gleam` Hex package): dropped entirely. Gleam targets the BEAM and Gleam consumers can keep using the Elixir binding via Erlang interop, so the dedicated package added negligible audience reach at a real maintenance cost (regen on every alef bump, dedicated CI workflow, Hex publish job, e2e fixtures, 92 doc snippets). Existing published versions of `kreuzberg_gleam` on Hex remain available for anyone still pinning them — no further releases will be made. - **`PageContent.images: Vec>` (#777)**: removed and replaced by `PageContent.image_indices: Vec`. All image data lives once in `ExtractionResult.images`; pages and chunks now carry indices into that collection. See `docs/migration/v5.0-image-indices.md` for migration guidance. ### Security - **[GHSA-gg9g-p963-p7x4]**: `HwpxExtractor` now validates the ZIP container against `SecurityLimits` before passing bytes to `unhwp::parse_bytes`. Previously the `ExtractionConfig` was silently discarded (`_config`), allowing a crafted HWPX file with a >100:1 DEFLATE ratio to exhaust process memory (CWE-409). The fix adds an upfront byte-count check (`max_archive_size`) and a `ZipBombValidator` pass over the central directory before any decompression occurs. Affects all builds with the `hwpx`, `formats`, or `full` feature enabled since `5.0.0-rc.1`. ### Added - **Heuristic PDF table extraction on the default path (#897)**: the default text-layer PDF extraction now falls back from pdf_oxide's native grid detector to a heuristic reconstruction when the native detector returns empty. The fallback clusters words into vertically-contiguous regions by abnormally-large row gaps, runs the existing `reconstruct_table → post_process_table → is_well_formed_table` chain (the same one used by the OCR pipeline and the layout-detection path), and emits the surviving grids as `Table` entries with bounding boxes. This recovers tables on text-layer PDFs (invoices, financial statements, scientific tables) that lack the explicit ruling lines pdf_oxide's grid detector requires — without needing the 12 GB ONNX layout-detection models. New config: `PdfConfig.extract_tables: bool` (default `true`) and CLI flag `--pdf-extract-tables`. Set to `false` to skip table extraction entirely. - **Inline-image OCR for PDFs**: Added `ocr_inline_images` field to `PdfConfig` and `--pdf-ocr-inline-images` CLI flag. When enabled, this performs OCR on inline images found within PDF pages and injects the results into the extracted document. - **InternalDocument is serde-bridgeable**: `InternalDocument` and its four previously-non-serde sub-types (`InternalElement`, `ElementKind`, `Relationship`, `RelationshipTarget`) gained `Serialize` + `Deserialize` derives. Combined with the `Cow<'static, str>` → `String` migration on `source_format` and `mime_type`, foreign-language plugin authors can now construct `InternalDocument` values that round-trip through JSON at the FFI boundary — unblocking `DocumentExtractor` and `Renderer` trait bridges in alef 0.15.25+. - **DocumentExtractor + Renderer cross-language plugins**: both trait_bridges are now active in `alef.toml`. All 16 language bindings expose `register_document_extractor` / `unregister_document_extractor` / `clear_document_extractors` / `list_document_extractors` and the matching Renderer lifecycle. Foreign-language plugin authors can now implement arbitrary document extractors and renderers in their host language. - **#619 follow-up**: `POST /extract-async` now returns HTTP 429 when more than 100 jobs are active simultaneously, preventing unbounded memory growth under load. - **`backend_options` passthrough on `OcrConfig` and `OcrPipelineStage`**: both types now carry an `Option` field `backend_options`. Custom and built-in backends that support runtime tuning (mode switching, preprocessing flags, inference parameters, etc.) can read this value via `config.backend_options` inside `OcrBackend::process_image` and deserialize only the keys they care about — unknown keys are silently ignored, so options from different backends coexist in the same config without conflict. When `OcrConfig` auto-constructs a two-stage pipeline (tesseract + paddleocr), `backend_options` is propagated to the primary (tesseract) stage only; the paddleocr fallback stage continues to use `paddle_ocr_config`. - **WASM OCR backend**: `TesseractWasmBackend` registered for the `ocr-wasm` feature set, exposing OCR on the WASM target via `tesseract-wasm` while the native path continues to use leptess. - **Renderer plugin**: `Renderer` now extends `Plugin`, picking up the shared `name/version/initialize/shutdown` lifecycle (with no-op defaults on `Plugin` so stateless renderers stay boilerplate-free). Public helpers `register_renderer`, `unregister_renderer`, `list_renderers`, `clear_renderers` match the convention of the other plugin registries and now all return `Result` for symmetric cross-language codegen. ### Changed - **BREAKING: `PageContent.images` replaced by `PageContent.image_indices` (#777)**: `PageContent.images: Vec>` is removed. Pages now carry `image_indices: Vec` — zero-based indices into the top-level `ExtractionResult.images` collection. `ChunkMetadata` gains the same `image_indices: Vec` field (populated post-chunking by matching each image's `page_number` against the chunk's `[first_page, last_page]` range). **Migration:** replace `page.images[i].data` with `result.images.as_ref().unwrap()[page.image_indices[i] as usize].data`. All image data still lives in `ExtractionResult.images`; pages and chunks reference it by index rather than carrying full copies. - **BREAKING (wire format)**: `LayoutClass` now serializes as snake_case in JSON output (e.g. `"list_item"` instead of `"ListItem"`). `LayoutDetection.class_name` returned by HTTP/MCP/Python APIs flips PascalCase → snake_case, matching the documentation that already used snake_case. The internal `LayoutClass::name()` accessor is renamed to `as_str()` and remains available for callers that need a `&'static str`. - `TableModel` serialization is now symmetric snake_case: previously `serde_json::to_string(&TableModel::SlanetWired)` returned `"SlanetWired"` (PascalCase) but `from_str` only accepted `"slanet_wired"` (snake_case), so JSON config round-trips were impossible. Both directions now use snake_case via `#[serde(rename_all = "snake_case")]`. The hand-rolled `Deserialize` impl is removed. - **Plugin lifecycle defaults**: `Plugin::version()`, `initialize()`, and `shutdown()` now have no-op default implementations (`name()` stays abstract). Existing `impl Plugin` blocks that returned `Ok(())` / `"1.0.0"` continue to work; new stateless plugins can omit the boilerplate. ### Fixed - **Performance (#908)**: Removed a redundant `config.normalized()` call in the internal `extract_file_uncached` helper that caused unnecessary heap allocations when bypassing the extraction cache. - **VLM OCR per-page propagation (#928)**: VLM OCR results now populate individual `PageContent` structures in PDF extractions, not just `ExtractionResult.content`. Per-page text segments are mapped to document pages, and stale native text is cleared on secondary pages when a VLM returns a single whole-document string. - **LLM base URL normalization**: fixed a regression where a custom `KREUZBERG_LLM_BASE_URL` with a trailing slash caused authentication (401) or 404 errors due to double-slash path construction (e.g. `/v1//chat/completions`). The LLM client now automatically trims trailing slashes from the base URL. - **PDF layout-classify regression on text-heavy structure-tree PDFs**: three coupled fixes to the layout-for-markdown pipeline that recover quality lost when RT-DETR hints were applied too aggressively. - `apply_spatial_overrides` now content-gates promotion-class hints (Title / SectionHeader / Caption / Footnote / ListItem). A hint is rejected when paragraph text doesn't match the hint type — short text (≤200 chars) for heading/caption/footnote classes; list-marker prefix for ListItem. Prevents promoting long body paragraphs whose bbox happens to overlap a heading hint. - `apply_proportional_overrides` is now bypassed for structure-tree pages (paragraphs without positional data). The fractional-position matching misaligned when paragraphs reordered between font-clustering and RT-DETR. The structure tree's font-size classification is the more reliable signal here; hints are skipped silently. - Heading-promotion thresholds tightened: `MAX_BOLD_HEADING_WORD_COUNT` 15 → 12, body→heading font-size delta `+0.5pt → +1.0pt` in three call sites of `classify.rs`. - **PDF Markdown / Djot / Plain quality gates now run by default**: `test_pdf_quality_gate`, `test_pdf_djot_quality_gate`, and `test_pdf_plain_quality_gate` in `crates/kreuzberg/tests/pdf_markdown_regression.rs` no longer carry `#[ignore = "TODO: pdf_oxide upstream"]`. The upstream issue (#484) hasn't been demonstrated to currently trigger these failures, and the calibrated `PDFIUM_GROUND_TRUTH` thresholds are the binding contract for layout-pipeline changes. - **#911**: `extraction_timeout_secs` now explicitly returns a `KreuzbergError::Validation` error when configured in non-tokio or WASM builds. Previously, timeouts in these environments were silently ignored, leading to unexpected hangs. - **swift e2e command**: `[crates.test.swift].e2e` now runs from `packages/swift` (was `e2e/swift`). The generated XCTest cases live inside `packages/swift/Tests/Tests/` because SwiftPM 6.0 forbids inter-package `.package(path:)` references in a monorepo, so `e2e/swift/Package.swift` is a documentation-only stub with no buildable target. The previous setting failed with `error: The package does not contain a buildable target.` on every `task swift:e2e` invocation. - **Java FFI compile error**: `readJsonList` now wraps the null-check and `checkLastError()` call inside try-catch, resolving an unreported `Throwable` exception that blocked Java e2e test compilation. - **alef.toml**: `TesseractWasmBackend` added to `[crates.exclude].types` so non-WASM bindings (kreuzberg-py, kreuzberg-nif, etc.) no longer reference the WASM-only OCR backend (which is `#[cfg(feature = "ocr-wasm")]`-gated) and break the build under default features. - **dart e2e**: `extract_file` overridden to `extract_bytes` in `alef.toml` (dart cannot pass file paths through the FRB bridge); e2e generator regenerated to forward actual `[BatchBytesItem]` / `Uint8List` arguments rather than empty parameter lists. - **e2e/gleam**: regenerated against alef gleam codegen with `contains_any` OR logic + `gleam/list` import; full FFI shim (`packages/gleam/src/kreuzberg_gleam_ffi.erl`) wraps the `Elixir.Kreuzberg.Native` `*_sync` NIFs and converts the Erlang map results into Gleam-typed tagged tuples (`extraction_result`, `metadata`, `document_structure`, `format_metadata`, `excel_metadata`). - **e2e/zig**: regenerated for Zig 0.16 API (allocator + IO surface) and `FormatMetadata` internally-tagged enum path lookups now skip the variant-name segment. - **Gleam dependency manifest**: restored canonical hex version ranges in `packages/gleam/gleam.toml` (`gleam_stdlib = ">= 0.34.0 and < 2.0.0"`, `gleeunit = ">= 1.0.0 and < 2.0.0"`). An earlier `alef sync-versions` had routed `gleam.toml` through the catch-all SEMVER replace path and overwrote both ranges with `">= 5.0.0-rc.1 and < 5.0.0-rc.1"` (an empty range gleam refuses to resolve), wedging `gleam test`. The `restore_gleam_dep_ranges` helper in alef now keeps these stable on future syncs. - **#853**: HWP structured extraction now returns an error instead of silently returning an empty document when no BodyText sections are found. Fixes a regression introduced in the structured extraction refactor. - **#619 follow-up**: `POST /extract-async` handler no longer panics on mutex poison — returns HTTP 500 and marks the job as Failed instead. - Fixed dead conditional-import warning on `KreuzbergError` in `plugins/registry/ocr.rs` under non-OCR feature sets. - **Zig e2e tests**: Added `default_extraction_config` constant and `extract_file_sync_default`, `extract_bytes_sync_default` overloads for e2e test generation. Configured alef codegen to emit these default variants. ## [5.0.0-rc.1] - 2026-05-05 ### Breaking Changes - **`Metadata.format` is now a nested object, not flattened**: The `format` field (and its `format_type` discriminator) moved from the root of the `Metadata` JSON object into a dedicated `format` key: `{"format": {"format_type": "pdf", ...}}` instead of `{"format_type": "pdf", ...}` at root. The `additional` postprocessor map is likewise nested under an `"additional"` key. The top-level `sheet_count` / `sheet_names` mirror fields are gone; access them via `metadata.format.excel.sheet_count` / `.sheet_names`. Affects REST, MCP, CLI (`--output-format json`), and every binding. - **Go module path changed from `v4` to `v5`**: Import path is now `github.com/kreuzberg-dev/kreuzberg/v5`. Update your `go.mod` and all import statements. - **PHP binding parameter names are now lowerCamelCase**: Function parameters such as `$mime_type` are now `$mimeType`, `$page_index` → `$pageIndex`, etc., matching PHP naming conventions. - **Python `_to_rust_extraction_config` dict-coercion refactored**: The `isinstance(value, dict)` branch now delegates to `_coerce_dict_extraction_config()`. No public API change; internal helper is not part of the public surface. ### Changed - Version bump from `4.x` to `5.0.0-rc.1` reflecting accumulated breaking changes across bindings since the `4.10.0` series. - All binding manifests (Node, Ruby, PHP, Java, C#, Go, Python, Elixir, Gleam, R, Dart, Swift, Zig) updated to `5.0.0-rc.1`. ### Changed - **Inlined `text-splitter` into `crates/kreuzberg/src/chunking/text_splitter/`.** The upstream crate pins `tokenizers = "0.22"`, which conflicted with kreuzberg's direct `tokenizers 0.23` dependency and produced a duplicate copy of `tokenizers` in the build graph plus a `Tokenizer: ChunkSizer` trait-bound failure in `chunking::core`. The inlined fork drops the unused `code` (tree-sitter) and `tiktoken-rs` features and rebuilds against `tokenizers 0.23`. Kreuzberg's own tree-sitter–based code splitter is unaffected. See `ATTRIBUTIONS.md` for full provenance and license terms. ### Added - **#619**: Async extraction API — `POST /extract-async` accepts the same multipart form as `POST /extract` and immediately returns `AsyncJobResponse` (`job_id`) with HTTP 202. Background task runs the extraction pipeline (with a configurable timeout) and stores the result in an in-memory `JobStore` (5-minute TTL, evicted on restart). `GET /jobs/{job_id}` returns `JobStatus` (`job_id`, `state`, `created_at`, `updated_at`, `result`, `error`) allowing clients to poll for `Pending` → `Running` → `Completed` / `Failed` transitions. Both endpoints are gated behind the `api` feature flag. - **HWPX Extraction Support (#875)**: Integrated the `unhwp` crate to natively extract text, structure, and comprehensive document metadata from modern Hangul Word Processor XML (`application/haansofthwpx`) documents. Features dedicated MIME routing distinct from legacy HWP5. - **#761**: `ExtractionResult.extraction_method` — new field exposing how text was extracted (`native`, `ocr`, `mixed`). Populated by PDF (native vs OCR vs `force_ocr_pages` mixed) and image (always `ocr`) extractors. Surfaced across every binding (Python, Node, PHP, Ruby, Java, C#, Go, R, Dart, Swift, Elixir, Gleam, Zig, WASM, C FFI). - **#788 follow-up**: Image classification + tile clustering on `ExtractedImage`. New optional fields `image_kind` (a public `ImageKind` enum: `Photograph`, `Diagram`, `Chart`, `Drawing`, `TextBlock`, `Decoration`, `Logo`, `Icon`, `TileFragment`, `Mask`, `Unknown`), `kind_confidence` (`f32` ∈ [0.0, 1.0]), and `cluster_id` (`u32`). The classifier is offline, deterministic, and uses already-captured signals: dimensions, aspect ratio, colorspace, bits-per-component, format, plus Shannon entropy of a 64×64 thumbnail. The clusterer groups same-page images with similar dimensions whose bounding boxes sit within half a tile-side of each other and reclassifies the members to `TileFragment`, so a technical drawing composed of N raster fragments surfaces as one `cluster_id`. Wired through every extractor that produces `ExtractedImage`: PDF (lopdf and pdfium fallback), DOCX, PPTX, HTML, ODT, EPUB, FictionBook, Jupyter, RTF, and the standalone image extractor. Toggle via `ImageExtractionConfig.classify` (default `true`). - **#784**: `PageInfo.has_vector_graphics: bool` — page-level flag indicating that a PDF page contains non-trivial vector drawing content (paths, shapes, curves) that is otherwise invisible to `ExtractionResult.images` because it isn't embedded as a raster XObject. Populated by counting `PdfPageObject::Path` instances on the pdfium-rendered page; the flag flips when more than 8 paths are present. Lets downstream consumers (RAG / VLM pipelines) decide per-page whether to rasterise the page themselves to capture vector charts and diagrams produced by Adobe InDesign, LaTeX TikZ, etc. Bounding-box detection of vector regions and auto-rasterisation are deferred to a later minor. ### Fixed - **#799**: Extract images nested inside PDF Form XObjects across XObject references — recursive Form XObject descent (depth-limited to 8) now follows indirect references through the resource chain, with cycle detection via a visited-set so self-referential XObject DAGs no longer hang. Both the lopdf and pdfium image-decoding paths benefit. - **#824**: Robust PDF image extraction across XObject references — fixes silent image drops when documents reference XObjects through indirect chains. Combined with the depth limit and cycle guard from #799 to harden the recursive walker against malformed structures. - **#826**: WASM loading on Next.js / Turbopack — `@kreuzberg/wasm` now bundles cleanly under webpack 5, Turbopack, and Next.js's app router. Dynamic imports of Node built-ins and the pdfium-js subsystem carry `/* webpackIgnore: true */` markers so bundlers stop trying to inline platform-specific binaries. - **#834**: DOCX `inject_placeholders` flag honored end-to-end — image placeholders now appear in markdown/plain/djot output when `ImageExtractionConfig.inject_placeholders = true`, and the DOCX OCR pipeline runs before rendering so OCR text reaches the final document. Adds extractor-level security regression tests for LaTeX, EPUB, ODT, Jupyter, RST, and RTF inputs (deeply nested envs, unclosed math, oversized control words, entity bombs, depth bombs, large item lists). - **#836**: Prevent base64 image data leaking into structured PDF output when image extraction is disabled. The structure pipeline now suppresses `populate_images_from_pdfium` and `inject_placeholders` whenever neither `ImageExtractionConfig.extract_images` nor `pdf_options.extract_images` is enabled, so disabled-by-default users no longer see embedded raster blobs in their results. - **#838**: OCR `elements` are now propagated through the extraction pipeline — image and PDF OCR backends populate `ExtractionResult.ocr_elements` consistently, fixing downstream consumers that relied on per-token bounding boxes. - **#839**: `extraction_timeout_secs` now applies to the single-file `extract_file` / `extract_bytes` paths. Previously the timeout only fired in batch and async wrappers, so a hostile single document could hang past the configured limit. The timeout is gated on `tokio-runtime`; non-tokio builds remain timeout-less by design. - **#797**: Chunking presets no longer auto-inject an `EmbeddingConfig`. Setting `ChunkingConfig.preset` (e.g. `"multilingual"`) without an explicit `embedding` field previously caused `resolve_preset()` to silently inject `Some(EmbeddingConfig { model: Preset { name }, ..default() })`, which made the extraction pipeline run `generate_embeddings_for_chunks()` and populate every `chunk.embedding` with vectors the caller never asked for. Presets now configure chunking parameters only; opt into embedding generation by providing an `EmbeddingConfig` explicitly. - **#782**: `result_format = "element_based"` now classifies headings and image placeholders correctly. `process_hierarchy` maps `h1` → `ElementType::Title` and `h2..h6` → `ElementType::Heading`, with the numeric level stored as `metadata.additional["heading_level"]`. `process_content` detects single-line markdown ATX headings (`# Title`, `## Section`, …) and emits them as `Title`/`Heading` instead of `NarrativeText`. `[Image: …]` placeholder lines are now emitted as `ElementType::Image` carrying the description in `metadata.additional["image_description"]`. `process_images` writes `metadata.additional["image_index"]` so consumers can join elements back to the `ExtractionResult.images` array by index. - **#844**: Python wrapper `extract_*` functions no longer crash on every call. Picked up upstream alef 0.11.24+ codegen fixes (kreuzberg-dev/alef#44): `#[serde(skip)]` is propagated to wrapper structs (no more `unknown field 'cancel_token'`), `api.py` wrappers forward arguments by keyword (no more `extract_file` mime*type/config arg-reorder `TypeError`), async pyo3 functions emit `async def` + `await`, and trait-bridge `register*\*`helpers are re-exported through`api.py`/`**init**.py` `**all**`. - **`force_ocr_pages` now reliably yields `ExtractionMethod::Mixed`** even when `pages` config is not explicitly set. The PDF text path now synthesizes a default `PageConfig` when force-ocr-pages is non-empty so byte boundaries are always available for splicing OCR text into the right ranges. - **PDF page extraction strategy enum renamed `ExtractionMethod` → `PageExtractionMethod`** in `kreuzberg-pdfium-render` to disambiguate from `kreuzberg::ExtractionMethod` (the new native/ocr/mixed strategy enum from #761). The pdfium variant remains exported via `pdfium_render::prelude` under the new name. ### Changed - **#787 (log hygiene)**: Default tracing subscriber now layers `ureq=warn`, `ureq_proto=warn`, `rustls=warn`, `hyper_util=warn`, `hf_hub=info`, `tower_http=info` on top of any user-supplied `RUST_LOG`, and the API router's `TraceLayer` demotes per-request/response events to DEBUG (failures stay at WARN). Default `info` log level no longer produces a wall of HTTP/TLS/transport DEBUG lines around HuggingFace model downloads. Per-target `RUST_LOG` rules continue to win, so `RUST_LOG=ureq=debug` still surfaces full transport detail when needed. The HuggingFace fetch chatter itself was never a re-fetch bug — models are cached on disk under `HF_HOME` and in-process via `LazyLock` engine caches; the noise was purely third-party transport DEBUG bleeding through. - **Dependabot bumps**: `swift-actions/setup-swift` 2 → 3 (#841), `gradle/actions` 4 → 6 (#840), `docker/setup-qemu-action` 3 → 4 (#833), `mlugg/setup-zig` 1 → 2 (#832). ## [4.10.0-rc.4] - 2026-04-28 Cycle 4 of the alef-backed publish-pipeline iteration. Cycle 3 surfaced fourteen build-stage publish failures across Elixir, Ruby, Python, PHP, Node, and C#; this RC bundles the targeted fixes for each. ### Fixed - **Elixir native libs build now uses the actual NIF directory name (`kreuzberg_nif`).** The publish workflow had references to `packages/elixir/native/kreuzberg_rustler` that never existed in this repo (the crate is named `kreuzberg_nif`). All five Elixir matrix targets failed in cycle 3 with `start process … working directory … invalid`. Replaced 16 `kreuzberg_rustler` references in `.github/workflows/publish.yaml`. - **Ruby gem build now declares its `async-trait` dependency.** `packages/ruby/ext/kreuzberg_rb/src/lib.rs` (alef-generated) imports `async_trait::async_trait` for trait bridges, but the matching `Cargo.toml` was missing the dep. Both Ruby gem matrix targets failed in cycle 3 with `unresolved import async_trait`. - **Ruby `batch_reduce_tokens`, `chunk_text`, `chunk_texts_batch`, `chunk_semantic` are now excluded from the alef-generated Magnus binding.** alef's Magnus codegen referenced undeclared local bindings (`texts_refs`, `page_boundaries_core`) for these functions; the codegen fix is tracked upstream. - **`task php:build` now exists** (`cargo build --release -p kreuzberg-php`). The publish workflow invoked this task name; without it both PHP PIE matrix targets failed in cycle 3 with `task: Task "php:build" does not exist`. - **Python wheel build on `linux-aarch64` symlinks `aarch64-linux-gnu-gcc` to `gcc`** when running natively in the manylinux container. The repo's `.cargo/config.toml` pins `aarch64-unknown-linux-gnu`'s linker to the cross-compiler binary name, but that binary doesn't exist in the `manylinux_2_28` container on a native ARM runner. Added a symlink in the cibuildwheel `before-script-linux` step. - **C# native assets `linux-musl-arm64` build no longer trips on `packages/dart/rust`.** `docker/Dockerfile.musl-ffi`'s sed pattern was missing the dart/swift workspace-member exclusions (added in earlier cycles to other Dockerfiles); cargo failed with `failed to load manifest for workspace member /build/packages/dart/rust`. - **Node bindings build uses a path-based pnpm filter** (`pnpm --filter ./crates/kreuzberg-node`) instead of the never-resolved `pnpm --filter @kreuzberg/node`. `crates/kreuzberg-node/package.json`'s name is `kreuzberg`, so the scoped filter never matched. Updated three sites (workflow + two scripts). - **alef bumped to v0.11.7.** Carries two codegen fixes the regenerated bindings depend on: (1) optional `string`/`bytes` arguments in Rust e2e tests now bind to a typed `Option`/`Option>` and pass via `.as_deref()` so signatures expecting `Option<&str>` no longer receive `&Option<_>`; (2) PHP backend correctly threads `data_enum_names` through the type mapper for tagged data enums. ## [4.10.0-rc.2] - 2026-04-28 Cycle 2 of the alef-backed publish-pipeline iteration. RC1 surfaced two failures: the `actions/check-registry@v1` and `actions/prepare-release-metadata@v1` shims now require alef ≥ 0.11.0 for the `check-registry` and `release-metadata` subcommands, but `alef.toml`'s top-level `version` field still pinned 0.10.4 (which `install-alef@v1` resolves "latest" against). Bump the alef pin to 0.11.0 so all kreuzberg jobs install an alef binary that has the new subcommands. Also fix the `release-metadata.json` artifact upload that was being wiped by the `prepare` job's re-checkout step (stash to /tmp before re-checkout, restore after). ## [4.10.0-rc.1] - 2026-04-28 First release candidate of v4.10.0. The release pipeline itself is the headline feature: this RC kicks off the iteration loop that proves out the alef-backed publish workflow against real registry endpoints. Substantive functional changes from v4.9.5 are listed below. ### Changed (release pipeline) - **The publish workflow now runs end-to-end on prerelease tags.** Previously, `if: !github.event.release.prerelease` on the `prepare` job blocked every RC tag from triggering CI. The gate is removed; RCs publish for real with prerelease dist-tags (npm `next`, gemspec `.pre.rc.N`, PyPI `rc{N}`). Homebrew formula updates remain gated on stable releases via a new `is_prerelease` metadata flag. - **`task version:set -- `** is the canonical way to set a release version (wraps `alef sync-versions --set`). Use it for both stable releases and RCs. - **Cross-manifest version validation in `scripts/publish/validate-version-consistency.sh` now glob-discovers the Ruby `version.rb` file** across all three rb-sys-style layouts and silently skips Go/PHP whose version-bearing manifest is absent (those ecosystems version via git tags, not in source). ### Fixed - **C FFI, PHP, Ruby, R bindings shipped 40+ stub functions that returned `Not implemented` at runtime**. Every batch API (`batch_extract_file_sync`, `batch_extract_bytes_sync`, plus async variants), `extract_file`, `extract_file_sync`, and most of the Ruby gem's surface (21 functions) silently failed with error code 99. Root cause was in the alef binding generator: bare `Path` was misresolved to `Named("Path")` and sanitized to `String`; sanitized batch-tuple params (`Vec<(PathBuf, Option)>`) were never handled by the PHP/Magnus/FFI codegen even though the IR carried the original type for JSON-roundtrip; Magnus rejected every extraction function via an over-strict `is_named_ref_param` check; and the R backend panicked on every async function. Fixed in alef and regenerated all bindings — Python, Node, FFI all build clean. Only `kreuzberg_get_preset` remains stubbed (return-type sanitization edge case; tracked separately). - **#788**: Extract images nested inside PDF Form XObjects — `lopdf::get_page_images` only scanned the page-level `Resources` dictionary and silently skipped images stored inside `Subtype=Form` XObjects, which is the structure used by technical drawings composed of tiled raster tiles. PDF image extraction now recursively descends into Form XObjects (up to 8 levels deep) in both the lopdf and pdfium code paths, so all constituent images are collected. - **Security limits actually enforce now**. `SecurityLimits` config fields (`max_nesting_depth`, `max_entity_length`, `max_content_size`, `max_iterations`, `max_xml_depth`, `max_table_cells`) and the matching `SecurityError` variants previously advertised protection that no extractor invoked — the validator helpers were `#[cfg(test)]`-gated and removed in commit `c58069201` as dead code, leaving only the config knobs. Five internal validators (`StringGrowthValidator`, `IterationValidator`, `DepthValidator`, `EntityValidator`, `TableValidator`) are restored and now run on every extraction path that ingests user-controlled bytes — XML-class formats (DOCX/PPTX/XLSX/ODT/EPUB/SVG/JATS/DocBook/FictionBook/OPML), HTML, JSON/YAML/TOML, tabular extraction (CSV/Excel/HTML tables/DOCX cells), and final text accumulation for plain-text formats (Markdown/Org/RST/LaTeX/Jupyter/RTF). Hostile inputs (billion-laughs entity expansion, depth bombs, cell bombs, quadratic string growth, iteration bombs) now fail with a structured `KreuzbergError::Security` instead of OOMing or hanging. The validators are internal core-only types; bindings observe the protection through the new unified `Security` error variant returned from every `extract_*` entry point. Defaults relaxed where the previous values false-positived on legitimate documents: `max_nesting_depth` 100 → 1024, `max_xml_depth` 100 → 1024, `max_entity_length` 32 → 1 048 576 (per-token cap; cumulative size remains bounded by `max_content_size`). - **#789**: PDF image extraction would hang indefinitely on documents with thousands of image objects on a single page (observed: 2487 images). The `max_images_per_page` cap was added to `ImageExtractionConfig` in #766 but only wired to the structure pipeline's position counting, never to the byte-decoding path; pages exceeding the cap are now skipped with a `WARN` log before the FlateDecode loop runs. Both `extract_images_from_pdf` and the pdfium fallback now run inside `tokio::task::spawn_blocking`, so `extraction_timeout_secs` can interrupt them. (#800) - **#794**: Fix Helm chart default install broken by two conflicts: (1) the cache init container ran as `root` while `podSecurityContext.runAsNonRoot: true` is the default, causing kubelet to reject the pod; (2) Kubernetes service discovery injects `KREUZBERG_PORT=tcp://...` when the release is named `kreuzberg`, which the binary parses as a `u16` and panics. Fixed by adding `runAsNonRoot: false` to the init container's `securityContext`, a new `cache.initChown` toggle (default `true`, set to `false` on fsGroup-aware storage to skip the init container entirely), and defaulting `enableServiceLinks: false` in the pod spec. (#822) - **#825**: `kreuzberg cache manifest` no longer fails with `E0282` when `kreuzberg-cli` is built without `paddle-ocr` or `layout-detection` (e.g. `--no-default-features --features bundled-pdfium`). The command now bails with a clear actionable error if invoked at runtime in such a build. - **`@kreuzberg/node` prebuilt bindings fail to load on RHEL 8 / AlmaLinux 8 / Rocky 8 / RHEL 9**: the Linux x64/arm64 GNU prebuilds are now built via `cargo-zigbuild`, which caps the glibc floor at link time. Fixes the `GLIBC_2.38 not found` / `GLIBCXX_3.4.31 not found` / `undefined symbol: __isoc23_strtoll` load errors on RHEL 8, AlmaLinux 8, Rocky 8 (glibc 2.28) and RHEL 9 (glibc 2.34). Verified locally: the prebuilt `.node` drops from `GLIBC_2.38` / `GLIBCXX_3.4.31` down to `GLIBC_2.28` / no `GLIBCXX` dependency. `kreuzberg-tesseract/build.rs` auto-detects the zigbuild toolchain and (1) disables tesseract's AVX512 codepath (zig/clang requires an explicit `evex512` feature that tesseract's CMake doesn't pass) and (2) skips linking `stdc++fs` (zig's libstdc++ has `std::filesystem` inline). The publish pipeline now (a) runs `objdump -T` against each linux-gnu prebuild and rejects any artifact requiring `GLIBC_*` > 2.28, any `GLIBCXX_*` symbol, or any `__isoc23_*` symbol, and (b) loads the prebuilt `.node` inside `redhat/ubi8` (glibc 2.28) and exercises the napi surface before publishing to npm. Refs #352. - **#781**: Fix DOCX OCR pipeline integration — reordered the extraction pipeline to ensure OCR processing runs before document rendering. Markdown, Plain, and Djot renderers now correctly receive and inject OCR text output instead of dropping it or omitting images. - **#823**: Fix WASM loading in Next.js / Turbopack by adding `/* webpackIgnore: true */` to dynamic imports of Node.js built-ins and resolving bundling issues in the pdfium-js subsystem. ### Added - **#788**: Extract images nested inside PDF Form XObjects — PDF image extraction now recursively descends into Form XObjects (up to 8 levels deep) in both the lopdf and pdfium code paths. ### Changed - Fix `use use` duplicate-import syntax error in alef-generated elixir NIF binding (`kreuzberg_nif/src/lib.rs`). - Apply `cargo fmt` uniformly across all workspace crates (formatting only, no logic changes). - Fix typo `entrys` → `entries` in auto-generated API reference docs. ### Added - **#788**: Extract images nested inside PDF Form XObjects — PDF image extraction now recursively descends into Form XObjects (up to 8 levels deep) in both the lopdf and pdfium code paths. ### Changed - Fix `use use` duplicate-import syntax error in alef-generated elixir NIF binding (`kreuzberg_nif/src/lib.rs`). - Apply `cargo fmt` uniformly across all workspace crates (formatting only, no logic changes). - Fix typo `entrys` → `entries` in auto-generated API reference docs. --- ## [4.9.5] - 2026-04-23 ### Fixed - **#790**: Fix GPU acceleration — kreuzberg now bundles CPU-only ONNX Runtime by default (zero-config). When a GPU execution provider (`cuda`, `tensorrt`, `coreml`) is explicitly requested via `AccelerationConfig` but unavailable, kreuzberg returns an error with setup instructions instead of silently falling back to CPU. `Auto` mode gracefully falls back to CPU with an info log. For GPU support, set `ORT_DYLIB_PATH` to a GPU-enabled ONNX Runtime. - **#791**: Fix DOCX OCR extraction — OCR now runs on embedded images before document rendering, and OCR text is injected into the rendered output. Previously, OCR results were discarded and replaced with placeholder text. - **#783**: PaddleOCR backend not utilizing GPU (CUDA) despite `AccelerationConfig` — `AccelerationConfig` from `ExtractionConfig` was never reaching PaddleOCR ONNX sessions, silently falling back to CPU. Acceleration is now propagated through `OcrConfig` to all OCR call sites (image extractor, PDF OCR). - **#779**: Expose `PaddleOcrConfig` in Python bindings and update `OcrConfig` for backward compatibility. - **#792**: Fix Ruby gem packaging — exclude staged `libpdfium.dylib` from gem artifacts by narrowing the native extension glob to only include the compiled `kreuzberg_rb.*` extension. ### Added - GPU CI workflow (`ci-gpu.yaml`) targeting self-hosted GPU runners with NVIDIA GPUs. - Comprehensive GPU integration tests covering all ORT-accelerated paths: PaddleOCR (det/cls/rec), layout detection (RT-DETR), embeddings, document orientation detection, and end-to-end extraction. Tests use tracing log capture to verify CUDA EP is actually invoked. --- ## [4.9.4] - 2026-04-22 ### Fixed - **Ruby gem build failure** — add missing `max_images_per_page` field to `ImageExtractionConfig` initializer in Ruby binding (`kreuzberg-rb`), fixing compilation error E0063 on all platforms. - **Node binding build failure on Linux** — stop removing `/usr/local/lib/node_modules` in CI disk cleanup script; npm was being deleted before `pnpm/action-setup` could use it, causing `spawn npm ENOENT`. - **Homebrew formula publish failure** — grant `contents: write` permission to the `publish-homebrew` job so `gh release upload` can attach bottle artifacts (was `contents: read`). - **#783**: PaddleOCR now correctly utilises the GPU (CUDA) when `AccelerationConfig(provider="cuda")` is set. Previously `self.acceleration` on `PaddleOcrBackend` was always `None` (hardcoded at construction time), so the ONNX session builder never received the requested execution provider and silently fell back to CPU. `AccelerationConfig` is now threaded from `ExtractionConfig` into the ephemeral `OcrConfig` at each `process_image` call site (image extractor and both PDF OCR paths), and `PaddleOcrBackend::process_image` sets the module-level thread-local before the engine-pool slow path — so ONNX sessions are created with the correct provider on first use. --- ## [4.9.3] - 2026-04-22 ### Added - **Layout detection regions on PageContent** — new `layout_regions` field exposes detected layout regions (class, confidence, bounding box, area fraction) from the RT-DETR model when layout detection is enabled. Enables programmatic detection of diagrams, figures, tables, and other content types per page. Available across all 10 bindings. (#579) - **LayoutRegion type files** for Java, PHP, and Elixir bindings (were referenced but missing). - **E2E assertions for layout regions** — `has_layout_regions` and `layout_classes_include` assertion types in all 12 language generators. ### Fixed - **#779**: Fix `PaddleOcrConfig` not bound in Python API — exposed `PaddleOcrConfig` as a first-class class in the Python bindings. Updated `OcrConfig` to accept both `PaddleOcrConfig` objects and raw dictionaries for backward compatibility. Added `paddle_ocr_config` property (getter/setter) to `OcrConfig`. - **#770**: DOCX page extraction (`extract_pages=True`) now works correctly — `result.pages` and `result.get_page_count()` are no longer always `None`/`0`. Two bugs fixed: (1) computed `PageContent` blocks were never stored on `InternalDocument.prebuilt_pages`, so the derivation pipeline always fell back to `None`; (2) page-break markers inside table cells were incorrectly added to the top-level element list, creating phantom page boundaries before tables and corrupting `table_page_numbers`. Page breaks (`w:br[w:type="page"]` and `w:lastRenderedPageBreak`) in body text are stored as `DocumentElement::PageBreak` and mapped to precise character offsets; breaks inside table cells are intentionally ignored at document level (tables spanning multiple pages remain a known limitation). - **#773**: `serve` and `mcp` CLI subcommands now correctly apply `KREUZBERG_*` environment variable overrides. Previously, variables such as `KREUZBERG_OCR_LANGUAGE`, `KREUZBERG_LLM_MODEL`, and `KREUZBERG_LLM_API_KEY` were silently ignored when starting the API or MCP server — only the `extract` command honoured them. Also fixes the provider env-var fallback in the LLM client: `MISTRAL_API_KEY` is now picked up for bare `mistral-*` model names (e.g. `mistral-large-latest`), not only for the `mistral/` prefix form. - **#774**: Tagged-PDF structure tree dropped paragraph body text when a block had both own text and children, and wrapped numbered section headings in an invalid `List → Heading` AST (panics comrak in debug, emits malformed markdown in release). `flatten_blocks` now emits parent text alongside children; text-pattern list detection in `element_to_paragraph` is gated on `heading_level.is_none()`. - **Semantic chunker fallback path now respects `max_characters`** — previously the non-embedding fallback hardcoded a 4000-char ceiling and silently ignored the caller's `max_characters`. A warning is also emitted when `chunker_type='semantic'` is used without an `EmbeddingConfig` so the fallback mode is discoverable. The `ChunkerType::Semantic` docstring has been corrected to describe both paths accurately. - **OCR backend dispatch**: `OcrConfig(backend=...)` with a non-default backend no longer silently falls back to paddleocr when the chosen backend errors — auto-fallback is limited to the default tesseract backend; users who want multi-backend fallback configure it via `OcrConfig.pipeline` (unchanged). - **EasyOCR on PDFs**: `EasyOCRBackend.supports_document_processing()` returns `False` so Rust's `PdfRenderer` handles page rendering, removing the implicit `pdf2image`/`pymupdf` requirement that was never declared in the `[easyocr]` extra. - **Cross-format parity test failure** — HTML extractor now normalizes setext headings to ATX and strips trailing whitespace from html-to-markdown-rs output. - **Broken wasm-deno/wasm-workers e2e tasks** — removed non-functional deno and workers e2e generate/lint/test tasks that referenced invalid generator lang values. - **oxlint path in node e2e lint** — `oxlint --fix typescript` changed to `oxlint --fix .` (was looking for nonexistent `typescript/` directory). - **Clippy warnings in benchmark-harness** — `sort_by` replaced with `sort_by_key` + `Reverse`. - **Clippy warnings and compilation errors across workspace** — added missing `max_images_per_page` field to `ImageExtractionConfig` in node and Python bindings; added missing `vlm_prompt` argument to VLM OCR test calls; collapsed nested `if-let` in WASM embeddings; added `embeddings` and `tree-sitter` passthrough features to `kreuzberg-ffi` to silence `unexpected_cfgs` warnings. - **Cancellation token not wired in oxide segment structure pipeline** — `cancel_token` was passed into `SegmentStructureConfig` but never checked, meaning cancellation/timeout had no effect during pdf-oxide table extraction or paragraph building. Added cancellation checks at table page prep, heuristic table extraction loops, and a pre-flight guard before parallel paragraph extraction. - **#771**: `OcrConfig.vlm_prompt` is now correctly honored in VLM OCR requests. Previously, it was documented but never forwarded to the underlying VLM calls, causing the default template to be used regardless of configuration. - **#762**: PDF image links are no longer silently dropped from markdown output. Image extraction now correctly preserves correspondence between pdfium objects and lopdf data, and respects the `inject_placeholders` configuration. - **#769**: Downgraded `pre-commit-shfmt` to `v3.13.1-1` (fixes broken CI due to non-existent version in `main`). - **#766**: PDF extraction with large numbers of image fragments no longer hangs indefinitely — added `ImageExtractionConfig.max_images_per_page` (default `None`) to cap images processed per page. Batch-level `extraction_timeout_secs` now interrupts blocking pdfium threads at the next inter-page checkpoint via a `CancellationToken`, preventing the timeout from being silently bypassed. - **#764**: PST extractor now populates email attachments — `attachments` was hardcoded to an empty list and never read from the message; now reads attachment name, filename, MIME type, size, and binary data via the attachment table. PST entry IDs are now formatted as proper 48-char MAPI hex strings instead of Rust Debug output. ### Added - `ImageExtractionConfig.max_images_per_page` — optional cap on images decoded per page; prevents hangs on PDFs with thousands of inline image fragments. ### Changed - Removed redundant `.task/workflows/e2e.yml` — e2e tasks consolidated in top-level `Taskfile.yml`. --- ## [4.9.2] - 2026-04-19 ### Fixed - Fix cancellation token not checked in WASM (non-tokio) path for Excel, DOC, PPT, Pages, Keynote, and Numbers extractors — cancellation was silently ignored in WASM builds - Propagate `Cancelled` error code (9) to all bindings — Go, C FFI, Python, TypeScript, and C API docs now include the new code - Fix PHP e2e embed tests calling instance methods statically — use procedural `\Kreuzberg\embed()` functions - Fix TypeScript e2e embed tests using wrong field names (`type`/`name` → `modelType`/`value`) for embedding model config - Fix Elixir e2e embed tests calling non-existent `embed_async/2` — use sync `embed/2` - Fix TypeScript e2e generator missing `html_output` config mapping for styled HTML tests - Fix `ORT_DYLIB_PATH` on Windows CI pointing to `lib/` instead of the actual DLL location - Fix C# CI build conditional to require successful FFI build - Add `libuv1-dev` to Linux CI system dependencies for R package builds --- ## [4.9.1] - 2026-04-19 ### Fixed - **#754**: Preserve `_internal_bindings.pyi` type stub during wheel artifact cleanup — published wheels now include inline type information for the core binding module - Add missing `Default` impl for `PyCancellationToken` to satisfy clippy `new_without_default` lint - Improve download resilience for `eng.traineddata` in build script — increase retries from 3 to 5, add fallback URL via `raw.githubusercontent.com`, and increase timeout to 300s - Increase Task installer retry resilience in CI — 5 attempts with `--retry-all-errors` curl flag --- ## [4.9.0] - 2026-04-18 ### Fixed - **#588**: Suppress C23 glibc symbols (`__isoc23_strtoll` etc.) in manylinux wheels — added CMake flag propagation and CI verification step to prevent incompatible symbols on glibc < 2.38 (Debian 12, Ubuntu 22.04) - **#748**: Remove `kreuzberg-cli` from Python wheel to fix `libonnxruntime.so.1` loading failure — CLI is available as standalone release - **#749**: Add cancellation token support — cancelled extractions no longer block subsequent calls via `PDFIUM_OPERATION_LOCK`; wired across Python, Node.js, Ruby, WASM, and C FFI bindings - **#750**: Fix `kreuzberg[easyocr]` extra silently installing nothing on Python 3.14+; clean up stale `[paddleocr]` references in docs - **#752**: Fix ~1000x slowdown on Ghostscript-produced PDFs with structured output — replace O(N²) `Vec::contains` with O(1) `AHashSet` lookup, add minimum dimension filter for tiny inline images - **#753**: Fix `llm_usage` returning `None` when using VLM-based OCR — propagate usage through PDF OCR, image OCR, and `force_ocr_pages` paths ### Added - Cancellation token API available in all language bindings (`CancellationToken` in Python/Node/Ruby/WASM/FFI) ### Changed - **Breaking**: `kreuzberg-cli` binary is no longer bundled in the Python wheel — install the standalone CLI from GitHub releases --- ## [4.8.6] - 2026-04-17 ### Added - **PST message EntryID in extracted metadata** — the `entry_id` field from Outlook PST message entries is now included in the `metadata` HashMap of `EmailExtractionResult`, enabling callers to unambiguously link extracted data back to its source message. (#739) - **AccelerationConfig wired through all ORT model loading** — `AccelerationConfig` (CUDA, CoreML, TensorRT, Auto) is now propagated to all ONNX Runtime sessions: layout detection (RT-DETR, YOLO, SLANeT, TATR, TableClassifier), embeddings, document orientation, and PaddleOCR. Previously, GPU acceleration was silently ignored and all models used CPU. The `acceleration` field is also added to `LayoutDetectionConfig` and `EmbeddingConfig` across all 11 bindings (Python, TypeScript, Ruby, Go, Java, C#, PHP, R, Elixir, FFI, WASM). (#740) ### Added - Semantic chunker (`ChunkerType::Semantic`) for topic-aware document splitting - `topic_threshold` configuration field for embedding-based topic detection - `utils/markdown_utils` shared utility for ATX heading detection - `preset_chunk_size()` helper in embeddings module - E2e contract fixtures for semantic chunking ### Fixed - **Batch extraction panics with "Lazy instance has previously been poisoned" on ARM64 Linux** — OCR backend registry initialization used `panic!()` on Tesseract/PaddleOCR init failures, poisoning the `Lazy` static and cascading to all concurrent batch tasks. Replaced with `tracing::warn!()` + graceful skip. Also converted `GLOBAL_RUNTIME`, `EXTRACTORS_INITIALIZED`, and 3 `PROCESSOR_INITIALIZED` statics from `once_cell::sync::Lazy` to `once_cell::sync::OnceCell` (retry on failure instead of permanent poisoning). Migrated ~15 collection/cache `Lazy` statics to `std::sync::LazyLock`. (#741) - **PaddleOCR `model_tier` from TOML config ignored by API server** — the singleton PaddleOcrBackend always used `self.config.model_tier` (default "mobile") to resolve models, ignoring the per-request `paddle_ocr_config.model_tier` from the user's TOML/API config. Engine initialization now uses the effective per-request config. (#725) - **VLM OCR backend ignored when paddle-ocr feature enabled** — the auto-constructed OCR pipeline hardcoded `vlm_config: None` on pipeline stages, silently discarding the user's VLM configuration. Users who configured `OcrConfig(backend="vlm", vlm_config=LlmConfig(...))` got tesseract/paddleocr output instead of VLM. The pipeline now propagates `vlm_config` from the parent `OcrConfig`. (#738) - **Doubled OCR content and corrupted page text in image extraction** — OCR elements were injected into the rendering pipeline as `OcrText` internal elements, causing `render_plain` to append every raw word token after the coherent HOCR string. `ExtractionResult.content` was effectively duplicated and `pages[*].content` contained a word-by-word dump instead of the readable text. OCR elements are now stored directly via `prebuilt_ocr_elements`, bypassing the rendering pipeline. (#706) - **Image OCR pages[] empty** — `include_elements` was not forced true for image extraction, so backends that gate element output (e.g. paddle-ocr) returned `None`, leaving `pages[]` empty. (#723) - **`LlmConfig` missing `Default` trait** — the documented `..Default::default()` struct-update pattern failed to compile with "trait not satisfied". Added `Default` to the derive macro; all optional fields default to `None`, `model` to `""`. (#716) - **Incorrect `llm` Cargo feature name in docs** — `llm-integration.md`, `api-rust.md`, and `configuration.md` referenced a `llm` feature that does not exist; the correct name is `liter-llm`. (#717) - **LLM embedding provider panics in server mode** — `embed_texts` called `block_on` inside a new runtime, which panics when already inside tokio (HTTP server, MCP). Uses `block_in_place` with the current runtime handle when available, falls back to a new runtime for standalone sync callers. (#713, #714) - **Duplicate `output_format` key in OCR metadata** — stale `additional` HashMap insert caused a duplicate JSON key violating RFC 8259. The value is already on the typed `Metadata::output_format` field. (#712) - **OCR table metadata serialized as strings instead of numbers** — `table_count`, `tables_detected`, `table_rows`, and `table_cols` were `"0"` instead of `0`, breaking numeric comparisons in all bindings. (#712) - **Ruby `structured_output` not exposed on Result** — the field was missing from the Ruby binding's `Result` class and not serialized from the native extension. (#736) - **Stale hf-hub lock files block embedding model downloads** — cleaned up orphaned lock files before downloading. (#721) - **WASM live demo `enableOcr()` not called** — OCR was silently unavailable in the demo; also throws on missing Rust registry export. (#719, #720) - **DOCX tables assigned wrong page numbers** — tables were numbered by index instead of by their actual document position based on page breaks. (#718) - **`ocr.enabled=false` config ignored** — OCR ran even when explicitly disabled; also dropped trailing newline in `--format text` output. (#715) - **Go module tag push fallback** — added `git push` fallback when tag push fails. - **Go E2E `LlmUsage` type mismatch** — generated Go test helper used `[]interface{}{}` instead of `[]kreuzberg.LlmUsage{}`. - **Rust E2E `extractMetadata` field name** — html_options fixture used camelCase `extractMetadata` instead of snake_case `extract_metadata` expected by html-to-markdown-rs v3.2. - **R package documentation stale** — 14 exported functions lacked `.Rd` man pages and `extraction_config.Rd` was missing 13 parameters added in v4.8.0–4.8.5. Regenerated all roxygen2 documentation. ### Changed - Updated all dependencies including html-to-markdown-rs 3.1→3.2, pdf_oxide 0.3.30→0.3.32, tokio 1.51→1.52. --- ## [4.8.5] - 2026-04-14 ### Added - **LLM usage tracking** — new `llm_usage` field on `ExtractionResult` captures token counts, estimated cost (USD), model identifier, and finish reason for every LLM call (VLM OCR, structured extraction, LLM embeddings). Multiple entries are produced when multiple LLM calls occur in a single extraction. Exposed across all bindings: Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, R, C FFI, and WASM. ### Fixed - **Markdown chunker duplicates heading when `prepend_heading_context` is enabled** — the heading was prepended twice when a chunk boundary aligned with a heading node, producing repeated heading text in the output. (#701) - **Helm chart icon 404 on Artifact Hub** — `Chart.yaml` referenced `logo.png` but the file is `logo.svg`. - **Python wheel manylinux compliance failure** — bumped manylinux from `2_38` to `2_39` to allow `GLIBCXX_3.4.31` symbols from the build toolchain, matching the v4.6.x baseline that worked. - **Python wheel requires glibc ≥ 2.38 (breaks Debian 12, Ubuntu 22.04)** — GCC 14 in the `manylinux_2_39` build container emitted C23-versioned glibc symbols (`__isoc23_strtoll`, `__isoc23_sscanf`, etc.), making the wheel uninstallable on systems with glibc < 2.38. Downgraded to `manylinux_2_28` and added `-std=gnu11`/`-std=gnu++17` CFLAGS to suppress C23 symbol emission. (#588) - **FFI memory leak** — `kreuzberg_free_result` was not freeing `djot_content_json`, `structured_output_json`, and `llm_usage_json` pointers. - **R e2e embed tests fail** — generated R embedding config was missing the `type` discriminator field required by Rust's tagged enum deserialization. - **Elixir parity test fails** — `ExtractionConfig` struct was missing the `:html_output` field. - **Go LLM e2e tests fail** — `EmbeddingModelType` struct was missing `Llm` nested config, `ExtractionConfig` was missing `StructuredExtraction` field. - **WASM tree-sitter build fails** — `tree-sitter-language-pack` 1.6.0 removed the `wasm` feature; removed stale feature gate from wasm32 target dependency. --- ## [4.8.4] - 2026-04-13 ### Added - **Helm chart for Kubernetes deployment** — minimal, security-hardened Helm chart with Deployment, Service, Ingress, PVC, HPA, PDB, and ServiceAccount templates. Publishes to GHCR as an OCI artifact. (#695) - **Helm lint and kubeconform pre-commit hooks** — added `helm lint --strict` and `kubeconform` (k8s 1.28.0 schema validation) to pre-commit and CI pipeline. - **Helm chart publish workflow** — new `publish-helm.yaml` GitHub Actions workflow pushes versioned chart to `oci://ghcr.io/kreuzberg-dev/charts`. ### Fixed - **Helm chart: init container cannot chown as non-root** — the `init-cache` container needs root to `chown` the PVC mount. Added `securityContext.runAsUser: 0` to the init container. - **Helm chart: unpinned busybox image tags** — pinned `busybox:latest` to `busybox:1.37-glibc` in init container and test pod for reproducibility. - **Comrak bridge panics on multi-byte UTF-8 boundaries** — annotation byte offsets landing inside multi-byte characters (e.g. Cyrillic, `\u00ab\u00bb`) caused panics in `build_inlines()`. Snaps offsets to valid char boundaries using `ceil_char_boundary()`/`floor_char_boundary()`. (#696) --- ## [4.8.3] - 2026-04-12 ### Fixed - **ONNX session creation fails on Linux x86-64 with "graph_optimization_level is not valid"** — `GraphOptimizationLevel::Level3` maps to `ORT_ENABLE_LAYOUT` (value 3), only valid in ORT >= 1.21. The Linux wheel bundled ORT 1.20.1 due to a hardcoded version override in the publish workflow. Fixed by switching to `GraphOptimizationLevel::All` (ORT_ENABLE_ALL = 99, valid across all ORT 1.x) and aligning all ORT versions to 1.24.2 (matching ort-sys 2.0.0-rc.12). Also upgraded manylinux target from `manylinux_2_28` to `manylinux_2_35` to support the newer ORT binaries. (#683) ### Documentation - **Documented AVX/AVX2 CPU requirement for ONNX Runtime features** — CPUs without AVX support (e.g. Intel Atom, Celeron N5105/Jasper Lake) cannot use PaddleOCR, layout detection, or embeddings. Added warning and system requirements entry to installation docs. (#691) --- ## [4.8.2] - 2026-04-10 ### Added - **`HtmlOutputConfig` typed in all bindings** — `html_output` config field (themes, CSS classes, embed CSS, custom CSS, class prefix) now fully typed in Python, TypeScript/Node, Go, Ruby, Elixir, PHP, Java, C#, R, and FFI. Previously only available in Rust core. ### Fixed - **PDF: legitimate repeated content stripped during page merging regardless of `strip_repeating_text` flag** — `deduplicate_paragraphs()` in the PDF merge pipeline runs unconditionally after per-page extraction, removing consecutive identical paragraphs (≥5 chars) and non-consecutive body-text duplicates (≥15 chars) via HashSet dedup. This strips brand names and other legitimately repeated content even when `ContentFilterConfig.strip_repeating_text` is set to `false`. Gated both deduplication passes behind the `strip_repeating_text` flag so they are skipped when content filtering is disabled (#670, #681) - **R package build failure** — R binding Cargo.toml version was stuck at 4.6.3 while core was at 4.8.1, causing tokio version resolution failure. Version sync script now includes the R native extension Cargo.toml. - **CI: PyPI publish action failure** — pinned `pypa/gh-action-pypi-publish` to v1.13.0 (v1.14.0 has broken Docker image on GHCR) - **E2E: Elixir generator emitted undefined `is_nan/1` function** — added helper function definition to the generated Elixir test helpers --- ## [4.8.1] - 2026-04-09 ### Added - **Styled HTML output** — New `HtmlOutputConfig` on `ExtractionConfig` with 5 built-in themes (`default`, `github`, `dark`, `light`, `unstyled`), semantic `kb-*` CSS class hooks on every structural element, CSS custom properties (`--kb-*`), custom CSS injection (inline or file), and configurable class prefix. The existing `Html` output format is upgraded in-place when `html_output` is set (#633, #665) - 5 new CLI flags: `--html-theme`, `--html-css`, `--html-css-file`, `--html-class-prefix`, `--html-no-embed-css` — any flag implicitly sets `--content-format html` - `HtmlOutputConfig` and `HtmlTheme` types exposed in Rust public API ### Changed - **Vendored yake-rust 1.0.3** into kreuzberg core, removing external dependency - Fixes #676: `BacktrackLimitExceeded` panic on large files (10+ MB) by replacing regex-based sentence splitting with memchr-based approach - Expanded YAKE stopwords from 34 to 64 languages using kreuzberg's unified stopwords module - Removed 6 transitive dependencies (yake-rust, segtok, fancy-regex, streaming-stats, hashbrown, levenshtein) - Styled HTML renderer included in the `html` feature (no separate `html-styled` feature gate) ### Fixed - **PPTX: panic on non-char-boundary during page boundary recomputation** — byte offsets could land inside multi-byte UTF-8 characters (e.g. `…` U+2026), causing a panic when slicing content (#674) - **PDF: `include_headers` / `include_footers` flags ignored by layout-model furniture stripping** — when a layout-detection model classified paragraphs as `PageHeader` or `PageFooter`, they were unconditionally stripped as furniture regardless of `ContentFilterConfig` flag values. Setting `strip_repeating_text=false` with `include_headers=true` now correctly preserves those regions (#670) - **PDF: heuristic table detector misclassifies body text as tables on slide-like PDFs** — PowerPoint-exported PDFs with column-like text gaps produced false-positive 2–3 row "tables" whose bounding boxes covered the entire page, suppressing all body text from the structured extraction pipeline. Tables with ≤3 rows spanning >50% of the page height are now rejected as false positives - **PPTX: `ImageExtractionConfig.inject_placeholders` silently ignored** — setting `inject_placeholders=false` now correctly suppresses `![alt](target)` image references in PPTX markdown output (#671, #677) - **DOCX/HTML/DocBook/LaTeX/RST: `inject_placeholders` config ignored** — all extractors now honour `ImageExtractionConfig.inject_placeholders` to suppress image reference injection when set to `false` - **PPTX public API cleanup** — `extract_pptx_from_path` and `extract_pptx_from_bytes` now accept `&PptxExtractionOptions` instead of 6 positional parameters --- ## [4.8.0] - 2026-04-08 ### Added - **Cross-extractor content filtering configuration** — New `ContentFilterConfig` on `ExtractionConfig` with `include_headers`, `include_footers`, `strip_repeating_text`, and `include_watermarks` flags. Controls header/footer/furniture inclusion across PDF, DOCX, RTF, ODT, HTML, EPUB, and PPT extractors. Typed in all bindings (Python, TypeScript, Ruby, Go, Elixir, PHP, Java, C#, WASM). - **Local LLM support** via liter-llm 1.2 — use Ollama, LM Studio, vLLM, llama.cpp, LocalAI, or llamafile as VLM OCR, embedding, or structured extraction backends with zero API key configuration - **LLM-powered document intelligence via liter-llm** — Integrates with 146 LLM providers (including local inference engines) for three new capabilities: - **VLM OCR**: Vision language models as OCR backend (OpenAI GPT-4o, Anthropic Claude, Google Gemini, etc.). Superior accuracy for low-quality scans, handwriting, Arabic/Farsi, and complex layouts. Configure via `ocr.backend = "vlm"` with `ocr.vlm_config`. - **Structured Extraction**: Extract structured JSON data from documents using a JSON schema constraint. Users provide a schema and optional Jinja2 prompt template; the LLM returns conforming data. Supports strict mode (OpenAI) with automatic schema sanitization for cross-provider compatibility. - **VLM Embeddings**: Provider-hosted embedding models (e.g., `openai/text-embedding-3-small`, `mistral/mistral-embed`) as alternative to local ONNX models. Works through existing `/embed` API, `embed_text` MCP tool, and `embed` CLI command. - **New CLI command**: `kreuzberg extract-structured` for schema-guided LLM extraction - **New API endpoint**: `POST /extract-structured` with multipart file upload - **New MCP tool**: `extract_structured` for AI assistant integration - **Minijinja template engine** for customizable LLM prompts — structured extraction supports `{{ content }}`, `{{ schema }}`, `{{ schema_name }}`, `{{ schema_description }}`; VLM OCR supports `{{ language }}` - **5 new environment variables**: `KREUZBERG_LLM_MODEL`, `KREUZBERG_LLM_API_KEY`, `KREUZBERG_LLM_BASE_URL`, `KREUZBERG_VLM_OCR_MODEL`, `KREUZBERG_VLM_EMBEDDING_MODEL` - `LlmConfig` and `StructuredExtractionConfig` types exposed in Python, Node.js, and PHP bindings - `structured_output` field on `ExtractionResult` across all languages - `structured_output_json` field in C FFI `CExtractionResult` struct - `EmbeddingModelType::Llm` variant for provider-hosted embeddings - VLM OCR registered as plugin backend in OCR registry - Standalone text embedding API (#599, #614) with `/embed` endpoint, `embed_text` MCP tool, and `embed` CLI command ### Changed - **License changed from MIT to Elastic License 2.0 (ELv2)** — copyright holder changed to Kreuzberg, Inc. Forked upstream crates (kreuzberg-paddle-ocr, kreuzberg-tesseract, kreuzberg-pdfium-render) retain their original MIT licenses. - All `ExtractionResult` constructors refactored to use `..Default::default()` for forward compatibility - Embed CLI command extended with `--provider llm` and `--model` flags - Embed MCP tool extended with `model` and `api_key` parameters - Extract CLI overrides extended with `--vlm-model`, `--vlm-api-key`, `--vlm-prompt` - API returns 501 Not Implemented (instead of 500) when liter-llm feature is disabled - JSON schema `additionalProperties` automatically stripped for non-OpenAI providers ### Fixed - FFI error code tests updated for Embedding variant - Flaky FFI string_intern tests serialized with `serial_test` - TypeScript `NativeBinding` interface updated with `embedSync`/`embed` declarations - E2E generator emits minimal `cfg` (no `any()` wrapper for single conditions) - **PDF: brand names stripped by repeating text detection** — `ContentFilterConfig.strip_repeating_text = false` disables cross-page repeating text removal that incorrectly strips brand names from PowerPoint-exported decks (#667) - **PPTX: slide order scrambled for decks with 10+ slides** — Fixed lexicographic sort of slide paths (`slide10.xml` before `slide2.xml`) to use numeric ordering (#669) - **UTF-8 panic in arXiv watermark stripping** — `strip_arxiv_watermark_noise` panics when a multi-byte character spans the 6000-byte search limit. Fixed with `floor_char_boundary` (#663) - **DOC: garbled text from old Word files** — CP1252 text misread as UTF-16LE when the fCompressed bit is unreliable. Added heuristic to detect and re-decode garbled output (#666) - **WASM: table extraction returns empty array** — TypeScript validation silently drops tables when `pageNumber` is null. Fixed to default to page 0 (#655) --- ## [4.7.4] - 2026-04-06 ### Added - Re-added `--layout` boolean CLI flag for easy layout detection enablement (use `--layout` to enable with model defaults, `--layout false` to explicitly disable) - arXiv watermark/sidebar noise filtering for academic PDFs — strips LaTeX sidebar identifiers from extracted text - Second-tier cross-page repeating text detection — catches conference headers and journal running titles that repeat on >70% of pages but appear outside the margin zone - Figure/picture text suppression — text inside layout-detected Picture regions is now marked as page furniture and excluded from body output ### Fixed - **Figure-internal text leaking into body output** — Text from inside figures and diagrams (e.g., diagram labels, axis text) was incorrectly included in the extracted body content, sometimes promoted to headings. The layout detection pipeline now suppresses text paragraphs classified as Picture regions. - CLI tests now correctly reference `--content-format` instead of deprecated `--output-format` - **Empty image references in PDF markdown/HTML output** — PDFs with embedded images produced empty `![]()` references in markdown and `` in HTML output. The PDF structure pipeline now extracts actual image pixel data via pdfium and populates document images, producing proper `![](image_N.png)` references. - **Invalid `extractFromFile` config in documentation** — Demo code in the TypeScript API reference included invalid configuration parameters that caused runtime errors. - **WASM build failure with `extern "C-unwind"`** — The LLVM WASM backend does not support `cleanupret` instructions generated by `extern "C-unwind"` FFI blocks. Added `ffi_extern!` macro that uses `extern "C-unwind"` on native targets (for C++ exception safety) and `extern "C"` on WASM. - **Go module tag format** — Go module tags now use the correct `packages/go/v4/vX.Y.Z` format matching the module path in `go.mod`, plus the legacy `packages/go/vX.Y.Z` format for backwards compatibility. Backfilled tags for all stable releases. ### Changed - CLI documentation updated with all missing extraction override flags (`--layout-table-model`, `--disable-ocr`, `--cache-namespace`, `--cache-ttl-secs`) --- ## [4.7.3] - 2026-04-05 ### Fixed - **Archive extraction SIGBUS crash on macOS ARM64** — ZIP, 7Z, TAR, and GZIP archive extraction crashed with SIGBUS (signal 10) in release builds due to miscompilation of unsafe code in `sevenz-rust2` and `zip` crates under `opt-level=3`. Reduced optimization level to 2 for these crates. This also fixes Elixir, R, Go, and C benchmark crashes when processing archive files. - **Native-text PDF extraction fails when OCR backend unavailable** (#646) — PDFs with extractable native text hard-failed with `ParsingError: All OCR pipeline backends failed` when no OCR backend (PaddleOCR/Tesseract) was installed, even though pdfium already extracted text successfully. The automatic OCR quality-enhancement pass now gracefully falls back to the native extraction result when OCR backends are unavailable, emitting a warning instead of failing. - **Elixir Logger pollutes stdout** — Elixir benchmark scripts produced `[debug] Initialized Kreuzberg.Plugin.Registry` on stdout, corrupting JSON output. Logger default handler now configured to write to stderr via `config :logger, :default_handler`. - **WASM benchmark module resolution** — WASM benchmark script failed to load `@kreuzberg/wasm` through pnpm virtual store due to `import.meta.url` resolution issues in tsx. Changed to direct import from local build path. - **CI: FFI-dependent tests fail when FFI build skipped** — Go, Elixir, R, C FFI, and CLI test jobs ran and failed when `build-ffi` was skipped by paths-filter. Added `needs.build-ffi.result == 'success'` guard. - **Rust cannot catch foreign exceptions crash** (#606) — C++ exceptions from Tesseract or Leptonica (e.g. on corrupted images or edge-case inputs) propagated across the FFI boundary unhandled, causing `fatal runtime error: Rust cannot catch foreign exceptions, aborting`. All Tesseract/Leptonica FFI declarations now use `extern "C-unwind"` to allow foreign exceptions to unwind safely, and OCR processing is wrapped with `catch_unwind` to convert them to recoverable errors. --- ## [4.7.2] - 2026-04-04 ### Added - **E2E generator published mode** — `cargo run -p kreuzberg-e2e-generator -- generate --mode published --version ` generates standalone test apps against published registry versions (PyPI, npm, Maven, NuGet, crates.io, Hex, RubyGems). All 12 language generators now also produce their project/dependency files (pyproject.toml, package.json, composer.json, etc.). ### Changed - **Global model cache** (#641) — Models now download to platform-appropriate global cache (`~/.cache/kreuzberg/` on Linux, `~/Library/Caches/kreuzberg/` on macOS, `%LOCALAPPDATA%/kreuzberg/` on Windows) instead of per-directory `.kreuzberg/` folders. Override with `KREUZBERG_CACHE_DIR` env var. Consolidates 7 duplicate cache-dir resolution implementations into a single `cache_dir::resolve_cache_dir()` function. ### Fixed - **Embedded HTML in PDF text layers** — PDFs with raw HTML in their text layer (`

`, `
`, ``) produced escaped garbage (`\`) in output. Now detected and converted to clean markdown using `html-to-markdown-rs`, the same crate and config used by the HTML extractor. Comrak-generated `` comments also stripped from output. - **Code classification false positives** — Layout model sometimes classified regular prose as Code blocks. Added a prose guard that rejects Code classification for text with sentence punctuation, low syntax density, and many words. - **PageBreak rendering as `-----` separators** — PageBreak elements in InternalDocument were rendered as ThematicBreak (`-----`) in markdown and `


` in HTML output. This polluted extraction output with separators that don't exist in the source document. PageBreak is now treated as structural metadata — paragraph breaks between elements provide sufficient page separation, matching the pdfium baseline behavior. - **Leptonica DPI crash** (#606) — Images with resolution 0 DPI caused Leptonica preprocessing (background normalization, unsharp mask, grayscale conversion) to trigger a C++ exception that Rust cannot catch, aborting the process. Now validates and fixes DPI to 72 before preprocessing. Also disabled C++ exception handling on Windows MSVC builds (`/EHsc` removed). - **Node.js `ExtractionResult.children` missing at runtime** — The `children` field was declared in TypeScript definitions but missing from the runtime NAPI object in the published v4.7.1 binary, causing parity test failures. - **Layout detection fixture stale `preset` field** — E2E fixture `layout_detection.json` included removed `preset` field, causing Python test failures. Removed from fixture. - **Node.js `disable_ocr` config not respected** — Setting `disableOcr: true` in the Node.js binding still produced OCR content for images instead of returning empty content. - **C# `Serialization` class inaccessible** — Generated e2e tests referenced `Serialization` class with insufficient access level in the published NuGet package. - **Java `PdfAnnotation` missing getters** — `getContent()` and `getPageNumber()` methods were missing from the Java record, causing parity test failures. Added JavaBean-style getters to match `getAnnotationType()` and `getBoundingBox()`. - **Java `Table` missing getters** — `getCells()`, `getMarkdown()`, and `getPageNumber()` methods were missing from the Java record. Added JavaBean-style getters to match existing `getBoundingBox()`. - **Go test_app module conflict** — Generated Go test_apps used the same module name as e2e/go, causing workspace conflicts. Published mode now uses a distinct module path. - **PaddleOCR angle classification crash** (#643) — V2 angle classifier model (`PP-LCNet_x1_0_textline_ori`) expects `[N, 3, 80, 160]` input but preprocessing resized to `[N, 3, 48, 192]` (old mobile cls dimensions). Fixed input dimensions to match the v2 model. - **Centralized concurrency controls** — Fixed 5 places bypassing `resolve_thread_budget()`: embeddings ONNX session (no thread config at all), image OCR (hardcoded 8 tasks), batch extraction fallback (`num_cpus * 1.5`), doc orientation (`.min(4)` cap), PaddleOCR BaseNet (`inter_threads` set to `num_thread` instead of `1`). - **Chunk page numbers missing** (#636) — Chunks produced with `first_page: null, last_page: null` when chunking was configured without explicit `pages` config. Three fixes: (1) auto-enable page tracking when chunking is configured, so the PDF extractor always produces per-page boundaries; (2) improved page boundary recomputation with first-line fallback when exact content match fails due to rendering transformations; (3) allow zero-length boundaries for blank pages instead of failing validation. --- ## [4.7.1] - 2026-04-03 ### Added - **Tree-sitter grammar management CLI** — New `kreuzberg tree-sitter` subcommand with `download`, `list`, `cache-dir`, and `clean` sub-commands for managing tree-sitter grammar parsers. Supports downloading by language name, group (`--groups web,systems,scripting`), or all (`--all`). Reads `[tree_sitter]` config from `kreuzberg.toml` with `--from-config`. - **Tree-sitter grammar management API** — New REST endpoints: `POST /grammars/download`, `GET /grammars/list`, `GET /grammars/cache`, `DELETE /grammars/cache` for programmatic grammar management. - **Tree-sitter grammar management MCP tools** — New MCP tools: `download_grammars`, `list_grammars`, `grammar_cache_info`, `clean_grammar_cache` for AI assistant-driven grammar management. - **Tree-sitter config startup initialization** — API and MCP servers auto-download tree-sitter grammars on startup when `[tree_sitter]` config specifies `languages` or `groups`. ### Changed - **Normalized OCR+layout pipeline** — Tesseract+layout path now follows the same architecture as pdfium+layout: hOCR → PdfParagraph → `apply_layout_overrides` → `assemble_internal_document` → comrak. Replaces the broken custom `apply_layout_to_ocr_document` path that destroyed paragraph structure and reading order. - **Elixir NIF crash protection** — All extraction and batch NIFs now wrapped with `catch_unwind` to prevent panics in native C libraries (pdfium, tesseract) from crashing the BEAM VM. Panics are caught and returned as `{:error, reason}` tuples with error-level tracing including backtraces. ### Fixed - **hOCR parser depth tracking** — Fixed paragraph boundary detection in the hOCR parser that used a generic depth counter for `

`, ``, and `

` tags. Closing tags from inner word spans could prematurely terminate a paragraph, causing content after that point to be silently dropped. Now uses tag-name-specific depth tracking. - **hOCR multi-page content loss** — Per-page hOCR documents from tesseract always report `ppageno=0` (page=1), but the paragraph conversion filtered by the actual page index, silently dropping all content on pages 2+. Removed the per-page filter since each hOCR document is independently extracted per page. - **OCR batch parallelization** — OCR page processing was hardcoded to 4 concurrent pages regardless of available CPUs. Now uses `resolve_thread_budget()` (auto-detects CPUs, capped at 8) for significantly faster multi-page document processing. - **Benchmark workflow** — Removed reference to deleted `kreuzberg-extract` binary target. - **Ruby OCR backend** — Added missing `ocr_internal_document` field to `ExtractionResult` construction. - **Keyword extraction tests** — Updated test assertions to use new `extracted_keywords` field instead of deprecated `metadata.additional["keywords"]`. - **PaddleOCR cache dir test** — Fixed test failure when `KREUZBERG_CACHE_DIR` environment variable is set by CI setup actions. - **API `pdf_password` handler** — Added `#[cfg(feature = "pdf")]` gate to prevent compile error when `api` feature is enabled without `pdf`. - **Chunking page boundary regression** (#636): Page boundaries were computed against raw extractor text but `result.content` uses rendered text with different byte lengths. Chunks now recompute boundaries from per-page content, fixing `first_page`/`last_page` being null and the "Page boundary byte_end exceeds text length" validation warning. - **HF Hub environment variables** (#634): Use `ApiBuilder::from_env()` instead of `ApiBuilder::new()` for Hugging Face model downloads, respecting `HF_HOME` and `HF_ENDPOINT` environment variables. Fixes permission errors on Kubernetes when running as non-root. - **PDF bridge tracing panic on multibyte characters** (#635): Use `.chars().take()` instead of byte indexing for `text_preview` in PDF structure bridge tracing, preventing panics on multibyte UTF-8 characters (e.g., `•`). - **Go FFI struct layout** — vendored C header was missing `children_json` field, causing 8-byte offset shift. All FFI fields after `chunks_json` read wrong memory (e.g., `ocr_elements_json` read `mime_type` instead). - **Java FFI struct layout** — `CExtractionResult` layout was missing `code_intelligence_json` field, causing `success` flag to read from wrong offset. All Java extractions returned `success=false`. - **PHP `__get` magic method bypass** — six JSON fields (`elements`, `djotContent`, `document`, `ocrElements`, `children`, `uris`) returned raw JSON strings instead of deserialized arrays because `#[php(prop)]` intercepted property access before `__get`. - **Ruby `disable_ocr` config** — `disable_ocr` keyword was not parsed in Ruby config handler, causing OCR to run even when explicitly disabled. - **Node.js `ExtractionResult` parity** — `document`, `djotContent`, and `ocrElements` fields were `Option` which NAPI-RS omitted from JS objects when `None`. Changed to `Value` defaulting to `null`. - **Node.js `convertChunk` missing `chunkType`** — TypeScript type converter did not forward the `chunk_type` field from NAPI bindings. - **ODT caption text extraction** — text inside `draw:frame > draw:text-box > text:p` (e.g., image captions) was not extracted. The ODT extractor now recurses into text-box content. - **OCR InternalDocument propagation** — `run_ocr_pipeline` discarded the structured InternalDocument built by `extract_with_ocr`, causing OCR results to fall back to naive `\n\n` paragraph splitting. Now propagated through the full pipeline. - **OCR table cells** — OCR-detected tables (via TATR) had empty `cells` vectors, causing comrak to render them as paragraphs instead of proper tables. Now populated from the cell grid, matching the native text path fix. - **OCR non-layout InternalDocument** — When layout detection is not active, the OCR path now builds an InternalDocument from results instead of returning None. Ensures structured output regardless of layout detection availability. - **Italian/European PDF ligature corruption** — Extended contextual ligature repair to handle `tt`, `ti`, `tti` ligatures common in Italian fonts. Fixes garbled text like `Dire*ore` → `Direttore`, `ges:one` → `gestione`, `progeM` → `progetti`. - **OCR layout false heading classification** — Tesseract+layout pipeline was worse than pure tesseract (33% vs 41% SF1) because layout confidence threshold was too low (0.5). Raised to 0.7 for OCR path where font-size validation is unavailable. - **OCR table rendering** — OCR-detected tables were not linked to InternalDocument elements, causing comrak to skip them entirely. Tables now properly registered via `push_table()` with corresponding `ElementKind::Table` elements. - **Spurious table detection** — Multi-column prose with short cells (like nougat_008) bypassed the prose row check due to a 30-char minimum row length. Lowered to 15 chars so short-cell prose tables are correctly rejected. - **PHP enum registration** — PHP enums (ContentLayer, ElementType, etc.) were registered with `.class()` instead of `.enumeration()`, causing empty case lists. Virtual properties on ExtractionResult and ArchiveEntry now declared via builder modifiers for reflection visibility. - **Go macOS FFI linking** — monorepo dev build (`ffi_dev.go`) was missing `-framework Foundation` in CGO LDFLAGS, causing linker failures on macOS with CoreML-enabled ONNX Runtime. - **Unified WASM e2e tests** — replaced broken separate Deno/Workers e2e generators with a single vitest-based WASM generator. ORT-dependent features (embeddings, layout, paddle-ocr) gracefully skip. - **WASM Rayon thread pool panic** — Rayon's `par_iter()` / `into_par_iter()` and `ThreadPoolBuilder::build_global()` panicked in WASM (`RuntimeError: unreachable`) because WASM has no threading support. All Rayon usages now fall back to sequential iteration on `wasm32` target. - **PHP virtual property reflection** — `ClassBuilder::property()` declarations for `__get`-backed fields (metadata, chunks, document, etc.) shadowed the magic method, returning null. Replaced with getter methods that don't interfere with `__get`. Parity test updated to check both `hasProperty()` and getter methods. --- ## [4.7.0] - 2026-03-30 ### Added - **Semantic chunk labeling** (#600): Chunks now include a `chunk_type` field identifying the semantic nature of the content (e.g., `paragraph`, `heading`, `list_item`, `table_cell`, `code_block`). Supported across all 11 language bindings with updated E2E test parity. - **Unified InternalDocument architecture**: All extractors now return a canonical `InternalDocument` with typed elements, relationships, images, and tables. Replaces format-specific intermediate representations. - **Unified rendering layer**: New `new_markdown.rs` renderer produces CommonMark from `InternalDocument`, supporting headings, lists, tables, code blocks, formulas, footnotes, images, and inline annotations (bold, italic, links). - **PDF structure pipeline**: Full rewrite of PDF extraction using `page.text().all()` for clean text, char-indexed font metadata for heading/bold detection, segment-based paragraph gap detection, and pdfium segment bounding boxes for precise paragraph regions. - **Image extraction across 8 formats**: Embedded images now extracted as `ExtractedImage` with binary data, format, dimensions, and alt text. Supported for DOCX, PPTX, PDF, EPUB, ODT, HTML (data URIs), RTF (hex-decoded), and Markdown/MDX/Jupyter. Markdown output renders as `![alt](image_N.ext)` with binary data in `ExtractionResult.images`. - **Recursive OCR on embedded images**: When OCR is configured, extracted images from EPUB, ODT, HTML, and RTF are processed through `process_images_with_ocr()`, producing nested `ExtractionResult` in `ExtractedImage.ocr_result`. - **PDF watermark artifact filtering**: Uses pdfium's `/Artifact` content marks (PDF tagged content spec) to identify and filter watermark text from output. - **Vertical table header reconstruction**: Detects and fixes rotated column headers in PDF tables where pdfium extracts characters as spaced single characters in reverse order (e.g., "y t i r o h t u A o N" → "NoAuthority"). - **Position-based page furniture detection**: Cross-page repeating text detection now uses actual page margins (top/bottom 10%) and page heights instead of word-count heuristics. - **html-to-markdown v3 migration**: Switched to html-to-markdown v3 with unified `convert()` API returning `ConversionResult` (content, metadata, tables, images, document structure in a single call). Uses visitor-based table collection. hOCR module vendored as `table_core`. - **Markdown ground truth for 336 documents**: Pandoc-generated GT across 10 formats (DOCX, HTML, RTF, PPTX, EPUB, ODT, XLSX, XLS, CSV, DOC) for structural quality benchmarking. All 371 markdown GT files cleaned of HTML remnants (415 tables converted to GFM pipe tables, 28 inline tags fixed). - **Multi-format benchmark support**: Pipeline benchmark now scores all document formats (not just PDF), shows file type per document, replaces NaN with "—", and reports ground truth loading errors. - **Comprehensive PDF pipeline tracing**: Trace-level logging across heading lifecycle (layout overrides, demotion passes, furniture detection, render layer) for debugging. - **Pages API for PDF extraction**: Per-page content now properly wired through the extraction pipeline via `prebuilt_pages` on `InternalDocument`, making `result.pages` available for PDF documents. - **TOON wire format**: Token-Oriented Object Notation support across CLI (`--format toon`), API (`Accept: application/toon`), MCP (`response_format: "toon"`), and all 11 language bindings (Python, Node.js, WASM, C FFI, PHP, Ruby, Elixir, Go, Java, C#, R). TOON is a token-efficient alternative to JSON for LLM prompts — losslessly convertible to/from JSON but uses ~30-50% fewer tokens. Core functions `serialize_to_toon()` and `serialize_to_json()` exposed as public API. - **Renderer registry**: Trait-based `Renderer` and `RendererRegistry` for custom output format plugins. Built-in renderers (markdown, HTML, djot, plain) registered at startup. External crates can register custom renderers (e.g., DOCX output) via `register_renderer()`. - **comrak-based rendering**: Markdown and HTML rendering now uses comrak AST bridge instead of hand-rolled string building. Produces GFM-compliant markdown and semantic HTML5. Paragraph consolidation merges consecutive same-format paragraphs at sentence boundaries (fixes DOCX CV fragmentation where each visual line was a separate `*...*` italic block). - **Benchmark quality scoring improvements**: Content normalization for HTML blocks in markdown scoring, Image↔Paragraph and Table↔ListItem type compatibility, `correct` field in `QualityMetrics`, HTML detection in ground truth validation. - **Benchmark harness overhaul**: Per-format SF1/TF1 aggregation, noise detection (10 heuristics for HTML remnants, garbled text, broken tables, page artifacts), diagnostic diff mode (`--diagnose`), JSON output (`--json-output`), ground truth validation subcommand (`validate-gt`). Comprehensive tracing across all extractors and the rendering layer. - **Markdown ground truth for 23 formats**: 350+ benchmark fixtures across CSV, DOCX, HTML, EPUB, LaTeX, RST, RTF, PPTX, ODT, XLSX, XLS, OPML, ORG, JATS, IPYNB, FictionBook, DocBook, Typst, DOC, PPT, and more. GT generated via pandoc and verified against source documents. - **OpenWebUI integration**: Kreuzberg serves as a document extraction backend for Open WebUI chat interfaces. - **URI extraction**: New `Uri` type with `UriKind` classification (Hyperlink, Image, Anchor, Citation, Reference, Email) extracted from 20+ document formats. URIs are always-on, deduplicated by (url, kind) pair, and capped at 100k per document. Available in `ExtractionResult.uris`. - **Recursive email attachment extraction**: EML/MSG/PST attachments are now recursively extracted as `ArchiveEntry` children using the same pattern as archive extractors. Nested `message/rfc822` parts also extracted as children. Respects `max_archive_depth`. - **PDF embedded file extraction**: PDF file attachments (portfolios) are now recursively extracted as `ArchiveEntry` children via lopdf. Includes filename sanitization, decompression size limits, and name tree depth guards. - **PDF bookmark/outline extraction**: Document outlines (bookmarks) extracted as URIs — page destinations as `UriKind::Anchor`, external links as `UriKind::Hyperlink`. - **DOCX/PPTX embedded object extraction**: OLE objects and embedded files from `word/embeddings/` and `ppt/embeddings/` directories are now recursively extracted as children. - **PPTX hyperlink extraction**: Hyperlinks from slide XML (`` in run properties) now resolved via relationship files and extracted as URIs. - **Image path resolution for markup formats**: When using `extract_file()`, relative image paths in Markdown, MDX, LaTeX, RST, OrgMode, Typst, Djot, and DocBook are resolved from the filesystem and extracted as `ExtractedImage` data. OS-agnostic with path traversal prevention. - **Unified image OCR pipeline stage**: Image OCR moved from per-extractor calls to a single pipeline stage after derivation. All extracted images (including path-resolved markup images) are now OCR'd uniformly when OCR is configured. Concurrency limited to 8 concurrent tasks. - **FictionBook image and link extraction**: Base64-encoded `` images and `` hyperlinks now extracted from FB2 documents. - **Apple iWork extractor improvements**: Numbers outputs tables instead of paragraphs, Keynote has improved slide structure, Pages has heading detection. All three extract metadata from ZIP plist. - **`code_intelligence` field on ExtractionResult**: Top-level access to tree-sitter `ProcessResult` with full structure, imports, exports, chunks, symbols, diagnostics, and docstrings. Previously only available inside `FormatMetadata::Code` metadata. - **`CodeContentMode` config**: Control code extraction content mode -- `chunks` (semantic TSLP chunks, default), `raw` (source as-is), `structure` (headings + docstrings only). Configured via `TreeSitterProcessConfig.content_mode`. - **TSLP semantic chunking for code**: Code files bypass the text-splitter entirely. TSLP's `CodeChunks` (function/class-aware) map directly to kreuzberg `Chunk`s with semantic types and heading context. - **Cross-format output parity tests**: 36 tests verifying Markdown, HTML, Djot, and Plain produce equivalent text content. GFM lint validation, bracket escaping checks, structural block comparison. - **HTML input markdown passthrough**: HTML files extracted as Markdown now use html-to-markdown output directly via `pre_rendered_content`, bypassing the lossy InternalDocument to comrak round-trip. ### Code Intelligence - **Tree-sitter integration** for 248 programming languages via [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) - Extract functions, classes, imports, exports, symbols, docstrings, diagnostics - Syntax-aware code chunking - Language detection from file extension and shebang - Dynamic grammar download (native) / 30-language static subset (WASM) - New `tree-sitter` and `tree-sitter-wasm` feature flags (included in `full` and `wasm-target`) - `TreeSitterConfig` and `TreeSitterProcessConfig` in `ExtractionConfig` - Re-exported TSLP types (`ProcessResult`, `StructureItem`, `FileMetrics`, etc.) - [TSLP documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) ### Typed Metadata - New `FormatMetadata` variants: `Code`, `Csv`, `Bibtex`, `Citation`, `FictionBook`, `Dbf`, `Jats`, `Epub`, `Pst` - Extended `PptxMetadata` with `image_count` and `table_count` - Migrated deprecated `metadata.additional` writes to typed fields across all extractors - Strong types for all new metadata variants across all 11 language bindings ### Breaking Changes - **Layout detection preset removed**: The `preset` field on `LayoutDetectionConfig` has been removed across all bindings. Layout detection now uses the RT-DETR v2 model unconditionally — no "fast" vs "accurate" distinction. The `--layout-preset` CLI flag is removed. Old configs with `"preset": "..."` are silently ignored for backward compatibility. - **Table model config typed**: `table_model` on `LayoutDetectionConfig` changed from `Option` to a `TableModel` enum (`tatr`, `slanet_wired`, `slanet_wireless`, `slanet_plus`, `slanet_auto`, `disabled`). Defaults to `tatr`. String values still accepted in JSON/TOML configs. ### Fixed - **PDF table rendering**: Populate `Table.cells` from TATR/SLANeXT grid so comrak renders proper Table nodes instead of wrapping markdown in a Paragraph. Table SF1 improved from 15.5% to 53.7%. - **Markdown GFM quality**: Enable `prefer_fenced` for code blocks, un-escape brackets/parens (`\[` to `[`), fix code block language spacing in djot. - **Semantic HTML output**: Enable `github_pre_lang` and `full_info_string` for code blocks with `class="language-X"`. - **Djot text normalization**: Shared `normalize_inline_text()` for consistent whitespace handling. MD-to-Djot TF1 now 1.0000. - **PDF structural extraction quality**: Improved heading detection (font-size-ratio H2/H3 differentiation, section numbering patterns, ALL-CAPS detection, paragraph-to-heading rescue pass), table discrimination (reject multi-column prose misclassified as tables via flow-through detection, row-count/column-count ratio, and table quality validation), list detection (multi-token prefix patterns), image scoring (normalize image block matching), and formula detection (math character density heuristic). Layout SF1 improved from 40.7% to 43.7% across 157 verified PDF fixtures. - **PDF ground truth verified**: All 157 PDF benchmark fixtures verified using vision (rendered page images vs GT markdown). 7 broken Mistral OCR GTs with hallucinated content replaced with vision-verified markdown. - **LaTeX extraction**: Convert `\href`, `\emph`, `\textbf`, `\textgreater`, `\verb`, `\sout`, blockquotes, lists, special characters, and typographic ligatures to markdown. - **XLSX/XLS sheet name headings**: Emit `## SheetName` heading before each sheet's table, matching pandoc convention. - **OPML outline headings**: All outline nodes now emit headings at appropriate depth, not just parent outlines. Inline HTML in text attributes converted to markdown. - **IPYNB heading detection**: Markdown cells now detect ATX headings and emit proper heading elements. Code cell outputs (stdout, execute_result) included in extraction. - **JATS abstract and references**: Abstract section with sub-headings now included. References rendered as numbered list with structured citation formatting. - **ODT formula extraction**: Embedded MathML formula objects extracted as formula content instead of empty image placeholders. Image alt text and captions now extracted from `draw:frame` elements. - **PPTX slide titles**: Title placeholders detected via OOXML placeholder type and emitted as H2 headings. Bulleted/numbered lists in slides extracted with proper ListStart/ListEnd wrapping. - **ORG source blocks**: `#+BEGIN_SRC` blocks converted to fenced code blocks with language annotation. `#+BEGIN_EXAMPLE` blocks converted to unfenced code blocks. Inline code `~text~` converted to backtick spans. Paragraph line wrapping joined. - **RST heading levels**: Overline+underline document titles assigned H1. Code block language hints preserved from `.. highlight::` and `.. code::` directives. `::` literal block shorthand handled. - **RTF formatting**: Bold/italic/strikethrough formatting now uses exact byte offsets from a unified text+formatting extraction pass, eliminating bold bleeding across paragraphs. Hidden text (`\v`) suppressed. Hyperlink field parsing fixed. Strikethrough support added. Table row rendering fixed for multi-row tables. Ordered list detection from `\listtext` markers. - **HTML preprocessing**: Navigation elements, forms, and sidebars now stripped by default. Previously disabled, causing page chrome to appear in extraction output. - **PDF table detection**: Reject false table detections where >70% of cells contain single-word fragments (justified prose incorrectly classified as multi-column table). - **DocBook root element handling**: XML fragments without a root element now wrapped automatically, fixing extraction of multi-element DocBook files. - **FictionBook poem support**: Verse lines (``), subtitles, text-author, and date elements within poem blocks now extracted. Heading levels aligned with pandoc conventions. - **PDF image FlateDecode fallback**: When `decode_flate_to_png()` fails for FlateDecode, CCITT, or JBIG2 streams, images are now re-extracted via pdfium's bitmap rendering pipeline, producing valid PNG output instead of unusable raw bytes (#615). - **Metadata standardization**: Metadata from PPTX, Excel, ODT, RST, OrgMode, Typst, RTF, JATS, DOC, PPT, HTML, Email, BibTeX, and Citation extractors now mapped to standard `Metadata` struct fields (title, authors, dates, keywords, language) instead of only `additional` map. - **MDX link parity with Markdown**: Links and annotations in headings and list items now extracted (was silently dropped). - **RST hyperlink extraction**: Inline hyperlinks (`` `text `_ ``) and reference targets now extracted. - **LaTeX `\url{}` extraction**: `\url{...}` commands now extracted as URIs alongside `\href`. - **OrgMode image detection**: Added .webp, .bmp, .tiff, .avif to recognized image extensions. - **BibTeX URI classification**: URL fields now correctly classified as Hyperlink (was Citation). Entry title used as label instead of BibTeX key. - **JATS title field**: Article title now stored in `metadata.title` (was only in `subject`). - **PDF bookmark stack safety**: Sibling traversal converted from recursion to iterative loop preventing stack overflow on wide outlines. - **PDF embedded file security**: Filename sanitization (strip directory components), decompressed size limit (50MB), name tree depth limit (50 levels). - **Tesseract C++ exception crash** (#606): Fixed fatal runtime error where C++ exceptions from Tesseract unwound through Rust FFI frames, triggering `std::terminate()`. Now compiles Tesseract with `-fno-exceptions` on macOS, Linux, and MinGW. The Tesseract CLI executable target (which uses `try`/`catch`) is patched out of CMakeLists.txt at build time since only the library is needed. - **ExtractionConfig rejects unknown fields**: `#[serde(deny_unknown_fields)]` added to `ExtractionConfig`. Previously, typos or invalid fields (e.g., `layout_analysis` instead of `layout`) were silently ignored. - **RTF delimiter space consumption**: Fixed space-in-word bug where font encoding directives (`\loch`, `\hich`, `\dbch`) caused spaces mid-word ("H eading" → "Heading"). Root cause: RTF spec requires consuming trailing delimiter space after control words. - **PPTX markdown mode**: Derive plain/markdown mode from `output_format` config instead of hardcoding `plain=true`. Tables now render as markdown tables, lists get bullet markers, text elements get newline separation. - **EPUB test compilation**: Added `InternalDocument::content()` method and fixed `epub_spine_semantics_tests` to use it instead of removed `.content` field. - **HTML extraction rewrite**: Replaced ~400-line manual HTML tag parser with html-to-markdown v3's `DocumentStructure` mapping. Single-pass conversion eliminates CSS/script content leakage and `[image: X]` placeholder artifacts. - **Chunking heading context with plain output**: Fixed `heading_context` always returning `None` when using plain text output format. The markdown chunker now receives the original markdown for heading map building even when content is rendered as plain text. - **WASM build compatibility**: Inlined workspace-inherited fields (`version`, `edition`, `authors`) in kreuzberg-wasm Cargo.toml because wasm-pack 0.14.0 cannot resolve `field.workspace = true` references. - **Pre-commit hooks**: Fixed rumdl hook config (use `rumdl-fmt` from official repo), wasm build (feature-gate layout config access), kreuzberg-node build (missing `formatted_content` field), broken relative links in READMEs and CHANGELOG. - **Binding compilation**: Added missing `formatted_content` field to kreuzberg-py and kreuzberg-php binding crates. - **PDF heading body_size_guard**: Narrowed guard range from `≤ body+0.5` to `body±1.5pt` so headings well below body font size (e.g., 8pt in 12pt body) pass through. - **RTF table extraction**: Fixed critical bug where table cell content was written to both result string and TableState, causing cells to appear as individual lines instead of proper markdown tables. - **DOCX merged cells**: Repeat content across gridSpan (horizontal) and vMerge (vertical) spans. Added `source_path` field to `ExtractedImage` for DOCX image relationship paths. - **DOCX formatting**: Merge adjacent runs with identical formatting to prevent spurious `****` sequences. Strip `` underline HTML tags. - **Python wheel `__isoc23_strtoll` error on older Linux distributions** (#588): Downgraded the Linux build environment `manylinux` target from `manylinux_2_39` to `manylinux_2_28` for pre-compiled Python wheels to ensure compatibility with systems using glibc versions prior to 2.39 (e.g., Ubuntu 20.04/22.04, Debian 11/12). - **`clear_ocr_backends` now fully clears the registry**: Calls `shutdown_all()` instead of `reset_to_defaults()`, so the backend list is empty after clearing as expected by the API contract. - **Go macOS link failure**: Added missing `-framework Foundation` to CGO LDFLAGS. ORT's CoreML provider uses Foundation for NSLog/NSFileManager, causing undefined symbol errors on macOS. - **Tesseract Windows MinGW build (Elixir/Go/C FFI publish)**: CMake resolved bare `g++` to MSVC `cl.exe` on CI runners with both toolchains. Added `resolve_mingw_compiler()` to find absolute paths from MSYS2 subsystem dirs. Bumped Tesseract cache key to invalidate stale MSVC-compiled artifacts. - **Windows GNU ORT linking**: `bundled` strategy on Windows GNU now uses dynamic linking with pre-downloaded Microsoft ORT (pyke.io has no static binaries for `x86_64-pc-windows-gnu`). Documented ONNX Runtime DLL requirement for Go, Elixir, and C/C++ on Windows. ### Changed - **PDF text extraction**: Full rewrite from segment-indexed assembly to `page.text().all()` + char-indexed font metadata. Produces cleaner text with correct word spacing. - **hOCR table reconstruction vendored**: `HocrWord`, `reconstruct_table`, `table_to_markdown` moved from `html-to-markdown-rs::hocr` to `kreuzberg::table_core` module. - **CLI format flags**: `--format` (`-f`) now supports `text`, `json`, and `toon` wire formats. `--output-format` renamed to `--content-format` (deprecated alias kept with warning). `OutputFormat` enum gains `Custom(String)` variant for extensible format plugins. - **html-to-markdown-rs v3.0.0**: Switched from git dependency to crates.io release. - **License policy**: MPL-2.0 and LGPL-2.1 no longer globally allowed — pinned to specific crate exceptions (cbindgen, option-ext, r-efi). Unicode-DFS-2016 allowed for comrak dependency. ### Removed - **`max_upload_mb` server config field**: Use `max_multipart_field_bytes` (in bytes) instead. The `KREUZBERG_MAX_UPLOAD_SIZE_MB` environment variable is also removed — use `KREUZBERG_MAX_MULTIPART_FIELD_BYTES`. - **`metadata.additional` legacy insertions**: Pipeline features (chunking, embeddings, language detection, keywords) no longer insert error/status keys into `metadata.additional`. Errors are available via `processing_warnings`. Keywords are in `extracted_keywords`. Embedding status is derivable from chunk embeddings. - **`derive_content_string` function**: Replaced by `render_plain()` in the rendering module. --- ## [4.6.3] - 2026-03-27 ### Added - **Tower service layer** (`service` module): Composable `ExtractionService` implementing `tower::Service` with configurable middleware layers (tracing, metrics, timeout, concurrency limit). New `tower-service` feature flag, auto-enabled by `api` and `mcp`. `ExtractionServiceBuilder` provides ergonomic layer composition. - **Semantic OpenTelemetry conventions** (`telemetry` module): Formal `kreuzberg.*` attribute namespace with 30+ span attributes, metric names, and operation/stage constants. Documented conventions for document extraction, pipeline stages, OCR, and model inference telemetry. - **Extraction metrics**: 11 OTel metric instruments (counters, histograms, gauge) covering extraction totals, durations, cache hits/misses, pipeline stages, OCR, and concurrent extractions. Feature-gated behind `otel`. - **InstrumentedExtractor wrapper**: Automatic per-extractor tracing spans and metrics without per-extractor annotations. Injected at registry dispatch when `otel` feature is enabled. ### Improved - **Deeper instrumentation**: Pipeline post-processing stages (Early/Middle/Late), individual processor execution, OCR operations, and RT-DETR layout model inference now have semantic spans and duration metrics. - **API and MCP servers use ExtractionService**: Both consumers now route extractions through the Tower service stack, getting unified tracing, metrics, and middleware for free. - **Unified config merge**: JSON config merge logic deduplicated between CLI and MCP into a shared function. - **API server hardening**: Added response compression (gzip/brotli/zstd), panic recovery, request-ID correlation, and sensitive header redaction via tower-http middleware. ### Changed - **Removed per-extractor `#[instrument]` annotations**: 29 manual `#[cfg_attr(feature = "otel", tracing::instrument(...))]` annotations replaced by the automatic `InstrumentedExtractor` wrapper. - **Span attribute names migrated to `kreuzberg.*` namespace**: `extraction.filename` -> `kreuzberg.document.filename`, `extraction.mime_type` -> `kreuzberg.document.mime_type`, etc. ### Fixed - **EPUB spine semantics refactor** (#594): Richer OPF package model preserves manifest fallback chains, guide references, and non-linear spine items. Navigation chrome stripped from output. Malformed guide references now produce warnings instead of hard failures. Tested for fallback cycles and empty spines. - **DOCX image extraction for `` with child elements** (#591): Images with high-quality settings (containing `` children) were not extracted because only `Event::Empty` was handled. Now also handles `Event::Start` for ``. - **OCR table extraction returned empty results via pipeline path** (#593): Layout detection was gated behind a `needs_structured` check, skipping it for the default `Plain` output format. Tables from `run_ocr_pipeline` were discarded. Both paths now propagate tables correctly. - **Missing `chunker_type` field in bindings** (#592): Exposed `chunker_type`, `sizing_cache_dir`, and `prepend_heading_context` fields across Python, TypeScript/WASM, Go, C#, PHP bindings. - **Full API parity across all 10 bindings**: Added `max_archive_depth` to all bindings. Added missing `acceleration`, `email` to Ruby/R. Added `layout` to PHP. Added 7 missing fields to WASM. Fixed parity script regex for Go slice types. - **`test_pipeline_with_all_features` assertion without `quality` feature**: `quality_score` assertion now gated behind `#[cfg(feature = "quality")]`. - **Node Windows publish failure**: Prepare script fallback used bash-specific `mkdir -p` and `echo >` which fail on Windows. Replaced with cross-platform `node -e` fallback. - **CI Validate path triggers too narrow**: Broadened glob patterns to cover `docs/**`, `biome.json`, `.task/**`, and other lintable paths that prek hooks check. - **Publish pipeline ORT bundling**: Added configurable `strategy` input (`system`/`bundled`) to `setup-onnx-runtime` action. Set `strategy: bundled` for all publish jobs so `ort-bundled` cargo feature takes effect, producing self-contained binaries. --- ## [4.6.2] - 2026-03-26 ### Added - **PDF page rendering API** (#583): New `render_pdf_page` function and `PdfPageIterator` for rendering individual PDF pages as PNG images. Available across all 11 language bindings with idiomatic patterns (Python context manager, Go Close(), Java AutoCloseable, C# IDisposable, Elixir Stream, etc.). Default 150 DPI, configurable per call. ### Fixed - **Table recognition coordinate mismatch on scanned PDFs** (#582): Layout detection bboxes (640x640 model space) are now scaled to OCR render resolution before TATR table recognition. Previously, coordinate space mismatch caused zero tables to be found. - **OCR elements report `page_number: 1` for all pages** (#582): Tesseract resets page numbers per single-page render. Page numbers are now correctly stamped after OCR in the batch loop. - **Rust E2E tests missing PDF feature**: Added `pdf` feature to the e2e-generator Rust template, fixing 41 `UnsupportedFormat("application/pdf")` failures. - **HWP styled extraction empty on ARM**: Added `skip_on_platform` support to Python and Java e2e generators, skipping the `hwp_styled` fixture on `aarch64-unknown-linux-gnu`. - **WASM CI build failure**: Made `kreuzberg-node` prepare script resilient to missing native addon, preventing `ENOENT: dist/cli.js` during pnpm workspace install. - **Go C header stale at 4.5.0**: Synced header and `DefaultVersion` constant to match current version. - **Ruby gem missing ONNX Runtime**: Added `ort-bundled` feature to Ruby native Cargo.toml. - **Elixir doctest failures**: Updated `ExtractionConfig.to_map/1` doctests for `force_ocr_pages` field. - **WASM benchmark timeout**: Reduced per-extraction timeout from 600s to 120s and job timeout from 6h to 2h. ### Improved - **`version:sync` now syncs Go C header, DefaultVersion, and Docker compose tags**: Prevents version drift across language bindings. - **Publish pipeline commits Elixir NIF checksums back to main**: Prevents stale checksums after releases. - **WASM test app migrated to Deno**: Replaced Node.js/vitest with Deno test runner, fixing `fetch()` unavailability. - **Docs migrated from MkDocs to Zensical**: 4-5x faster incremental builds. --- ## [4.6.1] - 2026-03-25 ### Added - **Per-file batch extraction timeouts** (#546): New `extraction_timeout_secs` on `ExtractionConfig` (batch-level default) and `timeout_secs` on `FileExtractionConfig` (per-file override). Timeouts apply after semaphore acquisition. New `KreuzbergError::Timeout` variant with `elapsed_ms` and `limit_ms` fields. All binding layers updated. - **Page-level OCR overrides** (#432): New `force_ocr_pages` option (1-indexed) on both `ExtractionConfig` and `FileExtractionConfig`. Enables selective OCR on specific pages of mixed-quality PDFs while preserving native text on others. - **PST extraction support** (#502): Extract emails from Microsoft Outlook PST archives via the `outlook-pst` crate. Iterative depth-first folder traversal with depth cap of 50. Feature-gated under `email`. - **JSONL/NDJSON extraction** (#575): Native `.jsonl`/`.ndjson` extraction via `StructuredExtractor`. Registered as `application/x-ndjson` MIME type. ### Fixed - **OCR elements now propagated to ExtractionResult** (#566): OCR elements with geometry data are collected during extraction and set on `ExtractionResult.ocr_elements`. Hierarchy transformer emits body-level blocks as `NarrativeText` elements with coordinates. OpenAPI schema registers OCR-related types. - **OOM crash on multi-page scanned PDFs** (#570): Replaced pre-rendering all PDF pages into memory with batched rendering. Pages are now rendered and OCR'd in bounded batches, capping peak memory to `batch_size * page` instead of `page_count * page`. - **OCR memory usage reduced 60-78%**: Restructured the OCR batch rendering loop to render-and-encode one page at a time instead of holding all decoded RGB buffers simultaneously. A 98-page scanned PDF dropped from 4.6GB to 1.9GB peak RSS (batch_size=4), and from 3.3GB to 713MB (batch_size=1). Batch size now adapts to available system memory on Linux and macOS. - **PDF control character encoding artifacts**: PDFs with broken ToUnicode font mappings that produce U+0002 (STX) and other control characters where hyphens should appear now have these replaced with hyphens when between word characters, or stripped otherwise. Fixes garbled output like `re\x02labelling` → `re-labelling`. - **DocumentStructure missing Heading nodes for PDFs**: `push_heading_group` now inserts a `Heading` child inside each `Group` node (matching DOCX builder behavior). Fallback `add_paragraphs` now detects markdown heading markers and creates heading groups instead of flat paragraphs. - **Layout detection returns empty tables on scanned PDFs** (#574): Three independent bugs caused `result.tables` to always be `[]` for scanned/image-based PDFs: (1) layout detection was gated behind a `needs_structured` output-format check, silently skipping detection for `Plain` (the default); (2) TATR-recognized tables in the OCR path were inlined as markdown text but never converted to `Table` structs; (3) `run_ocr_with_layout` returned only text, discarding table data. All three paths now propagate tables correctly. - **Table recognition coordinate mismatch on scanned PDFs** (#582): Layout detection operates at 640×640 pixels but TATR table recognition and layout-hint classification consumed those coordinates verbatim against OCR-rendered images (e.g. 2480×3508 px at 300 DPI). Bounding boxes never overlapped OCR word positions, producing zero recognized tables and incorrect paragraph-class overrides. Bounding boxes are now scaled from layout-model resolution to the actual OCR render resolution before both `recognize_page_tables` and `detection_to_layout_hints` are called. - **OCR elements report `page_number: 1` for all pages** (#582): The Tesseract backend resets `page_number` to 1 for every single-page render. The page-number is now stamped with the correct 1-indexed page index after collecting each batch page's OCR elements. - **PDF layout engine panic on malformed input** (#544): Replaced the panicking `.expect()` inside the thread-local `LayoutEngine` initializer in `layout_runner.rs` with proper `Result`-based error propagation. A failure to initialise the layout engine now returns a descriptive error instead of crashing the host process via FFI (Python, Node, etc.). --- ## [4.6.0] - 2026-03-24 ### Added - **Recursive archive extraction**: Archives (ZIP, TAR, 7Z, GZIP) now recursively extract all processable files, each with its own `ExtractionResult` including `DocumentStructure`, annotations, and metadata. New `ArchiveEntry` type with path, mime type, and nested result. Configurable via `max_archive_depth` (default: 3, set to 0 for legacy single-text behavior). - **YAML/JSON section chunker**: New `ChunkerType::Yaml` variant that splits structured files by keys with full hierarchy paths (e.g., `database > primary > host`). Auto-inferred from extraction metadata — no explicit `chunker_type` needed for YAML/JSON files. - **Unified DocumentStructure DTO**: Extended the `DocumentStructure` model with 7 new node types (`Slide`, `DefinitionList`, `DefinitionItem`, `Citation`, `Admonition`, `RawBlock`, `MetadataBlock`), 4 new annotation kinds (`Highlight`, `Color`, `FontSize`, `Custom`), and format-specific `attributes` bag on every node. - **DocumentStructureBuilder**: Ergonomic builder with heading-driven section nesting, container stack (Quote/Admonition/Slide auto-parenting), and annotation helpers. Replaces hand-constructed `DocumentNode` structs across all extractors. - **Unified rendering module**: `render_to_markdown()` and `render_to_plain()` renderers that walk a `DocumentStructure` tree to produce consistent output with inline annotation rendering, table pipe escaping, and nested list depth support. - **DocumentStructure support for all extractors**: Every extractor (35 formats) now natively produces a `DocumentStructure` when `include_document_structure` is enabled: - Office: DOCX (with TextAnnotation from Run formatting, Formula from OMML), PPTX (Slide containers), ODT, DOC, PPT - Markup: HTML (1,100-line tag parser with inline annotations), LaTeX, RST (admonitions, definition lists), OrgMode, Markdown, MDX, Djot, Typst - Books: EPUB (chapter structure from spine), FictionBook (inline formatting annotations) - Scientific: JATS (article structure), DocBook (section hierarchy) - Data: Excel (sheet headings + tables), CSV, DBF, JSON/YAML/TOML, BibTeX (citations), Jupyter (code + markdown cells) - Other: Email (metadata headers), RTF, OPML (outline hierarchy), HWP, iWork (Keynote/Numbers/Pages), XML, Image (OCR text) - **DocBook/JATS inline annotations**: Semantic inline formatting for academic/technical documents — emphasis, bold, code, links, subscript/superscript mapped to `AnnotationKind` variants. - **Document-level OCR**: `OcrBackend` trait supports `process_document()` for whole-file extraction without per-page rasterization. Up to 30% faster on multi-page documents with better context. ### Changed - **CSV extraction for embedding quality**: Produces `Row N: Header: Value` format instead of space-separated when a header row is detected. Programmatic `tables` field unchanged. - **XML extraction for embedding quality**: Indented hierarchical output preserving element tree with attributes inline, blank lines between top-level siblings, and `xmlns:*` filtering. ### Improved - **Zero-copy file I/O**: Automatic memory-mapping for files >1MB via `memmap2` with SIMD-accelerated UTF-8 validation (`simdutf8`). Measurable speed improvement for large PDFs and archives. WASM falls back to heap allocation. - **Unified concurrency management**: Centralized thread budget for Rayon, ONNX, and PaddleOCR with configurable `ConcurrencyConfig`. PDF OCR batched in chunks instead of all-at-once, reducing memory footprint on large documents. ### Fixed - **Incorrect page numbers in element-based output** (#557): When `result_format="element_based"` was used without `PageConfig(extract_pages=True)`, all elements received `page_number=1`. Now auto-enables `extract_pages` when element-based output is requested. - **Misleading `PageConfig` docstring** (#558): Updated docstring and type stub to show default constructor first and document interaction with `result_format="element_based"`. - **MSG extraction misses compressed RTF bodies** (#560): Added PR_RTF_COMPRESSED (0x1009) fallback for `.msg` files that store the body only in compressed RTF format. Implements MS-OXRTFCP decompression and RTF-to-plain-text stripping. - **Indexed colour PDF images returned as raw** (#561): Palette-based PDF images now decode correctly. Extracts the colour palette from the PDF dictionary and applies palette lookup to produce valid PNG output instead of unusable raw bytes. - **ODT extraction robustness**: Replaced unwraps with safe fallbacks in ODT parsing. --- ## [4.5.4] - 2026-03-23 ### Added - **Document-level OCR optimization**: The `OcrBackend` trait now supports native `process_document()` for efficient whole-file extraction without rasterizing individual PDFs to images when the backend supports it (e.g., Python's EasyOCR backend). ### Changed - **OCR protocol clarity**: Differentiated `process_file` to `process_image_file` in OCR backend trait for clearer protocol semantics. - **Python refactoring**: Removed unused loop variable in EasyOCR implementation. - **Dependency optimization**: Dropped redundant tokio multi-thread feature flag. ### Tests - **Backend registry robustness**: Hardened backend registry tests with drop guards and comprehensive mock coverage. ### Added - **PST (Outlook Personal Folders) extraction**: New `PstExtractor` backed by the `outlook-pst` crate. Traverses the full IPM folder hierarchy iteratively, extracts subject, sender, recipients (TO/CC/BCC), body, and date from every message in the archive. Enabled via the existing `email` feature flag. MIME type: `application/vnd.ms-outlook-pst`. ### Fixed - **PDF image extraction panic on mismatched buffer lengths** (#552): Replaced `assert!` in `pdf/images.rs` with graceful error handling. Malformed PDF images with wrong buffer sizes are now skipped instead of panicking. Regression from v4.5.0. - **`pdf` feature compilation without `layout-detection`** (#550): `config.layout` reference in `extraction.rs` was not behind a `#[cfg(feature = "layout-detection")]` gate, causing compilation errors when `pdf` was enabled without `layout-detection`. - **Unused `table_model` variable warning**: Fixed cfg-gating in `pipeline.rs` so `table_model` parameter is properly handled when `layout-detection` feature is disabled. - **Clippy `too_many_arguments` on `recognize_tables_slanet`**: Added allow attribute for the 8-parameter function in `table_recognition.rs`. - **Ruby binding missing `table_model` field**: Added `table_model` parsing to `LayoutDetectionConfig` initializer in Ruby native extension. - **WASM module resolution in Supabase/Deno edge functions** (#551): Added explicit `package.json` exports for `pkg/kreuzberg_wasm.js` and WASM binary. Extended `wasm-loader.ts` with Deno detection and clear error messaging for restricted edge runtimes. - **`zip` dependency pinned below 7.4**: Avoids let-chain build failures on some stable Rust toolchains (#549). - **Vendored HWP text extraction**: Replaced external `hwpers` crate with vendored subset (~1,650 lines). Eliminates `zip 2.x` transitive dependency that caused WASM and CI Validate build failures. ### Added - **`prepend_heading_context` chunking option**: When `true` and `chunker_type` is `Markdown`, prepends the heading hierarchy path (e.g. `# Title > ## Section`) to each chunk's content string. Useful for RAG pipelines where chunks need self-contained structural context. Available across all 10 language bindings, CLI, and WASM. Includes fixture-driven e2e tests and documentation for all languages. --- ## [4.5.3] - 2026-03-22 ### Added - **Apple iWork Format Support**: Native parsing for modern (2013+) `.pages`, `.numbers`, and `.key` files via a new `iwork` feature flag. Uses zero-allocation protobuf text extraction from Snappy-compressed IWA containers. - **SLANeXT table structure recognition models**: Alternative table structure backends alongside TATR. New `table_model` field on `LayoutDetectionConfig` selects the backend. Options: `"tatr"` (default, 30MB), `"slanet_wired"` (365MB, bordered tables), `"slanet_wireless"` (365MB, borderless tables), `"slanet_plus"` (7.78MB, lightweight), `"slanet_auto"` (classifier-routed, ~737MB). Available across all 12 language bindings and CLI (`--layout-table-model`). - **PP-LCNet table classifier**: Automatic wired/wireless table detection for SLANeXT auto mode. Uses center-crop preprocessing with BGR channel order matching PaddleOCR convention. - **CLI `cache warm --all-table-models`**: Opt-in download of SLANeXT model variants (~730MB). Default warm downloads only RT-DETR + TATR. - **ISO 21111-10 benchmark fixture**: Table-heavy ISO standard document with MinerU ground truth for table extraction benchmarking. --- ## [4.5.2] - 2026-03-21 ### Fixed - **PDF word splitting in extracted text**: Pdfium's text extraction inserted spurious spaces mid-word (e.g. `"s hall a b e active"` instead of `"shall be active"`). Added selective page-level respacing: pages with detected broken word spacing are re-extracted using character-level gap analysis (`font_size × 0.33` threshold). Clean pages use the fast single-call path. Reduces garbled lines from 406 to 0 on the ISO 21111-10 test document with no performance impact. - **Markdown underscore escaping**: Underscores in extracted text (e.g. `CTC_ARP_01`) were incorrectly escaped as `CTC\_ARP\_01` throughout the markdown output. Underscore escaping has been removed entirely since extracted PDF text contains literal identifiers, not markdown formatting. - **Page header/footer leakage**: Running headers like `ISO 21111-10:2021(E)` and copyright footers leaked into the document body. Added fuzzy alphanumeric matching to detect repeated header/footer text even when spacing or character extraction varies across pages. - **R batch function spurious NULL argument**: R wrapper batch functions passed an extra `NULL` positional argument to native Rust functions, causing "unused argument" errors on all batch operations. - **Elixir Windows ORT DLL staging**: ONNX Runtime DLL was only staged in `target/release/` but not in `priv/native/` where the BEAM VM loads NIFs. OCR/layout/embedding features now work correctly on Windows CI. ### Added - **General extraction result caching**: All file types (PDF, Office, HTML, archives, etc.) are now cached — not just OCR results. Repeated extractions of the same file with the same config return instantly from cache. - **Cache namespace isolation**: New `cache_namespace` field on `ExtractionConfig` enables multi-tenant cache isolation on shared filesystems. Available via `--cache-namespace` CLI flag and across all language bindings. - **Per-request cache TTL**: New `cache_ttl_secs` field on `ExtractionConfig` overrides the global TTL for individual extractions. Set to `0` to skip cache entirely. Available via `--cache-ttl-secs` CLI flag. - **Cache namespace deletion**: `delete_namespace()` removes all cache entries under a namespace. `get_stats_filtered()` returns per-namespace statistics. - **Multi-worker cleanup safety**: Cache cleanup no longer triggers excessively when multiple worker pods share the same cache directory. - **Bundled eng.traineddata**: English OCR works out of the box with zero runtime configuration (~4MB bundled at build time). - **Tessdata in `cache warm`**: `kreuzberg-cli cache warm` now downloads all tessdata_fast language files (~120 languages) to `KREUZBERG_CACHE_DIR/tessdata/`, giving full Tesseract language support without system packages. - **Tessdata in `cache manifest`**: `kreuzberg-cli cache manifest` now includes all tessdata files with source URLs, enabling `--sync-cache` to download tessdata alongside models. - **`KREUZBERG_CACHE_DIR/tessdata` resolution**: `resolve_tessdata_path()` now checks `KREUZBERG_CACHE_DIR/tessdata` and the bundled build path before falling back to system paths. Resolution order: `TESSDATA_PREFIX` env → `KREUZBERG_CACHE_DIR/tessdata` → bundled tessdata → system paths. - **CLI `embed` command**: Generate vector embeddings from text via `kreuzberg embed --text "..." --preset balanced`. Supports stdin, multiple texts, JSON/text output. Feature-gated on `embeddings`. - **CLI `chunk` command**: Split text into chunks via `kreuzberg chunk --text "..." --chunk-size 512`. Configurable size, overlap, chunker type, tokenizer model. - **CLI `completions` command**: Generate shell completions for bash, zsh, fish, powershell via `kreuzberg completions `. - **CLI `--log-level` global flag**: Override `RUST_LOG` via `kreuzberg --log-level debug extract doc.pdf`. - **CLI extraction overrides**: 27 flags exposed via `ExtractionOverrides` struct with `#[command(flatten)]`. New flags: `--layout-preset`, `--layout-confidence`, `--acceleration`, `--extract-pages`, `--page-markers`, `--extract-images`, `--target-dpi`, `--pdf-extract-images`, `--pdf-extract-metadata`, `--token-reduction`, `--include-structure`, `--max-concurrent`, `--max-threads`, `--msg-codepage`, `--ocr-auto-rotate`. - **CLI colored output**: Text output uses `anstyle` for colored headers, labels, success values, and dim separators. Respects `NO_COLOR` env var. - **API `POST /detect`**: MIME type detection endpoint via multipart file upload. - **API `GET /version`**: Version info endpoint. - **API `GET /cache/manifest`**: Model manifest with checksums and sizes. - **API `POST /cache/warm`**: Eager model download endpoint with embedding preset support. - **MCP `get_version` tool**: Query server version from MCP clients. - **MCP `cache_manifest` tool**: Get model manifest via MCP. - **MCP `cache_warm` tool**: Pre-download models via MCP. - **MCP `embed_text` tool**: Generate embeddings via MCP (feature-gated). - **MCP `chunk_text` tool**: Text chunking via MCP. - **Pipeline table extraction tracing**: Added zero-cost `tracing::trace!` and `tracing::debug!` logging throughout the layout detection and table extraction pipeline for easier debugging. - **TATR model availability check**: Layout detection now returns an error if table regions are detected but the TATR model is unavailable, instead of silently falling back to degraded extraction. - **Publish idempotency checks**: All publish jobs now have re-check steps using `check-registry@v1` before publishing. Added `check-elixir-release` job for GitHub release asset verification. - **ARM benchmark runners**: Benchmark workflows switched to `runner-medium-arm64` for ARM-native performance testing. - **Registry check tool**: `python3 scripts/publish/check_all_registries.py ` checks all 10+ registries and GitHub release assets locally. ### Changed - **CLI batch flags**: Batch command now supports all extraction override flags (chunking, layout, acceleration, etc.) via shared `ExtractionOverrides` struct, matching extract command parity. - **CLI config architecture**: Replaced 13-parameter `apply_extraction_overrides` function with `ExtractionOverrides` struct using `#[command(flatten)]`. Config fields auto-scale as `ExtractionConfig` evolves. - **MCP tool architecture**: Removed dead `tools/` trait-based duplicates; all tools implemented directly in `server.rs`. ### Improved - **CLI validation**: OCR backend values validated (tesseract, paddle-ocr, easyocr). Chunk size/overlap bounds checked. DPI range (36-2400) and layout confidence (0.0-1.0) validated. Zero-value `max_concurrent`/`max_threads` rejected. `--chunking-tokenizer` errors when feature disabled. - **API validation**: Embedding preset names validated in `/embed`. Chunk `max_characters` bounds checked (1-1M) in `/chunk`. - **MCP validation**: Empty paths rejected in `batch_extract_files`. Chunk `max_characters` bounds checked in `chunk_text`. Embedding preset validated in `embed_text`. - **Chunk overlap auto-clamping**: When `--chunk-size` is smaller than default overlap, overlap is automatically clamped to `size/4` instead of producing a confusing error. --- ## [4.5.1] - 2026-03-20 ## [4.5.1] - 2026-03-20 - **Java FFI `CBatchResult` struct layout mismatch**: The `count` and `results` fields were swapped in the Java Panama FFM layout, causing all batch extraction operations to fail with memory access errors. - **Go FFI stale C header**: The `CExtractionResult` struct field order in the Go binding's C header did not match the Rust `#[repr(C)]` layout (reordered alphabetically in 4.5.0, added `djot_content_json`). Go read fields at wrong offsets, causing `pages_json` to deserialize `metadata_json` instead. - **FFI `LayoutDetectionConfig` not feature-gated**: The FFI crate unconditionally imported `LayoutDetectionConfig` and exposed `kreuzberg_config_builder_set_layout`, causing compilation failures on targets without the `layout-detection` feature (e.g., `x86_64-pc-windows-gnu`). - **Python wheel builds on Linux aarch64**: OpenSSL library path was hardcoded to `x86_64-linux-gnu` in the manylinux build script, failing on aarch64 runners. Now detects architecture via `uname -m`. - **R batch function signature mismatch**: R wrapper functions were missing the `file_configs` parameter when calling native Rust functions, causing "Expected Scalar, got Language" errors on all batch operations. - **R package ORT linking**: The R build configuration (`config.R`) did not link against ONNX Runtime when `ORT_LIB_LOCATION` was set, causing `undefined symbol: OrtGetApiBase` at load time. --- ## [4.5.0] - 2026-03-20 ### Added - **ONNX-based document layout detection**: New `layout` config field enables document layout analysis using RT-DETR v2 with 17 element classes. Supports `"fast"` and `"accurate"` presets with auto-downloaded models. Available across all language bindings. - **SLANet table structure recognition**: Detected Table regions are processed by SLANet-plus for neural HTML structure recovery, producing markdown tables with colspan/rowspan support. Now runs on all pages including structure-tree pages (previously skipped). - **Layout-enhanced heading detection**: Layout model SectionHeader and Title regions guide heading detection in both structure tree and heuristic extraction. High-confidence hints (>=0.7) can override font-size-based classification. - **Multi-backend OCR pipeline**: New `OcrPipelineConfig` enables quality-based fallback across OCR backends (e.g., Tesseract then PaddleOCR) with configurable priority, language, and backend-specific settings. - **OCR quality thresholds**: New `OcrQualityThresholds` config with 16 tunable parameters for OCR output quality assessment and fallback decisions. - **OCR auto-rotate**: New `OcrConfig.auto_rotate` flag (default: false) for automatic page rotation detection. Handles 0/90/180/270 degree rotations. - **PaddleOCR v2 model tier system**: New `model_tier` field with `"mobile"` (default, ~21MB, fast) and `"server"` (~172MB, highest accuracy). Both use unified multilingual models (CJK+English in one model). Available across all bindings. - **`AccelerationConfig` for GPU/execution provider control**: Fine-grained control over ONNX execution providers (CPU, CoreML, CUDA, TensorRT) for layout detection and table recognition. Typed across all bindings. - **`ConcurrencyConfig` for thread limiting** (#503): New `max_threads` field caps Rayon, ONNX intra-op threads, and batch concurrency to a single limit. Typed across all bindings. - **`EmailConfig` for MSG fallback codepage** (#505): Configurable fallback codepage for MSG files lacking a codepage property (default: windows-1252). Set e.g. `1251` for Cyrillic. Typed across all bindings. - **Per-file extraction configuration (`FileExtractionConfig`)**: Per-file config overrides in batch operations. Each file can specify its own OCR, chunking, output format settings. CLI supports `--file-configs`, MCP supports `file_configs` parameter. - **Opt-in single-column pseudo tables** (#449): New `allow_single_column_tables` on `PdfConfig` (default: false). Allows single-column structured data (glossaries, itemized lists) to be emitted as tables. - **Experimental: `pdf_oxide` text extraction backend** (`pdf-oxide` feature): Pure Rust PDF text extraction as an alternative to pdfium. Opt-in only, not included in `full` feature set. - **CLI `cache warm` command**: Eagerly downloads all PaddleOCR and layout detection models. Supports `--all-embeddings` or `--embedding-model `. Useful for containerized or offline deployments. - **CLI `cache manifest` command**: Outputs a JSON manifest of all expected model files with SHA256 checksums, sizes, and source URLs for scripted cache verification. - **ChunkSizing configuration**: `sizing_type`, `sizing_model`, and `sizing_cache_dir` fields exposed in `ChunkingConfig` across all bindings. - **Chunk heading context**: New `HeadingContext` type in `ChunkMetadata` providing heading level and text. - **`ModelManifestEntry` type and `manifest()` / `ensure_all_models()` methods**: Public API for querying and eagerly downloading model cache manifests. - **SF1 structural quality metrics in benchmark CI**: SF1 quality scores now computed alongside TF1, with PDF-specific quality rankings for tracking extraction quality regressions. ### Changed - **Layout preset default**: Changed from `"fast"` to `"accurate"`. The `Fast` variant has been removed. The `"fast"` string is still accepted for backwards compatibility. - **PaddleOCR default model tier**: Changed from `"server"` to `"mobile"`. Mobile models provide equivalent quality on standard documents while being 3-5x faster. Server tier remains available via `with_model_tier("server")`. - **PaddleOCR v2 models**: All models updated to v2 generation (PP-OCRv5 detection, PP-LCNet classification, unified multilingual recognition). V1 models remain available for older versions. - **Unified multilingual recognition models**: PP-OCRv5 unified server (84MB) and mobile (16.5MB) models replace per-script English and Chinese models. Per-script models retained for 9 other script families. - **Batch API unification**: `_with_configs` batch functions removed; per-file `FileExtractionConfig` is now an optional parameter on the unified batch functions. - **Layout pipeline no longer forces heuristic extraction**: Structure tree extraction proceeds normally when layout detection is enabled, preserving text quality. - **Global ONNX model caching**: Layout detection and SLANet models are cached globally and reused across extractions, avoiding expensive ONNX session recreation in batch scenarios. - **Vendored text embedding pipeline**: Replaced `fastembed` dependency with vendored engine using ONNX Runtime directly for tighter integration. - **Embedding `embed()` now takes `&self` instead of `&mut self`**: Enables parallel embedding generation without mutable reference constraints. - **L2 normalization parallelized**: Embedding batches >= 64 vectors now use multi-threaded normalization. - **`padding` field in PaddleOcrConfig**: Now exposed across Python, TypeScript, Ruby, and Go bindings (previously Rust-only). - **Language-agnostic section pattern recognition**: Headings ending with a period are now allowed when they match structural patterns (section symbol, all-caps, numbered sections). Improves heading detection for legal, academic, and multilingual documents. - **Layout classification guards**: Heading overrides from the layout model now have word count limits, punctuation checks, figure label detection, and body-font-size validation to prevent false heading promotions. - **Strong typing across bindings**: Replaced weak `Dictionary`/`Map`/`array` types with strongly typed config classes in C#, Java, and PHP. Added missing config types to Python stubs, Node.js, Ruby, Elixir, and PHP. ### Removed - **`fastembed` dependency**: Replaced by vendored embedding engine using ONNX Runtime directly. - **`EmbeddingModelType::FastEmbed` variant**: Use `Preset` or `Custom` variants instead. ### Fixed - **C# FFI struct layout mismatch** (#538): `CExtractionResult` struct layout between Rust and C# was mismatched, causing deserialization failures and overflow exceptions that made the C# library completely broken in 4.4.6. - **PDF `force_ocr` without explicit OCR config** (#495): `force_ocr=true` was silently ignored when no `ocr` config block was provided. Now unconditionally triggers the OCR pipeline with default settings. - **PDF image extraction** (#511): Extracted images returned raw compressed data instead of properly decoded image bytes. Now automatically decoded and re-encoded as standard formats (PNG/JPEG). - **Node.js `extractFileInWorker` mime_type passthrough** (#523): MIME type was silently injected into PDF password config instead of being forwarded to extraction. Now correctly passed through. - **DOCX parser type inference failure** (#519): The `zip` 8.2.0 dependency introduced type ambiguity in DOCX and XML parsers, causing compilation failures. - **Python `py.typed` and `.pyi` missing from sdist**: Type stubs and `py.typed` marker now included in both wheel and sdist formats. - **PDF broken CMap word spacing**: Geometric validation now vetoes false word boundaries in PDFs with broken font CMaps, fixing "co mputer" -> "computer" style errors. - **PDF structure tree heading trust**: Structure tree heading tags (H1-H6) are now trusted as author-intent metadata. Previously, font-size validation rejected valid headings close to body size. - **PDF structure tree extraction performance**: Text and style maps now built in a single pass, eliminating multi-second extraction times on complex pages. - **OCR Picture regions suppressing text**: Layout-detected Picture regions now preserve embedded text as plain paragraphs instead of silently dropping it. - **Non-transitive sort comparators**: Spatial reading-order sorts now use discrete row buckets instead of tolerance-based grouping, ensuring correct and stable ordering. - **Page furniture over-stripping**: Added bulk and per-paragraph guards to prevent aggressive furniture stripping from removing legitimate content. - **`KREUZBERG_CACHE_DIR` not respected by all caches**: Embeddings, OCR result cache, and document extraction cache now honor the environment variable. - **MSG PT_STRING8 encoding**: MSG files now correctly decode ANSI string properties using the declared Windows code page instead of UTF-8 lossy conversion. - **SLANet-Plus ONNX model**: Re-exported with shape fix, resolving inference failures that caused all SLANet table extractions to silently fail on macOS CoreML. - **TATR model panic in batch processing**: Model unavailability in parallel closures caused crashes in FFI callers (Java, C#). Now falls back gracefully to heuristic table extraction. - **Docker musl builds**: Alpine/musl Docker images now link against the system ONNX Runtime library, fixing build failures. All features work in musl CLI images. - **FFI batch functions null handling**: C#/Java FFI batch functions now accept NULL for `file_config_jsons` instead of rejecting it. ### Known Issues - **PHP PIE Windows package temporarily unavailable**: The Windows build for the PHP PIE extension is disabled due to a transitive dependency conflict (`ort-sys` → `lzma-rust2` → `crc` version collision on the `x86_64-pc-windows-gnu` target). Linux and macOS PHP packages are unaffected. Will be resolved when upstream `ort` updates its `lzma-rust2` dependency. - **WASM: no layout detection, acceleration, or email config**: ONNX Runtime does not support WebAssembly, so layout detection (RT-DETR), hardware acceleration config, and concurrency config are unavailable in the WASM binding. OCR via Tesseract WASM and embeddings are supported. --- ## [4.4.6] ### Added - **dBASE (.dbf) format support**: Extract table data from dBASE files as markdown tables with field type support. - **Hangul Word Processor (.hwp/.hwpx) support**: Extract text content from HWP 5.0 documents (standard Korean document format). - **Office template/macro format variants**: Added support for `.docm`, `.dotx`, `.dotm`, `.dot` (Word), `.potx`, `.potm`, `.pot` (PowerPoint), `.xltx`, `.xlt` (Excel) formats. ### Fixed - **DOCX image placeholders missing (#484)**: Extracting `.docx` files with `extract_images=True` no longer produced `![](image)` placeholders in the output. The default plain text output path was stripping image references. Image extraction now forces markdown output so placeholders are always included. ### Changed - **Format count updated to 91+**: Documentation across all READMEs, docs, and package manifests updated to reflect expanded format support (previously 75+). ## [4.4.5] ### Fixed - **PDF markdown garbles positioned text (#431)**: PDFs with positioned/tabular text (CVs, addresses, data tables) had their line breaks destroyed during paragraph grouping. Added page-level positioned text detection: when fewer than 30% of lines on a page reach the right margin, short lines are split into separate paragraphs to preserve the document's visual structure. - **Node worker pool password bug**: `extractFileInWorker` was passing the `password` argument as `mime_type` to `extract_file_sync`, meaning passwords were never applied and MIME detection could break. Password is now correctly injected into `config.pdf_options.passwords`. - **Unused import in kreuzberg-node**: Removed unused `use serde_json::Value` import in `result.rs` that caused clippy warnings. - **WASM Deno OCR test hang**: OCR tests hung indefinitely on WASM Deno because Tesseract synchronous initialization blocks the single-threaded runtime. OCR fixtures are now skipped for the wasm-deno target. - **WASM camelCase config deserialization**: JS consumers send camelCase config keys (e.g. `includeDocumentStructure`) but `serde` expects snake_case. Added `camel_to_snake` transform in `parse_config()` so config fields are properly deserialized. Fixes document structure extraction returning empty results via WASM. - **PHP 8.5 array coercion on macOS**: On PHP 8.5 + macOS, ext-php-rs coerces `#[php_class]` return values to arrays instead of objects. Added `normalizeExtractionResult()` wrapper that transparently converts arrays via `ExtractionResult::fromArray()`. - **PHP 8.5 support**: Upgraded ext-php-rs to 0.15.6 for PHP 8.5 compatibility. - **Vendoring scripts missing path deps**: Ruby and R vendoring scripts failed when workspace dependencies use `path` instead of `version`. Added path field handling to `format_dependency()` and kreuzberg-ffi fixup block to the Ruby vendoring script. - **pdfium-render clippy lints**: Fixed clippy warnings in kreuzberg-pdfium-render crate. ### Added - **CLI `--pdf-password` flag**: New `--pdf-password` option on `extract` and `batch` commands for encrypted PDF support. Can be specified multiple times. - **MCP `pdf_password` parameter**: Added `pdf_password` field to `extract_file`, `extract_bytes`, and `batch_extract_files` MCP tool params for better discoverability. - **API `pdf_password` multipart field**: The HTTP API extract endpoint now accepts a `pdf_password` multipart field for encrypted PDFs. - **`PdfConfig` Default impl**: Added `Default` implementation for `PdfConfig` to support ergonomic config construction. - **Binding crate clippy in CI**: Added clippy steps to `ci-node`, `ci-python`, and `ci-wasm` workflows (gated to Linux). Added `node:clippy`, `python:clippy`, and `wasm:clippy` task commands. - **E2E password-protected PDF fixture**: Added `pdf_password_protected` fixture testing copy-protected PDF extraction across all bindings. ### Changed - **All binding crates linted in pre-commit**: Removed clippy exclusions for kreuzberg-php, kreuzberg-node, and kreuzberg-wasm from pre-commit config. - **golangci-lint v2.11.3**: Upgraded from v2.9.0 across Taskfile, CI workflows, and install scripts. ## [4.4.4] ### Fixed - **CLI test app fixes**: Fixed broken symlinks in CLI test documents, corrected `--format` to `--output-format` flag usage, fixed multipart form field name (`file=` → `files=`) in serve tests, and rewrote MCP test to use JSON-RPC stdin protocol instead of background process detection. - **Publish idempotency check scripts**: Fixed `check_nuget.sh` and `check-nuget-version.sh` using bash 4+ `${var,,}` syntax incompatible with bash 3.x. Fixed `check_pypi.sh` and `check_packagist.sh` writing to `$GITHUB_OUTPUT` internally instead of stdout (conflicting with workflow-level redirect). Fixed `check-rubygems-version.sh` false negatives for native gems by switching from `gem search` to RubyGems JSON API. Fixed `check-rubygems-version-python.sh` Python operator precedence bug. Fixed `check-maven-version.sh` using unreliable Solr search API instead of direct repo HEAD request. Fixed stderr redirect missing on diagnostic messages in multiple scripts. - **Node test app version**: Updated Node.js test app to reference v4.4.4 package version. ### Changed - **CLI install with all features**: CLI test install script now uses `--all-features` flag to enable API server and MCP server subcommands. - **Publish workflow republish support**: Added `republish` input to publish workflow that deletes and re-creates the tag on current HEAD before publishing, enabling clean retag + full republish. ## [4.4.3] ### Added - **PDF image placeholder toggle**: New `inject_placeholders` option on `ImageExtractionConfig` (default: `true`). Set to `false` to extract images as data without injecting `![image](...)` references into the markdown content. ### Fixed - **Token reduction not applied** ([#436](https://github.com/kreuzberg-dev/kreuzberg/issues/436)): Token reduction config was accepted but never executed during extraction. The pipeline now applies `reduce_tokens()` when `token_reduction.mode` is configured. - **Nested HTML table extraction**: Nested HTML tables now extract correctly with proper cell data and markdown rendering, using the visitor-based table extraction API from html-to-markdown-rs. - **hOCR plain text output**: hOCR conversion now correctly produces plain text when `OutputFormat::Plain` is requested, instead of silently falling back to Markdown. - **PDF garbled text for positioned/tabular content** ([#431](https://github.com/kreuzberg-dev/kreuzberg/issues/431)): PDF text extraction now detects X-position gaps between consecutive characters and inserts spaces when the gap exceeds `0.8 × avg_font_size`. Previously, characters placed at specific coordinates without explicit space characters were concatenated without spaces. - **Chunk page metadata drift with overlap** ([#439](https://github.com/kreuzberg-dev/kreuzberg/issues/439)): Chunk byte offsets are now computed via pointer arithmetic from the source text, fixing cumulative drift that caused chunks to report incorrect page numbers when overlap is enabled. - **Node.js metadata casing**: Standardized all `Metadata` and `EmailMetadata` fields to `camelCase` (e.g., `pageCount`, `creationDate`, `fromEmail`) in the Node.js/TypeScript bindings. Also corrected pluralization for `authors` and `keywords`. - **WASM build failure on Windows CI**: CMake try-compile checks on Windows used the host MSVC compiler (`cl.exe`), which rejected GCC/Clang flags like `-Wno-implicit-function-declaration`. Added `CMAKE_TRY_COMPILE_TARGET_TYPE=STATIC_LIBRARY` to both `build_leptonica_wasm` and `build_tesseract_wasm` to skip linking during cross-compilation checks. - **WASM OCR build panic when `git`/`patch` unavailable**: The tesseract WASM patch (`tesseract.diff`) application panicked when both `git apply` and `patch` commands failed. Added programmatic C++ source fixups as a fallback, applying all necessary changes (CPUID guard, pixa*debug* unique_ptr conversion, source list trimming) via string replacement when the diff patch cannot be applied. ## [4.4.2] ### Fixed - **E2E element type assertions**: Fixed element type field name in E2E generator templates for Python, TypeScript, WASM Deno, Elixir, Ruby, PHP, and C#. Each binding uses different casing conventions (Python: dict key `element_type`, TypeScript/Node: `elementType` via NAPI camelCase, Elixir: atom-to-string conversion, C#: JSON serialization for snake_case wire value). - **Ruby PDF annotation extraction**: Fixed `PdfAnnotation` and `PdfAnnotationBoundingBox` classes not being registered in the autoload list, causing `NameError` when extracting PDF annotations. Also fixed bounding box field name mismatch between Rust output (`x0/y0/x1/y1`) and Ruby struct (`left/top/right/bottom`). - **Ruby cyclomatic complexity**: Refactored `build_annotation_bbox` in result.rb to extract repeated field lookup pattern, reducing cyclomatic complexity below threshold. - **WASM OCR blocking event loop**: The `ocrRecognize()` function in the WASM package was running synchronously on the main thread, blocking the Node.js event loop during image decoding and Tesseract OCR processing. This prevented timeouts and other async operations from firing while OCR was in progress. OCR now runs in a worker thread (Node.js `worker_threads` / browser `Web Worker`), keeping the main thread responsive. - **JPEG 2000 OCR decode failure**: JPEG 2000 images (jp2, jpx, jpm, mj2) and JBIG2 images failed with "The image format could not be determined" during PaddleOCR and WASM OCR because these code paths used the standard `image` crate which doesn't support JPEG 2000. A shared `load_image_for_ocr()` helper now detects JP2/J2K/JBIG2 formats by magic bytes and uses `hayro-jpeg2000`/`hayro-jbig2` decoders across all OCR backends. The `ocr-wasm` feature now includes these decoders (pure Rust, WASM-compatible). - **WASM PDF empty content**: `initWasm()` fired off PDFium initialization asynchronously without awaiting it, causing a race condition where PDF extraction could start before PDFium was ready, returning empty content. PDFium initialization is now properly awaited during `initWasm()`. ### Added - **OMML-to-LaTeX math conversion for DOCX**: Mathematical equations in DOCX files (Office Math Markup Language) are now converted to LaTeX notation instead of being rendered as concatenated Unicode text. Supports superscripts, subscripts, fractions (`\frac`), radicals (`\sqrt`), n-ary operators (`\sum`, `\int`), delimiters, function names, accents, equation arrays, limits, bars, border boxes, matrices, and pre-sub-superscripts. Display math uses `$$...$$` and inline math uses `$...$` in markdown output. Plain text output includes raw LaTeX without delimiters. - **Plain text output paths for all extractors**: When `OutputFormat::Plain` or `OutputFormat::Structured` is requested, DOCX, PPTX, ODT, FB2, DocBook, RTF, and Jupyter extractors now produce clean plain text without markdown syntax (`#`, `**`, `|`, `![](image)`, `-`, etc.). Previously these extractors always emitted markdown regardless of the requested output format. - **DOCX**: `Document::to_plain_text()` skips heading prefixes, inline formatting markers, image placeholders, and renders footnotes/endnotes as `id: text` instead of `[^id]: text`. - **PPTX**: `ContentBuilder` respects `plain` mode — skips `#` title prefix, image markers, list markers, and uses `Notes:` instead of `### Notes:`. - **ODT**: Heading prefixes (`#`), list markers (`-`), and pipe-delimited tables conditionally omitted for plain text. - **FB2/FictionBook**: Inline markers (`*`, `**`, `` ` ``, `~~`), heading prefixes, and cite prefixes skipped for plain text. - **DocBook**: Section title prefixes, code fences, list markers, blockquote prefixes, bold figure captions, and pipe tables all conditionally omitted. - **RTF**: Table output in result string uses tab separation instead of pipe-delimited markdown. Image `![image](...)` markers omitted for plain text. - **Jupyter**: Skips `text/markdown` and `text/html` output types in plain mode, preferring `text/plain`. - **`cells_to_text()` shared utility**: Tab-separated plain text table formatter alongside existing `cells_to_markdown()`. Used by DOCX, PPTX, ODT, RTF, and DocBook extractors for plain text table rendering. ### Changed - **CLI includes all features**: `kreuzberg-cli` now depends on `kreuzberg` with the `full` feature set instead of a separate `cli` subset. The `cli` feature group has been removed from `kreuzberg`. This ensures the CLI supports all formats including archives (7z, tar, gz, zip). ### Fixed - **Alpine/musl CLI Docker image**: Fixed "Dynamic loading not supported" error when running `kreuzberg-cli` in Alpine containers. The CLI binary is now dynamically linked against musl libc, enabling runtime library loading for PDF processing. - **R package Windows installation**: Improved Python detection in configure script for Windows environments (added `py` launcher and `RETICULATE_PYTHON` support). Symlink extraction errors during source package installation are now handled gracefully. - **PHP 8.5 precompiled extension binaries**: Added PHP 8.5 support alongside existing PHP 8.4 in CI and release workflows. - **OCR DPI normalization**: The `normalize_image_dpi()` preprocessing logic is now integrated into the OCR pipeline. Images are normalized to the configured target DPI before being passed to Tesseract, and the calculated DPI is set via `set_source_resolution()`. This eliminates the "Estimating resolution as ..." warning and improves OCR accuracy for images with non-standard DPI. - **HTML metadata extraction with Plain output**: Fixed HTML metadata (headers, links, images, structured data) not being collected when using `OutputFormat::Plain` (the default). The underlying library's plain text fast path skips metadata extraction; kreuzberg now uses Markdown format internally for metadata collection and converts to plain text separately. - **PPTX text run spacing**: Adjacent text runs within paragraphs are now joined with smart spacing instead of being concatenated directly ("HelloWorld" → "Hello World"). - **CSV Shift-JIS/cp932 encoding detection**: `encoding_rs` is now a non-optional dependency. CSV files with Shift-JIS encoding are correctly decoded instead of producing mojibake. Fallback encoding detection tries common encodings (Shift-JIS, cp932, windows-1252, iso-8859-1, gb18030, big5). - **EML multipart body extraction**: All text/html body parts are now extracted by iterating over all indices instead of only index 0. Nested `message/rfc822` parts in multipart/digest are recursively extracted. - **EPUB media tag leakage**: `