` tags. Closing tags from inner word spans could prematurely terminate a paragraph, causing content after that point to be silently dropped. Now uses tag-name-specific depth tracking. - **hOCR multi-page content loss** — Per-page hOCR documents from tesseract always report `ppageno=0` (page=1), but the paragraph conversion filtered by the actual page index, silently dropping all content on pages 2+. Removed the per-page filter since each hOCR document is independently extracted per page. - **OCR batch parallelization** — OCR page processing was hardcoded to 4 concurrent pages regardless of available CPUs. Now uses `resolve_thread_budget()` (auto-detects CPUs, capped at 8) for significantly faster multi-page document processing. - **Benchmark workflow** — Removed reference to deleted `kreuzberg-extract` binary target. - **Ruby OCR backend** — Added missing `ocr_internal_document` field to `ExtractionResult` construction. - **Keyword extraction tests** — Updated test assertions to use new `extracted_keywords` field instead of deprecated `metadata.additional["keywords"]`. - **PaddleOCR cache dir test** — Fixed test failure when `KREUZBERG_CACHE_DIR` environment variable is set by CI setup actions. - **API `pdf_password` handler** — Added `#[cfg(feature = "pdf")]` gate to prevent compile error when `api` feature is enabled without `pdf`. - **Chunking page boundary regression** (#636): Page boundaries were computed against raw extractor text but `result.content` uses rendered text with different byte lengths. Chunks now recompute boundaries from per-page content, fixing `first_page`/`last_page` being null and the "Page boundary byte_end exceeds text length" validation warning. - **HF Hub environment variables** (#634): Use `ApiBuilder::from_env()` instead of `ApiBuilder::new()` for Hugging Face model downloads, respecting `HF_HOME` and `HF_ENDPOINT` environment variables. Fixes permission errors on Kubernetes when running as non-root. - **PDF bridge tracing panic on multibyte characters** (#635): Use `.chars().take()` instead of byte indexing for `text_preview` in PDF structure bridge tracing, preventing panics on multibyte UTF-8 characters (e.g., `•`). - **Go FFI struct layout** — vendored C header was missing `children_json` field, causing 8-byte offset shift. All FFI fields after `chunks_json` read wrong memory (e.g., `ocr_elements_json` read `mime_type` instead). - **Java FFI struct layout** — `CExtractionResult` layout was missing `code_intelligence_json` field, causing `success` flag to read from wrong offset. All Java extractions returned `success=false`. - **PHP `__get` magic method bypass** — six JSON fields (`elements`, `djotContent`, `document`, `ocrElements`, `children`, `uris`) returned raw JSON strings instead of deserialized arrays because `#[php(prop)]` intercepted property access before `__get`. - **Ruby `disable_ocr` config** — `disable_ocr` keyword was not parsed in Ruby config handler, causing OCR to run even when explicitly disabled. - **Node.js `ExtractionResult` parity** — `document`, `djotContent`, and `ocrElements` fields were `Option` which NAPI-RS omitted from JS objects when `None`. Changed to `Value` defaulting to `null`. - **Node.js `convertChunk` missing `chunkType`** — TypeScript type converter did not forward the `chunk_type` field from NAPI bindings. - **ODT caption text extraction** — text inside `draw:frame > draw:text-box > text:p` (e.g., image captions) was not extracted. The ODT extractor now recurses into text-box content. - **OCR InternalDocument propagation** — `run_ocr_pipeline` discarded the structured InternalDocument built by `extract_with_ocr`, causing OCR results to fall back to naive `\n\n` paragraph splitting. Now propagated through the full pipeline. - **OCR table cells** — OCR-detected tables (via TATR) had empty `cells` vectors, causing comrak to render them as paragraphs instead of proper tables. Now populated from the cell grid, matching the native text path fix. - **OCR non-layout InternalDocument** — When layout detection is not active, the OCR path now builds an InternalDocument from results instead of returning None. Ensures structured output regardless of layout detection availability. - **Italian/European PDF ligature corruption** — Extended contextual ligature repair to handle `tt`, `ti`, `tti` ligatures common in Italian fonts. Fixes garbled text like `Dire*ore` → `Direttore`, `ges:one` → `gestione`, `progeM` → `progetti`. - **OCR layout false heading classification** — Tesseract+layout pipeline was worse than pure tesseract (33% vs 41% SF1) because layout confidence threshold was too low (0.5). Raised to 0.7 for OCR path where font-size validation is unavailable. - **OCR table rendering** — OCR-detected tables were not linked to InternalDocument elements, causing comrak to skip them entirely. Tables now properly registered via `push_table()` with corresponding `ElementKind::Table` elements. - **Spurious table detection** — Multi-column prose with short cells (like nougat_008) bypassed the prose row check due to a 30-char minimum row length. Lowered to 15 chars so short-cell prose tables are correctly rejected. - **PHP enum registration** — PHP enums (ContentLayer, ElementType, etc.) were registered with `.class()` instead of `.enumeration()`, causing empty case lists. Virtual properties on ExtractionResult and ArchiveEntry now declared via builder modifiers for reflection visibility. - **Go macOS FFI linking** — monorepo dev build (`ffi_dev.go`) was missing `-framework Foundation` in CGO LDFLAGS, causing linker failures on macOS with CoreML-enabled ONNX Runtime. - **Unified WASM e2e tests** — replaced broken separate Deno/Workers e2e generators with a single vitest-based WASM generator. ORT-dependent features (embeddings, layout, paddle-ocr) gracefully skip. - **WASM Rayon thread pool panic** — Rayon's `par_iter()` / `into_par_iter()` and `ThreadPoolBuilder::build_global()` panicked in WASM (`RuntimeError: unreachable`) because WASM has no threading support. All Rayon usages now fall back to sequential iteration on `wasm32` target. - **PHP virtual property reflection** — `ClassBuilder::property()` declarations for `__get`-backed fields (metadata, chunks, document, etc.) shadowed the magic method, returning null. Replaced with getter methods that don't interfere with `__get`. Parity test updated to check both `hasProperty()` and getter methods. --- ## [4.7.0] - 2026-03-30 ### Added - **Semantic chunk labeling** (#600): Chunks now include a `chunk_type` field identifying the semantic nature of the content (e.g., `paragraph`, `heading`, `list_item`, `table_cell`, `code_block`). Supported across all 11 language bindings with updated E2E test parity. - **Unified InternalDocument architecture**: All extractors now return a canonical `InternalDocument` with typed elements, relationships, images, and tables. Replaces format-specific intermediate representations. - **Unified rendering layer**: New `new_markdown.rs` renderer produces CommonMark from `InternalDocument`, supporting headings, lists, tables, code blocks, formulas, footnotes, images, and inline annotations (bold, italic, links). - **PDF structure pipeline**: Full rewrite of PDF extraction using `page.text().all()` for clean text, char-indexed font metadata for heading/bold detection, segment-based paragraph gap detection, and pdfium segment bounding boxes for precise paragraph regions. - **Image extraction across 8 formats**: Embedded images now extracted as `ExtractedImage` with binary data, format, dimensions, and alt text. Supported for DOCX, PPTX, PDF, EPUB, ODT, HTML (data URIs), RTF (hex-decoded), and Markdown/MDX/Jupyter. Markdown output renders as `![alt](image_N.ext)` with binary data in `ExtractionResult.images`. - **Recursive OCR on embedded images**: When OCR is configured, extracted images from EPUB, ODT, HTML, and RTF are processed through `process_images_with_ocr()`, producing nested `ExtractionResult` in `ExtractedImage.ocr_result`. - **PDF watermark artifact filtering**: Uses pdfium's `/Artifact` content marks (PDF tagged content spec) to identify and filter watermark text from output. - **Vertical table header reconstruction**: Detects and fixes rotated column headers in PDF tables where pdfium extracts characters as spaced single characters in reverse order (e.g., "y t i r o h t u A o N" → "NoAuthority"). - **Position-based page furniture detection**: Cross-page repeating text detection now uses actual page margins (top/bottom 10%) and page heights instead of word-count heuristics. - **html-to-markdown v3 migration**: Switched to html-to-markdown v3 with unified `convert()` API returning `ConversionResult` (content, metadata, tables, images, document structure in a single call). Uses visitor-based table collection. hOCR module vendored as `table_core`. - **Markdown ground truth for 336 documents**: Pandoc-generated GT across 10 formats (DOCX, HTML, RTF, PPTX, EPUB, ODT, XLSX, XLS, CSV, DOC) for structural quality benchmarking. All 371 markdown GT files cleaned of HTML remnants (415 tables converted to GFM pipe tables, 28 inline tags fixed). - **Multi-format benchmark support**: Pipeline benchmark now scores all document formats (not just PDF), shows file type per document, replaces NaN with "—", and reports ground truth loading errors. - **Comprehensive PDF pipeline tracing**: Trace-level logging across heading lifecycle (layout overrides, demotion passes, furniture detection, render layer) for debugging. - **Pages API for PDF extraction**: Per-page content now properly wired through the extraction pipeline via `prebuilt_pages` on `InternalDocument`, making `result.pages` available for PDF documents. - **TOON wire format**: Token-Oriented Object Notation support across CLI (`--format toon`), API (`Accept: application/toon`), MCP (`response_format: "toon"`), and all 11 language bindings (Python, Node.js, WASM, C FFI, PHP, Ruby, Elixir, Go, Java, C#, R). TOON is a token-efficient alternative to JSON for LLM prompts — losslessly convertible to/from JSON but uses ~30-50% fewer tokens. Core functions `serialize_to_toon()` and `serialize_to_json()` exposed as public API. - **Renderer registry**: Trait-based `Renderer` and `RendererRegistry` for custom output format plugins. Built-in renderers (markdown, HTML, djot, plain) registered at startup. External crates can register custom renderers (e.g., DOCX output) via `register_renderer()`. - **comrak-based rendering**: Markdown and HTML rendering now uses comrak AST bridge instead of hand-rolled string building. Produces GFM-compliant markdown and semantic HTML5. Paragraph consolidation merges consecutive same-format paragraphs at sentence boundaries (fixes DOCX CV fragmentation where each visual line was a separate `*...*` italic block). - **Benchmark quality scoring improvements**: Content normalization for HTML blocks in markdown scoring, Image↔Paragraph and Table↔ListItem type compatibility, `correct` field in `QualityMetrics`, HTML detection in ground truth validation. - **Benchmark harness overhaul**: Per-format SF1/TF1 aggregation, noise detection (10 heuristics for HTML remnants, garbled text, broken tables, page artifacts), diagnostic diff mode (`--diagnose`), JSON output (`--json-output`), ground truth validation subcommand (`validate-gt`). Comprehensive tracing across all extractors and the rendering layer. - **Markdown ground truth for 23 formats**: 350+ benchmark fixtures across CSV, DOCX, HTML, EPUB, LaTeX, RST, RTF, PPTX, ODT, XLSX, XLS, OPML, ORG, JATS, IPYNB, FictionBook, DocBook, Typst, DOC, PPT, and more. GT generated via pandoc and verified against source documents. - **OpenWebUI integration**: Kreuzberg serves as a document extraction backend for Open WebUI chat interfaces. - **URI extraction**: New `Uri` type with `UriKind` classification (Hyperlink, Image, Anchor, Citation, Reference, Email) extracted from 20+ document formats. URIs are always-on, deduplicated by (url, kind) pair, and capped at 100k per document. Available in `ExtractionResult.uris`. - **Recursive email attachment extraction**: EML/MSG/PST attachments are now recursively extracted as `ArchiveEntry` children using the same pattern as archive extractors. Nested `message/rfc822` parts also extracted as children. Respects `max_archive_depth`. - **PDF embedded file extraction**: PDF file attachments (portfolios) are now recursively extracted as `ArchiveEntry` children via lopdf. Includes filename sanitization, decompression size limits, and name tree depth guards. - **PDF bookmark/outline extraction**: Document outlines (bookmarks) extracted as URIs — page destinations as `UriKind::Anchor`, external links as `UriKind::Hyperlink`. - **DOCX/PPTX embedded object extraction**: OLE objects and embedded files from `word/embeddings/` and `ppt/embeddings/` directories are now recursively extracted as children. - **PPTX hyperlink extraction**: Hyperlinks from slide XML (`` in run properties) now resolved via relationship files and extracted as URIs. - **Image path resolution for markup formats**: When using `extract_file()`, relative image paths in Markdown, MDX, LaTeX, RST, OrgMode, Typst, Djot, and DocBook are resolved from the filesystem and extracted as `ExtractedImage` data. OS-agnostic with path traversal prevention. - **Unified image OCR pipeline stage**: Image OCR moved from per-extractor calls to a single pipeline stage after derivation. All extracted images (including path-resolved markup images) are now OCR'd uniformly when OCR is configured. Concurrency limited to 8 concurrent tasks. - **FictionBook image and link extraction**: Base64-encoded `` images and `` hyperlinks now extracted from FB2 documents. - **Apple iWork extractor improvements**: Numbers outputs tables instead of paragraphs, Keynote has improved slide structure, Pages has heading detection. All three extract metadata from ZIP plist. - **`code_intelligence` field on ExtractionResult**: Top-level access to tree-sitter `ProcessResult` with full structure, imports, exports, chunks, symbols, diagnostics, and docstrings. Previously only available inside `FormatMetadata::Code` metadata. - **`CodeContentMode` config**: Control code extraction content mode -- `chunks` (semantic TSLP chunks, default), `raw` (source as-is), `structure` (headings + docstrings only). Configured via `TreeSitterProcessConfig.content_mode`. - **TSLP semantic chunking for code**: Code files bypass the text-splitter entirely. TSLP's `CodeChunks` (function/class-aware) map directly to kreuzberg `Chunk`s with semantic types and heading context. - **Cross-format output parity tests**: 36 tests verifying Markdown, HTML, Djot, and Plain produce equivalent text content. GFM lint validation, bracket escaping checks, structural block comparison. - **HTML input markdown passthrough**: HTML files extracted as Markdown now use html-to-markdown output directly via `pre_rendered_content`, bypassing the lossy InternalDocument to comrak round-trip. ### Code Intelligence - **Tree-sitter integration** for 248 programming languages via [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) - Extract functions, classes, imports, exports, symbols, docstrings, diagnostics - Syntax-aware code chunking - Language detection from file extension and shebang - Dynamic grammar download (native) / 30-language static subset (WASM) - New `tree-sitter` and `tree-sitter-wasm` feature flags (included in `full` and `wasm-target`) - `TreeSitterConfig` and `TreeSitterProcessConfig` in `ExtractionConfig` - Re-exported TSLP types (`ProcessResult`, `StructureItem`, `FileMetrics`, etc.) - [TSLP documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev) ### Typed Metadata - New `FormatMetadata` variants: `Code`, `Csv`, `Bibtex`, `Citation`, `FictionBook`, `Dbf`, `Jats`, `Epub`, `Pst` - Extended `PptxMetadata` with `image_count` and `table_count` - Migrated deprecated `metadata.additional` writes to typed fields across all extractors - Strong types for all new metadata variants across all 11 language bindings ### Breaking Changes - **Layout detection preset removed**: The `preset` field on `LayoutDetectionConfig` has been removed across all bindings. Layout detection now uses the RT-DETR v2 model unconditionally — no "fast" vs "accurate" distinction. The `--layout-preset` CLI flag is removed. Old configs with `"preset": "..."` are silently ignored for backward compatibility. - **Table model config typed**: `table_model` on `LayoutDetectionConfig` changed from `Option` to a `TableModel` enum (`tatr`, `slanet_wired`, `slanet_wireless`, `slanet_plus`, `slanet_auto`, `disabled`). Defaults to `tatr`. String values still accepted in JSON/TOML configs. ### Fixed - **PDF table rendering**: Populate `Table.cells` from TATR/SLANeXT grid so comrak renders proper Table nodes instead of wrapping markdown in a Paragraph. Table SF1 improved from 15.5% to 53.7%. - **Markdown GFM quality**: Enable `prefer_fenced` for code blocks, un-escape brackets/parens (`\[` to `[`), fix code block language spacing in djot. - **Semantic HTML output**: Enable `github_pre_lang` and `full_info_string` for code blocks with `class="language-X"`. - **Djot text normalization**: Shared `normalize_inline_text()` for consistent whitespace handling. MD-to-Djot TF1 now 1.0000. - **PDF structural extraction quality**: Improved heading detection (font-size-ratio H2/H3 differentiation, section numbering patterns, ALL-CAPS detection, paragraph-to-heading rescue pass), table discrimination (reject multi-column prose misclassified as tables via flow-through detection, row-count/column-count ratio, and table quality validation), list detection (multi-token prefix patterns), image scoring (normalize image block matching), and formula detection (math character density heuristic). Layout SF1 improved from 40.7% to 43.7% across 157 verified PDF fixtures. - **PDF ground truth verified**: All 157 PDF benchmark fixtures verified using vision (rendered page images vs GT markdown). 7 broken Mistral OCR GTs with hallucinated content replaced with vision-verified markdown. - **LaTeX extraction**: Convert `\href`, `\emph`, `\textbf`, `\textgreater`, `\verb`, `\sout`, blockquotes, lists, special characters, and typographic ligatures to markdown. - **XLSX/XLS sheet name headings**: Emit `## SheetName` heading before each sheet's table, matching pandoc convention. - **OPML outline headings**: All outline nodes now emit headings at appropriate depth, not just parent outlines. Inline HTML in text attributes converted to markdown. - **IPYNB heading detection**: Markdown cells now detect ATX headings and emit proper heading elements. Code cell outputs (stdout, execute_result) included in extraction. - **JATS abstract and references**: Abstract section with sub-headings now included. References rendered as numbered list with structured citation formatting. - **ODT formula extraction**: Embedded MathML formula objects extracted as formula content instead of empty image placeholders. Image alt text and captions now extracted from `draw:frame` elements. - **PPTX slide titles**: Title placeholders detected via OOXML placeholder type and emitted as H2 headings. Bulleted/numbered lists in slides extracted with proper ListStart/ListEnd wrapping. - **ORG source blocks**: `#+BEGIN_SRC` blocks converted to fenced code blocks with language annotation. `#+BEGIN_EXAMPLE` blocks converted to unfenced code blocks. Inline code `~text~` converted to backtick spans. Paragraph line wrapping joined. - **RST heading levels**: Overline+underline document titles assigned H1. Code block language hints preserved from `.. highlight::` and `.. code::` directives. `::` literal block shorthand handled. - **RTF formatting**: Bold/italic/strikethrough formatting now uses exact byte offsets from a unified text+formatting extraction pass, eliminating bold bleeding across paragraphs. Hidden text (`\v`) suppressed. Hyperlink field parsing fixed. Strikethrough support added. Table row rendering fixed for multi-row tables. Ordered list detection from `\listtext` markers. - **HTML preprocessing**: Navigation elements, forms, and sidebars now stripped by default. Previously disabled, causing page chrome to appear in extraction output. - **PDF table detection**: Reject false table detections where >70% of cells contain single-word fragments (justified prose incorrectly classified as multi-column table). - **DocBook root element handling**: XML fragments without a root element now wrapped automatically, fixing extraction of multi-element DocBook files. - **FictionBook poem support**: Verse lines (``), subtitles, text-author, and date elements within poem blocks now extracted. Heading levels aligned with pandoc conventions. - **PDF image FlateDecode fallback**: When `decode_flate_to_png()` fails for FlateDecode, CCITT, or JBIG2 streams, images are now re-extracted via pdfium's bitmap rendering pipeline, producing valid PNG output instead of unusable raw bytes (#615). - **Metadata standardization**: Metadata from PPTX, Excel, ODT, RST, OrgMode, Typst, RTF, JATS, DOC, PPT, HTML, Email, BibTeX, and Citation extractors now mapped to standard `Metadata` struct fields (title, authors, dates, keywords, language) instead of only `additional` map. - **MDX link parity with Markdown**: Links and annotations in headings and list items now extracted (was silently dropped). - **RST hyperlink extraction**: Inline hyperlinks (`` `text `_ ``) and reference targets now extracted. - **LaTeX `\url{}` extraction**: `\url{...}` commands now extracted as URIs alongside `\href`. - **OrgMode image detection**: Added .webp, .bmp, .tiff, .avif to recognized image extensions. - **BibTeX URI classification**: URL fields now correctly classified as Hyperlink (was Citation). Entry title used as label instead of BibTeX key. - **JATS title field**: Article title now stored in `metadata.title` (was only in `subject`). - **PDF bookmark stack safety**: Sibling traversal converted from recursion to iterative loop preventing stack overflow on wide outlines. - **PDF embedded file security**: Filename sanitization (strip directory components), decompressed size limit (50MB), name tree depth limit (50 levels). - **Tesseract C++ exception crash** (#606): Fixed fatal runtime error where C++ exceptions from Tesseract unwound through Rust FFI frames, triggering `std::terminate()`. Now compiles Tesseract with `-fno-exceptions` on macOS, Linux, and MinGW. The Tesseract CLI executable target (which uses `try`/`catch`) is patched out of CMakeLists.txt at build time since only the library is needed. - **ExtractionConfig rejects unknown fields**: `#[serde(deny_unknown_fields)]` added to `ExtractionConfig`. Previously, typos or invalid fields (e.g., `layout_analysis` instead of `layout`) were silently ignored. - **RTF delimiter space consumption**: Fixed space-in-word bug where font encoding directives (`\loch`, `\hich`, `\dbch`) caused spaces mid-word ("H eading" → "Heading"). Root cause: RTF spec requires consuming trailing delimiter space after control words. - **PPTX markdown mode**: Derive plain/markdown mode from `output_format` config instead of hardcoding `plain=true`. Tables now render as markdown tables, lists get bullet markers, text elements get newline separation. - **EPUB test compilation**: Added `InternalDocument::content()` method and fixed `epub_spine_semantics_tests` to use it instead of removed `.content` field. - **HTML extraction rewrite**: Replaced ~400-line manual HTML tag parser with html-to-markdown v3's `DocumentStructure` mapping. Single-pass conversion eliminates CSS/script content leakage and `[image: X]` placeholder artifacts. - **Chunking heading context with plain output**: Fixed `heading_context` always returning `None` when using plain text output format. The markdown chunker now receives the original markdown for heading map building even when content is rendered as plain text. - **WASM build compatibility**: Inlined workspace-inherited fields (`version`, `edition`, `authors`) in kreuzberg-wasm Cargo.toml because wasm-pack 0.14.0 cannot resolve `field.workspace = true` references. - **Pre-commit hooks**: Fixed rumdl hook config (use `rumdl-fmt` from official repo), wasm build (feature-gate layout config access), kreuzberg-node build (missing `formatted_content` field), broken relative links in READMEs and CHANGELOG. - **Binding compilation**: Added missing `formatted_content` field to kreuzberg-py and kreuzberg-php binding crates. - **PDF heading body_size_guard**: Narrowed guard range from `≤ body+0.5` to `body±1.5pt` so headings well below body font size (e.g., 8pt in 12pt body) pass through. - **RTF table extraction**: Fixed critical bug where table cell content was written to both result string and TableState, causing cells to appear as individual lines instead of proper markdown tables. - **DOCX merged cells**: Repeat content across gridSpan (horizontal) and vMerge (vertical) spans. Added `source_path` field to `ExtractedImage` for DOCX image relationship paths. - **DOCX formatting**: Merge adjacent runs with identical formatting to prevent spurious `****` sequences. Strip `` underline HTML tags. - **Python wheel `__isoc23_strtoll` error on older Linux distributions** (#588): Downgraded the Linux build environment `manylinux` target from `manylinux_2_39` to `manylinux_2_28` for pre-compiled Python wheels to ensure compatibility with systems using glibc versions prior to 2.39 (e.g., Ubuntu 20.04/22.04, Debian 11/12). - **`clear_ocr_backends` now fully clears the registry**: Calls `shutdown_all()` instead of `reset_to_defaults()`, so the backend list is empty after clearing as expected by the API contract. - **Go macOS link failure**: Added missing `-framework Foundation` to CGO LDFLAGS. ORT's CoreML provider uses Foundation for NSLog/NSFileManager, causing undefined symbol errors on macOS. - **Tesseract Windows MinGW build (Elixir/Go/C FFI publish)**: CMake resolved bare `g++` to MSVC `cl.exe` on CI runners with both toolchains. Added `resolve_mingw_compiler()` to find absolute paths from MSYS2 subsystem dirs. Bumped Tesseract cache key to invalidate stale MSVC-compiled artifacts. - **Windows GNU ORT linking**: `bundled` strategy on Windows GNU now uses dynamic linking with pre-downloaded Microsoft ORT (pyke.io has no static binaries for `x86_64-pc-windows-gnu`). Documented ONNX Runtime DLL requirement for Go, Elixir, and C/C++ on Windows. ### Changed - **PDF text extraction**: Full rewrite from segment-indexed assembly to `page.text().all()` + char-indexed font metadata. Produces cleaner text with correct word spacing. - **hOCR table reconstruction vendored**: `HocrWord`, `reconstruct_table`, `table_to_markdown` moved from `html-to-markdown-rs::hocr` to `kreuzberg::table_core` module. - **CLI format flags**: `--format` (`-f`) now supports `text`, `json`, and `toon` wire formats. `--output-format` renamed to `--content-format` (deprecated alias kept with warning). `OutputFormat` enum gains `Custom(String)` variant for extensible format plugins. - **html-to-markdown-rs v3.0.0**: Switched from git dependency to crates.io release. - **License policy**: MPL-2.0 and LGPL-2.1 no longer globally allowed — pinned to specific crate exceptions (cbindgen, option-ext, r-efi). Unicode-DFS-2016 allowed for comrak dependency. ### Removed - **`max_upload_mb` server config field**: Use `max_multipart_field_bytes` (in bytes) instead. The `KREUZBERG_MAX_UPLOAD_SIZE_MB` environment variable is also removed — use `KREUZBERG_MAX_MULTIPART_FIELD_BYTES`. - **`metadata.additional` legacy insertions**: Pipeline features (chunking, embeddings, language detection, keywords) no longer insert error/status keys into `metadata.additional`. Errors are available via `processing_warnings`. Keywords are in `extracted_keywords`. Embedding status is derivable from chunk embeddings. - **`derive_content_string` function**: Replaced by `render_plain()` in the rendering module. --- ## [4.6.3] - 2026-03-27 ### Added - **Tower service layer** (`service` module): Composable `ExtractionService` implementing `tower::Service` with configurable middleware layers (tracing, metrics, timeout, concurrency limit). New `tower-service` feature flag, auto-enabled by `api` and `mcp`. `ExtractionServiceBuilder` provides ergonomic layer composition. - **Semantic OpenTelemetry conventions** (`telemetry` module): Formal `kreuzberg.*` attribute namespace with 30+ span attributes, metric names, and operation/stage constants. Documented conventions for document extraction, pipeline stages, OCR, and model inference telemetry. - **Extraction metrics**: 11 OTel metric instruments (counters, histograms, gauge) covering extraction totals, durations, cache hits/misses, pipeline stages, OCR, and concurrent extractions. Feature-gated behind `otel`. - **InstrumentedExtractor wrapper**: Automatic per-extractor tracing spans and metrics without per-extractor annotations. Injected at registry dispatch when `otel` feature is enabled. ### Improved - **Deeper instrumentation**: Pipeline post-processing stages (Early/Middle/Late), individual processor execution, OCR operations, and RT-DETR layout model inference now have semantic spans and duration metrics. - **API and MCP servers use ExtractionService**: Both consumers now route extractions through the Tower service stack, getting unified tracing, metrics, and middleware for free. - **Unified config merge**: JSON config merge logic deduplicated between CLI and MCP into a shared function. - **API server hardening**: Added response compression (gzip/brotli/zstd), panic recovery, request-ID correlation, and sensitive header redaction via tower-http middleware. ### Changed - **Removed per-extractor `#[instrument]` annotations**: 29 manual `#[cfg_attr(feature = "otel", tracing::instrument(...))]` annotations replaced by the automatic `InstrumentedExtractor` wrapper. - **Span attribute names migrated to `kreuzberg.*` namespace**: `extraction.filename` -> `kreuzberg.document.filename`, `extraction.mime_type` -> `kreuzberg.document.mime_type`, etc. ### Fixed - **EPUB spine semantics refactor** (#594): Richer OPF package model preserves manifest fallback chains, guide references, and non-linear spine items. Navigation chrome stripped from output. Malformed guide references now produce warnings instead of hard failures. Tested for fallback cycles and empty spines. - **DOCX image extraction for `` with child elements** (#591): Images with high-quality settings (containing `` children) were not extracted because only `Event::Empty` was handled. Now also handles `Event::Start` for ``. - **OCR table extraction returned empty results via pipeline path** (#593): Layout detection was gated behind a `needs_structured` check, skipping it for the default `Plain` output format. Tables from `run_ocr_pipeline` were discarded. Both paths now propagate tables correctly. - **Missing `chunker_type` field in bindings** (#592): Exposed `chunker_type`, `sizing_cache_dir`, and `prepend_heading_context` fields across Python, TypeScript/WASM, Go, C#, PHP bindings. - **Full API parity across all 10 bindings**: Added `max_archive_depth` to all bindings. Added missing `acceleration`, `email` to Ruby/R. Added `layout` to PHP. Added 7 missing fields to WASM. Fixed parity script regex for Go slice types. - **`test_pipeline_with_all_features` assertion without `quality` feature**: `quality_score` assertion now gated behind `#[cfg(feature = "quality")]`. - **Node Windows publish failure**: Prepare script fallback used bash-specific `mkdir -p` and `echo >` which fail on Windows. Replaced with cross-platform `node -e` fallback. - **CI Validate path triggers too narrow**: Broadened glob patterns to cover `docs/**`, `biome.json`, `.task/**`, and other lintable paths that prek hooks check. - **Publish pipeline ORT bundling**: Added configurable `strategy` input (`system`/`bundled`) to `setup-onnx-runtime` action. Set `strategy: bundled` for all publish jobs so `ort-bundled` cargo feature takes effect, producing self-contained binaries. --- ## [4.6.2] - 2026-03-26 ### Added - **PDF page rendering API** (#583): New `render_pdf_page` function and `PdfPageIterator` for rendering individual PDF pages as PNG images. Available across all 11 language bindings with idiomatic patterns (Python context manager, Go Close(), Java AutoCloseable, C# IDisposable, Elixir Stream, etc.). Default 150 DPI, configurable per call. ### Fixed - **Table recognition coordinate mismatch on scanned PDFs** (#582): Layout detection bboxes (640x640 model space) are now scaled to OCR render resolution before TATR table recognition. Previously, coordinate space mismatch caused zero tables to be found. - **OCR elements report `page_number: 1` for all pages** (#582): Tesseract resets page numbers per single-page render. Page numbers are now correctly stamped after OCR in the batch loop. - **Rust E2E tests missing PDF feature**: Added `pdf` feature to the e2e-generator Rust template, fixing 41 `UnsupportedFormat("application/pdf")` failures. - **HWP styled extraction empty on ARM**: Added `skip_on_platform` support to Python and Java e2e generators, skipping the `hwp_styled` fixture on `aarch64-unknown-linux-gnu`. - **WASM CI build failure**: Made `kreuzberg-node` prepare script resilient to missing native addon, preventing `ENOENT: dist/cli.js` during pnpm workspace install. - **Go C header stale at 4.5.0**: Synced header and `DefaultVersion` constant to match current version. - **Ruby gem missing ONNX Runtime**: Added `ort-bundled` feature to Ruby native Cargo.toml. - **Elixir doctest failures**: Updated `ExtractionConfig.to_map/1` doctests for `force_ocr_pages` field. - **WASM benchmark timeout**: Reduced per-extraction timeout from 600s to 120s and job timeout from 6h to 2h. ### Improved - **`version:sync` now syncs Go C header, DefaultVersion, and Docker compose tags**: Prevents version drift across language bindings. - **Publish pipeline commits Elixir NIF checksums back to main**: Prevents stale checksums after releases. - **WASM test app migrated to Deno**: Replaced Node.js/vitest with Deno test runner, fixing `fetch()` unavailability. - **Docs migrated from MkDocs to Zensical**: 4-5x faster incremental builds. --- ## [4.6.1] - 2026-03-25 ### Added - **Per-file batch extraction timeouts** (#546): New `extraction_timeout_secs` on `ExtractionConfig` (batch-level default) and `timeout_secs` on `FileExtractionConfig` (per-file override). Timeouts apply after semaphore acquisition. New `KreuzbergError::Timeout` variant with `elapsed_ms` and `limit_ms` fields. All binding layers updated. - **Page-level OCR overrides** (#432): New `force_ocr_pages` option (1-indexed) on both `ExtractionConfig` and `FileExtractionConfig`. Enables selective OCR on specific pages of mixed-quality PDFs while preserving native text on others. - **PST extraction support** (#502): Extract emails from Microsoft Outlook PST archives via the `outlook-pst` crate. Iterative depth-first folder traversal with depth cap of 50. Feature-gated under `email`. - **JSONL/NDJSON extraction** (#575): Native `.jsonl`/`.ndjson` extraction via `StructuredExtractor`. Registered as `application/x-ndjson` MIME type. ### Fixed - **OCR elements now propagated to ExtractionResult** (#566): OCR elements with geometry data are collected during extraction and set on `ExtractionResult.ocr_elements`. Hierarchy transformer emits body-level blocks as `NarrativeText` elements with coordinates. OpenAPI schema registers OCR-related types. - **OOM crash on multi-page scanned PDFs** (#570): Replaced pre-rendering all PDF pages into memory with batched rendering. Pages are now rendered and OCR'd in bounded batches, capping peak memory to `batch_size * page` instead of `page_count * page`. - **OCR memory usage reduced 60-78%**: Restructured the OCR batch rendering loop to render-and-encode one page at a time instead of holding all decoded RGB buffers simultaneously. A 98-page scanned PDF dropped from 4.6GB to 1.9GB peak RSS (batch_size=4), and from 3.3GB to 713MB (batch_size=1). Batch size now adapts to available system memory on Linux and macOS. - **PDF control character encoding artifacts**: PDFs with broken ToUnicode font mappings that produce U+0002 (STX) and other control characters where hyphens should appear now have these replaced with hyphens when between word characters, or stripped otherwise. Fixes garbled output like `re\x02labelling` → `re-labelling`. - **DocumentStructure missing Heading nodes for PDFs**: `push_heading_group` now inserts a `Heading` child inside each `Group` node (matching DOCX builder behavior). Fallback `add_paragraphs` now detects markdown heading markers and creates heading groups instead of flat paragraphs. - **Layout detection returns empty tables on scanned PDFs** (#574): Three independent bugs caused `result.tables` to always be `[]` for scanned/image-based PDFs: (1) layout detection was gated behind a `needs_structured` output-format check, silently skipping detection for `Plain` (the default); (2) TATR-recognized tables in the OCR path were inlined as markdown text but never converted to `Table` structs; (3) `run_ocr_with_layout` returned only text, discarding table data. All three paths now propagate tables correctly. - **Table recognition coordinate mismatch on scanned PDFs** (#582): Layout detection operates at 640×640 pixels but TATR table recognition and layout-hint classification consumed those coordinates verbatim against OCR-rendered images (e.g. 2480×3508 px at 300 DPI). Bounding boxes never overlapped OCR word positions, producing zero recognized tables and incorrect paragraph-class overrides. Bounding boxes are now scaled from layout-model resolution to the actual OCR render resolution before both `recognize_page_tables` and `detection_to_layout_hints` are called. - **OCR elements report `page_number: 1` for all pages** (#582): The Tesseract backend resets `page_number` to 1 for every single-page render. The page-number is now stamped with the correct 1-indexed page index after collecting each batch page's OCR elements. - **PDF layout engine panic on malformed input** (#544): Replaced the panicking `.expect()` inside the thread-local `LayoutEngine` initializer in `layout_runner.rs` with proper `Result`-based error propagation. A failure to initialise the layout engine now returns a descriptive error instead of crashing the host process via FFI (Python, Node, etc.). --- ## [4.6.0] - 2026-03-24 ### Added - **Recursive archive extraction**: Archives (ZIP, TAR, 7Z, GZIP) now recursively extract all processable files, each with its own `ExtractionResult` including `DocumentStructure`, annotations, and metadata. New `ArchiveEntry` type with path, mime type, and nested result. Configurable via `max_archive_depth` (default: 3, set to 0 for legacy single-text behavior). - **YAML/JSON section chunker**: New `ChunkerType::Yaml` variant that splits structured files by keys with full hierarchy paths (e.g., `database > primary > host`). Auto-inferred from extraction metadata — no explicit `chunker_type` needed for YAML/JSON files. - **Unified DocumentStructure DTO**: Extended the `DocumentStructure` model with 7 new node types (`Slide`, `DefinitionList`, `DefinitionItem`, `Citation`, `Admonition`, `RawBlock`, `MetadataBlock`), 4 new annotation kinds (`Highlight`, `Color`, `FontSize`, `Custom`), and format-specific `attributes` bag on every node. - **DocumentStructureBuilder**: Ergonomic builder with heading-driven section nesting, container stack (Quote/Admonition/Slide auto-parenting), and annotation helpers. Replaces hand-constructed `DocumentNode` structs across all extractors. - **Unified rendering module**: `render_to_markdown()` and `render_to_plain()` renderers that walk a `DocumentStructure` tree to produce consistent output with inline annotation rendering, table pipe escaping, and nested list depth support. - **DocumentStructure support for all extractors**: Every extractor (35 formats) now natively produces a `DocumentStructure` when `include_document_structure` is enabled: - Office: DOCX (with TextAnnotation from Run formatting, Formula from OMML), PPTX (Slide containers), ODT, DOC, PPT - Markup: HTML (1,100-line tag parser with inline annotations), LaTeX, RST (admonitions, definition lists), OrgMode, Markdown, MDX, Djot, Typst - Books: EPUB (chapter structure from spine), FictionBook (inline formatting annotations) - Scientific: JATS (article structure), DocBook (section hierarchy) - Data: Excel (sheet headings + tables), CSV, DBF, JSON/YAML/TOML, BibTeX (citations), Jupyter (code + markdown cells) - Other: Email (metadata headers), RTF, OPML (outline hierarchy), HWP, iWork (Keynote/Numbers/Pages), XML, Image (OCR text) - **DocBook/JATS inline annotations**: Semantic inline formatting for academic/technical documents — emphasis, bold, code, links, subscript/superscript mapped to `AnnotationKind` variants. - **Document-level OCR**: `OcrBackend` trait supports `process_document()` for whole-file extraction without per-page rasterization. Up to 30% faster on multi-page documents with better context. ### Changed - **CSV extraction for embedding quality**: Produces `Row N: Header: Value` format instead of space-separated when a header row is detected. Programmatic `tables` field unchanged. - **XML extraction for embedding quality**: Indented hierarchical output preserving element tree with attributes inline, blank lines between top-level siblings, and `xmlns:*` filtering. ### Improved - **Zero-copy file I/O**: Automatic memory-mapping for files >1MB via `memmap2` with SIMD-accelerated UTF-8 validation (`simdutf8`). Measurable speed improvement for large PDFs and archives. WASM falls back to heap allocation. - **Unified concurrency management**: Centralized thread budget for Rayon, ONNX, and PaddleOCR with configurable `ConcurrencyConfig`. PDF OCR batched in chunks instead of all-at-once, reducing memory footprint on large documents. ### Fixed - **Incorrect page numbers in element-based output** (#557): When `result_format="element_based"` was used without `PageConfig(extract_pages=True)`, all elements received `page_number=1`. Now auto-enables `extract_pages` when element-based output is requested. - **Misleading `PageConfig` docstring** (#558): Updated docstring and type stub to show default constructor first and document interaction with `result_format="element_based"`. - **MSG extraction misses compressed RTF bodies** (#560): Added PR_RTF_COMPRESSED (0x1009) fallback for `.msg` files that store the body only in compressed RTF format. Implements MS-OXRTFCP decompression and RTF-to-plain-text stripping. - **Indexed colour PDF images returned as raw** (#561): Palette-based PDF images now decode correctly. Extracts the colour palette from the PDF dictionary and applies palette lookup to produce valid PNG output instead of unusable raw bytes. - **ODT extraction robustness**: Replaced unwraps with safe fallbacks in ODT parsing. --- ## [4.5.4] - 2026-03-23 ### Added - **Document-level OCR optimization**: The `OcrBackend` trait now supports native `process_document()` for efficient whole-file extraction without rasterizing individual PDFs to images when the backend supports it (e.g., Python's EasyOCR backend). ### Changed - **OCR protocol clarity**: Differentiated `process_file` to `process_image_file` in OCR backend trait for clearer protocol semantics. - **Python refactoring**: Removed unused loop variable in EasyOCR implementation. - **Dependency optimization**: Dropped redundant tokio multi-thread feature flag. ### Tests - **Backend registry robustness**: Hardened backend registry tests with drop guards and comprehensive mock coverage. ### Added - **PST (Outlook Personal Folders) extraction**: New `PstExtractor` backed by the `outlook-pst` crate. Traverses the full IPM folder hierarchy iteratively, extracts subject, sender, recipients (TO/CC/BCC), body, and date from every message in the archive. Enabled via the existing `email` feature flag. MIME type: `application/vnd.ms-outlook-pst`. ### Fixed - **PDF image extraction panic on mismatched buffer lengths** (#552): Replaced `assert!` in `pdf/images.rs` with graceful error handling. Malformed PDF images with wrong buffer sizes are now skipped instead of panicking. Regression from v4.5.0. - **`pdf` feature compilation without `layout-detection`** (#550): `config.layout` reference in `extraction.rs` was not behind a `#[cfg(feature = "layout-detection")]` gate, causing compilation errors when `pdf` was enabled without `layout-detection`. - **Unused `table_model` variable warning**: Fixed cfg-gating in `pipeline.rs` so `table_model` parameter is properly handled when `layout-detection` feature is disabled. - **Clippy `too_many_arguments` on `recognize_tables_slanet`**: Added allow attribute for the 8-parameter function in `table_recognition.rs`. - **Ruby binding missing `table_model` field**: Added `table_model` parsing to `LayoutDetectionConfig` initializer in Ruby native extension. - **WASM module resolution in Supabase/Deno edge functions** (#551): Added explicit `package.json` exports for `pkg/kreuzberg_wasm.js` and WASM binary. Extended `wasm-loader.ts` with Deno detection and clear error messaging for restricted edge runtimes. - **`zip` dependency pinned below 7.4**: Avoids let-chain build failures on some stable Rust toolchains (#549). - **Vendored HWP text extraction**: Replaced external `hwpers` crate with vendored subset (~1,650 lines). Eliminates `zip 2.x` transitive dependency that caused WASM and CI Validate build failures. ### Added - **`prepend_heading_context` chunking option**: When `true` and `chunker_type` is `Markdown`, prepends the heading hierarchy path (e.g. `# Title > ## Section`) to each chunk's content string. Useful for RAG pipelines where chunks need self-contained structural context. Available across all 10 language bindings, CLI, and WASM. Includes fixture-driven e2e tests and documentation for all languages. --- ## [4.5.3] - 2026-03-22 ### Added - **Apple iWork Format Support**: Native parsing for modern (2013+) `.pages`, `.numbers`, and `.key` files via a new `iwork` feature flag. Uses zero-allocation protobuf text extraction from Snappy-compressed IWA containers. - **SLANeXT table structure recognition models**: Alternative table structure backends alongside TATR. New `table_model` field on `LayoutDetectionConfig` selects the backend. Options: `"tatr"` (default, 30MB), `"slanet_wired"` (365MB, bordered tables), `"slanet_wireless"` (365MB, borderless tables), `"slanet_plus"` (7.78MB, lightweight), `"slanet_auto"` (classifier-routed, ~737MB). Available across all 12 language bindings and CLI (`--layout-table-model`). - **PP-LCNet table classifier**: Automatic wired/wireless table detection for SLANeXT auto mode. Uses center-crop preprocessing with BGR channel order matching PaddleOCR convention. - **CLI `cache warm --all-table-models`**: Opt-in download of SLANeXT model variants (~730MB). Default warm downloads only RT-DETR + TATR. - **ISO 21111-10 benchmark fixture**: Table-heavy ISO standard document with MinerU ground truth for table extraction benchmarking. --- ## [4.5.2] - 2026-03-21 ### Fixed - **PDF word splitting in extracted text**: Pdfium's text extraction inserted spurious spaces mid-word (e.g. `"s hall a b e active"` instead of `"shall be active"`). Added selective page-level respacing: pages with detected broken word spacing are re-extracted using character-level gap analysis (`font_size × 0.33` threshold). Clean pages use the fast single-call path. Reduces garbled lines from 406 to 0 on the ISO 21111-10 test document with no performance impact. - **Markdown underscore escaping**: Underscores in extracted text (e.g. `CTC_ARP_01`) were incorrectly escaped as `CTC\_ARP\_01` throughout the markdown output. Underscore escaping has been removed entirely since extracted PDF text contains literal identifiers, not markdown formatting. - **Page header/footer leakage**: Running headers like `ISO 21111-10:2021(E)` and copyright footers leaked into the document body. Added fuzzy alphanumeric matching to detect repeated header/footer text even when spacing or character extraction varies across pages. - **R batch function spurious NULL argument**: R wrapper batch functions passed an extra `NULL` positional argument to native Rust functions, causing "unused argument" errors on all batch operations. - **Elixir Windows ORT DLL staging**: ONNX Runtime DLL was only staged in `target/release/` but not in `priv/native/` where the BEAM VM loads NIFs. OCR/layout/embedding features now work correctly on Windows CI. ### Added - **General extraction result caching**: All file types (PDF, Office, HTML, archives, etc.) are now cached — not just OCR results. Repeated extractions of the same file with the same config return instantly from cache. - **Cache namespace isolation**: New `cache_namespace` field on `ExtractionConfig` enables multi-tenant cache isolation on shared filesystems. Available via `--cache-namespace` CLI flag and across all language bindings. - **Per-request cache TTL**: New `cache_ttl_secs` field on `ExtractionConfig` overrides the global TTL for individual extractions. Set to `0` to skip cache entirely. Available via `--cache-ttl-secs` CLI flag. - **Cache namespace deletion**: `delete_namespace()` removes all cache entries under a namespace. `get_stats_filtered()` returns per-namespace statistics. - **Multi-worker cleanup safety**: Cache cleanup no longer triggers excessively when multiple worker pods share the same cache directory. - **Bundled eng.traineddata**: English OCR works out of the box with zero runtime configuration (~4MB bundled at build time). - **Tessdata in `cache warm`**: `kreuzberg-cli cache warm` now downloads all tessdata_fast language files (~120 languages) to `KREUZBERG_CACHE_DIR/tessdata/`, giving full Tesseract language support without system packages. - **Tessdata in `cache manifest`**: `kreuzberg-cli cache manifest` now includes all tessdata files with source URLs, enabling `--sync-cache` to download tessdata alongside models. - **`KREUZBERG_CACHE_DIR/tessdata` resolution**: `resolve_tessdata_path()` now checks `KREUZBERG_CACHE_DIR/tessdata` and the bundled build path before falling back to system paths. Resolution order: `TESSDATA_PREFIX` env → `KREUZBERG_CACHE_DIR/tessdata` → bundled tessdata → system paths. - **CLI `embed` command**: Generate vector embeddings from text via `kreuzberg embed --text "..." --preset balanced`. Supports stdin, multiple texts, JSON/text output. Feature-gated on `embeddings`. - **CLI `chunk` command**: Split text into chunks via `kreuzberg chunk --text "..." --chunk-size 512`. Configurable size, overlap, chunker type, tokenizer model. - **CLI `completions` command**: Generate shell completions for bash, zsh, fish, powershell via `kreuzberg completions `. - **CLI `--log-level` global flag**: Override `RUST_LOG` via `kreuzberg --log-level debug extract doc.pdf`. - **CLI extraction overrides**: 27 flags exposed via `ExtractionOverrides` struct with `#[command(flatten)]`. New flags: `--layout-preset`, `--layout-confidence`, `--acceleration`, `--extract-pages`, `--page-markers`, `--extract-images`, `--target-dpi`, `--pdf-extract-images`, `--pdf-extract-metadata`, `--token-reduction`, `--include-structure`, `--max-concurrent`, `--max-threads`, `--msg-codepage`, `--ocr-auto-rotate`. - **CLI colored output**: Text output uses `anstyle` for colored headers, labels, success values, and dim separators. Respects `NO_COLOR` env var. - **API `POST /detect`**: MIME type detection endpoint via multipart file upload. - **API `GET /version`**: Version info endpoint. - **API `GET /cache/manifest`**: Model manifest with checksums and sizes. - **API `POST /cache/warm`**: Eager model download endpoint with embedding preset support. - **MCP `get_version` tool**: Query server version from MCP clients. - **MCP `cache_manifest` tool**: Get model manifest via MCP. - **MCP `cache_warm` tool**: Pre-download models via MCP. - **MCP `embed_text` tool**: Generate embeddings via MCP (feature-gated). - **MCP `chunk_text` tool**: Text chunking via MCP. - **Pipeline table extraction tracing**: Added zero-cost `tracing::trace!` and `tracing::debug!` logging throughout the layout detection and table extraction pipeline for easier debugging. - **TATR model availability check**: Layout detection now returns an error if table regions are detected but the TATR model is unavailable, instead of silently falling back to degraded extraction. - **Publish idempotency checks**: All publish jobs now have re-check steps using `check-registry@v1` before publishing. Added `check-elixir-release` job for GitHub release asset verification. - **ARM benchmark runners**: Benchmark workflows switched to `runner-medium-arm64` for ARM-native performance testing. - **Registry check tool**: `python3 scripts/publish/check_all_registries.py ` checks all 10+ registries and GitHub release assets locally. ### Changed - **CLI batch flags**: Batch command now supports all extraction override flags (chunking, layout, acceleration, etc.) via shared `ExtractionOverrides` struct, matching extract command parity. - **CLI config architecture**: Replaced 13-parameter `apply_extraction_overrides` function with `ExtractionOverrides` struct using `#[command(flatten)]`. Config fields auto-scale as `ExtractionConfig` evolves. - **MCP tool architecture**: Removed dead `tools/` trait-based duplicates; all tools implemented directly in `server.rs`. ### Improved - **CLI validation**: OCR backend values validated (tesseract, paddle-ocr, easyocr). Chunk size/overlap bounds checked. DPI range (36-2400) and layout confidence (0.0-1.0) validated. Zero-value `max_concurrent`/`max_threads` rejected. `--chunking-tokenizer` errors when feature disabled. - **API validation**: Embedding preset names validated in `/embed`. Chunk `max_characters` bounds checked (1-1M) in `/chunk`. - **MCP validation**: Empty paths rejected in `batch_extract_files`. Chunk `max_characters` bounds checked in `chunk_text`. Embedding preset validated in `embed_text`. - **Chunk overlap auto-clamping**: When `--chunk-size` is smaller than default overlap, overlap is automatically clamped to `size/4` instead of producing a confusing error. --- ## [4.5.1] - 2026-03-20 ## [4.5.1] - 2026-03-20 - **Java FFI `CBatchResult` struct layout mismatch**: The `count` and `results` fields were swapped in the Java Panama FFM layout, causing all batch extraction operations to fail with memory access errors. - **Go FFI stale C header**: The `CExtractionResult` struct field order in the Go binding's C header did not match the Rust `#[repr(C)]` layout (reordered alphabetically in 4.5.0, added `djot_content_json`). Go read fields at wrong offsets, causing `pages_json` to deserialize `metadata_json` instead. - **FFI `LayoutDetectionConfig` not feature-gated**: The FFI crate unconditionally imported `LayoutDetectionConfig` and exposed `kreuzberg_config_builder_set_layout`, causing compilation failures on targets without the `layout-detection` feature (e.g., `x86_64-pc-windows-gnu`). - **Python wheel builds on Linux aarch64**: OpenSSL library path was hardcoded to `x86_64-linux-gnu` in the manylinux build script, failing on aarch64 runners. Now detects architecture via `uname -m`. - **R batch function signature mismatch**: R wrapper functions were missing the `file_configs` parameter when calling native Rust functions, causing "Expected Scalar, got Language" errors on all batch operations. - **R package ORT linking**: The R build configuration (`config.R`) did not link against ONNX Runtime when `ORT_LIB_LOCATION` was set, causing `undefined symbol: OrtGetApiBase` at load time. --- ## [4.5.0] - 2026-03-20 ### Added - **ONNX-based document layout detection**: New `layout` config field enables document layout analysis using RT-DETR v2 with 17 element classes. Supports `"fast"` and `"accurate"` presets with auto-downloaded models. Available across all language bindings. - **SLANet table structure recognition**: Detected Table regions are processed by SLANet-plus for neural HTML structure recovery, producing markdown tables with colspan/rowspan support. Now runs on all pages including structure-tree pages (previously skipped). - **Layout-enhanced heading detection**: Layout model SectionHeader and Title regions guide heading detection in both structure tree and heuristic extraction. High-confidence hints (>=0.7) can override font-size-based classification. - **Multi-backend OCR pipeline**: New `OcrPipelineConfig` enables quality-based fallback across OCR backends (e.g., Tesseract then PaddleOCR) with configurable priority, language, and backend-specific settings. - **OCR quality thresholds**: New `OcrQualityThresholds` config with 16 tunable parameters for OCR output quality assessment and fallback decisions. - **OCR auto-rotate**: New `OcrConfig.auto_rotate` flag (default: false) for automatic page rotation detection. Handles 0/90/180/270 degree rotations. - **PaddleOCR v2 model tier system**: New `model_tier` field with `"mobile"` (default, ~21MB, fast) and `"server"` (~172MB, highest accuracy). Both use unified multilingual models (CJK+English in one model). Available across all bindings. - **`AccelerationConfig` for GPU/execution provider control**: Fine-grained control over ONNX execution providers (CPU, CoreML, CUDA, TensorRT) for layout detection and table recognition. Typed across all bindings. - **`ConcurrencyConfig` for thread limiting** (#503): New `max_threads` field caps Rayon, ONNX intra-op threads, and batch concurrency to a single limit. Typed across all bindings. - **`EmailConfig` for MSG fallback codepage** (#505): Configurable fallback codepage for MSG files lacking a codepage property (default: windows-1252). Set e.g. `1251` for Cyrillic. Typed across all bindings. - **Per-file extraction configuration (`FileExtractionConfig`)**: Per-file config overrides in batch operations. Each file can specify its own OCR, chunking, output format settings. CLI supports `--file-configs`, MCP supports `file_configs` parameter. - **Opt-in single-column pseudo tables** (#449): New `allow_single_column_tables` on `PdfConfig` (default: false). Allows single-column structured data (glossaries, itemized lists) to be emitted as tables. - **Experimental: `pdf_oxide` text extraction backend** (`pdf-oxide` feature): Pure Rust PDF text extraction as an alternative to pdfium. Opt-in only, not included in `full` feature set. - **CLI `cache warm` command**: Eagerly downloads all PaddleOCR and layout detection models. Supports `--all-embeddings` or `--embedding-model `. Useful for containerized or offline deployments. - **CLI `cache manifest` command**: Outputs a JSON manifest of all expected model files with SHA256 checksums, sizes, and source URLs for scripted cache verification. - **ChunkSizing configuration**: `sizing_type`, `sizing_model`, and `sizing_cache_dir` fields exposed in `ChunkingConfig` across all bindings. - **Chunk heading context**: New `HeadingContext` type in `ChunkMetadata` providing heading level and text. - **`ModelManifestEntry` type and `manifest()` / `ensure_all_models()` methods**: Public API for querying and eagerly downloading model cache manifests. - **SF1 structural quality metrics in benchmark CI**: SF1 quality scores now computed alongside TF1, with PDF-specific quality rankings for tracking extraction quality regressions. ### Changed - **Layout preset default**: Changed from `"fast"` to `"accurate"`. The `Fast` variant has been removed. The `"fast"` string is still accepted for backwards compatibility. - **PaddleOCR default model tier**: Changed from `"server"` to `"mobile"`. Mobile models provide equivalent quality on standard documents while being 3-5x faster. Server tier remains available via `with_model_tier("server")`. - **PaddleOCR v2 models**: All models updated to v2 generation (PP-OCRv5 detection, PP-LCNet classification, unified multilingual recognition). V1 models remain available for older versions. - **Unified multilingual recognition models**: PP-OCRv5 unified server (84MB) and mobile (16.5MB) models replace per-script English and Chinese models. Per-script models retained for 9 other script families. - **Batch API unification**: `_with_configs` batch functions removed; per-file `FileExtractionConfig` is now an optional parameter on the unified batch functions. - **Layout pipeline no longer forces heuristic extraction**: Structure tree extraction proceeds normally when layout detection is enabled, preserving text quality. - **Global ONNX model caching**: Layout detection and SLANet models are cached globally and reused across extractions, avoiding expensive ONNX session recreation in batch scenarios. - **Vendored text embedding pipeline**: Replaced `fastembed` dependency with vendored engine using ONNX Runtime directly for tighter integration. - **Embedding `embed()` now takes `&self` instead of `&mut self`**: Enables parallel embedding generation without mutable reference constraints. - **L2 normalization parallelized**: Embedding batches >= 64 vectors now use multi-threaded normalization. - **`padding` field in PaddleOcrConfig**: Now exposed across Python, TypeScript, Ruby, and Go bindings (previously Rust-only). - **Language-agnostic section pattern recognition**: Headings ending with a period are now allowed when they match structural patterns (section symbol, all-caps, numbered sections). Improves heading detection for legal, academic, and multilingual documents. - **Layout classification guards**: Heading overrides from the layout model now have word count limits, punctuation checks, figure label detection, and body-font-size validation to prevent false heading promotions. - **Strong typing across bindings**: Replaced weak `Dictionary`/`Map`/`array` types with strongly typed config classes in C#, Java, and PHP. Added missing config types to Python stubs, Node.js, Ruby, Elixir, and PHP. ### Removed - **`fastembed` dependency**: Replaced by vendored embedding engine using ONNX Runtime directly. - **`EmbeddingModelType::FastEmbed` variant**: Use `Preset` or `Custom` variants instead. ### Fixed - **C# FFI struct layout mismatch** (#538): `CExtractionResult` struct layout between Rust and C# was mismatched, causing deserialization failures and overflow exceptions that made the C# library completely broken in 4.4.6. - **PDF `force_ocr` without explicit OCR config** (#495): `force_ocr=true` was silently ignored when no `ocr` config block was provided. Now unconditionally triggers the OCR pipeline with default settings. - **PDF image extraction** (#511): Extracted images returned raw compressed data instead of properly decoded image bytes. Now automatically decoded and re-encoded as standard formats (PNG/JPEG). - **Node.js `extractFileInWorker` mime_type passthrough** (#523): MIME type was silently injected into PDF password config instead of being forwarded to extraction. Now correctly passed through. - **DOCX parser type inference failure** (#519): The `zip` 8.2.0 dependency introduced type ambiguity in DOCX and XML parsers, causing compilation failures. - **Python `py.typed` and `.pyi` missing from sdist**: Type stubs and `py.typed` marker now included in both wheel and sdist formats. - **PDF broken CMap word spacing**: Geometric validation now vetoes false word boundaries in PDFs with broken font CMaps, fixing "co mputer" -> "computer" style errors. - **PDF structure tree heading trust**: Structure tree heading tags (H1-H6) are now trusted as author-intent metadata. Previously, font-size validation rejected valid headings close to body size. - **PDF structure tree extraction performance**: Text and style maps now built in a single pass, eliminating multi-second extraction times on complex pages. - **OCR Picture regions suppressing text**: Layout-detected Picture regions now preserve embedded text as plain paragraphs instead of silently dropping it. - **Non-transitive sort comparators**: Spatial reading-order sorts now use discrete row buckets instead of tolerance-based grouping, ensuring correct and stable ordering. - **Page furniture over-stripping**: Added bulk and per-paragraph guards to prevent aggressive furniture stripping from removing legitimate content. - **`KREUZBERG_CACHE_DIR` not respected by all caches**: Embeddings, OCR result cache, and document extraction cache now honor the environment variable. - **MSG PT_STRING8 encoding**: MSG files now correctly decode ANSI string properties using the declared Windows code page instead of UTF-8 lossy conversion. - **SLANet-Plus ONNX model**: Re-exported with shape fix, resolving inference failures that caused all SLANet table extractions to silently fail on macOS CoreML. - **TATR model panic in batch processing**: Model unavailability in parallel closures caused crashes in FFI callers (Java, C#). Now falls back gracefully to heuristic table extraction. - **Docker musl builds**: Alpine/musl Docker images now link against the system ONNX Runtime library, fixing build failures. All features work in musl CLI images. - **FFI batch functions null handling**: C#/Java FFI batch functions now accept NULL for `file_config_jsons` instead of rejecting it. ### Known Issues - **PHP PIE Windows package temporarily unavailable**: The Windows build for the PHP PIE extension is disabled due to a transitive dependency conflict (`ort-sys` → `lzma-rust2` → `crc` version collision on the `x86_64-pc-windows-gnu` target). Linux and macOS PHP packages are unaffected. Will be resolved when upstream `ort` updates its `lzma-rust2` dependency. - **WASM: no layout detection, acceleration, or email config**: ONNX Runtime does not support WebAssembly, so layout detection (RT-DETR), hardware acceleration config, and concurrency config are unavailable in the WASM binding. OCR via Tesseract WASM and embeddings are supported. --- ## [4.4.6] ### Added - **dBASE (.dbf) format support**: Extract table data from dBASE files as markdown tables with field type support. - **Hangul Word Processor (.hwp/.hwpx) support**: Extract text content from HWP 5.0 documents (standard Korean document format). - **Office template/macro format variants**: Added support for `.docm`, `.dotx`, `.dotm`, `.dot` (Word), `.potx`, `.potm`, `.pot` (PowerPoint), `.xltx`, `.xlt` (Excel) formats. ### Fixed - **DOCX image placeholders missing (#484)**: Extracting `.docx` files with `extract_images=True` no longer produced `![](image)` placeholders in the output. The default plain text output path was stripping image references. Image extraction now forces markdown output so placeholders are always included. ### Changed - **Format count updated to 91+**: Documentation across all READMEs, docs, and package manifests updated to reflect expanded format support (previously 75+). ## [4.4.5] ### Fixed - **PDF markdown garbles positioned text (#431)**: PDFs with positioned/tabular text (CVs, addresses, data tables) had their line breaks destroyed during paragraph grouping. Added page-level positioned text detection: when fewer than 30% of lines on a page reach the right margin, short lines are split into separate paragraphs to preserve the document's visual structure. - **Node worker pool password bug**: `extractFileInWorker` was passing the `password` argument as `mime_type` to `extract_file_sync`, meaning passwords were never applied and MIME detection could break. Password is now correctly injected into `config.pdf_options.passwords`. - **Unused import in kreuzberg-node**: Removed unused `use serde_json::Value` import in `result.rs` that caused clippy warnings. - **WASM Deno OCR test hang**: OCR tests hung indefinitely on WASM Deno because Tesseract synchronous initialization blocks the single-threaded runtime. OCR fixtures are now skipped for the wasm-deno target. - **WASM camelCase config deserialization**: JS consumers send camelCase config keys (e.g. `includeDocumentStructure`) but `serde` expects snake_case. Added `camel_to_snake` transform in `parse_config()` so config fields are properly deserialized. Fixes document structure extraction returning empty results via WASM. - **PHP 8.5 array coercion on macOS**: On PHP 8.5 + macOS, ext-php-rs coerces `#[php_class]` return values to arrays instead of objects. Added `normalizeExtractionResult()` wrapper that transparently converts arrays via `ExtractionResult::fromArray()`. - **PHP 8.5 support**: Upgraded ext-php-rs to 0.15.6 for PHP 8.5 compatibility. - **Vendoring scripts missing path deps**: Ruby and R vendoring scripts failed when workspace dependencies use `path` instead of `version`. Added path field handling to `format_dependency()` and kreuzberg-ffi fixup block to the Ruby vendoring script. - **pdfium-render clippy lints**: Fixed clippy warnings in kreuzberg-pdfium-render crate. ### Added - **CLI `--pdf-password` flag**: New `--pdf-password` option on `extract` and `batch` commands for encrypted PDF support. Can be specified multiple times. - **MCP `pdf_password` parameter**: Added `pdf_password` field to `extract_file`, `extract_bytes`, and `batch_extract_files` MCP tool params for better discoverability. - **API `pdf_password` multipart field**: The HTTP API extract endpoint now accepts a `pdf_password` multipart field for encrypted PDFs. - **`PdfConfig` Default impl**: Added `Default` implementation for `PdfConfig` to support ergonomic config construction. - **Binding crate clippy in CI**: Added clippy steps to `ci-node`, `ci-python`, and `ci-wasm` workflows (gated to Linux). Added `node:clippy`, `python:clippy`, and `wasm:clippy` task commands. - **E2E password-protected PDF fixture**: Added `pdf_password_protected` fixture testing copy-protected PDF extraction across all bindings. ### Changed - **All binding crates linted in pre-commit**: Removed clippy exclusions for kreuzberg-php, kreuzberg-node, and kreuzberg-wasm from pre-commit config. - **golangci-lint v2.11.3**: Upgraded from v2.9.0 across Taskfile, CI workflows, and install scripts. ## [4.4.4] ### Fixed - **CLI test app fixes**: Fixed broken symlinks in CLI test documents, corrected `--format` to `--output-format` flag usage, fixed multipart form field name (`file=` → `files=`) in serve tests, and rewrote MCP test to use JSON-RPC stdin protocol instead of background process detection. - **Publish idempotency check scripts**: Fixed `check_nuget.sh` and `check-nuget-version.sh` using bash 4+ `${var,,}` syntax incompatible with bash 3.x. Fixed `check_pypi.sh` and `check_packagist.sh` writing to `$GITHUB_OUTPUT` internally instead of stdout (conflicting with workflow-level redirect). Fixed `check-rubygems-version.sh` false negatives for native gems by switching from `gem search` to RubyGems JSON API. Fixed `check-rubygems-version-python.sh` Python operator precedence bug. Fixed `check-maven-version.sh` using unreliable Solr search API instead of direct repo HEAD request. Fixed stderr redirect missing on diagnostic messages in multiple scripts. - **Node test app version**: Updated Node.js test app to reference v4.4.4 package version. ### Changed - **CLI install with all features**: CLI test install script now uses `--all-features` flag to enable API server and MCP server subcommands. - **Publish workflow republish support**: Added `republish` input to publish workflow that deletes and re-creates the tag on current HEAD before publishing, enabling clean retag + full republish. ## [4.4.3] ### Added - **PDF image placeholder toggle**: New `inject_placeholders` option on `ImageExtractionConfig` (default: `true`). Set to `false` to extract images as data without injecting `![image](...)` references into the markdown content. ### Fixed - **Token reduction not applied** ([#436](https://github.com/kreuzberg-dev/kreuzberg/issues/436)): Token reduction config was accepted but never executed during extraction. The pipeline now applies `reduce_tokens()` when `token_reduction.mode` is configured. - **Nested HTML table extraction**: Nested HTML tables now extract correctly with proper cell data and markdown rendering, using the visitor-based table extraction API from html-to-markdown-rs. - **hOCR plain text output**: hOCR conversion now correctly produces plain text when `OutputFormat::Plain` is requested, instead of silently falling back to Markdown. - **PDF garbled text for positioned/tabular content** ([#431](https://github.com/kreuzberg-dev/kreuzberg/issues/431)): PDF text extraction now detects X-position gaps between consecutive characters and inserts spaces when the gap exceeds `0.8 × avg_font_size`. Previously, characters placed at specific coordinates without explicit space characters were concatenated without spaces. - **Chunk page metadata drift with overlap** ([#439](https://github.com/kreuzberg-dev/kreuzberg/issues/439)): Chunk byte offsets are now computed via pointer arithmetic from the source text, fixing cumulative drift that caused chunks to report incorrect page numbers when overlap is enabled. - **Node.js metadata casing**: Standardized all `Metadata` and `EmailMetadata` fields to `camelCase` (e.g., `pageCount`, `creationDate`, `fromEmail`) in the Node.js/TypeScript bindings. Also corrected pluralization for `authors` and `keywords`. - **WASM build failure on Windows CI**: CMake try-compile checks on Windows used the host MSVC compiler (`cl.exe`), which rejected GCC/Clang flags like `-Wno-implicit-function-declaration`. Added `CMAKE_TRY_COMPILE_TARGET_TYPE=STATIC_LIBRARY` to both `build_leptonica_wasm` and `build_tesseract_wasm` to skip linking during cross-compilation checks. - **WASM OCR build panic when `git`/`patch` unavailable**: The tesseract WASM patch (`tesseract.diff`) application panicked when both `git apply` and `patch` commands failed. Added programmatic C++ source fixups as a fallback, applying all necessary changes (CPUID guard, pixa*debug* unique_ptr conversion, source list trimming) via string replacement when the diff patch cannot be applied. ## [4.4.2] ### Fixed - **E2E element type assertions**: Fixed element type field name in E2E generator templates for Python, TypeScript, WASM Deno, Elixir, Ruby, PHP, and C#. Each binding uses different casing conventions (Python: dict key `element_type`, TypeScript/Node: `elementType` via NAPI camelCase, Elixir: atom-to-string conversion, C#: JSON serialization for snake_case wire value). - **Ruby PDF annotation extraction**: Fixed `PdfAnnotation` and `PdfAnnotationBoundingBox` classes not being registered in the autoload list, causing `NameError` when extracting PDF annotations. Also fixed bounding box field name mismatch between Rust output (`x0/y0/x1/y1`) and Ruby struct (`left/top/right/bottom`). - **Ruby cyclomatic complexity**: Refactored `build_annotation_bbox` in result.rb to extract repeated field lookup pattern, reducing cyclomatic complexity below threshold. - **WASM OCR blocking event loop**: The `ocrRecognize()` function in the WASM package was running synchronously on the main thread, blocking the Node.js event loop during image decoding and Tesseract OCR processing. This prevented timeouts and other async operations from firing while OCR was in progress. OCR now runs in a worker thread (Node.js `worker_threads` / browser `Web Worker`), keeping the main thread responsive. - **JPEG 2000 OCR decode failure**: JPEG 2000 images (jp2, jpx, jpm, mj2) and JBIG2 images failed with "The image format could not be determined" during PaddleOCR and WASM OCR because these code paths used the standard `image` crate which doesn't support JPEG 2000. A shared `load_image_for_ocr()` helper now detects JP2/J2K/JBIG2 formats by magic bytes and uses `hayro-jpeg2000`/`hayro-jbig2` decoders across all OCR backends. The `ocr-wasm` feature now includes these decoders (pure Rust, WASM-compatible). - **WASM PDF empty content**: `initWasm()` fired off PDFium initialization asynchronously without awaiting it, causing a race condition where PDF extraction could start before PDFium was ready, returning empty content. PDFium initialization is now properly awaited during `initWasm()`. ### Added - **OMML-to-LaTeX math conversion for DOCX**: Mathematical equations in DOCX files (Office Math Markup Language) are now converted to LaTeX notation instead of being rendered as concatenated Unicode text. Supports superscripts, subscripts, fractions (`\frac`), radicals (`\sqrt`), n-ary operators (`\sum`, `\int`), delimiters, function names, accents, equation arrays, limits, bars, border boxes, matrices, and pre-sub-superscripts. Display math uses `$$...$$` and inline math uses `$...$` in markdown output. Plain text output includes raw LaTeX without delimiters. - **Plain text output paths for all extractors**: When `OutputFormat::Plain` or `OutputFormat::Structured` is requested, DOCX, PPTX, ODT, FB2, DocBook, RTF, and Jupyter extractors now produce clean plain text without markdown syntax (`#`, `**`, `|`, `![](image)`, `-`, etc.). Previously these extractors always emitted markdown regardless of the requested output format. - **DOCX**: `Document::to_plain_text()` skips heading prefixes, inline formatting markers, image placeholders, and renders footnotes/endnotes as `id: text` instead of `[^id]: text`. - **PPTX**: `ContentBuilder` respects `plain` mode — skips `#` title prefix, image markers, list markers, and uses `Notes:` instead of `### Notes:`. - **ODT**: Heading prefixes (`#`), list markers (`-`), and pipe-delimited tables conditionally omitted for plain text. - **FB2/FictionBook**: Inline markers (`*`, `**`, `` ` ``, `~~`), heading prefixes, and cite prefixes skipped for plain text. - **DocBook**: Section title prefixes, code fences, list markers, blockquote prefixes, bold figure captions, and pipe tables all conditionally omitted. - **RTF**: Table output in result string uses tab separation instead of pipe-delimited markdown. Image `![image](...)` markers omitted for plain text. - **Jupyter**: Skips `text/markdown` and `text/html` output types in plain mode, preferring `text/plain`. - **`cells_to_text()` shared utility**: Tab-separated plain text table formatter alongside existing `cells_to_markdown()`. Used by DOCX, PPTX, ODT, RTF, and DocBook extractors for plain text table rendering. ### Changed - **CLI includes all features**: `kreuzberg-cli` now depends on `kreuzberg` with the `full` feature set instead of a separate `cli` subset. The `cli` feature group has been removed from `kreuzberg`. This ensures the CLI supports all formats including archives (7z, tar, gz, zip). ### Fixed - **Alpine/musl CLI Docker image**: Fixed "Dynamic loading not supported" error when running `kreuzberg-cli` in Alpine containers. The CLI binary is now dynamically linked against musl libc, enabling runtime library loading for PDF processing. - **R package Windows installation**: Improved Python detection in configure script for Windows environments (added `py` launcher and `RETICULATE_PYTHON` support). Symlink extraction errors during source package installation are now handled gracefully. - **PHP 8.5 precompiled extension binaries**: Added PHP 8.5 support alongside existing PHP 8.4 in CI and release workflows. - **OCR DPI normalization**: The `normalize_image_dpi()` preprocessing logic is now integrated into the OCR pipeline. Images are normalized to the configured target DPI before being passed to Tesseract, and the calculated DPI is set via `set_source_resolution()`. This eliminates the "Estimating resolution as ..." warning and improves OCR accuracy for images with non-standard DPI. - **HTML metadata extraction with Plain output**: Fixed HTML metadata (headers, links, images, structured data) not being collected when using `OutputFormat::Plain` (the default). The underlying library's plain text fast path skips metadata extraction; kreuzberg now uses Markdown format internally for metadata collection and converts to plain text separately. - **PPTX text run spacing**: Adjacent text runs within paragraphs are now joined with smart spacing instead of being concatenated directly ("HelloWorld" → "Hello World"). - **CSV Shift-JIS/cp932 encoding detection**: `encoding_rs` is now a non-optional dependency. CSV files with Shift-JIS encoding are correctly decoded instead of producing mojibake. Fallback encoding detection tries common encodings (Shift-JIS, cp932, windows-1252, iso-8859-1, gb18030, big5). - **EML multipart body extraction**: All text/html body parts are now extracted by iterating over all indices instead of only index 0. Nested `message/rfc822` parts in multipart/digest are recursively extracted. - **EPUB media tag leakage**: `