Nomad changes

2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions
--- a/scripts/ci/README.md
+++ b/scripts/ci/README.md
@@ -0,0 +1,242 @@
+# CI Workflow Scripts
+
+This directory contains extracted scripts from GitHub Actions CI workflows, organized by workflow type.
+
+## Overview
+
+- **Total Scripts**: 41 (27 Bash + 14 PowerShell)
+- **Documentation**: See `SCRIPT_MAPPING.md` for detailed workflow-to-script mapping
+- **All Scripts**: Production-ready with proper error handling and documentation
+
+## Directory Structure
+
+```text
+scripts/ci/
+├── README.md               ← This file
+├── SCRIPT_MAPPING.md       ← Detailed workflow-to-script mapping guide
+├── docker/                 ← Docker image build and test scripts
+├── go/                     ← Go bindings scripts
+├── java/                   ← Java bindings scripts
+├── node/                   ← Node/TypeScript NAPI scripts
+├── python/                 ← Python wheel build scripts
+├── ruby/                   ← Ruby gem build scripts
+├── rust/                   ← Rust core and CLI scripts
+├── csharp/                 ← C# bindings scripts
+└── validate/               ← Validation and linting scripts
+```
+
+## Quick Start
+
+### Running a Script
+
+**Bash scripts:**
+
+```bash
+./scripts/ci/docker/build-image.sh core
+./scripts/ci/python/run-tests.sh true
+```
+
+**PowerShell scripts:**
+
+```powershell
+& ./scripts/ci/go/build-ffi.ps1
+& ./scripts/ci/rust/package-cli-windows.ps1 -Target "x86_64-pc-windows-msvc"
+```
+
+### Sourcing Scripts
+
+For library path setup scripts:
+
+```bash
+source ./scripts/lib/library-paths.sh
+setup_all_library_paths
+./scripts/ci/python/run-tests.sh true
+```
+
+## Scripts by Workflow
+
+### Docker (`docker/`)
+
+- `free-disk-space.sh` - Clean up CI disk space
+- `build-image.sh` - Build Docker image variant
+- `check-image-size.sh` - Validate image size constraints
+- `save-image.sh` - Save Docker image as tar.gz artifact
+- `collect-logs.sh` - Collect container logs on failure
+- `cleanup.sh` - Clean up Docker resources
+- `summary.sh` - Print test summary
+
+### Go (`go/`)
+
+- `build-ffi.sh` - Build FFI library (Unix)
+- `build-ffi.ps1` - Build FFI library (Windows)
+- `build-bindings.sh` - Build Go bindings with CGO (Unix)
+- `build-bindings.ps1` - Build Go bindings with CGO (Windows)
+- `reorganize-libraries.ps1` - Reorganize FFI libraries for Windows
+- `run-tests.sh` - Run Go tests with library paths
+
+### Java (`java/`)
+
+- `build-java.sh` - Build Java bindings with Maven
+- `run-tests.sh` - Run Java tests with Maven
+
+### Node/TypeScript (`node/`)
+
+- `build-napi.sh` - Build NAPI bindings with artifact collection
+- `unpack-bindings.sh` - Unpack and install bindings from tarball
+
+### Python (`python/`)
+
+- `clean-artifacts.sh` - Clean previous wheel artifacts
+- `smoke-test-wheel.sh` - Test wheel installation
+- `install-wheel.sh` - Install platform-specific wheel
+- `run-tests.sh` - Run tests with optional coverage
+
+### Ruby (`ruby/`)
+
+- `install-ruby-deps.sh` - Install bundle dependencies (Unix)
+- `install-ruby-deps.ps1` - Install bundle dependencies (Windows)
+- `vendor-kreuzberg-core.py` - Vendor core crate for packaging
+- `configure-bindgen-windows.ps1` - Configure bindgen headers (Windows)
+- `configure-tesseract-windows.ps1` - Configure Tesseract (Windows)
+- `build-gem.sh` - Build Ruby gem
+- `install-gem.sh` - Install built gem
+- `compile-extension.sh` - Compile native extension
+- `run-tests.sh` - Run RSpec tests
+
+### Rust (`rust/`)
+
+- `configure-bindgen-windows.ps1` - Configure bindgen headers (Windows)
+- `run-unit-tests.sh` - Run Rust unit tests
+- `package-cli-unix.sh` - Package CLI as tar.gz (Unix)
+- `package-cli-windows.ps1` - Package CLI as zip (Windows)
+- `test-cli-unix.sh` - Test CLI binary (Unix)
+- `test-cli-windows.ps1` - Test CLI binary (Windows)
+
+### C# (`csharp/`)
+
+- `build-csharp.sh` - Build C# bindings with dotnet
+- `run-tests.sh` - Run C# tests with dotnet
+
+### Validate (`validate/`)
+
+- `run-lint.sh` - Run all linting and validation checks via Task
+
+## Features
+
+### Error Handling
+
+- All Bash scripts use `set -euo pipefail`
+- All PowerShell scripts use `Set-StrictMode` and error action preferences
+- Proper exit codes and error messages
+- Usage information for incorrect arguments
+
+### Documentation
+
+- Every script has a descriptive header
+- Purpose and usage clearly stated
+- Which CI workflow step uses it
+- Argument documentation
+
+### Platform Support
+
+- Windows-specific operations via PowerShell (.ps1)
+- Unix operations via Bash (.sh)
+- Cross-platform scripts detect OS and adjust behavior
+- Library path setup scripts handle Windows/Linux/macOS
+
+### Reusability
+
+- `library-paths.sh` (`scripts/lib/`) - Shared by all workflows for native library configuration
+- `configure-bindgen-windows.ps1` used by Ruby and Rust
+- Common patterns consolidated into single scripts
+
+## Detailed Documentation
+
+For comprehensive workflow-to-script mapping and usage examples, see `SCRIPT_MAPPING.md`.
+
+## Usage in Workflows
+
+### Example: ci-docker.yaml
+
+**Before (inline commands):**
+
+```yaml
+- name: Free up disk space
+  run: |
+    echo "=== Initial disk space ==="
+    df -h /
+    echo "=== Removing unnecessary packages ==="
+    sudo rm -rf /usr/share/dotnet
+    # ... 30+ lines of commands ...
+```
+
+**After (using script):**
+
+```yaml
+- name: Free up disk space
+  run: ./scripts/ci/docker/free-disk-space.sh
+```
+
+### Example: ci-python.yaml
+
+**Before (inline commands):**
+
+```yaml
+- name: Run Python tests
+  run: |
+    cd packages/python
+    if [ "${{ matrix.coverage }}" = "true" ]; then
+      uv run pytest -vv --cov=kreuzberg --cov-report=lcov:coverage.lcov ...
+    else
+      uv run pytest -vv --reruns 1 --reruns-delay 1
+    fi
+```
+
+**After (using script):**
+
+```yaml
+- name: Run Python tests
+  run: ./scripts/ci/python/run-tests.sh ${{ matrix.coverage }}
+```
+
+## Testing Scripts Locally
+
+You can test scripts locally before running in CI:
+
+```bash
+# Test Docker scripts
+./scripts/ci/docker/free-disk-space.sh
+
+# Test Python scripts
+./scripts/ci/python/clean-artifacts.sh
+./scripts/ci/python/run-tests.sh false
+
+# Test Rust scripts
+./scripts/ci/rust/run-unit-tests.sh
+```
+
+## Shell Compatibility
+
+- **Bash scripts**: Compatible with bash 3.2+ (macOS) and bash 4.0+ (Linux)
+- **PowerShell scripts**: Compatible with PowerShell 5.1+ (Windows) and PowerShell Core 7+ (cross-platform)
+
+## Contributing
+
+When adding new CI steps or modifying existing ones:
+
+1. Extract the inline script into a separate file in the appropriate directory
+2. Add proper error handling (`set -euo pipefail` for bash)
+3. Include descriptive header comments
+4. Update `SCRIPT_MAPPING.md` with the new mapping
+5. Test the script locally before committing
+
+## Maintenance
+
+Scripts should be reviewed and updated when:
+
+- Updating CI workflow logic
+- Changing build tools or versions
+- Improving error handling
+- Adding new platform support
+
+See each script's header for detailed documentation on its purpose and usage.
--- a/scripts/ci/actions/setup-onnx-runtime/linux.sh
+++ b/scripts/ci/actions/setup-onnx-runtime/linux.sh
@@ -0,0 +1,90 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+ort_version="${1:?ort-version required}"
+dest_dir="${2:-crates/kreuzberg-node}"
+arch_id="${3:-}"
+strategy="${4:-system}"
+
+extract_dir="$RUNNER_TEMP/onnxruntime"
+
+if [ -z "$arch_id" ]; then
+  case "$(uname -m)" in
+  x86_64 | amd64) arch_id="x64" ;;
+  arm64 | aarch64) arch_id="arm64" ;;
+  *)
+    echo "Unsupported Linux architecture: $(uname -m)" >&2
+    exit 1
+    ;;
+  esac
+fi
+
+case "$arch_id" in
+x64)
+  ort_dir_name="onnxruntime-linux-x64-${ort_version}"
+  archive="onnxruntime-linux-x64-${ort_version}.tgz"
+  ;;
+arm64)
+  ort_dir_name="onnxruntime-linux-aarch64-${ort_version}"
+  archive="onnxruntime-linux-aarch64-${ort_version}.tgz"
+  ;;
+*)
+  echo "Unsupported Linux arch-id: $arch_id" >&2
+  exit 1
+  ;;
+esac
+
+if [ ! -d "$extract_dir/$ort_dir_name" ]; then
+  echo "Cache miss: Downloading ONNX Runtime ${ort_version}"
+  curl -fsSL --retry 5 --retry-delay 5 --retry-all-errors -o "$RUNNER_TEMP/$archive" "https://github.com/microsoft/onnxruntime/releases/download/v${ort_version}/$archive"
+  mkdir -p "$extract_dir"
+  tar -xzf "$RUNNER_TEMP/$archive" -C "$extract_dir"
+else
+  echo "Cache hit: Using cached ONNX Runtime ${ort_version}"
+fi
+
+ort_root="$extract_dir/$ort_dir_name"
+
+if [ ! -d "$ort_root/lib" ]; then
+  echo "ERROR: ONNX Runtime lib directory missing at $ort_root/lib" >&2
+  echo "Available directories:" >&2
+  ls -la "$extract_dir" >&2 || true
+  exit 1
+fi
+
+if ! ls "$ort_root/lib"/*.so* 1>/dev/null 2>&1; then
+  echo "ERROR: No ONNX Runtime libraries found in $ort_root/lib" >&2
+  echo "Directory contents:" >&2
+  ls -la "$ort_root/lib" >&2 || true
+  exit 1
+fi
+
+dest="$GITHUB_WORKSPACE/$dest_dir"
+mkdir -p "$dest"
+cp -f "$ort_root/lib/"*.so* "$dest/"
+
+if [ -n "${RUSTFLAGS:-}" ]; then
+  rustflags="$RUSTFLAGS -L $ort_root/lib"
+else
+  rustflags="-L $ort_root/lib"
+fi
+
+if [ "$strategy" = "bundled" ]; then
+  echo "Using bundled ORT strategy — letting ort-sys download-binaries handle static linking"
+  {
+    echo "LD_LIBRARY_PATH=$ort_root/lib:$dest:${LD_LIBRARY_PATH:-}"
+    echo "LIBRARY_PATH=$ort_root/lib:$dest:${LIBRARY_PATH:-}"
+  } >>"$GITHUB_ENV"
+else
+  {
+    ort_lib=$(find "$ort_root/lib" -name "libonnxruntime*.so*" -print -quit)
+    echo "ORT_LIB_LOCATION=$ort_root/lib"
+    echo "ORT_PREFER_DYNAMIC_LINK=1"
+    echo "ORT_SKIP_DOWNLOAD=1"
+    echo "ORT_STRATEGY=system"
+    echo "ORT_DYLIB_PATH=$ort_root/lib/${ort_lib##*/}"
+    echo "LD_LIBRARY_PATH=$ort_root/lib:$dest:${LD_LIBRARY_PATH:-}"
+    echo "LIBRARY_PATH=$ort_root/lib:$dest:${LIBRARY_PATH:-}"
+    echo "RUSTFLAGS=$rustflags"
+  } >>"$GITHUB_ENV"
+fi
--- a/scripts/ci/actions/setup-onnx-runtime/macos.sh
+++ b/scripts/ci/actions/setup-onnx-runtime/macos.sh
@@ -0,0 +1,86 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+ort_version="${1:?ort-version required}"
+dest_dir="${2:-crates/kreuzberg-node}"
+arch_id="${3:-}"
+strategy="${4:-system}"
+
+extract_dir="$RUNNER_TEMP/onnxruntime"
+
+if [ -z "$arch_id" ]; then
+  arch="$(uname -m)"
+  if [ "$arch" = "arm64" ]; then
+    arch_id="arm64"
+  else
+    arch_id="x64"
+  fi
+fi
+
+case "$arch_id" in
+arm64) ort_arch="arm64" ;;
+x64) ort_arch="x86_64" ;;
+*)
+  echo "Unsupported macOS arch-id: $arch_id" >&2
+  exit 1
+  ;;
+esac
+echo "Using macOS ONNX Runtime arch: $ort_arch"
+
+if [ ! -d "$extract_dir/onnxruntime-osx-${ort_arch}-${ort_version}" ]; then
+  echo "Cache miss: Downloading ONNX Runtime ${ort_version} for macOS ${ort_arch}"
+  archive="onnxruntime-osx-${ort_arch}-${ort_version}.tgz"
+  curl -fsSL --retry 5 --retry-delay 5 --retry-all-errors -o "$RUNNER_TEMP/$archive" "https://github.com/microsoft/onnxruntime/releases/download/v${ort_version}/$archive"
+  mkdir -p "$extract_dir"
+  tar -xzf "$RUNNER_TEMP/$archive" -C "$extract_dir"
+else
+  echo "Cache hit: Using cached ONNX Runtime ${ort_version}"
+fi
+
+ort_root="$extract_dir/onnxruntime-osx-${ort_arch}-${ort_version}"
+
+if [ ! -d "$ort_root/lib" ]; then
+  echo "ERROR: ONNX Runtime lib directory missing at $ort_root/lib" >&2
+  echo "Available directories:" >&2
+  ls -la "$extract_dir" >&2 || true
+  exit 1
+fi
+
+if ! ls "$ort_root/lib"/libonnxruntime*.dylib 1>/dev/null 2>&1; then
+  echo "ERROR: No ONNX Runtime libraries found in $ort_root/lib" >&2
+  echo "Directory contents:" >&2
+  ls -la "$ort_root/lib" >&2 || true
+  exit 1
+fi
+
+dest="$GITHUB_WORKSPACE/$dest_dir"
+mkdir -p "$dest"
+cp -f "$ort_root/lib/"libonnxruntime*.dylib "$dest/"
+
+if [ -n "${RUSTFLAGS:-}" ]; then
+  rustflags="$RUSTFLAGS -L $ort_root/lib"
+else
+  rustflags="-L $ort_root/lib"
+fi
+
+if [ "$strategy" = "bundled" ]; then
+  echo "Using bundled ORT strategy — letting ort-sys download-binaries handle static linking"
+  {
+    echo "DYLD_LIBRARY_PATH=$ort_root/lib:$dest:${DYLD_LIBRARY_PATH:-}"
+    echo "DYLD_FALLBACK_LIBRARY_PATH=$ort_root/lib:$dest:${DYLD_FALLBACK_LIBRARY_PATH:-}"
+    echo "LIBRARY_PATH=$ort_root/lib:$dest:${LIBRARY_PATH:-}"
+  } >>"$GITHUB_ENV"
+else
+  {
+    ort_lib=$(find "$ort_root/lib" -name "libonnxruntime*.dylib" -print -quit)
+    echo "ORT_LIB_LOCATION=$ort_root/lib"
+    echo "ORT_PREFER_DYNAMIC_LINK=1"
+    echo "ORT_SKIP_DOWNLOAD=1"
+    echo "ORT_STRATEGY=system"
+    echo "ORT_DYLIB_PATH=$ort_root/lib/${ort_lib##*/}"
+    echo "DYLD_LIBRARY_PATH=$ort_root/lib:$dest:${DYLD_LIBRARY_PATH:-}"
+    echo "DYLD_FALLBACK_LIBRARY_PATH=$ort_root/lib:$dest:${DYLD_FALLBACK_LIBRARY_PATH:-}"
+    echo "LIBRARY_PATH=$ort_root/lib:$dest:${LIBRARY_PATH:-}"
+    echo "RUSTFLAGS=$rustflags"
+  } >>"$GITHUB_ENV"
+fi
--- a/scripts/ci/actions/setup-onnx-runtime/windows.ps1
+++ b/scripts/ci/actions/setup-onnx-runtime/windows.ps1
@@ -0,0 +1,100 @@
+$OrtVersion = $args[0]
+if ([string]::IsNullOrWhiteSpace($OrtVersion)) { throw "Usage: windows.ps1 <ortVersion> [destDir] [archId] [strategy]" }
+
+$DestDir = if ($args.Count -ge 2 -and -not [string]::IsNullOrWhiteSpace($args[1])) { $args[1] } else { "crates/kreuzberg-node" }
+$ArchId = if ($args.Count -ge 3) { $args[2] } else { "" }
+$Strategy = if ($args.Count -ge 4 -and -not [string]::IsNullOrWhiteSpace($args[3])) { $args[3] } else { "system" }
+
+$ExtractRoot = Join-Path $env:TEMP "onnxruntime"
+if ([string]::IsNullOrWhiteSpace($ArchId)) {
+  $ArchId = $env:RUNNER_ARCH
+}
+$ArchId = $ArchId.ToLowerInvariant()
+if ($ArchId -eq "arm64") { $ArchId = "arm64" } else { $ArchId = "x64" }
+
+$OrtRoot = Join-Path $ExtractRoot "onnxruntime-win-$ArchId-$OrtVersion"
+$OrtBin = Join-Path $OrtRoot 'bin'
+$OrtLib = Join-Path $OrtRoot 'lib'
+
+if (-Not (Test-Path $OrtRoot)) {
+  Write-Host "Cache miss: Downloading ONNX Runtime $OrtVersion"
+  $Archive = "onnxruntime-win-$ArchId-$OrtVersion.zip"
+  $DownloadPath = Join-Path $env:TEMP $Archive
+  Invoke-WebRequest -Uri "https://github.com/microsoft/onnxruntime/releases/download/v$OrtVersion/$Archive" -OutFile $DownloadPath -UseBasicParsing -MaximumRetryCount 5 -RetryIntervalSec 5
+  New-Item -ItemType Directory -Path $ExtractRoot -Force | Out-Null
+  Expand-Archive -Path $DownloadPath -DestinationPath $ExtractRoot -Force
+} else {
+  Write-Host "Cache hit: Using cached ONNX Runtime $OrtVersion"
+}
+
+if (!(Test-Path $OrtLib)) {
+  Write-Error "ERROR: ONNX Runtime lib directory missing at $OrtLib"
+  Get-ChildItem -Path $ExtractRoot -Recurse | Write-Host
+  exit 1
+}
+
+$LibFiles = @(Get-ChildItem -Path $OrtLib -Filter "*.lib" -ErrorAction SilentlyContinue)
+if ($LibFiles.Count -eq 0) {
+  Write-Error "ERROR: No ONNX Runtime library files found in $OrtLib"
+  Get-ChildItem -Path $OrtLib | Write-Host
+  exit 1
+}
+
+$DllDirs = @()
+foreach ($Candidate in @($OrtLib, $OrtBin)) {
+  if (Test-Path $Candidate) {
+    $CandidateDlls = @(Get-ChildItem -Path $Candidate -Filter "*.dll" -File -ErrorAction SilentlyContinue)
+    if ($CandidateDlls.Count -gt 0) {
+      $DllDirs += $Candidate
+    }
+  }
+}
+if ($DllDirs.Count -eq 0) {
+  $OrtDll = Get-ChildItem -Path $OrtRoot -Recurse -Filter "onnxruntime.dll" -File -ErrorAction SilentlyContinue | Select-Object -First 1
+  if ($OrtDll) { $DllDirs += $OrtDll.DirectoryName }
+}
+if ($DllDirs.Count -eq 0) {
+  $AnyDll = Get-ChildItem -Path $OrtRoot -Recurse -Filter "*.dll" -File -ErrorAction SilentlyContinue | Select-Object -First 1
+  if ($AnyDll) { $DllDirs += $AnyDll.DirectoryName }
+}
+$DllDirs = $DllDirs | Select-Object -Unique
+if ($DllDirs.Count -eq 0) {
+  Write-Error "ERROR: No ONNX Runtime runtime DLLs found under $OrtRoot"
+  Get-ChildItem -Path $OrtRoot -Recurse | Write-Host
+  exit 1
+}
+
+$Dest = Join-Path $env:GITHUB_WORKSPACE $DestDir
+New-Item -ItemType Directory -Path $Dest -Force | Out-Null
+Copy-Item -Path (Join-Path $OrtLib '*') -Destination $Dest -Force
+foreach ($Dir in $DllDirs) {
+  Copy-Item -Path (Join-Path $Dir '*.dll') -Destination $Dest -Force
+}
+
+$RustFlags = if ($env:RUSTFLAGS) { "$env:RUSTFLAGS -L $OrtLib" } else { "-L $OrtLib" }
+
+if ($Strategy -eq "bundled") {
+  # ort-sys has no prebuilt static binaries for x86_64-pc-windows-gnu (MSYS2/MinGW).
+  # Use the pre-downloaded Microsoft ORT with dynamic linking for Windows GNU targets.
+  Write-Host "Using bundled ORT strategy (Windows) - dynamic linking against pre-downloaded ORT (no static binaries for windows-gnu)"
+  @(
+    "ORT_LIB_LOCATION=$OrtLib"
+    "ORT_PREFER_DYNAMIC_LINK=1"
+    "RUSTFLAGS=$RustFlags"
+    "LIB=$OrtLib;$env:LIB"
+    "LIBRARY_PATH=$OrtLib;$env:LIBRARY_PATH"
+    "PATH=$Dest;$env:PATH"
+  ) | Out-File -FilePath $env:GITHUB_ENV -Encoding utf8 -Append
+} else {
+  @(
+    "ORT_LIB_LOCATION=$OrtLib"
+    "ORT_PREFER_DYNAMIC_LINK=1"
+    "ORT_SKIP_DOWNLOAD=1"
+    "ORT_STRATEGY=system"
+    "ORT_DYLIB_PATH=$Dest\onnxruntime.dll"
+    "RUSTFLAGS=$RustFlags"
+    "LIB=$OrtLib;$env:LIB"
+    "LIBRARY_PATH=$OrtLib;$env:LIBRARY_PATH"
+    "PATH=$Dest;$env:PATH"
+  ) | Out-File -FilePath $env:GITHUB_ENV -Encoding utf8 -Append
+}
--- a/scripts/ci/actions/setup-prebuilt-onnx/prepare.sh
+++ b/scripts/ci/actions/setup-prebuilt-onnx/prepare.sh
@@ -0,0 +1,48 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+target="${1:?target required}"
+
+case "$target" in
+aarch64-apple-darwin)
+  ort_url="https://cdn.pyke.io/0/pyke:ort-rs/ms@1.24.1/aarch64-apple-darwin.tgz"
+  ;;
+x86_64-apple-darwin)
+  ort_url="https://cdn.pyke.io/0/pyke:ort-rs/ms@1.24.1/x86_64-apple-darwin.tgz"
+  ;;
+*)
+  echo "setup-prebuilt-onnx does not support target $target" >&2
+  exit 1
+  ;;
+esac
+
+ort_dir="${GITHUB_WORKSPACE}/target/onnxruntime/${target}"
+ort_root="${ort_dir}/onnxruntime"
+ort_lib="${ort_root}/lib"
+
+write_env() {
+  {
+    echo "ORT_STRATEGY=system"
+    echo "ORT_LIB_LOCATION=${ort_lib}"
+    echo "ORT_SKIP_DOWNLOAD=1"
+    echo "ORT_PREFER_DYNAMIC_LINK=1"
+  } >>"${GITHUB_ENV}"
+}
+
+if [ ! -f "${ort_lib}/libonnxruntime.a" ]; then
+  rm -rf "${ort_dir}"
+  mkdir -p "${ort_lib}"
+
+  echo "Attempting to download prebuilt ONNX Runtime for ${target}..." >&2
+  if curl -fsSL --max-time 30 -o /tmp/ort.tgz "${ort_url}" 2>/dev/null; then
+    tar -xz -C "${ort_lib}" -f /tmp/ort.tgz
+    rm -f /tmp/ort.tgz
+    write_env
+  else
+    echo "Warning: Prebuilt ONNX Runtime not available for ${target}" >&2
+    echo "Will download and build ONNX Runtime during compilation" >&2
+  fi
+else
+  echo "Using existing ONNX Runtime at ${ort_lib}" >&2
+  write_env
+fi
--- a/scripts/ci/actions/setup-rust/build-with-sccache-fallback.sh
+++ b/scripts/ci/actions/setup-rust/build-with-sccache-fallback.sh
@@ -0,0 +1,29 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Usage: build-with-sccache-fallback.sh <cargo command...>
+log_file=$(mktemp)
+trap 'rm -f "$log_file"' EXIT
+
+echo "Building with sccache (fallback on errors)..."
+
+# Attempt with sccache
+if "$@" 2>&1 | tee "$log_file"; then
+  echo "✓ Build succeeded with sccache"
+  exit 0
+fi
+
+# Check for sccache-related errors
+if grep -Eq "sccache.*(error|failed)|cache storage failed|dns error|connection (refused|timed out)" "$log_file"; then
+  echo "⚠️  sccache failure detected, retrying without cache..."
+  export RUSTC_WRAPPER=""
+  export SCCACHE_GHA_ENABLED=false
+
+  if "$@"; then
+    echo "✓ Build succeeded without sccache (fallback)"
+    exit 0
+  fi
+fi
+
+echo "✗ Build failed"
+exit 1
--- a/scripts/ci/actions/setup-tesseract-cache/clean-dirs.sh
+++ b/scripts/ci/actions/setup-tesseract-cache/clean-dirs.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+label="${1:?label required}"
+
+rm -rf ".tesseract-cache/${label}"
+rm -rf ".xdg-cache/${label}"
--- a/scripts/ci/actions/setup-tesseract-cache/clean-target-cache.sh
+++ b/scripts/ci/actions/setup-tesseract-cache/clean-target-cache.sh
@@ -0,0 +1,5 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+rust_target="${1:?rust target required}"
+rm -rf "target/${rust_target}/kreuzberg-tesseract-cache"
--- a/scripts/ci/actions/setup-tesseract-cache/set-outputs.sh
+++ b/scripts/ci/actions/setup-tesseract-cache/set-outputs.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+label="${1:?label required}"
+enable_cache="${2:?enable-cache required (true/false)}"
+
+if [ "$enable_cache" = "true" ]; then
+  cache_dir="${GITHUB_WORKSPACE}/.tesseract-cache/${label}"
+
+  echo "TESSERACT_RS_CACHE_DIR=${cache_dir}" >>"$GITHUB_ENV"
+  echo "XDG_CACHE_HOME=${GITHUB_WORKSPACE}/.xdg-cache/${label}" >>"$GITHUB_ENV"
+
+  echo "cache-dir=${cache_dir}" >>"$GITHUB_OUTPUT"
+  echo "cache-enabled=true" >>"$GITHUB_OUTPUT"
+
+  docker_opts="--env TESSERACT_RS_CACHE_DIR=/io/.tesseract-cache/${label}"
+  docker_opts="${docker_opts} --env XDG_CACHE_HOME=/io/.xdg-cache/${label}"
+  multiarch=""
+  if command -v dpkg-architecture >/dev/null 2>&1; then
+    multiarch="$(dpkg-architecture -qDEB_HOST_MULTIARCH 2>/dev/null || true)"
+  fi
+  if [ -z "$multiarch" ]; then
+    case "$(uname -m)" in
+    x86_64) multiarch="x86_64-linux-gnu" ;;
+    aarch64 | arm64) multiarch="aarch64-linux-gnu" ;;
+    esac
+  fi
+  openssl_lib_dir="/usr/lib"
+  if [ -n "$multiarch" ]; then
+    openssl_lib_dir="/usr/lib/${multiarch}"
+  fi
+  docker_opts="${docker_opts} --env OPENSSL_LIB_DIR=${openssl_lib_dir}"
+  docker_opts="${docker_opts} --env OPENSSL_INCLUDE_DIR=/usr/include"
+  echo "docker-options=${docker_opts}" >>"$GITHUB_OUTPUT"
+else
+  {
+    echo "TESSERACT_RS_CACHE_DIR="
+  } >>"$GITHUB_ENV"
+  {
+    echo "cache-dir="
+    echo "cache-enabled=false"
+    echo "docker-options="
+  } >>"$GITHUB_OUTPUT"
+fi
--- a/scripts/ci/actions/setup-tesseract-cache/setup-dirs.sh
+++ b/scripts/ci/actions/setup-tesseract-cache/setup-dirs.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+label="${1:?label required}"
+
+mkdir -p ".tesseract-cache/${label}"
+mkdir -p ".xdg-cache/${label}"
--- a/scripts/ci/benchmarks/verify-node-setup.sh
+++ b/scripts/ci/benchmarks/verify-node-setup.sh
@@ -0,0 +1,11 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+label="${1:-Node setup}"
+
+echo "=== ${label} ==="
+echo "Node version: $(node --version)"
+echo "pnpm version: $(pnpm --version)"
+echo "tsx availability: $(command -v tsx || echo 'NOT FOUND')"
+echo "pnpm workspace structure:"
+pnpm list --depth=0 || true
--- a/scripts/ci/cache/compute-hash.sh
+++ b/scripts/ci/cache/compute-hash.sh
@@ -0,0 +1,158 @@
+#!/usr/bin/env bash
+# Compute deterministic hash for cache key generation
+#
+# Usage:
+#   compute-hash.sh <glob-pattern> [glob-pattern...]
+#   compute-hash.sh --files <file1> <file2> ...
+#   compute-hash.sh --dirs <dir1> <dir2> ...
+#
+# Examples:
+#   compute-hash.sh "crates/kreuzberg/**/*.rs" "crates/kreuzberg-ffi/**/*.rs"
+#   compute-hash.sh --files Cargo.lock uv.lock
+#   compute-hash.sh --dirs crates/kreuzberg/ crates/kreuzberg-ffi/
+
+set -euo pipefail
+
+# Color output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+
+error() {
+  echo -e "${RED}Error: $*${NC}" >&2
+  exit 1
+}
+
+info() {
+  echo -e "${GREEN}$*${NC}" >&2
+}
+
+warn() {
+  echo -e "${YELLOW}$*${NC}" >&2
+}
+
+# Check if sha256sum or shasum is available
+if command -v sha256sum &>/dev/null; then
+  HASH_CMD="sha256sum"
+elif command -v shasum &>/dev/null; then
+  HASH_CMD="shasum -a 256"
+else
+  error "Neither sha256sum nor shasum found in PATH"
+fi
+
+# Mode detection
+MODE="glob"
+if [[ "${1:-}" == "--files" ]]; then
+  MODE="files"
+  shift
+elif [[ "${1:-}" == "--dirs" ]]; then
+  MODE="dirs"
+  shift
+fi
+
+if [[ $# -eq 0 ]]; then
+  error "No input provided. Usage: $0 <pattern...> or $0 --files <file...> or $0 --dirs <dir...>"
+fi
+
+# Temporary file for collecting hashes
+TEMP_HASHES=$(mktemp)
+trap 'rm -f "$TEMP_HASHES"' EXIT
+
+case "$MODE" in
+files)
+  # Hash specific files directly
+  for file in "$@"; do
+    if [[ -f "$file" ]]; then
+      $HASH_CMD "$file" >>"$TEMP_HASHES" 2>/dev/null || warn "Failed to hash: $file"
+    else
+      warn "File not found: $file"
+    fi
+  done
+  ;;
+
+dirs)
+  # Hash all files in directories recursively
+  for dir in "$@"; do
+    if [[ -d "$dir" ]]; then
+      # Find all files (excluding hidden files and directories)
+      find "$dir" -type f \
+        ! -path "*/.*" \
+        ! -path "*/target/*" \
+        ! -path "*/node_modules/*" \
+        ! -path "*/.venv/*" \
+        ! -path "*/dist/*" \
+        ! -path "*/build/*" \
+        -exec "$HASH_CMD" {} \; >>"$TEMP_HASHES" 2>/dev/null || true
+    else
+      warn "Directory not found: $dir"
+    fi
+  done
+  ;;
+
+glob)
+  # Hash files matching glob patterns
+  for pattern in "$@"; do
+    # Use find with -path for glob matching
+    # Convert glob to find path expression
+
+    if [[ "$pattern" == *"**"* ]]; then
+      # Handle ** recursive glob (e.g., "crates/kreuzberg/**/*.rs")
+      # Extract the base directory and file extension/name pattern
+      base_dir=$(echo "$pattern" | cut -d'*' -f1 | sed 's|/$||')
+
+      # Get the suffix after the ** (e.g., "/*.rs" from "crates/kreuzberg/**/*.rs")
+      # Remove everything up to and including **/
+      suffix="${pattern#*\*\*/}"
+
+      # Extract filename pattern (e.g., "*.rs" from "/*.rs")
+      # Remove leading / if present
+      if [[ "$suffix" == /* ]]; then
+        name_pattern="${suffix#/}"
+      else
+        name_pattern="$suffix"
+      fi
+
+      if [[ -d "$base_dir" ]]; then
+        # Find all files recursively using -name for filename matching
+        # This is more portable and reliable than bash regex
+        find "$base_dir" -type f \
+          ! -path "*/.*" \
+          ! -path "*/target/*" \
+          ! -path "*/node_modules/*" \
+          ! -path "*/.venv/*" \
+          -name "$name_pattern" \
+          -exec "$HASH_CMD" {} \; 2>/dev/null >>"$TEMP_HASHES" || true
+      else
+        warn "Directory not found: $base_dir"
+      fi
+    else
+      # Simple glob (no **)
+      for file in $pattern; do
+        if [[ -f "$file" ]]; then
+          $HASH_CMD "$file" >>"$TEMP_HASHES" 2>/dev/null || warn "Failed to hash: $file"
+        fi
+      done
+    fi
+  done
+  ;;
+esac
+
+# Check if we found any files to hash
+if [[ ! -s "$TEMP_HASHES" ]]; then
+  error "No files found matching the provided patterns"
+fi
+
+# Sort hashes (for determinism across different find orders)
+# Then hash the combined hashes to get final hash
+FINAL_HASH=$(sort "$TEMP_HASHES" | $HASH_CMD | cut -d' ' -f1)
+
+# Truncate to 12 characters for cache key (still 48 bits of entropy)
+SHORT_HASH="${FINAL_HASH:0:12}"
+
+# Output the hash
+echo "$SHORT_HASH"
+
+# Debug info (to stderr)
+FILE_COUNT=$(wc -l <"$TEMP_HASHES")
+info "Hashed $FILE_COUNT files → $SHORT_HASH" >&2
--- a/scripts/ci/docker/run-cli-tests.sh
+++ b/scripts/ci/docker/run-cli-tests.sh
@@ -0,0 +1,5 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+echo "=== Running Docker CLI feature tests ==="
+python3 scripts/ci/docker/test_docker.py --image "kreuzberg:cli" --variant cli --verbose
--- a/scripts/ci/docker/run-config-tests.sh
+++ b/scripts/ci/docker/run-config-tests.sh
@@ -0,0 +1,13 @@
+#!/usr/bin/env bash
+# CI wrapper for Docker configuration testing
+# Tests volume mounts, config formats, and environment variable overrides
+
+set -euo pipefail
+
+variant="${1:?missing variant}"
+
+echo "=== Running Docker configuration tests (${variant}) ==="
+
+# Run the comprehensive config test script
+# The script expects the image to already be built and tagged
+exec ./scripts/test/test-docker-config-local.sh --image "kreuzberg:${variant}" --variant "${variant}"
--- a/scripts/ci/docker/run-feature-tests.sh
+++ b/scripts/ci/docker/run-feature-tests.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+variant="${1:?missing variant}"
+
+echo "=== Running Docker feature tests (${variant}) ==="
+python3 scripts/ci/docker/test_docker.py --image "kreuzberg:${variant}" --variant "${variant}" --verbose
--- a/scripts/ci/docker/test_docker.py
+++ b/scripts/ci/docker/test_docker.py
@@ -0,0 +1,750 @@
+#!/usr/bin/env python3
+"""Unified Docker image test script for all variants (core, full, cli)."""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import random
+import subprocess
+import sys
+import tempfile
+import time
+from dataclasses import dataclass, field
+from pathlib import Path
+
+BLUE = "\033[0;34m"
+GREEN = "\033[0;32m"
+RED = "\033[0;31m"
+YELLOW = "\033[1;33m"
+NC = "\033[0m"
+
+REPO_ROOT = Path(__file__).resolve().parents[3]
+TEST_DOCS_DIR = REPO_ROOT / "test_documents"
+RESULTS_FILE = Path("/tmp/kreuzberg-docker-test-results.json")
+
+
+@dataclass
+class TestRunner:
+    image: str
+    variant: str
+    verbose: bool = False
+    total: int = 0
+    passed: int = 0
+    failed: int = 0
+    failed_names: list[str] = field(default_factory=list)
+    containers: list[str] = field(default_factory=list)
+
+    def log(self, level: str, color: str, msg: str) -> None:
+        print(f"{color}[{level}]{NC} {msg}", flush=True)
+
+    def info(self, msg: str) -> None:
+        self.log("INFO", BLUE, msg)
+
+    def ok(self, msg: str = "PASS") -> None:
+        self.log("SUCCESS", GREEN, msg)
+
+    def error(self, msg: str) -> None:
+        self.log("ERROR", RED, msg)
+
+    def warn(self, msg: str) -> None:
+        self.log("WARNING", YELLOW, msg)
+
+    def debug(self, msg: str) -> None:
+        if self.verbose:
+            self.log("VERBOSE", YELLOW, msg)
+
+    def start(self, name: str) -> None:
+        self.total += 1
+        self.info(f"Test {self.total}: {name}")
+
+    def pass_test(self) -> None:
+        self.passed += 1
+        self.ok()
+
+    def fail_test(self, name: str, details: str = "") -> None:
+        self.failed += 1
+        self.failed_names.append(name)
+        msg = f"FAIL: {name}"
+        if details:
+            msg += f"\n  Details: {details}"
+        self.error(msg)
+
+    def container_name(self) -> str:
+        name = f"kreuzberg-test-{int(time.time())}-{random.randint(0, 99999)}"
+        self.containers.append(name)
+        return name
+
+    def docker_run(self, *args: str, capture: bool = True) -> subprocess.CompletedProcess[str]:
+        cmd = ["docker", "run", "--rm", *args]
+        return subprocess.run(cmd, capture_output=capture, text=True, timeout=120)
+
+    def docker_run_detached(self, *args: str) -> str:
+        name = self.container_name()
+        cmd = ["docker", "run", "-d", "--name", name, *args]
+        subprocess.run(cmd, capture_output=True, text=True, check=True, timeout=60)
+        return name
+
+    def docker_rm(self, name: str) -> None:
+        subprocess.run(["docker", "rm", "-f", name], capture_output=True, timeout=30)
+
+    def cleanup(self) -> None:
+        for c in self.containers:
+            self.docker_rm(c)
+
+    def run_cli_output(self, *extra_args: str, volumes: bool = False) -> str:
+        """Run a CLI command against the image and return combined stdout+stderr."""
+        args: list[str] = ["--name", self.container_name()]
+        if volumes:
+            args += ["-v", f"{TEST_DOCS_DIR}:/data:ro"]
+        args.append(self.image)
+        args.extend(extra_args)
+        r = self.docker_run(*args)
+        return (r.stdout + r.stderr).strip()
+
+    def write_results(self) -> None:
+        rate = (self.passed * 100 // self.total) if self.total else 0
+        data = {
+            "image": self.image,
+            "variant": self.variant,
+            "total_tests": self.total,
+            "passed": self.passed,
+            "failed": self.failed,
+            "success_rate": rate,
+            "failed_tests": self.failed_names,
+        }
+        RESULTS_FILE.write_text(json.dumps(data, indent=2))
+        self.info(f"Results written to {RESULTS_FILE}")
+
+
+# ---------------------------------------------------------------------------
+# Shared tests (all variants)
+# ---------------------------------------------------------------------------
+
+def test_image_exists(t: TestRunner) -> None:
+    t.start("Docker image exists")
+    r = subprocess.run(["docker", "inspect", t.image], capture_output=True, timeout=30)
+    if r.returncode == 0:
+        t.pass_test()
+    else:
+        t.fail_test("Image does not exist", t.image)
+
+
+def test_version(t: TestRunner) -> None:
+    t.start("CLI --version command")
+    out = t.run_cli_output("--version")
+    t.debug(f"Version output: {out}")
+    if "kreuzberg" in out.lower():
+        t.pass_test()
+    else:
+        t.fail_test("CLI version", f"Expected 'kreuzberg' in output, got: {out}")
+
+
+def test_help(t: TestRunner) -> None:
+    t.start("CLI --help command")
+    out = t.run_cli_output("--help")
+    if "extract" in out.lower():
+        t.pass_test()
+    else:
+        t.fail_test("CLI help", "Expected 'extract' in help output")
+
+
+def test_mime_detection(t: TestRunner) -> None:
+    t.start("MIME type detection (detect command)")
+    out = t.run_cli_output("detect", "/data/pdf/searchable.pdf", volumes=True)
+    t.debug(f"MIME detection output: {out}")
+    if "application/pdf" in out.lower():
+        t.pass_test()
+    else:
+        t.fail_test("MIME detection", f"Expected 'application/pdf', got: {out}")
+
+
+def test_extract_text(t: TestRunner) -> None:
+    t.start("Extract plain text file")
+    out = t.run_cli_output("extract", "/data/text/contract.txt", volumes=True)
+    t.debug(f"Text extraction output (first 100 chars): {out[:100]}")
+    if len(out) > 15 and "contract" in out.lower():
+        t.pass_test()
+    else:
+        t.fail_test("Text extraction", f"Output too short ({len(out)} chars) or missing expected keywords")
+
+
+def test_extract_pdf(t: TestRunner) -> None:
+    t.start("Extract searchable PDF")
+    name = t.container_name()
+    r = subprocess.run(
+        ["docker", "run", "--rm", "--name", name,
+         "-v", f"{TEST_DOCS_DIR}:/data:ro",
+         t.image, "extract", "/data/pdf/searchable.pdf"],
+        capture_output=True, text=True, timeout=120,
+    )
+    out = (r.stdout + r.stderr).strip()
+    t.debug(f"PDF extraction output (first 200 chars): {out[:200]}")
+    if r.returncode != 0:
+        t.fail_test("Searchable PDF extraction", f"Exit code {r.returncode}: {out[:300]}")
+    elif len(out) > 50:
+        t.pass_test()
+    else:
+        t.fail_test("Searchable PDF extraction", f"Output too short: {len(out)} chars")
+
+
+def test_extract_html(t: TestRunner) -> None:
+    t.start("Extract HTML file")
+    out = t.run_cli_output("extract", "/data/html/simple_table.html", volumes=True)
+    t.debug(f"HTML extraction output (first 100 chars): {out[:100]}")
+    if len(out) > 10:
+        t.pass_test()
+    else:
+        t.fail_test("HTML extraction", f"Output too short: {len(out)} chars")
+
+
+def test_extract_docx(t: TestRunner) -> None:
+    t.start("Extract DOCX file")
+    out = t.run_cli_output("extract", "/data/docx/extraction_test.docx", volumes=True)
+    t.debug(f"DOCX extraction output (first 100 chars): {out[:100]}")
+    if len(out) > 100:
+        t.pass_test()
+    else:
+        t.fail_test("DOCX extraction", f"Output too short ({len(out)} chars)")
+
+
+def test_batch_cli(t: TestRunner) -> None:
+    t.start("CLI batch extraction (multiple files)")
+    out = t.run_cli_output(
+        "batch", "/data/text/contract.txt", "/data/html/simple_table.html",
+        volumes=True,
+    )
+    t.debug(f"Batch output (first 200 chars): {out[:200]}")
+    if len(out) > 20:
+        t.pass_test()
+    else:
+        t.fail_test("Batch extraction", f"Output too short: {len(out)} chars")
+
+
+def test_nonexistent_file(t: TestRunner) -> None:
+    t.start("Non-existent file returns error")
+    r = subprocess.run(
+        ["docker", "run", "--rm", t.image, "extract", "/nonexistent/file.pdf"],
+        capture_output=True, text=True, timeout=60,
+    )
+    if r.returncode != 0:
+        t.pass_test()
+    else:
+        t.fail_test("Error on missing file", "Expected non-zero exit code for missing file")
+
+
+def test_readonly_mount(t: TestRunner) -> None:
+    t.start("Read-only volume mount works")
+    name = t.container_name()
+    r = subprocess.run(
+        ["docker", "run", "--rm", "--name", name,
+         "-v", f"{TEST_DOCS_DIR}:/data:ro",
+         "--read-only", "--tmpfs", "/tmp",
+         t.image, "extract", "/data/text/simple.txt"],
+        capture_output=True, text=True, timeout=60,
+    )
+    out = (r.stdout + r.stderr).strip()
+    if len(out) > 5:
+        t.pass_test()
+    else:
+        t.fail_test("Read-only mount", "Failed to extract with read-only filesystem")
+
+
+# ---------------------------------------------------------------------------
+# Core/Full-only tests (API server tests)
+# ---------------------------------------------------------------------------
+
+def _wait_for_api(port: int, retries: int = 10) -> bool:
+    import urllib.request
+    for _ in range(retries):
+        try:
+            urllib.request.urlopen(f"http://localhost:{port}/health", timeout=3)
+            return True
+        except Exception:
+            time.sleep(2)
+    return False
+
+
+def _api_get(port: int, path: str) -> str | None:
+    import urllib.request
+    try:
+        with urllib.request.urlopen(f"http://localhost:{port}{path}", timeout=10) as resp:
+            return resp.read().decode()
+    except Exception:
+        return None
+
+
+def _api_post_file(port: int, path: str, filepath: str) -> str | None:
+    """POST a file using curl (simplest multipart approach)."""
+    r = subprocess.run(
+        ["curl", "-f", "-s", "-X", "POST", f"http://localhost:{port}{path}",
+         "-F", f"files=@{filepath}"],
+        capture_output=True, text=True, timeout=30,
+    )
+    return r.stdout if r.returncode == 0 else None
+
+
+def test_ocr_extraction(t: TestRunner) -> None:
+    t.start("OCR extraction with Tesseract")
+    name = t.container_name()
+    r = subprocess.run(
+        ["docker", "run", "--rm", "--name", name, "--memory", "1g",
+         "-v", f"{TEST_DOCS_DIR}:/data:ro",
+         t.image, "extract", "/data/images/ocr_image.jpg", "--ocr", "true"],
+        capture_output=True, text=True, timeout=120,
+    )
+    out = (r.stdout + r.stderr).strip()
+    t.debug(f"OCR extraction output (first 100 chars): {out[:100]}")
+    if len(out) > 10:
+        t.pass_test()
+    else:
+        t.fail_test("OCR extraction", "Output too short or OCR failed")
+
+
+def test_paddle_ocr_extraction(t: TestRunner) -> None:
+    t.start("PaddleOCR extraction (pre-loaded models)")
+    name = t.container_name()
+    r = subprocess.run(
+        ["docker", "run", "--rm", "--name", name, "--memory", "2g",
+         "-v", f"{TEST_DOCS_DIR}:/data:ro",
+         t.image, "extract", "/data/images/ocr_image.jpg",
+         "--ocr", "true", "--ocr-backend", "paddle-ocr"],
+        capture_output=True, text=True, timeout=120,
+    )
+    out = (r.stdout + r.stderr).strip()
+    t.debug(f"PaddleOCR extraction output (first 200 chars): {out[:200]}")
+    if r.returncode == 0 and len(out) > 10:
+        t.pass_test()
+    else:
+        t.fail_test("PaddleOCR extraction", f"Exit code: {r.returncode}, output length: {len(out)}")
+
+
+def test_doc_extraction(t: TestRunner) -> None:
+    t.start("Legacy DOC extraction (native OLE/CFB)")
+    name = t.container_name()
+    r = subprocess.run(
+        ["docker", "run", "--rm", "--name", name, "--memory", "1g",
+         "-v", f"{TEST_DOCS_DIR}:/data:ro",
+         t.image, "extract", "/data/doc/unit_test_lists.doc"],
+        capture_output=True, text=True, timeout=120,
+    )
+    out = (r.stdout + r.stderr).strip()
+    t.debug(f"DOC extraction output (first 100 chars): {out[:100]}")
+    if len(out) > 20:
+        t.pass_test()
+    else:
+        t.fail_test("DOC extraction", f"Output too short: {len(out)} chars")
+
+
+def test_api_health(t: TestRunner) -> None:
+    t.start("API server startup and health check")
+    port = 9000 + random.randint(0, 999)
+    name = t.docker_run_detached(
+        "--memory", "2g", "--cpus", "2",
+        "-p", f"{port}:8000", t.image,
+    )
+    if not _wait_for_api(port):
+        t.fail_test("API health check", f"Health endpoint not responding on port {port}")
+        t.docker_rm(name)
+        return
+
+    health = _api_get(port, "/health")
+    t.debug(f"Health response: {health}")
+    if health:
+        t.pass_test()
+    else:
+        t.fail_test("API health check", "No response from /health")
+
+    # Plugin initialization validation
+    t.start("Plugin initialization validation")
+    if health and "plugins" in health:
+        import re
+        ocr_m = re.search(r'"ocr_backends_count":(\d+)', health)
+        ext_m = re.search(r'"extractors_count":(\d+)', health)
+        ocr_count = int(ocr_m.group(1)) if ocr_m else 0
+        ext_count = int(ext_m.group(1)) if ext_m else 0
+        t.debug(f"OCR backends: {ocr_count}, Extractors: {ext_count}")
+
+        if t.variant == "full":
+            if ocr_count > 0:
+                t.info(f"Full variant: {ocr_count} OCR backend(s) registered")
+                t.pass_test()
+            else:
+                t.fail_test("Plugin initialization", "Full variant: No OCR backends registered")
+                t.docker_rm(name)
+                return
+        else:
+            t.pass_test()
+
+        if ext_count == 0:
+            t.fail_test("Plugin initialization", "No document extractors registered")
+            t.docker_rm(name)
+            return
+    else:
+        t.warn("Health response missing 'plugins' field")
+        t.pass_test()
+
+    t.docker_rm(name)
+
+
+def test_api_extract(t: TestRunner) -> None:
+    t.start("API extraction endpoint")
+    port = 9000 + random.randint(0, 999)
+    name = t.docker_run_detached(
+        "--memory", "2g", "--cpus", "2",
+        "-p", f"{port}:8000", t.image,
+    )
+    if not _wait_for_api(port):
+        t.fail_test("API extraction", "Server not ready")
+        t.docker_rm(name)
+        return
+
+    with tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False) as f:
+        f.write("Test content for API extraction")
+        tmp = f.name
+
+    resp = _api_post_file(port, "/extract", tmp)
+    os.unlink(tmp)
+    t.debug(f"API response: {resp}")
+
+    if resp and "Test content for API extraction" in resp:
+        t.pass_test()
+    else:
+        t.fail_test("API extraction", "Response missing expected content")
+    t.docker_rm(name)
+
+
+def test_api_info(t: TestRunner) -> None:
+    t.start("API /info endpoint")
+    port = 9000 + random.randint(0, 999)
+    name = t.docker_run_detached(
+        "--memory", "2g", "--cpus", "2",
+        "-p", f"{port}:8000", t.image,
+    )
+    if not _wait_for_api(port):
+        t.fail_test("API /info", "Server not ready")
+        t.docker_rm(name)
+        return
+
+    resp = _api_get(port, "/info")
+    t.debug(f"/info response: {resp}")
+    if resp and "version" in resp and "rust_backend" in resp:
+        t.pass_test()
+    else:
+        t.fail_test("API /info endpoint", "Response missing expected fields")
+    t.docker_rm(name)
+
+
+def test_api_openapi(t: TestRunner) -> None:
+    t.start("API /openapi.json endpoint")
+    port = 9000 + random.randint(0, 999)
+    name = t.docker_run_detached(
+        "--memory", "2g", "--cpus", "2",
+        "-p", f"{port}:8000", t.image,
+    )
+    if not _wait_for_api(port):
+        t.fail_test("API /openapi.json", "Server not ready")
+        t.docker_rm(name)
+        return
+
+    resp = _api_get(port, "/openapi.json")
+    t.debug(f"/openapi.json response (first 200 chars): {(resp or '')[:200]}")
+    if resp and '"openapi"' in resp and '"paths"' in resp:
+        t.pass_test()
+    else:
+        t.fail_test("API /openapi.json endpoint", "Response missing OpenAPI schema fields")
+    t.docker_rm(name)
+
+
+def test_api_cache(t: TestRunner) -> None:
+    t.start("API /cache/stats endpoint")
+    port = 9000 + random.randint(0, 999)
+    name = t.docker_run_detached(
+        "--memory", "2g", "--cpus", "2",
+        "-p", f"{port}:8000", t.image,
+    )
+    if not _wait_for_api(port):
+        t.fail_test("API /cache/stats", "Server not ready")
+        t.docker_rm(name)
+        return
+
+    resp = _api_get(port, "/cache/stats")
+    t.debug(f"/cache/stats response: {resp}")
+    if resp and "total_files" in resp:
+        t.pass_test()
+    else:
+        t.fail_test("API /cache/stats endpoint", "Response missing expected fields")
+
+    t.start("API /cache/clear endpoint")
+    r = subprocess.run(
+        ["curl", "-f", "-s", "-X", "DELETE", f"http://localhost:{port}/cache/clear"],
+        capture_output=True, text=True, timeout=10,
+    )
+    if r.returncode == 0 and "removed_files" in r.stdout:
+        t.pass_test()
+    else:
+        t.fail_test("API /cache/clear endpoint", "Response missing expected fields")
+    t.docker_rm(name)
+
+
+def test_api_batch(t: TestRunner) -> None:
+    t.start("API batch extraction (multiple files)")
+    port = 9000 + random.randint(0, 999)
+    name = t.docker_run_detached(
+        "--memory", "2g", "--cpus", "2",
+        "-p", f"{port}:8000", t.image,
+    )
+    if not _wait_for_api(port):
+        t.fail_test("API batch extraction", "Server not ready")
+        t.docker_rm(name)
+        return
+
+    tmp1 = tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False)
+    tmp2 = tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False)
+    tmp1.write("File one content"); tmp1.close()
+    tmp2.write("File two content"); tmp2.close()
+
+    r = subprocess.run(
+        ["curl", "-f", "-s", "-X", "POST", f"http://localhost:{port}/extract",
+         "-F", f"files=@{tmp1.name}", "-F", f"files=@{tmp2.name}"],
+        capture_output=True, text=True, timeout=30,
+    )
+    os.unlink(tmp1.name)
+    os.unlink(tmp2.name)
+
+    t.debug(f"Batch extraction response (first 200 chars): {r.stdout[:200]}")
+    if "File one content" in r.stdout and "File two content" in r.stdout:
+        t.pass_test()
+    else:
+        t.fail_test("API batch extraction", "Response missing expected content")
+    t.docker_rm(name)
+
+
+def test_cli_batch_json(t: TestRunner) -> None:
+    t.start("CLI batch extraction with JSON format")
+    name = t.container_name()
+    r = subprocess.run(
+        ["docker", "run", "--rm", "--name", name,
+         "-v", f"{TEST_DOCS_DIR}:/data:ro",
+         t.image, "batch", "/data/text/contract.txt", "/data/pdf/searchable.pdf",
+         "--format", "json"],
+        capture_output=True, text=True, timeout=120,
+    )
+    out = (r.stdout + r.stderr).strip()
+    t.debug(f"Batch command output (first 200 chars): {out[:200]}")
+    if len(out) > 100 and "content" in out:
+        t.pass_test()
+    else:
+        t.fail_test("CLI batch command", "Output too short or malformed")
+
+
+def test_mcp_server(t: TestRunner) -> None:
+    t.start("MCP server startup and persistence")
+    name = t.docker_run_detached(
+        "-i", "--memory", "1g", t.image, "mcp",
+    )
+    time.sleep(3)
+    r = subprocess.run(
+        ["docker", "ps", "--filter", f"name={name}", "--format", "{{.Names}}"],
+        capture_output=True, text=True, timeout=10,
+    )
+    if name in r.stdout:
+        t.debug("MCP server is running")
+        t.pass_test()
+    else:
+        t.fail_test("MCP server persistence", "MCP server exited immediately")
+    t.docker_rm(name)
+
+
+def test_cli_cache(t: TestRunner) -> None:
+    t.start("CLI cache stats command")
+    name = t.container_name()
+    r = subprocess.run(
+        ["docker", "run", "--rm", "--name", name, t.image, "cache", "stats", "--format", "json"],
+        capture_output=True, text=True, timeout=60,
+    )
+    out = (r.stdout + r.stderr).strip()
+    t.debug(f"Cache stats output: {out}")
+    if "total_files" in out:
+        t.pass_test()
+    else:
+        t.fail_test("CLI cache stats", "Output missing expected fields")
+
+    t.start("CLI cache clear command")
+    name = t.container_name()
+    r = subprocess.run(
+        ["docker", "run", "--rm", "--name", name, t.image, "cache", "clear", "--format", "json"],
+        capture_output=True, text=True, timeout=60,
+    )
+    out = (r.stdout + r.stderr).strip()
+    t.debug(f"Cache clear output: {out}")
+    if "removed_files" in out:
+        t.pass_test()
+    else:
+        t.fail_test("CLI cache clear", "Output missing expected fields")
+
+
+def test_security_nonroot(t: TestRunner) -> None:
+    t.start("Security: Container runs as non-root user")
+    name = t.container_name()
+    r = subprocess.run(
+        ["docker", "run", "--rm", "--name", name, "--entrypoint", "/bin/sh",
+         t.image, "-c", "whoami"],
+        capture_output=True, text=True, timeout=30,
+    )
+    user = r.stdout.strip()
+    if user == "kreuzberg":
+        t.pass_test()
+    else:
+        t.fail_test("Non-root user", f"Container running as: {user} (expected: kreuzberg)")
+
+
+def test_security_readonly(t: TestRunner) -> None:
+    t.start("Security: Read-only volume enforcement")
+    with tempfile.TemporaryDirectory() as tmpdir:
+        (Path(tmpdir) / "test.txt").write_text("test")
+        name = t.container_name()
+        r = subprocess.run(
+            ["docker", "run", "--rm", "--name", name,
+             "-v", f"{tmpdir}:/data:ro",
+             "--entrypoint", "/bin/sh", t.image,
+             "-c", "echo 'attempt' > /data/test2.txt 2>&1 || echo 'READ_ONLY'"],
+            capture_output=True, text=True, timeout=30,
+        )
+        out = r.stdout + r.stderr
+        if any(s in out for s in ("READ_ONLY", "read-only", "Read-only")):
+            t.pass_test()
+        else:
+            t.fail_test("Read-only volume", "Was able to write to read-only volume")
+
+
+def test_security_memlimit(t: TestRunner) -> None:
+    t.start("Security: Memory limit enforcement")
+    name = t.container_name()
+    r = subprocess.run(
+        ["docker", "run", "--rm", "--name", name,
+         "--memory", "128m", "--memory-swap", "128m",
+         "--entrypoint", "/bin/sh", t.image,
+         "-c", "echo 'Memory limit test passed'"],
+        capture_output=True, text=True, timeout=30,
+    )
+    if "Memory limit test passed" in r.stdout:
+        t.pass_test()
+    else:
+        t.fail_test("Memory limit", "Container failed with memory limit")
+
+
+# ---------------------------------------------------------------------------
+# CLI-only tests
+# ---------------------------------------------------------------------------
+
+def test_cli_image_size(t: TestRunner) -> None:
+    t.start("Image size is reasonable (< 200MB)")
+    r = subprocess.run(
+        ["docker", "inspect", t.image, "--format", "{{.Size}}"],
+        capture_output=True, text=True, timeout=10,
+    )
+    try:
+        size_mb = int(r.stdout.strip()) // (1024 * 1024)
+    except ValueError:
+        size_mb = 0
+    t.debug(f"Image size: {size_mb}MB")
+    if 0 < size_mb < 200:
+        t.pass_test()
+    else:
+        t.fail_test("Image size", f"Expected < 200MB, got {size_mb}MB")
+
+
+# ---------------------------------------------------------------------------
+# Test suites per variant
+# ---------------------------------------------------------------------------
+
+def run_cli_tests(t: TestRunner) -> None:
+    """Tests for the minimal CLI Docker image."""
+    test_image_exists(t)
+    test_cli_image_size(t)
+    test_version(t)
+    test_help(t)
+    test_mime_detection(t)
+    test_extract_text(t)
+    test_extract_pdf(t)
+    test_extract_html(t)
+    test_extract_docx(t)
+    test_batch_cli(t)
+    test_readonly_mount(t)
+    test_nonexistent_file(t)
+
+
+def run_core_full_tests(t: TestRunner) -> None:
+    """Tests for core and full Docker images."""
+    test_image_exists(t)
+    test_version(t)
+    test_help(t)
+    test_mime_detection(t)
+    test_extract_text(t)
+    test_extract_pdf(t)
+    test_extract_docx(t)
+    test_extract_html(t)
+    test_ocr_extraction(t)
+
+    if t.variant == "full":
+        test_doc_extraction(t)
+        test_paddle_ocr_extraction(t)
+
+    test_api_health(t)
+    test_api_extract(t)
+    test_api_info(t)
+    test_api_openapi(t)
+    test_api_cache(t)
+    test_api_batch(t)
+    test_cli_batch_json(t)
+    test_mcp_server(t)
+    test_cli_cache(t)
+    test_security_nonroot(t)
+    test_security_readonly(t)
+    test_security_memlimit(t)
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Docker image tests")
+    parser.add_argument("--image", required=True, help="Docker image name")
+    parser.add_argument("--variant", required=True, choices=["core", "full", "cli"])
+    parser.add_argument("--verbose", action="store_true")
+    parser.add_argument("--skip-build", action="store_true", help="(ignored, kept for compat)")
+    args = parser.parse_args()
+
+    t = TestRunner(image=args.image, variant=args.variant, verbose=args.verbose)
+
+    print("=" * 72)
+    t.info(f"Starting Docker tests for: {args.image} (variant: {args.variant})")
+    print("=" * 72)
+
+    try:
+        if args.variant == "cli":
+            run_cli_tests(t)
+        else:
+            run_core_full_tests(t)
+    finally:
+        t.cleanup()
+
+    # Summary
+    print()
+    print("=" * 72)
+    t.info(f"Test Results: {t.passed}/{t.total} passed, {t.failed} failed")
+    print("=" * 72)
+
+    if t.failed > 0:
+        t.error("Failed tests:")
+        for name in t.failed_names:
+            print(f"  - {name}")
+
+    t.write_results()
+
+    if t.failed > 0:
+        sys.exit(1)
+    t.ok("All tests passed!")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/ci/docs/build.sh
+++ b/scripts/ci/docs/build.sh
@@ -0,0 +1,61 @@
+#!/usr/bin/env bash
+# Build the documentation site (Zensical, doc dependency group).
+#
+# Usage:
+#   scripts/ci/docs/build.sh
+#   scripts/ci/docs/build.sh --strict --log-file /tmp/build-log.txt
+#
+# Caching: use astral-sh/setup-uv with enable-cache in CI; this script only runs uv.
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"
+cd "$REPO_ROOT"
+
+strict=false
+log_file=""
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --strict)
+      strict=true
+      shift
+      ;;
+    --log-file)
+      if [[ $# -lt 2 ]]; then
+        echo "error: --log-file requires a path" >&2
+        exit 2
+      fi
+      log_file="$2"
+      shift 2
+      ;;
+    *)
+      echo "usage: $0 [--strict] [--log-file PATH]" >&2
+      exit 2
+      ;;
+  esac
+done
+
+uv_sync() {
+  uv sync --group doc --no-editable --no-install-workspace --no-install-project
+}
+
+zensical_build() {
+  if [[ "$strict" == true ]]; then
+    uv run --no-sync zensical build --clean --strict
+  else
+    uv run --no-sync zensical build --clean
+  fi
+}
+
+if [[ -n "$log_file" ]]; then
+  set -o pipefail
+  mkdir -p "$(dirname "$log_file")"
+  : >"$log_file"
+  uv_sync 2>&1 | tee -a "$log_file"
+  zensical_build 2>&1 | tee -a "$log_file"
+else
+  uv_sync
+  zensical_build
+fi
--- a/scripts/ci/docs/textlint.sh
+++ b/scripts/ci/docs/textlint.sh
@@ -0,0 +1,13 @@
+#!/usr/bin/env bash
+# Run textlint prose linting against docs/**/*.md.
+#
+# Usage:
+#   scripts/ci/docs/textlint.sh
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"
+cd "$REPO_ROOT"
+
+npx textlint "docs/**/*.md"
--- a/scripts/ci/install-system-deps/detect-tesseract-linux.sh
+++ b/scripts/ci/install-system-deps/detect-tesseract-linux.sh
@@ -0,0 +1,17 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+version="$(
+  apt-cache policy tesseract-ocr 2>/dev/null |
+    grep 'Candidate:' |
+    grep -Eo '[0-9]+\.[0-9]+' |
+    head -1 ||
+    true
+)"
+
+if [[ -z "${version}" ]]; then
+  version="unknown"
+fi
+
+echo "version=${version}" >>"${GITHUB_OUTPUT}"
+echo "::notice title=Tesseract Version::Detected version: ${version}"
--- a/scripts/ci/install-system-deps/detect-tesseract-macos.sh
+++ b/scripts/ci/install-system-deps/detect-tesseract-macos.sh
@@ -0,0 +1,25 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+version=""
+
+json="$(brew info --json=v2 tesseract 2>/dev/null || true)"
+if [[ -n "${json}" ]]; then
+  version="$(
+    python3 -c 'import json, re, sys; data = json.loads(sys.argv[1]); stable = (((data.get("formulae") or [{}])[0].get("versions") or {}).get("stable") or ""); m = re.match(r"^(\d+\.\d+)", stable); print(m.group(1) if m else "")' "${json}" || true
+  )"
+fi
+
+if [[ -z "${version}" ]]; then
+  first_line="$(brew info tesseract 2>/dev/null | head -1 || true)"
+  if [[ "${first_line}" =~ ([0-9]+\.[0-9]+) ]]; then
+    version="${BASH_REMATCH[1]}"
+  fi
+fi
+
+if [[ -z "${version}" ]]; then
+  version="unknown"
+fi
+
+echo "version=${version}" >>"${GITHUB_OUTPUT}"
+echo "::notice title=Tesseract Version::Detected version: ${version}"
--- a/scripts/ci/install-system-deps/install-linux.sh
+++ b/scripts/ci/install-system-deps/install-linux.sh
@@ -0,0 +1,136 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="${REPO_ROOT:-$(cd "$SCRIPT_DIR/../../.." && pwd)}"
+
+source "$REPO_ROOT/scripts/lib/retry.sh"
+
+echo "::group::Installing Linux dependencies"
+
+echo "Updating package index..."
+if ! retry_with_backoff sudo apt-get update; then
+  echo "::warning::apt-get update failed after retries, continuing anyway..."
+fi
+
+packages=(
+  tesseract-ocr
+  tesseract-ocr-eng
+  tesseract-ocr-tur
+  tesseract-ocr-deu
+  fonts-liberation
+  fonts-dejavu-core
+  fonts-noto-core
+  libssl-dev
+  pkg-config
+  build-essential
+  cmake
+  libmagic-dev
+  libuv1-dev
+  php-cli
+  php-dev
+)
+
+echo "Installing dependencies..."
+if retry_with_backoff_timeout 900 sudo apt-get install -y "${packages[@]}"; then
+  echo "✓ All packages installed successfully"
+else
+  exit_code=$?
+  if [ $exit_code -eq 124 ]; then
+    echo "::error::Package installation timed out after 15 minutes"
+  else
+    echo "::warning::Some packages failed to install, attempting individual installs..."
+    for pkg in tesseract-ocr libssl-dev pkg-config cmake; do
+      echo "Installing $pkg..."
+      if retry_with_backoff_timeout 300 sudo apt-get install -y "$pkg" 2>&1; then
+        echo "  ✓ $pkg installed"
+      else
+        echo "  ⚠ Failed to install $pkg"
+      fi
+    done
+  fi
+fi
+
+echo "::endgroup::"
+
+echo "::group::Verifying Linux installations"
+
+echo "CMake:"
+if command -v cmake >/dev/null 2>&1; then
+  cmake --version | head -1
+  echo "✓ CMake available"
+  # Export CMAKE environment variable for immediate availability in build scripts
+  CMAKE_FULL_PATH="$(command -v cmake)"
+  if [[ -n "$GITHUB_ENV" ]]; then
+    echo "CMAKE=$CMAKE_FULL_PATH" >>"$GITHUB_ENV"
+    echo "✓ Set CMAKE=$CMAKE_FULL_PATH in GITHUB_ENV"
+  fi
+  # Also add cmake binary directory to GITHUB_PATH for subsequent steps
+  CMAKE_BIN="$(dirname "$CMAKE_FULL_PATH")"
+  if [[ -n "$GITHUB_PATH" && -d "$CMAKE_BIN" ]]; then
+    echo "$CMAKE_BIN" >>"$GITHUB_PATH"
+    echo "✓ Added cmake directory to GITHUB_PATH: $CMAKE_BIN"
+  fi
+else
+  echo "::error::CMake not found after installation"
+  exit 1
+fi
+
+echo ""
+echo "Tesseract:"
+if command -v tesseract >/dev/null 2>&1; then
+  if tesseract --version 2>/dev/null | head -1; then
+    echo "✓ Tesseract CLI available"
+  else
+    echo "::warning::Tesseract CLI present but failed to run"
+  fi
+else
+  echo "::warning::Tesseract CLI not found; continuing (OCR will rely on bundled Tesseract)"
+fi
+
+echo ""
+echo "Available Tesseract languages:"
+if command -v tesseract >/dev/null 2>&1; then
+  tesseract --list-langs | head -10 || true
+else
+  echo "(tesseract CLI not available)"
+fi
+
+echo ""
+echo "PHP:"
+if command -v php >/dev/null 2>&1; then
+  php --version | head -1
+  echo "✓ PHP available"
+else
+  echo "::error::PHP not found after installation"
+  exit 1
+fi
+
+echo ""
+echo "Checking Tesseract data path..."
+
+tessdata_found=0
+for tessdata_path in "/usr/share/tesseract-ocr/5/tessdata" "/usr/share/tesseract-ocr/tessdata"; do
+  if [ -d "$tessdata_path" ]; then
+    echo "Found tessdata at: $tessdata_path"
+
+    echo "Required language files:"
+    for lang in eng tur deu; do
+      if [ -f "$tessdata_path/${lang}.traineddata" ]; then
+        size=$(stat -c%s "$tessdata_path/${lang}.traineddata" 2>/dev/null || echo "unknown")
+        echo "  ✓ ${lang}.traineddata ($size bytes)"
+      else
+        echo "  ⚠ ${lang}.traineddata (missing)"
+      fi
+    done
+    tessdata_found=1
+    break
+  fi
+done
+
+if [ $tessdata_found -eq 0 ]; then
+  echo "::error::Tessdata directory not found in standard locations"
+  exit 1
+fi
+
+echo "::endgroup::"
--- a/scripts/ci/install-system-deps/install-macos.sh
+++ b/scripts/ci/install-system-deps/install-macos.sh
@@ -0,0 +1,136 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="${REPO_ROOT:-$(cd "$SCRIPT_DIR/../../.." && pwd)}"
+
+source "$REPO_ROOT/scripts/lib/retry.sh"
+
+echo "::group::Installing macOS dependencies"
+
+if [[ -d "/opt/homebrew/bin" ]]; then
+  export PATH="/opt/homebrew/bin:/opt/homebrew/sbin:${PATH}"
+  echo "/opt/homebrew/bin" >>"$GITHUB_PATH"
+  echo "/opt/homebrew/sbin" >>"$GITHUB_PATH"
+fi
+if [[ -d "/usr/local/bin" ]]; then
+  export PATH="/usr/local/bin:/usr/local/sbin:${PATH}"
+  echo "/usr/local/bin" >>"$GITHUB_PATH"
+  echo "/usr/local/sbin" >>"$GITHUB_PATH"
+fi
+
+if ! brew list cmake &>/dev/null; then
+  echo "Installing CMake..."
+  retry_with_backoff brew install cmake || {
+    echo "::error::Failed to install CMake after retries"
+    exit 1
+  }
+else
+  echo "✓ CMake already installed"
+fi
+
+if ! command -v cmake >/dev/null 2>&1; then
+  echo "CMake not on PATH after install; attempting brew link..."
+  brew link --overwrite cmake >/dev/null 2>&1 || true
+fi
+
+if ! brew list tesseract &>/dev/null; then
+  echo "Installing Tesseract..."
+  retry_with_backoff brew install tesseract || {
+    echo "::error::Failed to install Tesseract after retries"
+    exit 1
+  }
+else
+  echo "✓ Tesseract already installed"
+fi
+
+if ! command -v tesseract >/dev/null 2>&1; then
+  echo "Tesseract not on PATH after install; attempting brew link..."
+  brew link --overwrite tesseract >/dev/null 2>&1 || true
+fi
+
+if ! brew list tesseract-lang &>/dev/null; then
+  echo "Installing Tesseract language packs..."
+  retry_with_backoff brew install tesseract-lang || {
+    echo "::warning::Failed to install tesseract-lang, some languages may be unavailable"
+  }
+else
+  echo "✓ Tesseract language packs already installed"
+fi
+
+if ! brew list libmagic &>/dev/null; then
+  echo "Installing libmagic..."
+  retry_with_backoff brew install libmagic || {
+    echo "::warning::Failed to install libmagic after retries"
+  }
+else
+  echo "✓ libmagic already installed"
+fi
+
+if ! brew list php &>/dev/null; then
+  echo "Installing PHP..."
+  retry_with_backoff brew install php || {
+    echo "::error::Failed to install PHP after retries"
+    exit 1
+  }
+else
+  echo "✓ PHP already installed"
+fi
+
+if ! command -v php >/dev/null 2>&1; then
+  echo "PHP not on PATH after install; attempting brew link..."
+  brew link --overwrite php >/dev/null 2>&1 || true
+fi
+
+echo "::endgroup::"
+
+echo "::group::Verifying macOS installations"
+
+echo "CMake:"
+if command -v cmake >/dev/null 2>&1; then
+  cmake --version | head -1
+  # Export CMAKE environment variable for immediate availability in build scripts
+  CMAKE_FULL_PATH="$(command -v cmake)"
+  if [[ -n "$GITHUB_ENV" ]]; then
+    echo "CMAKE=$CMAKE_FULL_PATH" >>"$GITHUB_ENV"
+    echo "✓ Set CMAKE=$CMAKE_FULL_PATH in GITHUB_ENV"
+  fi
+  # Also add cmake binary directory to GITHUB_PATH for subsequent steps
+  CMAKE_BIN="$(dirname "$CMAKE_FULL_PATH")"
+  if [[ -n "$GITHUB_PATH" && -d "$CMAKE_BIN" ]]; then
+    echo "$CMAKE_BIN" >>"$GITHUB_PATH"
+    echo "✓ Added cmake directory to GITHUB_PATH: $CMAKE_BIN"
+  fi
+else
+  echo "::error::CMake not found on PATH after installation"
+  echo "PATH=$PATH"
+  brew --prefix cmake 2>/dev/null || true
+  exit 1
+fi
+
+echo ""
+echo "Tesseract:"
+if command -v tesseract >/dev/null 2>&1; then
+  tesseract --version | head -1
+else
+  echo "::error::Tesseract not found on PATH after installation"
+  echo "PATH=$PATH"
+  brew --prefix tesseract 2>/dev/null || true
+  exit 1
+fi
+
+echo ""
+echo "Available languages:"
+tesseract --list-langs | head -5
+
+echo ""
+echo "PHP:"
+if command -v php >/dev/null 2>&1; then
+  php --version | head -1
+else
+  echo "::error::PHP not found on PATH after installation"
+  echo "PATH=$PATH"
+  exit 1
+fi
+
+echo "::endgroup::"
--- a/scripts/ci/install-system-deps/install-windows.ps1
+++ b/scripts/ci/install-system-deps/install-windows.ps1
@@ -0,0 +1,301 @@
+#!/usr/bin/env pwsh
+
+Set-StrictMode -Version Latest
+$ErrorActionPreference = 'Stop'
+
+Write-Host "::group::Installing Windows dependencies"
+
+function Retry-Command {
+  param(
+    [scriptblock]$Command,
+    [int]$MaxAttempts = 3,
+    [int]$DelaySeconds = 5
+  )
+
+  $attempt = 1
+  while ($attempt -le $MaxAttempts) {
+    try {
+      Write-Host "Attempt $attempt of $MaxAttempts..."
+      & $Command
+      return $true
+    }
+    catch {
+      $attempt++
+      if ($attempt -le $MaxAttempts) {
+        $backoffDelay = $DelaySeconds * [Math]::Pow(2, $attempt - 1)
+        Write-Host "⚠ Attempt failed, retrying in ${backoffDelay}s..." -ForegroundColor Yellow
+        Start-Sleep -Seconds $backoffDelay
+      }
+      else {
+        return $false
+      }
+    }
+  }
+}
+
+$tesseractCacheHit = $env:TESSERACT_CACHE_HIT -eq "true"
+$llvmCacheHit = $env:LLVM_CACHE_HIT -eq "true"
+$cmakeCacheHit = $env:CMAKE_CACHE_HIT -eq "true"
+$cmakeInstalled = $false
+
+Write-Host "Cache status:"
+Write-Host "  TESSERACT_CACHE_HIT: $env:TESSERACT_CACHE_HIT (evaluated: $tesseractCacheHit)"
+Write-Host "  LLVM_CACHE_HIT: $env:LLVM_CACHE_HIT (evaluated: $llvmCacheHit)"
+Write-Host "  CMAKE_CACHE_HIT: $env:CMAKE_CACHE_HIT (evaluated: $cmakeCacheHit)"
+Write-Host ""
+try {
+  & cmake --version 2>$null
+  Write-Host "✓ CMake already installed"
+  $cmakeInstalled = $true
+}
+catch {
+  Write-Host "CMake not found, will attempt to install"
+}
+
+if (-not $tesseractCacheHit) {
+  Write-Host "Tesseract cache miss, installing (optional for build - needed for tests only)..."
+  if (-not (Retry-Command { choco install -y tesseract --no-progress } -MaxAttempts 3)) {
+    Write-Host "::warning::Failed to install Tesseract (optional dependency - gem build does not require it)"
+  }
+  else {
+    Write-Host "✓ Tesseract installed"
+    # Ensure tessdata directory exists and is accessible
+    $tesseractPath = "C:\Program Files\Tesseract-OCR"
+    if (Test-Path $tesseractPath) {
+      Write-Host "  Configuring Tesseract data paths..."
+
+      # Create tessdata directory if it doesn't exist
+      $tessdataPath = "$tesseractPath\tessdata"
+      if (-not (Test-Path $tessdataPath)) {
+        Write-Host "  Creating tessdata directory at: $tessdataPath"
+        New-Item -ItemType Directory -Path $tessdataPath -Force | Out-Null
+      }
+
+      # Download English language data if not present
+      if (-not (Test-Path "$tessdataPath\eng.traineddata")) {
+        Write-Host "  Downloading English language data..."
+        try {
+          $engUrl = "https://github.com/tesseract-ocr/tessdata_fast/raw/main/eng.traineddata"
+          Invoke-WebRequest -Uri $engUrl -OutFile "$tessdataPath\eng.traineddata" -ErrorAction Stop
+          Write-Host "  ✓ Downloaded eng.traineddata"
+        }
+        catch {
+          Write-Host "  ::warning::Failed to download eng.traineddata: $($_.Exception.Message)"
+        }
+      }
+
+      # Download OSD data if not present (needed for orientation detection)
+      if (-not (Test-Path "$tessdataPath\osd.traineddata")) {
+        Write-Host "  Downloading OSD data..."
+        try {
+          $osdUrl = "https://github.com/tesseract-ocr/tessdata_fast/raw/main/osd.traineddata"
+          Invoke-WebRequest -Uri $osdUrl -OutFile "$tessdataPath\osd.traineddata" -ErrorAction Stop
+          Write-Host "  ✓ Downloaded osd.traineddata"
+        }
+        catch {
+          Write-Host "  ::warning::Failed to download osd.traineddata: $($_.Exception.Message)"
+        }
+      }
+    }
+  }
+}
+else {
+  Write-Host "✓ Tesseract found in cache"
+}
+
+if (-not $llvmCacheHit) {
+  Write-Host "LLVM cache miss, installing LLVM/Clang (required for bindgen)..."
+  if (-not (Retry-Command { choco install -y llvm --no-progress } -MaxAttempts 3)) {
+    Write-Host "::warning::Failed to install LLVM/Clang via Chocolatey"
+  }
+  else {
+    Write-Host "✓ LLVM/Clang installed"
+  }
+}
+else {
+  Write-Host "✓ LLVM/Clang found in cache"
+}
+
+Write-Host "Installing PHP..."
+$phpInstalled = $false
+try {
+  & php --version 2>$null
+  Write-Host "✓ PHP already installed"
+  $phpInstalled = $true
+}
+catch {
+  Write-Host "PHP not found, installing via Chocolatey..."
+  if (-not (Retry-Command { choco install -y php --no-progress } -MaxAttempts 3)) {
+    Write-Host "::warning::Failed to install PHP via Chocolatey, will rely on shivammathur/setup-php action"
+  }
+  else {
+    Write-Host "✓ PHP installed via Chocolatey"
+    $phpInstalled = $true
+  }
+}
+
+Write-Host "Installing CMake..."
+if (-not $cmakeCacheHit) {
+  Write-Host "CMake cache miss, installing..."
+  if (-not (Retry-Command { choco install -y cmake --no-progress } -MaxAttempts 3)) {
+    throw "Failed to install CMake after 3 attempts"
+  }
+  Write-Host "✓ CMake installed"
+}
+else {
+  Write-Host "✓ CMake found in cache"
+}
+
+Write-Host "Configuring PATH and environment variables..."
+$paths = @(
+  "C:\Program Files\CMake\bin",
+  "C:\Program Files\Tesseract-OCR",
+  "C:\Program Files\LLVM\bin",
+  "C:\tools\php",
+  "C:\Program Files\PHP"
+)
+
+foreach ($path in $paths) {
+  if (Test-Path $path) {
+    Write-Host "  Adding to PATH: $path"
+    Write-Output $path | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
+    $env:PATH = "$path;$env:PATH"
+  }
+  else {
+    Write-Host "  Path not found (skipping): $path"
+  }
+}
+
+# Ensure TESSDATA_PREFIX is set for Windows OCR tests
+$tesseractPath = "C:\Program Files\Tesseract-OCR"
+if (Test-Path $tesseractPath) {
+  $tessdataPath = "$tesseractPath\tessdata"
+  if (Test-Path $tessdataPath) {
+    Write-Host "  Setting TESSDATA_PREFIX for tests: $tessdataPath"
+    Add-Content -Path $env:GITHUB_ENV -Value "TESSDATA_PREFIX=$tessdataPath"
+    $env:TESSDATA_PREFIX = $tessdataPath
+  }
+}
+
+Write-Host "::endgroup::"
+
+Write-Host "::group::Verifying Windows installations"
+
+Write-Host "Tesseract (optional for build):"
+try {
+  $tesseractCmd = Get-Command tesseract -ErrorAction Stop
+  $tesseractPath = $tesseractCmd.Path
+  Write-Host "  Found at: $tesseractPath"
+  Write-Host "  Command type: $($tesseractCmd.CommandType)"
+
+  # Get installation directory
+  $tesseractDir = Split-Path -Parent $tesseractPath
+  Write-Host "  Installation directory: $tesseractDir"
+
+  # Check for tessdata
+  $tessdataPath = Join-Path $tesseractDir "tessdata"
+  if (Test-Path $tessdataPath) {
+    Write-Host "  tessdata directory: $tessdataPath"
+    Write-Host "  Available language files:"
+    Get-ChildItem "$tessdataPath\*.traineddata" -ErrorAction SilentlyContinue | ForEach-Object {
+      Write-Host "    - $($_.Name)"
+    }
+  }
+  else {
+    Write-Host "  tessdata directory not found at: $tessdataPath"
+  }
+
+  try {
+    $version = & tesseract --version 2>&1
+    Write-Host "  Version output: $version"
+    Write-Host "✓ Tesseract available and working"
+
+    Write-Host ""
+    Write-Host "Available Tesseract languages:"
+    & tesseract --list-langs 2>&1 | ForEach-Object { Write-Host "  $_" }
+  }
+  catch {
+    Write-Host "⚠ Warning: Tesseract found but failed to run: $($_.Exception.Message)"
+  }
+
+  # Set TESSDATA_PREFIX environment variable for tests
+  if (Test-Path $tessdataPath) {
+    Write-Host ""
+    Write-Host "Setting TESSDATA_PREFIX environment variable..."
+    Add-Content -Path $env:GITHUB_ENV -Value "TESSDATA_PREFIX=$tessdataPath"
+    Write-Host "✓ Set TESSDATA_PREFIX=$tessdataPath in GITHUB_ENV"
+    $env:TESSDATA_PREFIX = $tessdataPath
+  }
+}
+catch {
+  Write-Host "⚠ Tesseract not found on PATH (not required for build)"
+  Write-Host "  Error details: $($_.Exception.Message)"
+  Write-Host "  Searching common installation locations..."
+
+  $commonPaths = @(
+    "C:\Program Files\Tesseract-OCR\tesseract.exe",
+    "C:\Program Files (x86)\Tesseract-OCR\tesseract.exe",
+    "${env:ProgramFiles}\Tesseract-OCR\tesseract.exe",
+    "${env:ProgramFiles(x86)}\Tesseract-OCR\tesseract.exe"
+  )
+
+  $found = $false
+  foreach ($path in $commonPaths) {
+    if (Test-Path $path) {
+      Write-Host "  Found Tesseract at: $path (not on PATH)"
+      $tesseractDir = Split-Path -Parent $path
+      $tessdataPath = Join-Path $tesseractDir "tessdata"
+      if (Test-Path $tessdataPath) {
+        Write-Host "  Found tessdata at: $tessdataPath"
+        Add-Content -Path $env:GITHUB_ENV -Value "TESSDATA_PREFIX=$tessdataPath"
+        Write-Host "✓ Set TESSDATA_PREFIX=$tessdataPath in GITHUB_ENV"
+        $env:TESSDATA_PREFIX = $tessdataPath
+      }
+      $found = $true
+      break
+    }
+  }
+
+  if (-not $found) {
+    Write-Host "  Tesseract not found in common locations"
+  }
+}
+
+Write-Host ""
+Write-Host "CMake:"
+try {
+  & cmake --version
+  Write-Host "✓ CMake available"
+  # Export CMAKE environment variable for immediate availability in build scripts
+  $cmakePath = (Get-Command cmake -ErrorAction Stop).Source
+  if ($cmakePath) {
+    Add-Content -Path $env:GITHUB_ENV -Value "CMAKE=$cmakePath"
+    Write-Host "✓ Set CMAKE=$cmakePath in GITHUB_ENV"
+  }
+}
+catch {
+  Write-Host "::error::CMake not found after installation"
+  throw "CMake verification failed"
+}
+
+Write-Host ""
+Write-Host "Clang:"
+try {
+  & clang --version
+  Write-Host "✓ Clang available"
+}
+catch {
+  Write-Host "⚠ Warning: Clang not currently available on PATH"
+}
+
+Write-Host ""
+Write-Host "PHP:"
+try {
+  & php --version
+  Write-Host "✓ PHP available"
+}
+catch {
+  Write-Host "⚠ Warning: PHP not currently available on PATH (will be set up by shivammathur/setup-php action)"
+}
+
+Write-Host "::endgroup::"
--- a/scripts/ci/r/vendor-kreuzberg-core.py
+++ b/scripts/ci/r/vendor-kreuzberg-core.py
@@ -0,0 +1,433 @@
+#!/usr/bin/env python3
+"""
+Vendor kreuzberg core crate into R package
+Used by: ci-r.yaml - Vendor kreuzberg core crate step
+
+This script:
+1. Reads workspace.dependencies from root Cargo.toml
+2. Copies core crates to packages/r/vendor/
+3. Replaces workspace = true with explicit versions
+4. Generates vendor/Cargo.toml with proper workspace setup
+"""
+
+import os
+import sys
+import shutil
+import re
+from pathlib import Path
+
+try:
+    import tomllib
+except ImportError:
+    import tomli as tomllib  # type: ignore
+
+
+def get_repo_root() -> Path:
+    """Get repository root directory."""
+    repo_root_env = os.environ.get("REPO_ROOT")
+    if repo_root_env:
+        return Path(repo_root_env)
+
+    script_dir = Path(__file__).parent.absolute()
+    return (script_dir / ".." / ".." / "..").resolve()
+
+
+def read_toml(path: Path) -> dict[str, object]:
+    """Read TOML file."""
+    with open(path, "rb") as f:
+        return tomllib.load(f)
+
+
+def get_workspace_deps(repo_root: Path) -> dict[str, object]:
+    """Extract workspace.dependencies from root Cargo.toml."""
+    cargo_toml_path = repo_root / "Cargo.toml"
+    data = read_toml(cargo_toml_path)
+    return data.get("workspace", {}).get("dependencies", {})
+
+
+def get_workspace_version(repo_root: Path) -> str:
+    """Extract version from workspace.package."""
+    cargo_toml_path = repo_root / "Cargo.toml"
+    data = read_toml(cargo_toml_path)
+    return data.get("workspace", {}).get("package", {}).get("version", "4.0.0")
+
+
+def format_dependency(name: str, dep_spec: object) -> str:
+    """Format a dependency spec for Cargo.toml."""
+    if isinstance(dep_spec, str):
+        return f'{name} = "{dep_spec}"'
+    elif isinstance(dep_spec, dict):
+        version: str = dep_spec.get("version", "")
+        package: str | None = dep_spec.get("package")
+        features: list[str] = dep_spec.get("features", [])
+        default_features: bool | None = dep_spec.get("default-features")
+        optional: bool | None = dep_spec.get("optional")
+
+        path: str | None = dep_spec.get("path")
+        git: str | None = dep_spec.get("git")
+        branch: str | None = dep_spec.get("branch")
+        tag: str | None = dep_spec.get("tag")
+        rev: str | None = dep_spec.get("rev")
+
+        parts: list[str] = []
+
+        if package:
+            parts.append(f'package = "{package}"')
+
+        if git:
+            parts.append(f'git = "{git}"')
+
+        if branch:
+            parts.append(f'branch = "{branch}"')
+
+        if tag:
+            parts.append(f'tag = "{tag}"')
+
+        if rev:
+            parts.append(f'rev = "{rev}"')
+
+        if path:
+            parts.append(f'path = "{path}"')
+
+        if version:
+            parts.append(f'version = "{version}"')
+
+        if features:
+            features_str = ', '.join(f'"{f}"' for f in features)
+            parts.append(f'features = [{features_str}]')
+
+        if default_features is False:
+            parts.append('default-features = false')
+        elif default_features is True:
+            parts.append('default-features = true')
+
+        if optional is True:
+            parts.append('optional = true')
+        elif optional is False:
+            parts.append('optional = false')
+
+        spec_str = ", ".join(parts)
+        return f"{name} = {{ {spec_str} }}"
+
+    return f'{name} = "{dep_spec}"'
+
+
+def replace_workspace_deps_in_toml(toml_path: Path, workspace_deps: dict[str, object]) -> None:
+    """Replace workspace = true with explicit versions in a Cargo.toml file."""
+    with open(toml_path, "r") as f:
+        content = f.read()
+
+    for name, dep_spec in workspace_deps.items():
+        pattern1 = rf'^{re.escape(name)} = \{{ workspace = true \}}$'
+        content = re.sub(pattern1, format_dependency(name, dep_spec), content, flags=re.MULTILINE)
+
+        def replace_with_fields(match: re.Match[str]) -> str:
+            other_fields_str = match.group(1).strip()
+            base_spec = format_dependency(name, dep_spec)
+            if " = { " not in base_spec:
+                # Simple string dep like `ctor = "0.6"` - wrap it
+                version_val = base_spec.split(" = ", 1)[1].strip('"')
+                spec_part = f'version = "{version_val}"'
+            else:
+                spec_part = base_spec.split(" = { ", 1)[1].rstrip("} ").rstrip("}")
+
+            # Extract existing keys and values from workspace spec, handling nested brackets
+            workspace_fields: dict[str, str] = {}
+            bracket_depth = 0
+            current_field = ""
+            for char in spec_part:
+                if char == '[':
+                    bracket_depth += 1
+                    current_field += char
+                elif char == ']':
+                    bracket_depth -= 1
+                    current_field += char
+                elif char == ',' and bracket_depth == 0:
+                    # End of field
+                    field = current_field.strip()
+                    if field and "=" in field:
+                        key, val = field.split("=", 1)
+                        workspace_fields[key.strip()] = val.strip()
+                    current_field = ""
+                else:
+                    current_field += char
+
+            # Don't forget the last field
+            if current_field.strip():
+                field = current_field.strip()
+                if field and "=" in field:
+                    key, val = field.split("=", 1)
+                    workspace_fields[key.strip()] = val.strip()
+
+            # Extract crate-specific keys using bracket-aware parsing
+            crate_fields: dict[str, str] = {}
+            bracket_depth = 0
+            current_field = ""
+            for char in other_fields_str:
+                if char == '[':
+                    bracket_depth += 1
+                    current_field += char
+                elif char == ']':
+                    bracket_depth -= 1
+                    current_field += char
+                elif char == ',' and bracket_depth == 0:
+                    # End of field
+                    field = current_field.strip()
+                    if field and "=" in field:
+                        key, val = field.split("=", 1)
+                        crate_fields[key.strip()] = val.strip()
+                    current_field = ""
+                else:
+                    current_field += char
+
+            # Don't forget the last field
+            if current_field.strip():
+                field = current_field.strip()
+                if field and "=" in field:
+                    key, val = field.split("=", 1)
+                    crate_fields[key.strip()] = val.strip()
+
+            # Merge: crate-specific fields override workspace fields
+            merged_fields = {**workspace_fields, **crate_fields}
+
+            # Build result from merged fields
+            merged_parts = [f"{k} = {v}" for k, v in merged_fields.items()]
+            merged_spec = ", ".join(merged_parts)
+
+            return f"{name} = {{ {merged_spec} }}"
+
+        pattern2 = rf'^{re.escape(name)} = \{{ workspace = true, (.+?) \}}$'
+        content = re.sub(pattern2, replace_with_fields, content, flags=re.MULTILINE | re.DOTALL)
+
+    with open(toml_path, "w") as f:
+        f.write(content)
+
+
+def generate_vendor_cargo_toml(repo_root: Path, workspace_deps: dict[str, object], core_version: str, copied_crates: list[str]) -> None:
+    """Generate vendor/Cargo.toml with workspace setup.
+
+    Args:
+        repo_root: Repository root directory
+        workspace_deps: Workspace dependencies from Cargo.toml
+        core_version: Core version string
+        copied_crates: List of crates that were successfully copied
+    """
+
+    deps_lines: list[str] = []
+    for name, dep_spec in sorted(workspace_deps.items()):
+        deps_lines.append(format_dependency(name, dep_spec))
+
+    deps_str = "\n".join(deps_lines)
+
+    # Build members list based on actually copied crates
+    members = [name for name in ["kreuzberg", "kreuzberg-ffi", "kreuzberg-tesseract", "kreuzberg-paddle-ocr"]
+               if name in copied_crates]
+    members_str = ', '.join(f'"{m}"' for m in members)
+
+    vendor_toml = f'''[workspace]
+members = [{members_str}]
+
+[workspace.package]
+version = "{core_version}"
+edition = "2024"
+rust-version = "1.91"
+authors = ["Na'aman Hirschfeld <naaman@kreuzberg.dev>"]
+license = "MIT"
+repository = "https://github.com/kreuzberg-dev/kreuzberg"
+homepage = "https://kreuzberg.dev"
+
+[workspace.dependencies]
+{deps_str}
+'''
+
+    vendor_dir = repo_root / "packages" / "r" / "vendor"
+    vendor_dir.mkdir(parents=True, exist_ok=True)
+
+    toml_path = vendor_dir / "Cargo.toml"
+    with open(toml_path, "w") as f:
+        f.write(vendor_toml)
+
+
+def main() -> None:
+    """Main vendoring function."""
+    repo_root: Path = get_repo_root()
+
+    print("=== Vendoring kreuzberg core crate ===")
+
+    workspace_deps: dict[str, object] = get_workspace_deps(repo_root)
+    core_version: str = get_workspace_version(repo_root)
+
+    print(f"Core version: {core_version}")
+    print(f"Workspace dependencies: {len(workspace_deps)}")
+
+    vendor_base: Path = repo_root / "packages" / "r" / "vendor"
+
+    # Clean only crate directories
+    crate_names = ["kreuzberg", "kreuzberg-ffi", "kreuzberg-tesseract",
+                   "kreuzberg-paddle-ocr"]
+    for name in crate_names:
+        crate_path = vendor_base / name
+        if crate_path.exists():
+            shutil.rmtree(crate_path)
+    # Also clean the vendor Cargo.toml (will be regenerated)
+    vendor_cargo = vendor_base / "Cargo.toml"
+    if vendor_cargo.exists():
+        vendor_cargo.unlink()
+    print("Cleaned vendor crate directories")
+
+    vendor_base.mkdir(parents=True, exist_ok=True)
+
+    crates_to_copy: list[tuple[str, str]] = [
+        ("crates/kreuzberg", "kreuzberg"),
+        ("crates/kreuzberg-ffi", "kreuzberg-ffi"),
+        ("crates/kreuzberg-tesseract", "kreuzberg-tesseract"),
+        ("crates/kreuzberg-paddle-ocr", "kreuzberg-paddle-ocr"),
+    ]
+
+    copied_crates: list[str] = []
+    for src_rel, dest_name in crates_to_copy:
+        src: Path = repo_root / src_rel
+        dest: Path = vendor_base / dest_name
+        if src.exists():
+            try:
+                shutil.copytree(src, dest)
+                copied_crates.append(dest_name)
+                print(f"Copied {dest_name}")
+            except Exception as e:
+                print(f"Warning: Failed to copy {dest_name}: {e}", file=sys.stderr)
+        else:
+            print(f"Warning: Source directory not found: {src_rel}")
+
+    artifact_dirs: list[str] = [".fastembed_cache", "target"]
+    temp_patterns: list[str] = ["*.swp", "*.bak", "*.tmp", "*~"]
+
+    for crate_dir in copied_crates:
+        crate_path: Path = vendor_base / crate_dir
+        if crate_path.exists():
+            for artifact_dir in artifact_dirs:
+                artifact: Path = crate_path / artifact_dir
+                if artifact.exists():
+                    shutil.rmtree(artifact)
+
+            for pattern in temp_patterns:
+                for f in crate_path.rglob(pattern):
+                    f.unlink()
+
+    print("Cleaned build artifacts")
+
+    # Update workspace inheritance in Cargo.toml files
+    for crate_dir in copied_crates:
+        crate_toml = vendor_base / crate_dir / "Cargo.toml"
+        if crate_toml.exists():
+            with open(crate_toml, "r") as f:
+                content = f.read()
+
+            content = re.sub(r'^version\.workspace = true$', f'version = "{core_version}"', content, flags=re.MULTILINE)
+            content = re.sub(r'^edition\.workspace = true$', 'edition = "2024"', content, flags=re.MULTILINE)
+            content = re.sub(r'^rust-version\.workspace = true$', 'rust-version = "1.91"', content, flags=re.MULTILINE)
+            content = re.sub(r'^authors\.workspace = true$', 'authors = ["Na\'aman Hirschfeld <naaman@kreuzberg.dev>"]', content, flags=re.MULTILINE)
+            content = re.sub(r'^license\.workspace = true$', 'license = "MIT"', content, flags=re.MULTILINE)
+
+            with open(crate_toml, "w") as f:
+                f.write(content)
+
+            replace_workspace_deps_in_toml(crate_toml, workspace_deps)
+            print(f"Updated {crate_dir}/Cargo.toml")
+
+    # Update path dependencies in all crates that depend on other vendored crates
+    # First handle kreuzberg-ffi's dependency on kreuzberg
+    if "kreuzberg-ffi" in copied_crates:
+        ffi_toml = vendor_base / "kreuzberg-ffi" / "Cargo.toml"
+        if ffi_toml.exists():
+            with open(ffi_toml, "r") as f:
+                content = f.read()
+
+            if "kreuzberg" in copied_crates:
+                # Replace kreuzberg workspace references with path dependency
+                # Handle cases with path, version, or neither
+                content = re.sub(
+                    r'(kreuzberg = \{) (?:(?:path|version) = "[^"]*", )?',
+                    r'\1 path = "../kreuzberg", ',
+                    content
+                )
+
+            with open(ffi_toml, "w") as f:
+                f.write(content)
+
+    # Update path dependencies in kreuzberg crate if tesseract was copied
+    if "kreuzberg" in copied_crates:
+        kreuzberg_toml = vendor_base / "kreuzberg" / "Cargo.toml"
+        if kreuzberg_toml.exists():
+            with open(kreuzberg_toml, "r") as f:
+                content = f.read()
+
+            # Only update tesseract path if it was actually copied
+            if "kreuzberg-tesseract" in copied_crates:
+                content = re.sub(
+                    r'kreuzberg-tesseract = \{ version = "[^"]*", optional = true \}',
+                    'kreuzberg-tesseract = { path = "../kreuzberg-tesseract", optional = true }',
+                    content
+                )
+            # Only update paddle-ocr path if it was actually copied
+            if "kreuzberg-paddle-ocr" in copied_crates:
+                content = re.sub(
+                    r'kreuzberg-paddle-ocr = \{ version = "[^"]*", optional = true \}',
+                    'kreuzberg-paddle-ocr = { path = "../kreuzberg-paddle-ocr", optional = true }',
+                    content
+                )
+
+            with open(kreuzberg_toml, "w") as f:
+                f.write(content)
+
+    generate_vendor_cargo_toml(repo_root, workspace_deps, core_version, copied_crates)
+    print("Generated vendor/Cargo.toml")
+
+    # Copy root Cargo.lock so vendor workspace uses identical dependency versions
+    root_lock = repo_root / "Cargo.lock"
+    vendor_lock = vendor_base / "Cargo.lock"
+    if root_lock.exists():
+        shutil.copy2(root_lock, vendor_lock)
+        print("Copied Cargo.lock to vendor directory")
+
+    # Update R package Cargo.toml to use vendored crates
+    r_toml = repo_root / "packages" / "r" / "src" / "rust" / "Cargo.toml"
+    if r_toml.exists():
+        with open(r_toml, "r") as f:
+            content = f.read()
+
+        # Replace path dependencies to point to vendored crates
+        # From: path = "../../../../crates/kreuzberg"
+        # To: path = "../../vendor/kreuzberg"
+        content = re.sub(
+            r'path = "\.\./\.\./\.\./\.\./crates/kreuzberg"',
+            'path = "../../vendor/kreuzberg"',
+            content
+        )
+        content = re.sub(
+            r'path = "\.\./\.\./\.\./\.\./crates/kreuzberg-ffi"',
+            'path = "../../vendor/kreuzberg-ffi"',
+            content
+        )
+
+        with open(r_toml, "w") as f:
+            f.write(content)
+
+        print("Updated R package Cargo.toml to use vendored crates")
+
+    print(f"\nVendoring complete (core version: {core_version})")
+    print(f"Copied crates: {', '.join(sorted(copied_crates))}")
+
+    if "kreuzberg" in copied_crates and "kreuzberg-ffi" in copied_crates:
+        print("R package Cargo.toml uses:")
+        print("  - path '../../vendor/kreuzberg' for kreuzberg crate")
+        print("  - path '../../vendor/kreuzberg-ffi' for kreuzberg-ffi crate")
+    else:
+        print("Warning: Some required crates were not copied. Check for missing source directories.")
+
+
+if __name__ == "__main__":
+    try:
+        main()
+    except Exception as e:
+        print(f"Error: {e}", file=sys.stderr)
+        sys.exit(1)
--- a/scripts/ci/ruby/compile-extension.sh
+++ b/scripts/ci/ruby/compile-extension.sh
@@ -0,0 +1,95 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="${REPO_ROOT:-$(cd "$SCRIPT_DIR/../../.." && pwd)}"
+
+source "$REPO_ROOT/scripts/lib/common.sh"
+source "$REPO_ROOT/scripts/lib/library-paths.sh"
+
+validate_repo_root "$REPO_ROOT" || exit 1
+setup_rust_ffi_paths "$REPO_ROOT"
+
+echo "=== Compiling Ruby native extension (Verbose Debug) ==="
+cd "$REPO_ROOT/packages/ruby"
+
+export CARGO_BUILD_JOBS=1
+export RUST_BACKTRACE=1
+export RB_SYS_VERBOSE=1
+
+echo ""
+echo "=== Pre-compilation environment ==="
+echo "Ruby version: $(ruby --version)"
+echo "Ruby platform: $(ruby -e 'puts RUBY_PLATFORM')"
+echo "Rustc version: $(rustc --version)"
+echo "Cargo version: $(cargo --version)"
+echo "Working directory: $(pwd)"
+echo ""
+
+echo "=== Build configuration variables ==="
+echo "CARGO_BUILD_JOBS: ${CARGO_BUILD_JOBS}"
+echo "RUST_BACKTRACE: ${RUST_BACKTRACE}"
+echo "RB_SYS_VERBOSE: ${RB_SYS_VERBOSE}"
+echo "LD_LIBRARY_PATH: ${LD_LIBRARY_PATH:-<not set>}"
+echo "DYLD_LIBRARY_PATH: ${DYLD_LIBRARY_PATH:-<not set>}"
+echo ""
+
+echo "=== Pre-vendor directory state ==="
+echo "packages/ruby directory contents:"
+find . -maxdepth 1 -type f -o -maxdepth 1 -type d | head -20
+echo ""
+
+echo "=== Vendoring kreuzberg core ==="
+python3 "$REPO_ROOT/scripts/ci/ruby/vendor-kreuzberg-core.py"
+
+echo ""
+echo "=== Post-vendor directory state ==="
+if [ -d "ext/kreuzberg_rb/vendor" ]; then
+  echo "Vendor directory contents:"
+  find ext/kreuzberg_rb/vendor -maxdepth 2 -type f | head -10
+else
+  echo "WARNING: No vendor directory found in ext/kreuzberg_rb"
+fi
+echo ""
+
+echo "=== Running rake compile with verbose output ==="
+bundle exec rake compile --verbose --trace 2>&1 || {
+  echo ""
+  echo "ERROR: rake compile failed"
+  echo "=== Attempting to capture compilation error details ==="
+
+  if [ -f "mkmf.log" ]; then
+    echo "=== mkmf.log (last 150 lines) ==="
+    tail -150 mkmf.log
+  fi
+
+  echo ""
+  echo "=== Looking for compiled artifacts ==="
+  find . -name "*.so" -o -name "*.dll" -o -name "*.dylib" 2>/dev/null | head -20
+
+  echo ""
+  echo "=== Checking gem installation ==="
+  gem list kreuzberg || echo "Gem not found"
+
+  exit 1
+}
+
+echo ""
+echo "=== Post-compilation directory state ==="
+echo "lib/ contents:"
+if [ -d "lib" ]; then
+  find lib -type f -name "*.so" -o -name "*.dll" -o -name "*.dylib" 2>/dev/null || echo "No compiled extension found"
+else
+  echo "ERROR: lib directory not found"
+fi
+echo ""
+
+echo "=== Verifying extension can be loaded ==="
+ruby -e "require_relative 'lib/kreuzberg'; puts 'Extension loaded successfully'" 2>&1 || {
+  echo "WARNING: Could not load extension directly"
+  echo "This might be expected if gem installation is required"
+}
+
+echo ""
+echo "=== Compilation complete ==="
--- a/scripts/ci/ruby/install-bundler.sh
+++ b/scripts/ci/ruby/install-bundler.sh
@@ -0,0 +1,5 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+gem install bundler -v 4.0.3 --no-document || gem install bundler --no-document
+bundler --version
--- a/scripts/ci/ruby/install-ruby-deps.sh
+++ b/scripts/ci/ruby/install-ruby-deps.sh
@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="${REPO_ROOT:-$(cd "$SCRIPT_DIR/../../.." && pwd)}"
+
+source "$REPO_ROOT/scripts/lib/common.sh"
+
+validate_repo_root "$REPO_ROOT" || exit 1
+
+echo "=== Installing Ruby dependencies ==="
+cd "$REPO_ROOT/packages/ruby"
+
+bundle_path="${BUNDLE_PATH:-$REPO_ROOT/packages/ruby/.bundle/bundle}"
+
+if [[ -n "${GITHUB_ENV:-}" ]]; then
+  if [[ -z "${BUNDLE_GEMFILE:-}" ]]; then
+    echo "BUNDLE_GEMFILE=$REPO_ROOT/packages/ruby/Gemfile" >>"$GITHUB_ENV"
+  fi
+  if [[ -z "${BUNDLE_PATH:-}" ]]; then
+    echo "BUNDLE_PATH=$bundle_path" >>"$GITHUB_ENV"
+  fi
+fi
+
+bundle config set deployment false
+bundle config set path "$bundle_path"
+bundle install --jobs 4
+
+echo "Ruby dependencies installed"
--- a/scripts/ci/ruby/vendor-kreuzberg-core.py
+++ b/scripts/ci/ruby/vendor-kreuzberg-core.py
@@ -0,0 +1,430 @@
+#!/usr/bin/env python3
+"""
+Vendor kreuzberg core crate into Ruby package
+Used by: ci-ruby.yaml - Vendor kreuzberg core crate step
+
+This script:
+1. Reads workspace.dependencies from root Cargo.toml
+2. Copies core crates to packages/ruby/vendor/
+3. Replaces workspace = true with explicit versions
+4. Generates vendor/Cargo.toml with proper workspace setup
+"""
+
+import os
+import sys
+import shutil
+import re
+from pathlib import Path
+
+try:
+    import tomllib
+except ImportError:
+    import tomli as tomllib  # type: ignore[import-not-found]
+
+
+def get_repo_root() -> Path:
+    """Get repository root directory."""
+    repo_root_env = os.environ.get("REPO_ROOT")
+    if repo_root_env:
+        return Path(repo_root_env)
+
+    script_dir = Path(__file__).parent.absolute()
+    return (script_dir / ".." / ".." / "..").resolve()
+
+
+def read_toml(path: Path) -> dict[str, object]:
+    """Read TOML file."""
+    with open(path, "rb") as f:
+        return tomllib.load(f)
+
+
+def get_workspace_deps(repo_root: Path) -> dict[str, object]:
+    """Extract workspace.dependencies from root Cargo.toml."""
+    cargo_toml_path = repo_root / "Cargo.toml"
+    data = read_toml(cargo_toml_path)
+    return data.get("workspace", {}).get("dependencies", {})
+
+
+def get_workspace_version(repo_root: Path) -> str:
+    """Extract version from workspace.package."""
+    cargo_toml_path = repo_root / "Cargo.toml"
+    data = read_toml(cargo_toml_path)
+    return data.get("workspace", {}).get("package", {}).get("version", "4.0.0")
+
+
+def format_dependency(name: str, dep_spec: object) -> str:
+    """Format a dependency spec for Cargo.toml."""
+    if isinstance(dep_spec, str):
+        return f'{name} = "{dep_spec}"'
+    elif isinstance(dep_spec, dict):
+        version: str = dep_spec.get("version", "")
+        package: str | None = dep_spec.get("package")
+        features: list[str] = dep_spec.get("features", [])
+        default_features: bool | None = dep_spec.get("default-features")
+
+        optional: bool | None = dep_spec.get("optional")
+
+        path: str | None = dep_spec.get("path")
+        git: str | None = dep_spec.get("git")
+        branch: str | None = dep_spec.get("branch")
+        tag: str | None = dep_spec.get("tag")
+        rev: str | None = dep_spec.get("rev")
+
+        parts: list[str] = []
+
+        if package:
+            parts.append(f'package = "{package}"')
+
+        if git:
+            parts.append(f'git = "{git}"')
+
+        if branch:
+            parts.append(f'branch = "{branch}"')
+
+        if tag:
+            parts.append(f'tag = "{tag}"')
+
+        if rev:
+            parts.append(f'rev = "{rev}"')
+
+        if path:
+            parts.append(f'path = "{path}"')
+
+        if version:
+            parts.append(f'version = "{version}"')
+
+        if features:
+            features_str = ', '.join(f'"{f}"' for f in features)
+            parts.append(f'features = [{features_str}]')
+
+        if default_features is False:
+            parts.append('default-features = false')
+        elif default_features is True:
+            parts.append('default-features = true')
+
+        if optional is True:
+            parts.append('optional = true')
+        elif optional is False:
+            parts.append('optional = false')
+
+        spec_str = ", ".join(parts)
+        return f"{name} = {{ {spec_str} }}"
+
+    return f'{name} = "{dep_spec}"'
+
+
+def replace_workspace_deps_in_toml(toml_path: Path, workspace_deps: dict[str, object]) -> None:
+    """Replace workspace = true with explicit versions in a Cargo.toml file."""
+    with open(toml_path, "r") as f:
+        content = f.read()
+
+    for name, dep_spec in workspace_deps.items():
+        pattern1 = rf'^{re.escape(name)} = \{{ workspace = true \}}$'
+        content = re.sub(pattern1, format_dependency(name, dep_spec), content, flags=re.MULTILINE)
+
+        def replace_with_fields(match: re.Match[str]) -> str:
+            other_fields_str = match.group(1).strip()
+            base_spec = format_dependency(name, dep_spec)
+            if " = { " not in base_spec:
+                # Simple string dep like `ctor = "0.6"` - wrap it
+                version_val = base_spec.split(" = ", 1)[1].strip('"')
+                spec_part = f'version = "{version_val}"'
+            else:
+                spec_part = base_spec.split(" = { ", 1)[1].rstrip("} ").rstrip("}")
+
+            # Extract existing keys and values from workspace spec, handling nested brackets
+            workspace_fields: dict[str, str] = {}
+            bracket_depth = 0
+            current_field = ""
+            for char in spec_part:
+                if char == '[':
+                    bracket_depth += 1
+                    current_field += char
+                elif char == ']':
+                    bracket_depth -= 1
+                    current_field += char
+                elif char == ',' and bracket_depth == 0:
+                    # End of field
+                    field = current_field.strip()
+                    if field and "=" in field:
+                        key, val = field.split("=", 1)
+                        workspace_fields[key.strip()] = val.strip()
+                    current_field = ""
+                else:
+                    current_field += char
+
+            # Don't forget the last field
+            if current_field.strip():
+                field = current_field.strip()
+                if field and "=" in field:
+                    key, val = field.split("=", 1)
+                    workspace_fields[key.strip()] = val.strip()
+
+            # Extract crate-specific keys using bracket-aware parsing
+            crate_fields: dict[str, str] = {}
+            bracket_depth = 0
+            current_field = ""
+            for char in other_fields_str:
+                if char == '[':
+                    bracket_depth += 1
+                    current_field += char
+                elif char == ']':
+                    bracket_depth -= 1
+                    current_field += char
+                elif char == ',' and bracket_depth == 0:
+                    # End of field
+                    field = current_field.strip()
+                    if field and "=" in field:
+                        key, val = field.split("=", 1)
+                        crate_fields[key.strip()] = val.strip()
+                    current_field = ""
+                else:
+                    current_field += char
+
+            # Don't forget the last field
+            if current_field.strip():
+                field = current_field.strip()
+                if field and "=" in field:
+                    key, val = field.split("=", 1)
+                    crate_fields[key.strip()] = val.strip()
+
+            # Merge: crate-specific fields override workspace fields
+            merged_fields = {**workspace_fields, **crate_fields}
+
+            # Build result from merged fields
+            merged_parts = [f"{k} = {v}" for k, v in merged_fields.items()]
+            merged_spec = ", ".join(merged_parts)
+
+            return f"{name} = {{ {merged_spec} }}"
+
+        pattern2 = rf'^{re.escape(name)} = \{{ workspace = true, (.+?) \}}$'
+        content = re.sub(pattern2, replace_with_fields, content, flags=re.MULTILINE | re.DOTALL)
+
+    with open(toml_path, "w") as f:
+        f.write(content)
+
+
+def generate_vendor_cargo_toml(repo_root: Path, workspace_deps: dict[str, object], core_version: str, copied_crates: list[str]) -> None:
+    """Generate vendor/Cargo.toml with workspace setup.
+
+    Args:
+        repo_root: Repository root directory
+        workspace_deps: Workspace dependencies from Cargo.toml
+        core_version: Core version string
+        copied_crates: List of crates that were successfully copied
+    """
+
+    deps_lines: list[str] = []
+    for name, dep_spec in sorted(workspace_deps.items()):
+        deps_lines.append(format_dependency(name, dep_spec))
+
+    deps_str = "\n".join(deps_lines)
+
+    # Build members list based on actually copied crates
+    members = [name for name in ["kreuzberg", "kreuzberg-ffi", "kreuzberg-tesseract", "kreuzberg-paddle-ocr", "rb-sys"]
+               if name in copied_crates]
+    members_str = ', '.join(f'"{m}"' for m in members)
+
+    vendor_toml = f'''[workspace]
+members = [{members_str}]
+
+[workspace.package]
+version = "{core_version}"
+edition = "2024"
+rust-version = "1.91"
+authors = ["Na'aman Hirschfeld <naaman@kreuzberg.dev>"]
+license = "MIT"
+repository = "https://github.com/kreuzberg-dev/kreuzberg"
+homepage = "https://kreuzberg.dev"
+
+[workspace.dependencies]
+{deps_str}
+'''
+
+    vendor_dir = repo_root / "packages" / "ruby" / "vendor"
+    vendor_dir.mkdir(parents=True, exist_ok=True)
+
+    toml_path = vendor_dir / "Cargo.toml"
+    with open(toml_path, "w") as f:
+        f.write(vendor_toml)
+
+
+def main() -> None:
+    """Main vendoring function."""
+    repo_root: Path = get_repo_root()
+
+    print("=== Vendoring kreuzberg core crate ===")
+
+    workspace_deps: dict[str, object] = get_workspace_deps(repo_root)
+    core_version: str = get_workspace_version(repo_root)
+
+    print(f"Core version: {core_version}")
+    print(f"Workspace dependencies: {len(workspace_deps)}")
+
+    vendor_base: Path = repo_root / "packages" / "ruby" / "vendor"
+
+    # Clean only crate directories, preserving vendor/bundle/ (Bundler gems)
+    crate_names = ["kreuzberg", "kreuzberg-ffi", "kreuzberg-tesseract",
+                   "kreuzberg-paddle-ocr", "rb-sys"]
+    for name in crate_names:
+        crate_path = vendor_base / name
+        if crate_path.exists():
+            shutil.rmtree(crate_path)
+    # Also clean the vendor Cargo.toml (will be regenerated)
+    vendor_cargo = vendor_base / "Cargo.toml"
+    if vendor_cargo.exists():
+        vendor_cargo.unlink()
+    print("Cleaned vendor crate directories")
+
+    vendor_base.mkdir(parents=True, exist_ok=True)
+
+    crates_to_copy: list[tuple[str, str]] = [
+        ("crates/kreuzberg", "kreuzberg"),
+        ("crates/kreuzberg-ffi", "kreuzberg-ffi"),
+        ("crates/kreuzberg-tesseract", "kreuzberg-tesseract"),
+        ("crates/kreuzberg-paddle-ocr", "kreuzberg-paddle-ocr"),
+        ("vendor/rb-sys", "rb-sys"),
+    ]
+
+    copied_crates: list[str] = []
+    for src_rel, dest_name in crates_to_copy:
+        src: Path = repo_root / src_rel
+        dest: Path = vendor_base / dest_name
+        if src.exists():
+            try:
+                shutil.copytree(src, dest)
+                copied_crates.append(dest_name)
+                print(f"Copied {dest_name}")
+            except Exception as e:
+                print(f"Warning: Failed to copy {dest_name}: {e}", file=sys.stderr)
+        else:
+            print(f"Warning: Source directory not found: {src_rel}")
+
+    artifact_dirs: list[str] = [".fastembed_cache", "target"]
+    temp_patterns: list[str] = ["*.swp", "*.bak", "*.tmp", "*~"]
+
+    for crate_dir in copied_crates:
+        crate_path: Path = vendor_base / crate_dir
+        if crate_path.exists():
+            for artifact_dir in artifact_dirs:
+                artifact: Path = crate_path / artifact_dir
+                if artifact.exists():
+                    shutil.rmtree(artifact)
+
+            for pattern in temp_patterns:
+                for f in crate_path.rglob(pattern):
+                    f.unlink()
+
+    print("Cleaned build artifacts")
+
+    # Update workspace inheritance in Cargo.toml files
+    for crate_dir in copied_crates:
+        crate_toml = vendor_base / crate_dir / "Cargo.toml"
+        if crate_toml.exists():
+            with open(crate_toml, "r") as f:
+                content = f.read()
+
+            content = re.sub(r'^version\.workspace = true$', f'version = "{core_version}"', content, flags=re.MULTILINE)
+            content = re.sub(r'^edition\.workspace = true$', 'edition = "2024"', content, flags=re.MULTILINE)
+            content = re.sub(r'^rust-version\.workspace = true$', 'rust-version = "1.91"', content, flags=re.MULTILINE)
+            content = re.sub(r'^authors\.workspace = true$', 'authors = ["Na\'aman Hirschfeld <naaman@kreuzberg.dev>"]', content, flags=re.MULTILINE)
+            content = re.sub(r'^license\.workspace = true$', 'license = "MIT"', content, flags=re.MULTILINE)
+
+            with open(crate_toml, "w") as f:
+                f.write(content)
+
+            replace_workspace_deps_in_toml(crate_toml, workspace_deps)
+            print(f"Updated {crate_dir}/Cargo.toml")
+
+    # Update path dependencies in kreuzberg-ffi crate
+    if "kreuzberg-ffi" in copied_crates and "kreuzberg" in copied_crates:
+        ffi_toml = vendor_base / "kreuzberg-ffi" / "Cargo.toml"
+        if ffi_toml.exists():
+            with open(ffi_toml, "r") as f:
+                content = f.read()
+
+            # Replace kreuzberg workspace references with path dependency
+            # Handle cases with path, version, or neither
+            content = re.sub(
+                r'(kreuzberg = \{) (?:(?:path|version) = "[^"]*", )?',
+                r'\1 path = "../kreuzberg", ',
+                content
+            )
+
+            with open(ffi_toml, "w") as f:
+                f.write(content)
+
+    # Update path dependencies in kreuzberg crate if tesseract was copied
+    if "kreuzberg" in copied_crates:
+        kreuzberg_toml = vendor_base / "kreuzberg" / "Cargo.toml"
+        if kreuzberg_toml.exists():
+            with open(kreuzberg_toml, "r") as f:
+                content = f.read()
+
+            # Only update tesseract path if it was actually copied
+            if "kreuzberg-tesseract" in copied_crates:
+                content = re.sub(
+                    r'kreuzberg-tesseract = \{ (?:path = "[^"]*", )?version = "[^"]*", optional = true \}',
+                    'kreuzberg-tesseract = { path = "../kreuzberg-tesseract", optional = true }',
+                    content
+                )
+            # Only update paddle-ocr path if it was actually copied
+            if "kreuzberg-paddle-ocr" in copied_crates:
+                content = re.sub(
+                    r'kreuzberg-paddle-ocr = \{ (?:path = "[^"]*", )?version = "[^"]*", optional = true \}',
+                    'kreuzberg-paddle-ocr = { path = "../kreuzberg-paddle-ocr", optional = true }',
+                    content
+                )
+
+            with open(kreuzberg_toml, "w") as f:
+                f.write(content)
+
+    generate_vendor_cargo_toml(repo_root, workspace_deps, core_version, copied_crates)
+    print("Generated vendor/Cargo.toml")
+
+    # Update native extension Cargo.toml to use vendored crates
+    native_toml = repo_root / "packages" / "ruby" / "ext" / "kreuzberg_rb" / "native" / "Cargo.toml"
+    if native_toml.exists():
+        with open(native_toml, "r") as f:
+            content = f.read()
+
+        # Replace path dependencies to point to vendored crates
+        # From: path = "../../../../../crates/kreuzberg"
+        # To: path = "../../../vendor/kreuzberg"
+        content = re.sub(
+            r'path = "\.\./\.\./\.\./\.\./\.\./crates/kreuzberg"',
+            'path = "../../../vendor/kreuzberg"',
+            content
+        )
+        content = re.sub(
+            r'path = "\.\./\.\./\.\./\.\./\.\./crates/kreuzberg-ffi"',
+            'path = "../../../vendor/kreuzberg-ffi"',
+            content
+        )
+
+        with open(native_toml, "w") as f:
+            f.write(content)
+
+        print("Updated native extension Cargo.toml to use vendored crates")
+
+    print(f"\nVendoring complete (core version: {core_version})")
+    print(f"Copied crates: {', '.join(sorted(copied_crates))}")
+
+    if "kreuzberg" in copied_crates and "kreuzberg-ffi" in copied_crates:
+        print("Native extension Cargo.toml uses:")
+        print("  - path '../../../vendor/kreuzberg' for kreuzberg crate")
+        print("  - path '../../../vendor/kreuzberg-ffi' for kreuzberg-ffi crate")
+        if "rb-sys" in copied_crates:
+            print("  - path '../../../vendor/rb-sys' for rb-sys crate")
+        else:
+            print("  - rb-sys from crates.io")
+    else:
+        print("Warning: Some required crates were not copied. Check for missing source directories.")
+
+
+if __name__ == "__main__":
+    try:
+        main()
+    except Exception as e:
+        print(f"Error: {e}", file=sys.stderr)
+        sys.exit(1)
--- a/scripts/ci/rust/package-cli-windows.ps1
+++ b/scripts/ci/rust/package-cli-windows.ps1
@@ -0,0 +1,19 @@
+#!/usr/bin/env pwsh
+# Package CLI binary as zip archive (Windows)
+# Used by: ci-rust.yaml - Package CLI (Windows) step
+# Arguments: TARGET (e.g., x86_64-pc-windows-msvc)
+
+param(
+    [Parameter(Mandatory=$true)]
+    [string]$Target
+)
+
+Set-StrictMode -Version Latest
+$ErrorActionPreference = 'Stop'
+
+Write-Host "=== Packaging CLI binary for $Target ==="
+
+cd target/$Target/release
+Compress-Archive -Path kreuzberg.exe -DestinationPath ../../../kreuzberg-cli-$Target.zip
+
+Write-Host "Packaging complete: kreuzberg-cli-$Target.zip"
--- a/scripts/ci/rust/run-unit-tests.sh
+++ b/scripts/ci/rust/run-unit-tests.sh
@@ -0,0 +1,103 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="${REPO_ROOT:-$(cd "$SCRIPT_DIR/../../.." && pwd)}"
+
+source "$REPO_ROOT/scripts/lib/common.sh"
+source "$REPO_ROOT/scripts/lib/tessdata.sh"
+
+validate_repo_root "$REPO_ROOT" || exit 1
+
+cd "$REPO_ROOT"
+
+echo "=== Running Rust unit tests ==="
+
+setup_tessdata
+
+echo "Test environment configuration:"
+echo "  TESSDATA_PREFIX: ${TESSDATA_PREFIX:-not set}"
+echo "  RUST_BACKTRACE: ${RUST_BACKTRACE:-not set}"
+echo "  CARGO_TERM_COLOR: ${CARGO_TERM_COLOR:-not set}"
+
+echo "Workspace information:"
+echo "  Repository: $REPO_ROOT"
+echo "  Excluded packages: kreuzberg-e2e-generator, kreuzberg-py, kreuzberg-node (+ benchmark-harness on Windows)"
+
+if [ ! -d "$TESSDATA_PREFIX" ]; then
+  echo "WARNING: TESSDATA_PREFIX directory not found: $TESSDATA_PREFIX"
+  echo "Attempting to create it..."
+  mkdir -p "$TESSDATA_PREFIX"
+  ensure_tessdata "$TESSDATA_PREFIX"
+fi
+
+echo "Verifying Tesseract data files..."
+for lang in eng osd; do
+  langfile="$TESSDATA_PREFIX/${lang}.traineddata"
+  if [ -f "$langfile" ]; then
+    size=$(stat -f%z "$langfile" 2>/dev/null || stat -c%s "$langfile" 2>/dev/null || echo "unknown")
+    echo "  ✓ ${lang}.traineddata (${size} bytes)"
+  else
+    echo "  WARNING: Missing ${lang}.traineddata"
+  fi
+done
+
+if [ -n "${KREUZBERG_PDFIUM_PREBUILT:-}" ]; then
+  export LD_LIBRARY_PATH="${KREUZBERG_PDFIUM_PREBUILT}/lib:${LD_LIBRARY_PATH:-}"
+  export DYLD_LIBRARY_PATH="${KREUZBERG_PDFIUM_PREBUILT}/lib:${DYLD_LIBRARY_PATH:-}"
+  export DYLD_FALLBACK_LIBRARY_PATH="${KREUZBERG_PDFIUM_PREBUILT}/lib:${DYLD_FALLBACK_LIBRARY_PATH:-}"
+  echo "Library path configuration:"
+  echo "  LD_LIBRARY_PATH: $LD_LIBRARY_PATH"
+  echo "  DYLD_LIBRARY_PATH: $DYLD_LIBRARY_PATH"
+  echo "  DYLD_FALLBACK_LIBRARY_PATH: $DYLD_FALLBACK_LIBRARY_PATH"
+fi
+
+echo "=== Starting cargo test ==="
+
+# NOTE: We intentionally avoid `--all-features` for the `kreuzberg` crate because
+TEST_LOG="/tmp/cargo-test-$$.log"
+
+if ! {
+  # `--all-targets` runs --lib --bins --tests --examples --benches but excludes
+  # `--doc`. 22 rustdoc examples in the kreuzberg crate currently reference
+  # private items (extraction::capacity::estimate_content_capacity et al.) and
+  # fail to compile. Tracking the cleanup separately; doc-test coverage is not
+  # on the v5.0.0 publish path. TODO: re-enable doc tests once the failing
+  # examples are rewritten against the public API.
+  echo "=== cargo test -p kreuzberg --features full ==="
+  RUST_BACKTRACE=full cargo test -p kreuzberg --features full --all-targets --verbose
+
+  echo "=== cargo test --workspace (all features, excluding kreuzberg) ==="
+  extra_excludes=()
+  if [[ "$OSTYPE" == "msys" || "$OSTYPE" == "cygwin" || "$OSTYPE" == "win32" ]]; then
+    extra_excludes+=(--exclude benchmark-harness)
+  fi
+  RUST_BACKTRACE=full cargo test \
+    --workspace \
+    --exclude kreuzberg \
+    --exclude kreuzberg-e2e-generator \
+    --exclude kreuzberg-py \
+    --exclude kreuzberg-node \
+    ${extra_excludes[@]+"${extra_excludes[@]}"} \
+    --all-features \
+    --all-targets \
+    --verbose
+} 2>&1 | tee "$TEST_LOG"; then
+  echo "=== Test execution failed ==="
+  echo "Last 50 lines of test output:"
+  tail -n 50 "$TEST_LOG"
+  echo ""
+  echo "Collecting diagnostic information..."
+  echo "Disk space:"
+  df -h . || du -h . 2>/dev/null | head -1
+  echo "Cargo environment:"
+  cargo --version
+  rustc --version
+  rm -f "$TEST_LOG"
+  exit 1
+fi
+
+rm -f "$TEST_LOG"
+
+echo "=== Tests complete ==="
--- a/scripts/ci/validate/show-disk-space.sh
+++ b/scripts/ci/validate/show-disk-space.sh
@@ -0,0 +1,9 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+label="${1:-Disk space}"
+echo "=== ${label} ===" >&2
+df -h / >&2
+
+echo "Disk info:" >&2
+df -B1 / | tail -1 >&2 || true