Nomad changes
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s

This commit is contained in:
Henrik Jess Nielsen
2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions

242
scripts/ci/README.md Normal file
View File

@@ -0,0 +1,242 @@
# CI Workflow Scripts
This directory contains extracted scripts from GitHub Actions CI workflows, organized by workflow type.
## Overview
- **Total Scripts**: 41 (27 Bash + 14 PowerShell)
- **Documentation**: See `SCRIPT_MAPPING.md` for detailed workflow-to-script mapping
- **All Scripts**: Production-ready with proper error handling and documentation
## Directory Structure
```text
scripts/ci/
├── README.md ← This file
├── SCRIPT_MAPPING.md ← Detailed workflow-to-script mapping guide
├── docker/ ← Docker image build and test scripts
├── go/ ← Go bindings scripts
├── java/ ← Java bindings scripts
├── node/ ← Node/TypeScript NAPI scripts
├── python/ ← Python wheel build scripts
├── ruby/ ← Ruby gem build scripts
├── rust/ ← Rust core and CLI scripts
├── csharp/ ← C# bindings scripts
└── validate/ ← Validation and linting scripts
```
## Quick Start
### Running a Script
**Bash scripts:**
```bash
./scripts/ci/docker/build-image.sh core
./scripts/ci/python/run-tests.sh true
```
**PowerShell scripts:**
```powershell
& ./scripts/ci/go/build-ffi.ps1
& ./scripts/ci/rust/package-cli-windows.ps1 -Target "x86_64-pc-windows-msvc"
```
### Sourcing Scripts
For library path setup scripts:
```bash
source ./scripts/lib/library-paths.sh
setup_all_library_paths
./scripts/ci/python/run-tests.sh true
```
## Scripts by Workflow
### Docker (`docker/`)
- `free-disk-space.sh` - Clean up CI disk space
- `build-image.sh` - Build Docker image variant
- `check-image-size.sh` - Validate image size constraints
- `save-image.sh` - Save Docker image as tar.gz artifact
- `collect-logs.sh` - Collect container logs on failure
- `cleanup.sh` - Clean up Docker resources
- `summary.sh` - Print test summary
### Go (`go/`)
- `build-ffi.sh` - Build FFI library (Unix)
- `build-ffi.ps1` - Build FFI library (Windows)
- `build-bindings.sh` - Build Go bindings with CGO (Unix)
- `build-bindings.ps1` - Build Go bindings with CGO (Windows)
- `reorganize-libraries.ps1` - Reorganize FFI libraries for Windows
- `run-tests.sh` - Run Go tests with library paths
### Java (`java/`)
- `build-java.sh` - Build Java bindings with Maven
- `run-tests.sh` - Run Java tests with Maven
### Node/TypeScript (`node/`)
- `build-napi.sh` - Build NAPI bindings with artifact collection
- `unpack-bindings.sh` - Unpack and install bindings from tarball
### Python (`python/`)
- `clean-artifacts.sh` - Clean previous wheel artifacts
- `smoke-test-wheel.sh` - Test wheel installation
- `install-wheel.sh` - Install platform-specific wheel
- `run-tests.sh` - Run tests with optional coverage
### Ruby (`ruby/`)
- `install-ruby-deps.sh` - Install bundle dependencies (Unix)
- `install-ruby-deps.ps1` - Install bundle dependencies (Windows)
- `vendor-kreuzberg-core.py` - Vendor core crate for packaging
- `configure-bindgen-windows.ps1` - Configure bindgen headers (Windows)
- `configure-tesseract-windows.ps1` - Configure Tesseract (Windows)
- `build-gem.sh` - Build Ruby gem
- `install-gem.sh` - Install built gem
- `compile-extension.sh` - Compile native extension
- `run-tests.sh` - Run RSpec tests
### Rust (`rust/`)
- `configure-bindgen-windows.ps1` - Configure bindgen headers (Windows)
- `run-unit-tests.sh` - Run Rust unit tests
- `package-cli-unix.sh` - Package CLI as tar.gz (Unix)
- `package-cli-windows.ps1` - Package CLI as zip (Windows)
- `test-cli-unix.sh` - Test CLI binary (Unix)
- `test-cli-windows.ps1` - Test CLI binary (Windows)
### C# (`csharp/`)
- `build-csharp.sh` - Build C# bindings with dotnet
- `run-tests.sh` - Run C# tests with dotnet
### Validate (`validate/`)
- `run-lint.sh` - Run all linting and validation checks via Task
## Features
### Error Handling
- All Bash scripts use `set -euo pipefail`
- All PowerShell scripts use `Set-StrictMode` and error action preferences
- Proper exit codes and error messages
- Usage information for incorrect arguments
### Documentation
- Every script has a descriptive header
- Purpose and usage clearly stated
- Which CI workflow step uses it
- Argument documentation
### Platform Support
- Windows-specific operations via PowerShell (.ps1)
- Unix operations via Bash (.sh)
- Cross-platform scripts detect OS and adjust behavior
- Library path setup scripts handle Windows/Linux/macOS
### Reusability
- `library-paths.sh` (`scripts/lib/`) - Shared by all workflows for native library configuration
- `configure-bindgen-windows.ps1` used by Ruby and Rust
- Common patterns consolidated into single scripts
## Detailed Documentation
For comprehensive workflow-to-script mapping and usage examples, see `SCRIPT_MAPPING.md`.
## Usage in Workflows
### Example: ci-docker.yaml
**Before (inline commands):**
```yaml
- name: Free up disk space
run: |
echo "=== Initial disk space ==="
df -h /
echo "=== Removing unnecessary packages ==="
sudo rm -rf /usr/share/dotnet
# ... 30+ lines of commands ...
```
**After (using script):**
```yaml
- name: Free up disk space
run: ./scripts/ci/docker/free-disk-space.sh
```
### Example: ci-python.yaml
**Before (inline commands):**
```yaml
- name: Run Python tests
run: |
cd packages/python
if [ "${{ matrix.coverage }}" = "true" ]; then
uv run pytest -vv --cov=kreuzberg --cov-report=lcov:coverage.lcov ...
else
uv run pytest -vv --reruns 1 --reruns-delay 1
fi
```
**After (using script):**
```yaml
- name: Run Python tests
run: ./scripts/ci/python/run-tests.sh ${{ matrix.coverage }}
```
## Testing Scripts Locally
You can test scripts locally before running in CI:
```bash
# Test Docker scripts
./scripts/ci/docker/free-disk-space.sh
# Test Python scripts
./scripts/ci/python/clean-artifacts.sh
./scripts/ci/python/run-tests.sh false
# Test Rust scripts
./scripts/ci/rust/run-unit-tests.sh
```
## Shell Compatibility
- **Bash scripts**: Compatible with bash 3.2+ (macOS) and bash 4.0+ (Linux)
- **PowerShell scripts**: Compatible with PowerShell 5.1+ (Windows) and PowerShell Core 7+ (cross-platform)
## Contributing
When adding new CI steps or modifying existing ones:
1. Extract the inline script into a separate file in the appropriate directory
2. Add proper error handling (`set -euo pipefail` for bash)
3. Include descriptive header comments
4. Update `SCRIPT_MAPPING.md` with the new mapping
5. Test the script locally before committing
## Maintenance
Scripts should be reviewed and updated when:
- Updating CI workflow logic
- Changing build tools or versions
- Improving error handling
- Adding new platform support
See each script's header for detailed documentation on its purpose and usage.

View File

@@ -0,0 +1,90 @@
#!/usr/bin/env bash
set -euo pipefail
ort_version="${1:?ort-version required}"
dest_dir="${2:-crates/kreuzberg-node}"
arch_id="${3:-}"
strategy="${4:-system}"
extract_dir="$RUNNER_TEMP/onnxruntime"
if [ -z "$arch_id" ]; then
case "$(uname -m)" in
x86_64 | amd64) arch_id="x64" ;;
arm64 | aarch64) arch_id="arm64" ;;
*)
echo "Unsupported Linux architecture: $(uname -m)" >&2
exit 1
;;
esac
fi
case "$arch_id" in
x64)
ort_dir_name="onnxruntime-linux-x64-${ort_version}"
archive="onnxruntime-linux-x64-${ort_version}.tgz"
;;
arm64)
ort_dir_name="onnxruntime-linux-aarch64-${ort_version}"
archive="onnxruntime-linux-aarch64-${ort_version}.tgz"
;;
*)
echo "Unsupported Linux arch-id: $arch_id" >&2
exit 1
;;
esac
if [ ! -d "$extract_dir/$ort_dir_name" ]; then
echo "Cache miss: Downloading ONNX Runtime ${ort_version}"
curl -fsSL --retry 5 --retry-delay 5 --retry-all-errors -o "$RUNNER_TEMP/$archive" "https://github.com/microsoft/onnxruntime/releases/download/v${ort_version}/$archive"
mkdir -p "$extract_dir"
tar -xzf "$RUNNER_TEMP/$archive" -C "$extract_dir"
else
echo "Cache hit: Using cached ONNX Runtime ${ort_version}"
fi
ort_root="$extract_dir/$ort_dir_name"
if [ ! -d "$ort_root/lib" ]; then
echo "ERROR: ONNX Runtime lib directory missing at $ort_root/lib" >&2
echo "Available directories:" >&2
ls -la "$extract_dir" >&2 || true
exit 1
fi
if ! ls "$ort_root/lib"/*.so* 1>/dev/null 2>&1; then
echo "ERROR: No ONNX Runtime libraries found in $ort_root/lib" >&2
echo "Directory contents:" >&2
ls -la "$ort_root/lib" >&2 || true
exit 1
fi
dest="$GITHUB_WORKSPACE/$dest_dir"
mkdir -p "$dest"
cp -f "$ort_root/lib/"*.so* "$dest/"
if [ -n "${RUSTFLAGS:-}" ]; then
rustflags="$RUSTFLAGS -L $ort_root/lib"
else
rustflags="-L $ort_root/lib"
fi
if [ "$strategy" = "bundled" ]; then
echo "Using bundled ORT strategy — letting ort-sys download-binaries handle static linking"
{
echo "LD_LIBRARY_PATH=$ort_root/lib:$dest:${LD_LIBRARY_PATH:-}"
echo "LIBRARY_PATH=$ort_root/lib:$dest:${LIBRARY_PATH:-}"
} >>"$GITHUB_ENV"
else
{
ort_lib=$(find "$ort_root/lib" -name "libonnxruntime*.so*" -print -quit)
echo "ORT_LIB_LOCATION=$ort_root/lib"
echo "ORT_PREFER_DYNAMIC_LINK=1"
echo "ORT_SKIP_DOWNLOAD=1"
echo "ORT_STRATEGY=system"
echo "ORT_DYLIB_PATH=$ort_root/lib/${ort_lib##*/}"
echo "LD_LIBRARY_PATH=$ort_root/lib:$dest:${LD_LIBRARY_PATH:-}"
echo "LIBRARY_PATH=$ort_root/lib:$dest:${LIBRARY_PATH:-}"
echo "RUSTFLAGS=$rustflags"
} >>"$GITHUB_ENV"
fi

View File

@@ -0,0 +1,86 @@
#!/usr/bin/env bash
set -euo pipefail
ort_version="${1:?ort-version required}"
dest_dir="${2:-crates/kreuzberg-node}"
arch_id="${3:-}"
strategy="${4:-system}"
extract_dir="$RUNNER_TEMP/onnxruntime"
if [ -z "$arch_id" ]; then
arch="$(uname -m)"
if [ "$arch" = "arm64" ]; then
arch_id="arm64"
else
arch_id="x64"
fi
fi
case "$arch_id" in
arm64) ort_arch="arm64" ;;
x64) ort_arch="x86_64" ;;
*)
echo "Unsupported macOS arch-id: $arch_id" >&2
exit 1
;;
esac
echo "Using macOS ONNX Runtime arch: $ort_arch"
if [ ! -d "$extract_dir/onnxruntime-osx-${ort_arch}-${ort_version}" ]; then
echo "Cache miss: Downloading ONNX Runtime ${ort_version} for macOS ${ort_arch}"
archive="onnxruntime-osx-${ort_arch}-${ort_version}.tgz"
curl -fsSL --retry 5 --retry-delay 5 --retry-all-errors -o "$RUNNER_TEMP/$archive" "https://github.com/microsoft/onnxruntime/releases/download/v${ort_version}/$archive"
mkdir -p "$extract_dir"
tar -xzf "$RUNNER_TEMP/$archive" -C "$extract_dir"
else
echo "Cache hit: Using cached ONNX Runtime ${ort_version}"
fi
ort_root="$extract_dir/onnxruntime-osx-${ort_arch}-${ort_version}"
if [ ! -d "$ort_root/lib" ]; then
echo "ERROR: ONNX Runtime lib directory missing at $ort_root/lib" >&2
echo "Available directories:" >&2
ls -la "$extract_dir" >&2 || true
exit 1
fi
if ! ls "$ort_root/lib"/libonnxruntime*.dylib 1>/dev/null 2>&1; then
echo "ERROR: No ONNX Runtime libraries found in $ort_root/lib" >&2
echo "Directory contents:" >&2
ls -la "$ort_root/lib" >&2 || true
exit 1
fi
dest="$GITHUB_WORKSPACE/$dest_dir"
mkdir -p "$dest"
cp -f "$ort_root/lib/"libonnxruntime*.dylib "$dest/"
if [ -n "${RUSTFLAGS:-}" ]; then
rustflags="$RUSTFLAGS -L $ort_root/lib"
else
rustflags="-L $ort_root/lib"
fi
if [ "$strategy" = "bundled" ]; then
echo "Using bundled ORT strategy — letting ort-sys download-binaries handle static linking"
{
echo "DYLD_LIBRARY_PATH=$ort_root/lib:$dest:${DYLD_LIBRARY_PATH:-}"
echo "DYLD_FALLBACK_LIBRARY_PATH=$ort_root/lib:$dest:${DYLD_FALLBACK_LIBRARY_PATH:-}"
echo "LIBRARY_PATH=$ort_root/lib:$dest:${LIBRARY_PATH:-}"
} >>"$GITHUB_ENV"
else
{
ort_lib=$(find "$ort_root/lib" -name "libonnxruntime*.dylib" -print -quit)
echo "ORT_LIB_LOCATION=$ort_root/lib"
echo "ORT_PREFER_DYNAMIC_LINK=1"
echo "ORT_SKIP_DOWNLOAD=1"
echo "ORT_STRATEGY=system"
echo "ORT_DYLIB_PATH=$ort_root/lib/${ort_lib##*/}"
echo "DYLD_LIBRARY_PATH=$ort_root/lib:$dest:${DYLD_LIBRARY_PATH:-}"
echo "DYLD_FALLBACK_LIBRARY_PATH=$ort_root/lib:$dest:${DYLD_FALLBACK_LIBRARY_PATH:-}"
echo "LIBRARY_PATH=$ort_root/lib:$dest:${LIBRARY_PATH:-}"
echo "RUSTFLAGS=$rustflags"
} >>"$GITHUB_ENV"
fi

View File

@@ -0,0 +1,100 @@
$OrtVersion = $args[0]
if ([string]::IsNullOrWhiteSpace($OrtVersion)) { throw "Usage: windows.ps1 <ortVersion> [destDir] [archId] [strategy]" }
$DestDir = if ($args.Count -ge 2 -and -not [string]::IsNullOrWhiteSpace($args[1])) { $args[1] } else { "crates/kreuzberg-node" }
$ArchId = if ($args.Count -ge 3) { $args[2] } else { "" }
$Strategy = if ($args.Count -ge 4 -and -not [string]::IsNullOrWhiteSpace($args[3])) { $args[3] } else { "system" }
$ExtractRoot = Join-Path $env:TEMP "onnxruntime"
if ([string]::IsNullOrWhiteSpace($ArchId)) {
$ArchId = $env:RUNNER_ARCH
}
$ArchId = $ArchId.ToLowerInvariant()
if ($ArchId -eq "arm64") { $ArchId = "arm64" } else { $ArchId = "x64" }
$OrtRoot = Join-Path $ExtractRoot "onnxruntime-win-$ArchId-$OrtVersion"
$OrtBin = Join-Path $OrtRoot 'bin'
$OrtLib = Join-Path $OrtRoot 'lib'
if (-Not (Test-Path $OrtRoot)) {
Write-Host "Cache miss: Downloading ONNX Runtime $OrtVersion"
$Archive = "onnxruntime-win-$ArchId-$OrtVersion.zip"
$DownloadPath = Join-Path $env:TEMP $Archive
Invoke-WebRequest -Uri "https://github.com/microsoft/onnxruntime/releases/download/v$OrtVersion/$Archive" -OutFile $DownloadPath -UseBasicParsing -MaximumRetryCount 5 -RetryIntervalSec 5
New-Item -ItemType Directory -Path $ExtractRoot -Force | Out-Null
Expand-Archive -Path $DownloadPath -DestinationPath $ExtractRoot -Force
} else {
Write-Host "Cache hit: Using cached ONNX Runtime $OrtVersion"
}
if (!(Test-Path $OrtLib)) {
Write-Error "ERROR: ONNX Runtime lib directory missing at $OrtLib"
Get-ChildItem -Path $ExtractRoot -Recurse | Write-Host
exit 1
}
$LibFiles = @(Get-ChildItem -Path $OrtLib -Filter "*.lib" -ErrorAction SilentlyContinue)
if ($LibFiles.Count -eq 0) {
Write-Error "ERROR: No ONNX Runtime library files found in $OrtLib"
Get-ChildItem -Path $OrtLib | Write-Host
exit 1
}
$DllDirs = @()
foreach ($Candidate in @($OrtLib, $OrtBin)) {
if (Test-Path $Candidate) {
$CandidateDlls = @(Get-ChildItem -Path $Candidate -Filter "*.dll" -File -ErrorAction SilentlyContinue)
if ($CandidateDlls.Count -gt 0) {
$DllDirs += $Candidate
}
}
}
if ($DllDirs.Count -eq 0) {
$OrtDll = Get-ChildItem -Path $OrtRoot -Recurse -Filter "onnxruntime.dll" -File -ErrorAction SilentlyContinue | Select-Object -First 1
if ($OrtDll) { $DllDirs += $OrtDll.DirectoryName }
}
if ($DllDirs.Count -eq 0) {
$AnyDll = Get-ChildItem -Path $OrtRoot -Recurse -Filter "*.dll" -File -ErrorAction SilentlyContinue | Select-Object -First 1
if ($AnyDll) { $DllDirs += $AnyDll.DirectoryName }
}
$DllDirs = $DllDirs | Select-Object -Unique
if ($DllDirs.Count -eq 0) {
Write-Error "ERROR: No ONNX Runtime runtime DLLs found under $OrtRoot"
Get-ChildItem -Path $OrtRoot -Recurse | Write-Host
exit 1
}
$Dest = Join-Path $env:GITHUB_WORKSPACE $DestDir
New-Item -ItemType Directory -Path $Dest -Force | Out-Null
Copy-Item -Path (Join-Path $OrtLib '*') -Destination $Dest -Force
foreach ($Dir in $DllDirs) {
Copy-Item -Path (Join-Path $Dir '*.dll') -Destination $Dest -Force
}
$RustFlags = if ($env:RUSTFLAGS) { "$env:RUSTFLAGS -L $OrtLib" } else { "-L $OrtLib" }
if ($Strategy -eq "bundled") {
# ort-sys has no prebuilt static binaries for x86_64-pc-windows-gnu (MSYS2/MinGW).
# Use the pre-downloaded Microsoft ORT with dynamic linking for Windows GNU targets.
Write-Host "Using bundled ORT strategy (Windows) - dynamic linking against pre-downloaded ORT (no static binaries for windows-gnu)"
@(
"ORT_LIB_LOCATION=$OrtLib"
"ORT_PREFER_DYNAMIC_LINK=1"
"RUSTFLAGS=$RustFlags"
"LIB=$OrtLib;$env:LIB"
"LIBRARY_PATH=$OrtLib;$env:LIBRARY_PATH"
"PATH=$Dest;$env:PATH"
) | Out-File -FilePath $env:GITHUB_ENV -Encoding utf8 -Append
} else {
@(
"ORT_LIB_LOCATION=$OrtLib"
"ORT_PREFER_DYNAMIC_LINK=1"
"ORT_SKIP_DOWNLOAD=1"
"ORT_STRATEGY=system"
"ORT_DYLIB_PATH=$Dest\onnxruntime.dll"
"RUSTFLAGS=$RustFlags"
"LIB=$OrtLib;$env:LIB"
"LIBRARY_PATH=$OrtLib;$env:LIBRARY_PATH"
"PATH=$Dest;$env:PATH"
) | Out-File -FilePath $env:GITHUB_ENV -Encoding utf8 -Append
}

View File

@@ -0,0 +1,48 @@
#!/usr/bin/env bash
set -euo pipefail
target="${1:?target required}"
case "$target" in
aarch64-apple-darwin)
ort_url="https://cdn.pyke.io/0/pyke:ort-rs/ms@1.24.1/aarch64-apple-darwin.tgz"
;;
x86_64-apple-darwin)
ort_url="https://cdn.pyke.io/0/pyke:ort-rs/ms@1.24.1/x86_64-apple-darwin.tgz"
;;
*)
echo "setup-prebuilt-onnx does not support target $target" >&2
exit 1
;;
esac
ort_dir="${GITHUB_WORKSPACE}/target/onnxruntime/${target}"
ort_root="${ort_dir}/onnxruntime"
ort_lib="${ort_root}/lib"
write_env() {
{
echo "ORT_STRATEGY=system"
echo "ORT_LIB_LOCATION=${ort_lib}"
echo "ORT_SKIP_DOWNLOAD=1"
echo "ORT_PREFER_DYNAMIC_LINK=1"
} >>"${GITHUB_ENV}"
}
if [ ! -f "${ort_lib}/libonnxruntime.a" ]; then
rm -rf "${ort_dir}"
mkdir -p "${ort_lib}"
echo "Attempting to download prebuilt ONNX Runtime for ${target}..." >&2
if curl -fsSL --max-time 30 -o /tmp/ort.tgz "${ort_url}" 2>/dev/null; then
tar -xz -C "${ort_lib}" -f /tmp/ort.tgz
rm -f /tmp/ort.tgz
write_env
else
echo "Warning: Prebuilt ONNX Runtime not available for ${target}" >&2
echo "Will download and build ONNX Runtime during compilation" >&2
fi
else
echo "Using existing ONNX Runtime at ${ort_lib}" >&2
write_env
fi

View File

@@ -0,0 +1,29 @@
#!/usr/bin/env bash
set -euo pipefail
# Usage: build-with-sccache-fallback.sh <cargo command...>
log_file=$(mktemp)
trap 'rm -f "$log_file"' EXIT
echo "Building with sccache (fallback on errors)..."
# Attempt with sccache
if "$@" 2>&1 | tee "$log_file"; then
echo "✓ Build succeeded with sccache"
exit 0
fi
# Check for sccache-related errors
if grep -Eq "sccache.*(error|failed)|cache storage failed|dns error|connection (refused|timed out)" "$log_file"; then
echo "⚠️ sccache failure detected, retrying without cache..."
export RUSTC_WRAPPER=""
export SCCACHE_GHA_ENABLED=false
if "$@"; then
echo "✓ Build succeeded without sccache (fallback)"
exit 0
fi
fi
echo "✗ Build failed"
exit 1

View File

@@ -0,0 +1,7 @@
#!/usr/bin/env bash
set -euo pipefail
label="${1:?label required}"
rm -rf ".tesseract-cache/${label}"
rm -rf ".xdg-cache/${label}"

View File

@@ -0,0 +1,5 @@
#!/usr/bin/env bash
set -euo pipefail
rust_target="${1:?rust target required}"
rm -rf "target/${rust_target}/kreuzberg-tesseract-cache"

View File

@@ -0,0 +1,44 @@
#!/usr/bin/env bash
set -euo pipefail
label="${1:?label required}"
enable_cache="${2:?enable-cache required (true/false)}"
if [ "$enable_cache" = "true" ]; then
cache_dir="${GITHUB_WORKSPACE}/.tesseract-cache/${label}"
echo "TESSERACT_RS_CACHE_DIR=${cache_dir}" >>"$GITHUB_ENV"
echo "XDG_CACHE_HOME=${GITHUB_WORKSPACE}/.xdg-cache/${label}" >>"$GITHUB_ENV"
echo "cache-dir=${cache_dir}" >>"$GITHUB_OUTPUT"
echo "cache-enabled=true" >>"$GITHUB_OUTPUT"
docker_opts="--env TESSERACT_RS_CACHE_DIR=/io/.tesseract-cache/${label}"
docker_opts="${docker_opts} --env XDG_CACHE_HOME=/io/.xdg-cache/${label}"
multiarch=""
if command -v dpkg-architecture >/dev/null 2>&1; then
multiarch="$(dpkg-architecture -qDEB_HOST_MULTIARCH 2>/dev/null || true)"
fi
if [ -z "$multiarch" ]; then
case "$(uname -m)" in
x86_64) multiarch="x86_64-linux-gnu" ;;
aarch64 | arm64) multiarch="aarch64-linux-gnu" ;;
esac
fi
openssl_lib_dir="/usr/lib"
if [ -n "$multiarch" ]; then
openssl_lib_dir="/usr/lib/${multiarch}"
fi
docker_opts="${docker_opts} --env OPENSSL_LIB_DIR=${openssl_lib_dir}"
docker_opts="${docker_opts} --env OPENSSL_INCLUDE_DIR=/usr/include"
echo "docker-options=${docker_opts}" >>"$GITHUB_OUTPUT"
else
{
echo "TESSERACT_RS_CACHE_DIR="
} >>"$GITHUB_ENV"
{
echo "cache-dir="
echo "cache-enabled=false"
echo "docker-options="
} >>"$GITHUB_OUTPUT"
fi

View File

@@ -0,0 +1,7 @@
#!/usr/bin/env bash
set -euo pipefail
label="${1:?label required}"
mkdir -p ".tesseract-cache/${label}"
mkdir -p ".xdg-cache/${label}"

View File

@@ -0,0 +1,11 @@
#!/usr/bin/env bash
set -euo pipefail
label="${1:-Node setup}"
echo "=== ${label} ==="
echo "Node version: $(node --version)"
echo "pnpm version: $(pnpm --version)"
echo "tsx availability: $(command -v tsx || echo 'NOT FOUND')"
echo "pnpm workspace structure:"
pnpm list --depth=0 || true

158
scripts/ci/cache/compute-hash.sh vendored Executable file
View File

@@ -0,0 +1,158 @@
#!/usr/bin/env bash
# Compute deterministic hash for cache key generation
#
# Usage:
# compute-hash.sh <glob-pattern> [glob-pattern...]
# compute-hash.sh --files <file1> <file2> ...
# compute-hash.sh --dirs <dir1> <dir2> ...
#
# Examples:
# compute-hash.sh "crates/kreuzberg/**/*.rs" "crates/kreuzberg-ffi/**/*.rs"
# compute-hash.sh --files Cargo.lock uv.lock
# compute-hash.sh --dirs crates/kreuzberg/ crates/kreuzberg-ffi/
set -euo pipefail
# Color output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
error() {
echo -e "${RED}Error: $*${NC}" >&2
exit 1
}
info() {
echo -e "${GREEN}$*${NC}" >&2
}
warn() {
echo -e "${YELLOW}$*${NC}" >&2
}
# Check if sha256sum or shasum is available
if command -v sha256sum &>/dev/null; then
HASH_CMD="sha256sum"
elif command -v shasum &>/dev/null; then
HASH_CMD="shasum -a 256"
else
error "Neither sha256sum nor shasum found in PATH"
fi
# Mode detection
MODE="glob"
if [[ "${1:-}" == "--files" ]]; then
MODE="files"
shift
elif [[ "${1:-}" == "--dirs" ]]; then
MODE="dirs"
shift
fi
if [[ $# -eq 0 ]]; then
error "No input provided. Usage: $0 <pattern...> or $0 --files <file...> or $0 --dirs <dir...>"
fi
# Temporary file for collecting hashes
TEMP_HASHES=$(mktemp)
trap 'rm -f "$TEMP_HASHES"' EXIT
case "$MODE" in
files)
# Hash specific files directly
for file in "$@"; do
if [[ -f "$file" ]]; then
$HASH_CMD "$file" >>"$TEMP_HASHES" 2>/dev/null || warn "Failed to hash: $file"
else
warn "File not found: $file"
fi
done
;;
dirs)
# Hash all files in directories recursively
for dir in "$@"; do
if [[ -d "$dir" ]]; then
# Find all files (excluding hidden files and directories)
find "$dir" -type f \
! -path "*/.*" \
! -path "*/target/*" \
! -path "*/node_modules/*" \
! -path "*/.venv/*" \
! -path "*/dist/*" \
! -path "*/build/*" \
-exec "$HASH_CMD" {} \; >>"$TEMP_HASHES" 2>/dev/null || true
else
warn "Directory not found: $dir"
fi
done
;;
glob)
# Hash files matching glob patterns
for pattern in "$@"; do
# Use find with -path for glob matching
# Convert glob to find path expression
if [[ "$pattern" == *"**"* ]]; then
# Handle ** recursive glob (e.g., "crates/kreuzberg/**/*.rs")
# Extract the base directory and file extension/name pattern
base_dir=$(echo "$pattern" | cut -d'*' -f1 | sed 's|/$||')
# Get the suffix after the ** (e.g., "/*.rs" from "crates/kreuzberg/**/*.rs")
# Remove everything up to and including **/
suffix="${pattern#*\*\*/}"
# Extract filename pattern (e.g., "*.rs" from "/*.rs")
# Remove leading / if present
if [[ "$suffix" == /* ]]; then
name_pattern="${suffix#/}"
else
name_pattern="$suffix"
fi
if [[ -d "$base_dir" ]]; then
# Find all files recursively using -name for filename matching
# This is more portable and reliable than bash regex
find "$base_dir" -type f \
! -path "*/.*" \
! -path "*/target/*" \
! -path "*/node_modules/*" \
! -path "*/.venv/*" \
-name "$name_pattern" \
-exec "$HASH_CMD" {} \; 2>/dev/null >>"$TEMP_HASHES" || true
else
warn "Directory not found: $base_dir"
fi
else
# Simple glob (no **)
for file in $pattern; do
if [[ -f "$file" ]]; then
$HASH_CMD "$file" >>"$TEMP_HASHES" 2>/dev/null || warn "Failed to hash: $file"
fi
done
fi
done
;;
esac
# Check if we found any files to hash
if [[ ! -s "$TEMP_HASHES" ]]; then
error "No files found matching the provided patterns"
fi
# Sort hashes (for determinism across different find orders)
# Then hash the combined hashes to get final hash
FINAL_HASH=$(sort "$TEMP_HASHES" | $HASH_CMD | cut -d' ' -f1)
# Truncate to 12 characters for cache key (still 48 bits of entropy)
SHORT_HASH="${FINAL_HASH:0:12}"
# Output the hash
echo "$SHORT_HASH"
# Debug info (to stderr)
FILE_COUNT=$(wc -l <"$TEMP_HASHES")
info "Hashed $FILE_COUNT files → $SHORT_HASH" >&2

View File

@@ -0,0 +1,5 @@
#!/usr/bin/env bash
set -euo pipefail
echo "=== Running Docker CLI feature tests ==="
python3 scripts/ci/docker/test_docker.py --image "kreuzberg:cli" --variant cli --verbose

View File

@@ -0,0 +1,13 @@
#!/usr/bin/env bash
# CI wrapper for Docker configuration testing
# Tests volume mounts, config formats, and environment variable overrides
set -euo pipefail
variant="${1:?missing variant}"
echo "=== Running Docker configuration tests (${variant}) ==="
# Run the comprehensive config test script
# The script expects the image to already be built and tagged
exec ./scripts/test/test-docker-config-local.sh --image "kreuzberg:${variant}" --variant "${variant}"

View File

@@ -0,0 +1,7 @@
#!/usr/bin/env bash
set -euo pipefail
variant="${1:?missing variant}"
echo "=== Running Docker feature tests (${variant}) ==="
python3 scripts/ci/docker/test_docker.py --image "kreuzberg:${variant}" --variant "${variant}" --verbose

750
scripts/ci/docker/test_docker.py Executable file
View File

@@ -0,0 +1,750 @@
#!/usr/bin/env python3
"""Unified Docker image test script for all variants (core, full, cli)."""
from __future__ import annotations
import argparse
import json
import os
import random
import subprocess
import sys
import tempfile
import time
from dataclasses import dataclass, field
from pathlib import Path
BLUE = "\033[0;34m"
GREEN = "\033[0;32m"
RED = "\033[0;31m"
YELLOW = "\033[1;33m"
NC = "\033[0m"
REPO_ROOT = Path(__file__).resolve().parents[3]
TEST_DOCS_DIR = REPO_ROOT / "test_documents"
RESULTS_FILE = Path("/tmp/kreuzberg-docker-test-results.json")
@dataclass
class TestRunner:
image: str
variant: str
verbose: bool = False
total: int = 0
passed: int = 0
failed: int = 0
failed_names: list[str] = field(default_factory=list)
containers: list[str] = field(default_factory=list)
def log(self, level: str, color: str, msg: str) -> None:
print(f"{color}[{level}]{NC} {msg}", flush=True)
def info(self, msg: str) -> None:
self.log("INFO", BLUE, msg)
def ok(self, msg: str = "PASS") -> None:
self.log("SUCCESS", GREEN, msg)
def error(self, msg: str) -> None:
self.log("ERROR", RED, msg)
def warn(self, msg: str) -> None:
self.log("WARNING", YELLOW, msg)
def debug(self, msg: str) -> None:
if self.verbose:
self.log("VERBOSE", YELLOW, msg)
def start(self, name: str) -> None:
self.total += 1
self.info(f"Test {self.total}: {name}")
def pass_test(self) -> None:
self.passed += 1
self.ok()
def fail_test(self, name: str, details: str = "") -> None:
self.failed += 1
self.failed_names.append(name)
msg = f"FAIL: {name}"
if details:
msg += f"\n Details: {details}"
self.error(msg)
def container_name(self) -> str:
name = f"kreuzberg-test-{int(time.time())}-{random.randint(0, 99999)}"
self.containers.append(name)
return name
def docker_run(self, *args: str, capture: bool = True) -> subprocess.CompletedProcess[str]:
cmd = ["docker", "run", "--rm", *args]
return subprocess.run(cmd, capture_output=capture, text=True, timeout=120)
def docker_run_detached(self, *args: str) -> str:
name = self.container_name()
cmd = ["docker", "run", "-d", "--name", name, *args]
subprocess.run(cmd, capture_output=True, text=True, check=True, timeout=60)
return name
def docker_rm(self, name: str) -> None:
subprocess.run(["docker", "rm", "-f", name], capture_output=True, timeout=30)
def cleanup(self) -> None:
for c in self.containers:
self.docker_rm(c)
def run_cli_output(self, *extra_args: str, volumes: bool = False) -> str:
"""Run a CLI command against the image and return combined stdout+stderr."""
args: list[str] = ["--name", self.container_name()]
if volumes:
args += ["-v", f"{TEST_DOCS_DIR}:/data:ro"]
args.append(self.image)
args.extend(extra_args)
r = self.docker_run(*args)
return (r.stdout + r.stderr).strip()
def write_results(self) -> None:
rate = (self.passed * 100 // self.total) if self.total else 0
data = {
"image": self.image,
"variant": self.variant,
"total_tests": self.total,
"passed": self.passed,
"failed": self.failed,
"success_rate": rate,
"failed_tests": self.failed_names,
}
RESULTS_FILE.write_text(json.dumps(data, indent=2))
self.info(f"Results written to {RESULTS_FILE}")
# ---------------------------------------------------------------------------
# Shared tests (all variants)
# ---------------------------------------------------------------------------
def test_image_exists(t: TestRunner) -> None:
t.start("Docker image exists")
r = subprocess.run(["docker", "inspect", t.image], capture_output=True, timeout=30)
if r.returncode == 0:
t.pass_test()
else:
t.fail_test("Image does not exist", t.image)
def test_version(t: TestRunner) -> None:
t.start("CLI --version command")
out = t.run_cli_output("--version")
t.debug(f"Version output: {out}")
if "kreuzberg" in out.lower():
t.pass_test()
else:
t.fail_test("CLI version", f"Expected 'kreuzberg' in output, got: {out}")
def test_help(t: TestRunner) -> None:
t.start("CLI --help command")
out = t.run_cli_output("--help")
if "extract" in out.lower():
t.pass_test()
else:
t.fail_test("CLI help", "Expected 'extract' in help output")
def test_mime_detection(t: TestRunner) -> None:
t.start("MIME type detection (detect command)")
out = t.run_cli_output("detect", "/data/pdf/searchable.pdf", volumes=True)
t.debug(f"MIME detection output: {out}")
if "application/pdf" in out.lower():
t.pass_test()
else:
t.fail_test("MIME detection", f"Expected 'application/pdf', got: {out}")
def test_extract_text(t: TestRunner) -> None:
t.start("Extract plain text file")
out = t.run_cli_output("extract", "/data/text/contract.txt", volumes=True)
t.debug(f"Text extraction output (first 100 chars): {out[:100]}")
if len(out) > 15 and "contract" in out.lower():
t.pass_test()
else:
t.fail_test("Text extraction", f"Output too short ({len(out)} chars) or missing expected keywords")
def test_extract_pdf(t: TestRunner) -> None:
t.start("Extract searchable PDF")
name = t.container_name()
r = subprocess.run(
["docker", "run", "--rm", "--name", name,
"-v", f"{TEST_DOCS_DIR}:/data:ro",
t.image, "extract", "/data/pdf/searchable.pdf"],
capture_output=True, text=True, timeout=120,
)
out = (r.stdout + r.stderr).strip()
t.debug(f"PDF extraction output (first 200 chars): {out[:200]}")
if r.returncode != 0:
t.fail_test("Searchable PDF extraction", f"Exit code {r.returncode}: {out[:300]}")
elif len(out) > 50:
t.pass_test()
else:
t.fail_test("Searchable PDF extraction", f"Output too short: {len(out)} chars")
def test_extract_html(t: TestRunner) -> None:
t.start("Extract HTML file")
out = t.run_cli_output("extract", "/data/html/simple_table.html", volumes=True)
t.debug(f"HTML extraction output (first 100 chars): {out[:100]}")
if len(out) > 10:
t.pass_test()
else:
t.fail_test("HTML extraction", f"Output too short: {len(out)} chars")
def test_extract_docx(t: TestRunner) -> None:
t.start("Extract DOCX file")
out = t.run_cli_output("extract", "/data/docx/extraction_test.docx", volumes=True)
t.debug(f"DOCX extraction output (first 100 chars): {out[:100]}")
if len(out) > 100:
t.pass_test()
else:
t.fail_test("DOCX extraction", f"Output too short ({len(out)} chars)")
def test_batch_cli(t: TestRunner) -> None:
t.start("CLI batch extraction (multiple files)")
out = t.run_cli_output(
"batch", "/data/text/contract.txt", "/data/html/simple_table.html",
volumes=True,
)
t.debug(f"Batch output (first 200 chars): {out[:200]}")
if len(out) > 20:
t.pass_test()
else:
t.fail_test("Batch extraction", f"Output too short: {len(out)} chars")
def test_nonexistent_file(t: TestRunner) -> None:
t.start("Non-existent file returns error")
r = subprocess.run(
["docker", "run", "--rm", t.image, "extract", "/nonexistent/file.pdf"],
capture_output=True, text=True, timeout=60,
)
if r.returncode != 0:
t.pass_test()
else:
t.fail_test("Error on missing file", "Expected non-zero exit code for missing file")
def test_readonly_mount(t: TestRunner) -> None:
t.start("Read-only volume mount works")
name = t.container_name()
r = subprocess.run(
["docker", "run", "--rm", "--name", name,
"-v", f"{TEST_DOCS_DIR}:/data:ro",
"--read-only", "--tmpfs", "/tmp",
t.image, "extract", "/data/text/simple.txt"],
capture_output=True, text=True, timeout=60,
)
out = (r.stdout + r.stderr).strip()
if len(out) > 5:
t.pass_test()
else:
t.fail_test("Read-only mount", "Failed to extract with read-only filesystem")
# ---------------------------------------------------------------------------
# Core/Full-only tests (API server tests)
# ---------------------------------------------------------------------------
def _wait_for_api(port: int, retries: int = 10) -> bool:
import urllib.request
for _ in range(retries):
try:
urllib.request.urlopen(f"http://localhost:{port}/health", timeout=3)
return True
except Exception:
time.sleep(2)
return False
def _api_get(port: int, path: str) -> str | None:
import urllib.request
try:
with urllib.request.urlopen(f"http://localhost:{port}{path}", timeout=10) as resp:
return resp.read().decode()
except Exception:
return None
def _api_post_file(port: int, path: str, filepath: str) -> str | None:
"""POST a file using curl (simplest multipart approach)."""
r = subprocess.run(
["curl", "-f", "-s", "-X", "POST", f"http://localhost:{port}{path}",
"-F", f"files=@{filepath}"],
capture_output=True, text=True, timeout=30,
)
return r.stdout if r.returncode == 0 else None
def test_ocr_extraction(t: TestRunner) -> None:
t.start("OCR extraction with Tesseract")
name = t.container_name()
r = subprocess.run(
["docker", "run", "--rm", "--name", name, "--memory", "1g",
"-v", f"{TEST_DOCS_DIR}:/data:ro",
t.image, "extract", "/data/images/ocr_image.jpg", "--ocr", "true"],
capture_output=True, text=True, timeout=120,
)
out = (r.stdout + r.stderr).strip()
t.debug(f"OCR extraction output (first 100 chars): {out[:100]}")
if len(out) > 10:
t.pass_test()
else:
t.fail_test("OCR extraction", "Output too short or OCR failed")
def test_paddle_ocr_extraction(t: TestRunner) -> None:
t.start("PaddleOCR extraction (pre-loaded models)")
name = t.container_name()
r = subprocess.run(
["docker", "run", "--rm", "--name", name, "--memory", "2g",
"-v", f"{TEST_DOCS_DIR}:/data:ro",
t.image, "extract", "/data/images/ocr_image.jpg",
"--ocr", "true", "--ocr-backend", "paddle-ocr"],
capture_output=True, text=True, timeout=120,
)
out = (r.stdout + r.stderr).strip()
t.debug(f"PaddleOCR extraction output (first 200 chars): {out[:200]}")
if r.returncode == 0 and len(out) > 10:
t.pass_test()
else:
t.fail_test("PaddleOCR extraction", f"Exit code: {r.returncode}, output length: {len(out)}")
def test_doc_extraction(t: TestRunner) -> None:
t.start("Legacy DOC extraction (native OLE/CFB)")
name = t.container_name()
r = subprocess.run(
["docker", "run", "--rm", "--name", name, "--memory", "1g",
"-v", f"{TEST_DOCS_DIR}:/data:ro",
t.image, "extract", "/data/doc/unit_test_lists.doc"],
capture_output=True, text=True, timeout=120,
)
out = (r.stdout + r.stderr).strip()
t.debug(f"DOC extraction output (first 100 chars): {out[:100]}")
if len(out) > 20:
t.pass_test()
else:
t.fail_test("DOC extraction", f"Output too short: {len(out)} chars")
def test_api_health(t: TestRunner) -> None:
t.start("API server startup and health check")
port = 9000 + random.randint(0, 999)
name = t.docker_run_detached(
"--memory", "2g", "--cpus", "2",
"-p", f"{port}:8000", t.image,
)
if not _wait_for_api(port):
t.fail_test("API health check", f"Health endpoint not responding on port {port}")
t.docker_rm(name)
return
health = _api_get(port, "/health")
t.debug(f"Health response: {health}")
if health:
t.pass_test()
else:
t.fail_test("API health check", "No response from /health")
# Plugin initialization validation
t.start("Plugin initialization validation")
if health and "plugins" in health:
import re
ocr_m = re.search(r'"ocr_backends_count":(\d+)', health)
ext_m = re.search(r'"extractors_count":(\d+)', health)
ocr_count = int(ocr_m.group(1)) if ocr_m else 0
ext_count = int(ext_m.group(1)) if ext_m else 0
t.debug(f"OCR backends: {ocr_count}, Extractors: {ext_count}")
if t.variant == "full":
if ocr_count > 0:
t.info(f"Full variant: {ocr_count} OCR backend(s) registered")
t.pass_test()
else:
t.fail_test("Plugin initialization", "Full variant: No OCR backends registered")
t.docker_rm(name)
return
else:
t.pass_test()
if ext_count == 0:
t.fail_test("Plugin initialization", "No document extractors registered")
t.docker_rm(name)
return
else:
t.warn("Health response missing 'plugins' field")
t.pass_test()
t.docker_rm(name)
def test_api_extract(t: TestRunner) -> None:
t.start("API extraction endpoint")
port = 9000 + random.randint(0, 999)
name = t.docker_run_detached(
"--memory", "2g", "--cpus", "2",
"-p", f"{port}:8000", t.image,
)
if not _wait_for_api(port):
t.fail_test("API extraction", "Server not ready")
t.docker_rm(name)
return
with tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False) as f:
f.write("Test content for API extraction")
tmp = f.name
resp = _api_post_file(port, "/extract", tmp)
os.unlink(tmp)
t.debug(f"API response: {resp}")
if resp and "Test content for API extraction" in resp:
t.pass_test()
else:
t.fail_test("API extraction", "Response missing expected content")
t.docker_rm(name)
def test_api_info(t: TestRunner) -> None:
t.start("API /info endpoint")
port = 9000 + random.randint(0, 999)
name = t.docker_run_detached(
"--memory", "2g", "--cpus", "2",
"-p", f"{port}:8000", t.image,
)
if not _wait_for_api(port):
t.fail_test("API /info", "Server not ready")
t.docker_rm(name)
return
resp = _api_get(port, "/info")
t.debug(f"/info response: {resp}")
if resp and "version" in resp and "rust_backend" in resp:
t.pass_test()
else:
t.fail_test("API /info endpoint", "Response missing expected fields")
t.docker_rm(name)
def test_api_openapi(t: TestRunner) -> None:
t.start("API /openapi.json endpoint")
port = 9000 + random.randint(0, 999)
name = t.docker_run_detached(
"--memory", "2g", "--cpus", "2",
"-p", f"{port}:8000", t.image,
)
if not _wait_for_api(port):
t.fail_test("API /openapi.json", "Server not ready")
t.docker_rm(name)
return
resp = _api_get(port, "/openapi.json")
t.debug(f"/openapi.json response (first 200 chars): {(resp or '')[:200]}")
if resp and '"openapi"' in resp and '"paths"' in resp:
t.pass_test()
else:
t.fail_test("API /openapi.json endpoint", "Response missing OpenAPI schema fields")
t.docker_rm(name)
def test_api_cache(t: TestRunner) -> None:
t.start("API /cache/stats endpoint")
port = 9000 + random.randint(0, 999)
name = t.docker_run_detached(
"--memory", "2g", "--cpus", "2",
"-p", f"{port}:8000", t.image,
)
if not _wait_for_api(port):
t.fail_test("API /cache/stats", "Server not ready")
t.docker_rm(name)
return
resp = _api_get(port, "/cache/stats")
t.debug(f"/cache/stats response: {resp}")
if resp and "total_files" in resp:
t.pass_test()
else:
t.fail_test("API /cache/stats endpoint", "Response missing expected fields")
t.start("API /cache/clear endpoint")
r = subprocess.run(
["curl", "-f", "-s", "-X", "DELETE", f"http://localhost:{port}/cache/clear"],
capture_output=True, text=True, timeout=10,
)
if r.returncode == 0 and "removed_files" in r.stdout:
t.pass_test()
else:
t.fail_test("API /cache/clear endpoint", "Response missing expected fields")
t.docker_rm(name)
def test_api_batch(t: TestRunner) -> None:
t.start("API batch extraction (multiple files)")
port = 9000 + random.randint(0, 999)
name = t.docker_run_detached(
"--memory", "2g", "--cpus", "2",
"-p", f"{port}:8000", t.image,
)
if not _wait_for_api(port):
t.fail_test("API batch extraction", "Server not ready")
t.docker_rm(name)
return
tmp1 = tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False)
tmp2 = tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False)
tmp1.write("File one content"); tmp1.close()
tmp2.write("File two content"); tmp2.close()
r = subprocess.run(
["curl", "-f", "-s", "-X", "POST", f"http://localhost:{port}/extract",
"-F", f"files=@{tmp1.name}", "-F", f"files=@{tmp2.name}"],
capture_output=True, text=True, timeout=30,
)
os.unlink(tmp1.name)
os.unlink(tmp2.name)
t.debug(f"Batch extraction response (first 200 chars): {r.stdout[:200]}")
if "File one content" in r.stdout and "File two content" in r.stdout:
t.pass_test()
else:
t.fail_test("API batch extraction", "Response missing expected content")
t.docker_rm(name)
def test_cli_batch_json(t: TestRunner) -> None:
t.start("CLI batch extraction with JSON format")
name = t.container_name()
r = subprocess.run(
["docker", "run", "--rm", "--name", name,
"-v", f"{TEST_DOCS_DIR}:/data:ro",
t.image, "batch", "/data/text/contract.txt", "/data/pdf/searchable.pdf",
"--format", "json"],
capture_output=True, text=True, timeout=120,
)
out = (r.stdout + r.stderr).strip()
t.debug(f"Batch command output (first 200 chars): {out[:200]}")
if len(out) > 100 and "content" in out:
t.pass_test()
else:
t.fail_test("CLI batch command", "Output too short or malformed")
def test_mcp_server(t: TestRunner) -> None:
t.start("MCP server startup and persistence")
name = t.docker_run_detached(
"-i", "--memory", "1g", t.image, "mcp",
)
time.sleep(3)
r = subprocess.run(
["docker", "ps", "--filter", f"name={name}", "--format", "{{.Names}}"],
capture_output=True, text=True, timeout=10,
)
if name in r.stdout:
t.debug("MCP server is running")
t.pass_test()
else:
t.fail_test("MCP server persistence", "MCP server exited immediately")
t.docker_rm(name)
def test_cli_cache(t: TestRunner) -> None:
t.start("CLI cache stats command")
name = t.container_name()
r = subprocess.run(
["docker", "run", "--rm", "--name", name, t.image, "cache", "stats", "--format", "json"],
capture_output=True, text=True, timeout=60,
)
out = (r.stdout + r.stderr).strip()
t.debug(f"Cache stats output: {out}")
if "total_files" in out:
t.pass_test()
else:
t.fail_test("CLI cache stats", "Output missing expected fields")
t.start("CLI cache clear command")
name = t.container_name()
r = subprocess.run(
["docker", "run", "--rm", "--name", name, t.image, "cache", "clear", "--format", "json"],
capture_output=True, text=True, timeout=60,
)
out = (r.stdout + r.stderr).strip()
t.debug(f"Cache clear output: {out}")
if "removed_files" in out:
t.pass_test()
else:
t.fail_test("CLI cache clear", "Output missing expected fields")
def test_security_nonroot(t: TestRunner) -> None:
t.start("Security: Container runs as non-root user")
name = t.container_name()
r = subprocess.run(
["docker", "run", "--rm", "--name", name, "--entrypoint", "/bin/sh",
t.image, "-c", "whoami"],
capture_output=True, text=True, timeout=30,
)
user = r.stdout.strip()
if user == "kreuzberg":
t.pass_test()
else:
t.fail_test("Non-root user", f"Container running as: {user} (expected: kreuzberg)")
def test_security_readonly(t: TestRunner) -> None:
t.start("Security: Read-only volume enforcement")
with tempfile.TemporaryDirectory() as tmpdir:
(Path(tmpdir) / "test.txt").write_text("test")
name = t.container_name()
r = subprocess.run(
["docker", "run", "--rm", "--name", name,
"-v", f"{tmpdir}:/data:ro",
"--entrypoint", "/bin/sh", t.image,
"-c", "echo 'attempt' > /data/test2.txt 2>&1 || echo 'READ_ONLY'"],
capture_output=True, text=True, timeout=30,
)
out = r.stdout + r.stderr
if any(s in out for s in ("READ_ONLY", "read-only", "Read-only")):
t.pass_test()
else:
t.fail_test("Read-only volume", "Was able to write to read-only volume")
def test_security_memlimit(t: TestRunner) -> None:
t.start("Security: Memory limit enforcement")
name = t.container_name()
r = subprocess.run(
["docker", "run", "--rm", "--name", name,
"--memory", "128m", "--memory-swap", "128m",
"--entrypoint", "/bin/sh", t.image,
"-c", "echo 'Memory limit test passed'"],
capture_output=True, text=True, timeout=30,
)
if "Memory limit test passed" in r.stdout:
t.pass_test()
else:
t.fail_test("Memory limit", "Container failed with memory limit")
# ---------------------------------------------------------------------------
# CLI-only tests
# ---------------------------------------------------------------------------
def test_cli_image_size(t: TestRunner) -> None:
t.start("Image size is reasonable (< 200MB)")
r = subprocess.run(
["docker", "inspect", t.image, "--format", "{{.Size}}"],
capture_output=True, text=True, timeout=10,
)
try:
size_mb = int(r.stdout.strip()) // (1024 * 1024)
except ValueError:
size_mb = 0
t.debug(f"Image size: {size_mb}MB")
if 0 < size_mb < 200:
t.pass_test()
else:
t.fail_test("Image size", f"Expected < 200MB, got {size_mb}MB")
# ---------------------------------------------------------------------------
# Test suites per variant
# ---------------------------------------------------------------------------
def run_cli_tests(t: TestRunner) -> None:
"""Tests for the minimal CLI Docker image."""
test_image_exists(t)
test_cli_image_size(t)
test_version(t)
test_help(t)
test_mime_detection(t)
test_extract_text(t)
test_extract_pdf(t)
test_extract_html(t)
test_extract_docx(t)
test_batch_cli(t)
test_readonly_mount(t)
test_nonexistent_file(t)
def run_core_full_tests(t: TestRunner) -> None:
"""Tests for core and full Docker images."""
test_image_exists(t)
test_version(t)
test_help(t)
test_mime_detection(t)
test_extract_text(t)
test_extract_pdf(t)
test_extract_docx(t)
test_extract_html(t)
test_ocr_extraction(t)
if t.variant == "full":
test_doc_extraction(t)
test_paddle_ocr_extraction(t)
test_api_health(t)
test_api_extract(t)
test_api_info(t)
test_api_openapi(t)
test_api_cache(t)
test_api_batch(t)
test_cli_batch_json(t)
test_mcp_server(t)
test_cli_cache(t)
test_security_nonroot(t)
test_security_readonly(t)
test_security_memlimit(t)
def main() -> None:
parser = argparse.ArgumentParser(description="Docker image tests")
parser.add_argument("--image", required=True, help="Docker image name")
parser.add_argument("--variant", required=True, choices=["core", "full", "cli"])
parser.add_argument("--verbose", action="store_true")
parser.add_argument("--skip-build", action="store_true", help="(ignored, kept for compat)")
args = parser.parse_args()
t = TestRunner(image=args.image, variant=args.variant, verbose=args.verbose)
print("=" * 72)
t.info(f"Starting Docker tests for: {args.image} (variant: {args.variant})")
print("=" * 72)
try:
if args.variant == "cli":
run_cli_tests(t)
else:
run_core_full_tests(t)
finally:
t.cleanup()
# Summary
print()
print("=" * 72)
t.info(f"Test Results: {t.passed}/{t.total} passed, {t.failed} failed")
print("=" * 72)
if t.failed > 0:
t.error("Failed tests:")
for name in t.failed_names:
print(f" - {name}")
t.write_results()
if t.failed > 0:
sys.exit(1)
t.ok("All tests passed!")
if __name__ == "__main__":
main()

61
scripts/ci/docs/build.sh Executable file
View File

@@ -0,0 +1,61 @@
#!/usr/bin/env bash
# Build the documentation site (Zensical, doc dependency group).
#
# Usage:
# scripts/ci/docs/build.sh
# scripts/ci/docs/build.sh --strict --log-file /tmp/build-log.txt
#
# Caching: use astral-sh/setup-uv with enable-cache in CI; this script only runs uv.
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"
cd "$REPO_ROOT"
strict=false
log_file=""
while [[ $# -gt 0 ]]; do
case "$1" in
--strict)
strict=true
shift
;;
--log-file)
if [[ $# -lt 2 ]]; then
echo "error: --log-file requires a path" >&2
exit 2
fi
log_file="$2"
shift 2
;;
*)
echo "usage: $0 [--strict] [--log-file PATH]" >&2
exit 2
;;
esac
done
uv_sync() {
uv sync --group doc --no-editable --no-install-workspace --no-install-project
}
zensical_build() {
if [[ "$strict" == true ]]; then
uv run --no-sync zensical build --clean --strict
else
uv run --no-sync zensical build --clean
fi
}
if [[ -n "$log_file" ]]; then
set -o pipefail
mkdir -p "$(dirname "$log_file")"
: >"$log_file"
uv_sync 2>&1 | tee -a "$log_file"
zensical_build 2>&1 | tee -a "$log_file"
else
uv_sync
zensical_build
fi

13
scripts/ci/docs/textlint.sh Executable file
View File

@@ -0,0 +1,13 @@
#!/usr/bin/env bash
# Run textlint prose linting against docs/**/*.md.
#
# Usage:
# scripts/ci/docs/textlint.sh
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"
cd "$REPO_ROOT"
npx textlint "docs/**/*.md"

View File

@@ -0,0 +1,17 @@
#!/usr/bin/env bash
set -euo pipefail
version="$(
apt-cache policy tesseract-ocr 2>/dev/null |
grep 'Candidate:' |
grep -Eo '[0-9]+\.[0-9]+' |
head -1 ||
true
)"
if [[ -z "${version}" ]]; then
version="unknown"
fi
echo "version=${version}" >>"${GITHUB_OUTPUT}"
echo "::notice title=Tesseract Version::Detected version: ${version}"

View File

@@ -0,0 +1,25 @@
#!/usr/bin/env bash
set -euo pipefail
version=""
json="$(brew info --json=v2 tesseract 2>/dev/null || true)"
if [[ -n "${json}" ]]; then
version="$(
python3 -c 'import json, re, sys; data = json.loads(sys.argv[1]); stable = (((data.get("formulae") or [{}])[0].get("versions") or {}).get("stable") or ""); m = re.match(r"^(\d+\.\d+)", stable); print(m.group(1) if m else "")' "${json}" || true
)"
fi
if [[ -z "${version}" ]]; then
first_line="$(brew info tesseract 2>/dev/null | head -1 || true)"
if [[ "${first_line}" =~ ([0-9]+\.[0-9]+) ]]; then
version="${BASH_REMATCH[1]}"
fi
fi
if [[ -z "${version}" ]]; then
version="unknown"
fi
echo "version=${version}" >>"${GITHUB_OUTPUT}"
echo "::notice title=Tesseract Version::Detected version: ${version}"

View File

@@ -0,0 +1,136 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="${REPO_ROOT:-$(cd "$SCRIPT_DIR/../../.." && pwd)}"
source "$REPO_ROOT/scripts/lib/retry.sh"
echo "::group::Installing Linux dependencies"
echo "Updating package index..."
if ! retry_with_backoff sudo apt-get update; then
echo "::warning::apt-get update failed after retries, continuing anyway..."
fi
packages=(
tesseract-ocr
tesseract-ocr-eng
tesseract-ocr-tur
tesseract-ocr-deu
fonts-liberation
fonts-dejavu-core
fonts-noto-core
libssl-dev
pkg-config
build-essential
cmake
libmagic-dev
libuv1-dev
php-cli
php-dev
)
echo "Installing dependencies..."
if retry_with_backoff_timeout 900 sudo apt-get install -y "${packages[@]}"; then
echo "✓ All packages installed successfully"
else
exit_code=$?
if [ $exit_code -eq 124 ]; then
echo "::error::Package installation timed out after 15 minutes"
else
echo "::warning::Some packages failed to install, attempting individual installs..."
for pkg in tesseract-ocr libssl-dev pkg-config cmake; do
echo "Installing $pkg..."
if retry_with_backoff_timeout 300 sudo apt-get install -y "$pkg" 2>&1; then
echo "$pkg installed"
else
echo " ⚠ Failed to install $pkg"
fi
done
fi
fi
echo "::endgroup::"
echo "::group::Verifying Linux installations"
echo "CMake:"
if command -v cmake >/dev/null 2>&1; then
cmake --version | head -1
echo "✓ CMake available"
# Export CMAKE environment variable for immediate availability in build scripts
CMAKE_FULL_PATH="$(command -v cmake)"
if [[ -n "$GITHUB_ENV" ]]; then
echo "CMAKE=$CMAKE_FULL_PATH" >>"$GITHUB_ENV"
echo "✓ Set CMAKE=$CMAKE_FULL_PATH in GITHUB_ENV"
fi
# Also add cmake binary directory to GITHUB_PATH for subsequent steps
CMAKE_BIN="$(dirname "$CMAKE_FULL_PATH")"
if [[ -n "$GITHUB_PATH" && -d "$CMAKE_BIN" ]]; then
echo "$CMAKE_BIN" >>"$GITHUB_PATH"
echo "✓ Added cmake directory to GITHUB_PATH: $CMAKE_BIN"
fi
else
echo "::error::CMake not found after installation"
exit 1
fi
echo ""
echo "Tesseract:"
if command -v tesseract >/dev/null 2>&1; then
if tesseract --version 2>/dev/null | head -1; then
echo "✓ Tesseract CLI available"
else
echo "::warning::Tesseract CLI present but failed to run"
fi
else
echo "::warning::Tesseract CLI not found; continuing (OCR will rely on bundled Tesseract)"
fi
echo ""
echo "Available Tesseract languages:"
if command -v tesseract >/dev/null 2>&1; then
tesseract --list-langs | head -10 || true
else
echo "(tesseract CLI not available)"
fi
echo ""
echo "PHP:"
if command -v php >/dev/null 2>&1; then
php --version | head -1
echo "✓ PHP available"
else
echo "::error::PHP not found after installation"
exit 1
fi
echo ""
echo "Checking Tesseract data path..."
tessdata_found=0
for tessdata_path in "/usr/share/tesseract-ocr/5/tessdata" "/usr/share/tesseract-ocr/tessdata"; do
if [ -d "$tessdata_path" ]; then
echo "Found tessdata at: $tessdata_path"
echo "Required language files:"
for lang in eng tur deu; do
if [ -f "$tessdata_path/${lang}.traineddata" ]; then
size=$(stat -c%s "$tessdata_path/${lang}.traineddata" 2>/dev/null || echo "unknown")
echo "${lang}.traineddata ($size bytes)"
else
echo "${lang}.traineddata (missing)"
fi
done
tessdata_found=1
break
fi
done
if [ $tessdata_found -eq 0 ]; then
echo "::error::Tessdata directory not found in standard locations"
exit 1
fi
echo "::endgroup::"

View File

@@ -0,0 +1,136 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="${REPO_ROOT:-$(cd "$SCRIPT_DIR/../../.." && pwd)}"
source "$REPO_ROOT/scripts/lib/retry.sh"
echo "::group::Installing macOS dependencies"
if [[ -d "/opt/homebrew/bin" ]]; then
export PATH="/opt/homebrew/bin:/opt/homebrew/sbin:${PATH}"
echo "/opt/homebrew/bin" >>"$GITHUB_PATH"
echo "/opt/homebrew/sbin" >>"$GITHUB_PATH"
fi
if [[ -d "/usr/local/bin" ]]; then
export PATH="/usr/local/bin:/usr/local/sbin:${PATH}"
echo "/usr/local/bin" >>"$GITHUB_PATH"
echo "/usr/local/sbin" >>"$GITHUB_PATH"
fi
if ! brew list cmake &>/dev/null; then
echo "Installing CMake..."
retry_with_backoff brew install cmake || {
echo "::error::Failed to install CMake after retries"
exit 1
}
else
echo "✓ CMake already installed"
fi
if ! command -v cmake >/dev/null 2>&1; then
echo "CMake not on PATH after install; attempting brew link..."
brew link --overwrite cmake >/dev/null 2>&1 || true
fi
if ! brew list tesseract &>/dev/null; then
echo "Installing Tesseract..."
retry_with_backoff brew install tesseract || {
echo "::error::Failed to install Tesseract after retries"
exit 1
}
else
echo "✓ Tesseract already installed"
fi
if ! command -v tesseract >/dev/null 2>&1; then
echo "Tesseract not on PATH after install; attempting brew link..."
brew link --overwrite tesseract >/dev/null 2>&1 || true
fi
if ! brew list tesseract-lang &>/dev/null; then
echo "Installing Tesseract language packs..."
retry_with_backoff brew install tesseract-lang || {
echo "::warning::Failed to install tesseract-lang, some languages may be unavailable"
}
else
echo "✓ Tesseract language packs already installed"
fi
if ! brew list libmagic &>/dev/null; then
echo "Installing libmagic..."
retry_with_backoff brew install libmagic || {
echo "::warning::Failed to install libmagic after retries"
}
else
echo "✓ libmagic already installed"
fi
if ! brew list php &>/dev/null; then
echo "Installing PHP..."
retry_with_backoff brew install php || {
echo "::error::Failed to install PHP after retries"
exit 1
}
else
echo "✓ PHP already installed"
fi
if ! command -v php >/dev/null 2>&1; then
echo "PHP not on PATH after install; attempting brew link..."
brew link --overwrite php >/dev/null 2>&1 || true
fi
echo "::endgroup::"
echo "::group::Verifying macOS installations"
echo "CMake:"
if command -v cmake >/dev/null 2>&1; then
cmake --version | head -1
# Export CMAKE environment variable for immediate availability in build scripts
CMAKE_FULL_PATH="$(command -v cmake)"
if [[ -n "$GITHUB_ENV" ]]; then
echo "CMAKE=$CMAKE_FULL_PATH" >>"$GITHUB_ENV"
echo "✓ Set CMAKE=$CMAKE_FULL_PATH in GITHUB_ENV"
fi
# Also add cmake binary directory to GITHUB_PATH for subsequent steps
CMAKE_BIN="$(dirname "$CMAKE_FULL_PATH")"
if [[ -n "$GITHUB_PATH" && -d "$CMAKE_BIN" ]]; then
echo "$CMAKE_BIN" >>"$GITHUB_PATH"
echo "✓ Added cmake directory to GITHUB_PATH: $CMAKE_BIN"
fi
else
echo "::error::CMake not found on PATH after installation"
echo "PATH=$PATH"
brew --prefix cmake 2>/dev/null || true
exit 1
fi
echo ""
echo "Tesseract:"
if command -v tesseract >/dev/null 2>&1; then
tesseract --version | head -1
else
echo "::error::Tesseract not found on PATH after installation"
echo "PATH=$PATH"
brew --prefix tesseract 2>/dev/null || true
exit 1
fi
echo ""
echo "Available languages:"
tesseract --list-langs | head -5
echo ""
echo "PHP:"
if command -v php >/dev/null 2>&1; then
php --version | head -1
else
echo "::error::PHP not found on PATH after installation"
echo "PATH=$PATH"
exit 1
fi
echo "::endgroup::"

View File

@@ -0,0 +1,301 @@
#!/usr/bin/env pwsh
Set-StrictMode -Version Latest
$ErrorActionPreference = 'Stop'
Write-Host "::group::Installing Windows dependencies"
function Retry-Command {
param(
[scriptblock]$Command,
[int]$MaxAttempts = 3,
[int]$DelaySeconds = 5
)
$attempt = 1
while ($attempt -le $MaxAttempts) {
try {
Write-Host "Attempt $attempt of $MaxAttempts..."
& $Command
return $true
}
catch {
$attempt++
if ($attempt -le $MaxAttempts) {
$backoffDelay = $DelaySeconds * [Math]::Pow(2, $attempt - 1)
Write-Host "⚠ Attempt failed, retrying in ${backoffDelay}s..." -ForegroundColor Yellow
Start-Sleep -Seconds $backoffDelay
}
else {
return $false
}
}
}
}
$tesseractCacheHit = $env:TESSERACT_CACHE_HIT -eq "true"
$llvmCacheHit = $env:LLVM_CACHE_HIT -eq "true"
$cmakeCacheHit = $env:CMAKE_CACHE_HIT -eq "true"
$cmakeInstalled = $false
Write-Host "Cache status:"
Write-Host " TESSERACT_CACHE_HIT: $env:TESSERACT_CACHE_HIT (evaluated: $tesseractCacheHit)"
Write-Host " LLVM_CACHE_HIT: $env:LLVM_CACHE_HIT (evaluated: $llvmCacheHit)"
Write-Host " CMAKE_CACHE_HIT: $env:CMAKE_CACHE_HIT (evaluated: $cmakeCacheHit)"
Write-Host ""
try {
& cmake --version 2>$null
Write-Host "✓ CMake already installed"
$cmakeInstalled = $true
}
catch {
Write-Host "CMake not found, will attempt to install"
}
if (-not $tesseractCacheHit) {
Write-Host "Tesseract cache miss, installing (optional for build - needed for tests only)..."
if (-not (Retry-Command { choco install -y tesseract --no-progress } -MaxAttempts 3)) {
Write-Host "::warning::Failed to install Tesseract (optional dependency - gem build does not require it)"
}
else {
Write-Host "✓ Tesseract installed"
# Ensure tessdata directory exists and is accessible
$tesseractPath = "C:\Program Files\Tesseract-OCR"
if (Test-Path $tesseractPath) {
Write-Host " Configuring Tesseract data paths..."
# Create tessdata directory if it doesn't exist
$tessdataPath = "$tesseractPath\tessdata"
if (-not (Test-Path $tessdataPath)) {
Write-Host " Creating tessdata directory at: $tessdataPath"
New-Item -ItemType Directory -Path $tessdataPath -Force | Out-Null
}
# Download English language data if not present
if (-not (Test-Path "$tessdataPath\eng.traineddata")) {
Write-Host " Downloading English language data..."
try {
$engUrl = "https://github.com/tesseract-ocr/tessdata_fast/raw/main/eng.traineddata"
Invoke-WebRequest -Uri $engUrl -OutFile "$tessdataPath\eng.traineddata" -ErrorAction Stop
Write-Host " ✓ Downloaded eng.traineddata"
}
catch {
Write-Host " ::warning::Failed to download eng.traineddata: $($_.Exception.Message)"
}
}
# Download OSD data if not present (needed for orientation detection)
if (-not (Test-Path "$tessdataPath\osd.traineddata")) {
Write-Host " Downloading OSD data..."
try {
$osdUrl = "https://github.com/tesseract-ocr/tessdata_fast/raw/main/osd.traineddata"
Invoke-WebRequest -Uri $osdUrl -OutFile "$tessdataPath\osd.traineddata" -ErrorAction Stop
Write-Host " ✓ Downloaded osd.traineddata"
}
catch {
Write-Host " ::warning::Failed to download osd.traineddata: $($_.Exception.Message)"
}
}
}
}
}
else {
Write-Host "✓ Tesseract found in cache"
}
if (-not $llvmCacheHit) {
Write-Host "LLVM cache miss, installing LLVM/Clang (required for bindgen)..."
if (-not (Retry-Command { choco install -y llvm --no-progress } -MaxAttempts 3)) {
Write-Host "::warning::Failed to install LLVM/Clang via Chocolatey"
}
else {
Write-Host "✓ LLVM/Clang installed"
}
}
else {
Write-Host "✓ LLVM/Clang found in cache"
}
Write-Host "Installing PHP..."
$phpInstalled = $false
try {
& php --version 2>$null
Write-Host "✓ PHP already installed"
$phpInstalled = $true
}
catch {
Write-Host "PHP not found, installing via Chocolatey..."
if (-not (Retry-Command { choco install -y php --no-progress } -MaxAttempts 3)) {
Write-Host "::warning::Failed to install PHP via Chocolatey, will rely on shivammathur/setup-php action"
}
else {
Write-Host "✓ PHP installed via Chocolatey"
$phpInstalled = $true
}
}
Write-Host "Installing CMake..."
if (-not $cmakeCacheHit) {
Write-Host "CMake cache miss, installing..."
if (-not (Retry-Command { choco install -y cmake --no-progress } -MaxAttempts 3)) {
throw "Failed to install CMake after 3 attempts"
}
Write-Host "✓ CMake installed"
}
else {
Write-Host "✓ CMake found in cache"
}
Write-Host "Configuring PATH and environment variables..."
$paths = @(
"C:\Program Files\CMake\bin",
"C:\Program Files\Tesseract-OCR",
"C:\Program Files\LLVM\bin",
"C:\tools\php",
"C:\Program Files\PHP"
)
foreach ($path in $paths) {
if (Test-Path $path) {
Write-Host " Adding to PATH: $path"
Write-Output $path | Out-File -FilePath $env:GITHUB_PATH -Encoding utf8 -Append
$env:PATH = "$path;$env:PATH"
}
else {
Write-Host " Path not found (skipping): $path"
}
}
# Ensure TESSDATA_PREFIX is set for Windows OCR tests
$tesseractPath = "C:\Program Files\Tesseract-OCR"
if (Test-Path $tesseractPath) {
$tessdataPath = "$tesseractPath\tessdata"
if (Test-Path $tessdataPath) {
Write-Host " Setting TESSDATA_PREFIX for tests: $tessdataPath"
Add-Content -Path $env:GITHUB_ENV -Value "TESSDATA_PREFIX=$tessdataPath"
$env:TESSDATA_PREFIX = $tessdataPath
}
}
Write-Host "::endgroup::"
Write-Host "::group::Verifying Windows installations"
Write-Host "Tesseract (optional for build):"
try {
$tesseractCmd = Get-Command tesseract -ErrorAction Stop
$tesseractPath = $tesseractCmd.Path
Write-Host " Found at: $tesseractPath"
Write-Host " Command type: $($tesseractCmd.CommandType)"
# Get installation directory
$tesseractDir = Split-Path -Parent $tesseractPath
Write-Host " Installation directory: $tesseractDir"
# Check for tessdata
$tessdataPath = Join-Path $tesseractDir "tessdata"
if (Test-Path $tessdataPath) {
Write-Host " tessdata directory: $tessdataPath"
Write-Host " Available language files:"
Get-ChildItem "$tessdataPath\*.traineddata" -ErrorAction SilentlyContinue | ForEach-Object {
Write-Host " - $($_.Name)"
}
}
else {
Write-Host " tessdata directory not found at: $tessdataPath"
}
try {
$version = & tesseract --version 2>&1
Write-Host " Version output: $version"
Write-Host "✓ Tesseract available and working"
Write-Host ""
Write-Host "Available Tesseract languages:"
& tesseract --list-langs 2>&1 | ForEach-Object { Write-Host " $_" }
}
catch {
Write-Host "⚠ Warning: Tesseract found but failed to run: $($_.Exception.Message)"
}
# Set TESSDATA_PREFIX environment variable for tests
if (Test-Path $tessdataPath) {
Write-Host ""
Write-Host "Setting TESSDATA_PREFIX environment variable..."
Add-Content -Path $env:GITHUB_ENV -Value "TESSDATA_PREFIX=$tessdataPath"
Write-Host "✓ Set TESSDATA_PREFIX=$tessdataPath in GITHUB_ENV"
$env:TESSDATA_PREFIX = $tessdataPath
}
}
catch {
Write-Host "⚠ Tesseract not found on PATH (not required for build)"
Write-Host " Error details: $($_.Exception.Message)"
Write-Host " Searching common installation locations..."
$commonPaths = @(
"C:\Program Files\Tesseract-OCR\tesseract.exe",
"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe",
"${env:ProgramFiles}\Tesseract-OCR\tesseract.exe",
"${env:ProgramFiles(x86)}\Tesseract-OCR\tesseract.exe"
)
$found = $false
foreach ($path in $commonPaths) {
if (Test-Path $path) {
Write-Host " Found Tesseract at: $path (not on PATH)"
$tesseractDir = Split-Path -Parent $path
$tessdataPath = Join-Path $tesseractDir "tessdata"
if (Test-Path $tessdataPath) {
Write-Host " Found tessdata at: $tessdataPath"
Add-Content -Path $env:GITHUB_ENV -Value "TESSDATA_PREFIX=$tessdataPath"
Write-Host "✓ Set TESSDATA_PREFIX=$tessdataPath in GITHUB_ENV"
$env:TESSDATA_PREFIX = $tessdataPath
}
$found = $true
break
}
}
if (-not $found) {
Write-Host " Tesseract not found in common locations"
}
}
Write-Host ""
Write-Host "CMake:"
try {
& cmake --version
Write-Host "✓ CMake available"
# Export CMAKE environment variable for immediate availability in build scripts
$cmakePath = (Get-Command cmake -ErrorAction Stop).Source
if ($cmakePath) {
Add-Content -Path $env:GITHUB_ENV -Value "CMAKE=$cmakePath"
Write-Host "✓ Set CMAKE=$cmakePath in GITHUB_ENV"
}
}
catch {
Write-Host "::error::CMake not found after installation"
throw "CMake verification failed"
}
Write-Host ""
Write-Host "Clang:"
try {
& clang --version
Write-Host "✓ Clang available"
}
catch {
Write-Host "⚠ Warning: Clang not currently available on PATH"
}
Write-Host ""
Write-Host "PHP:"
try {
& php --version
Write-Host "✓ PHP available"
}
catch {
Write-Host "⚠ Warning: PHP not currently available on PATH (will be set up by shivammathur/setup-php action)"
}
Write-Host "::endgroup::"

View File

@@ -0,0 +1,433 @@
#!/usr/bin/env python3
"""
Vendor kreuzberg core crate into R package
Used by: ci-r.yaml - Vendor kreuzberg core crate step
This script:
1. Reads workspace.dependencies from root Cargo.toml
2. Copies core crates to packages/r/vendor/
3. Replaces workspace = true with explicit versions
4. Generates vendor/Cargo.toml with proper workspace setup
"""
import os
import sys
import shutil
import re
from pathlib import Path
try:
import tomllib
except ImportError:
import tomli as tomllib # type: ignore
def get_repo_root() -> Path:
"""Get repository root directory."""
repo_root_env = os.environ.get("REPO_ROOT")
if repo_root_env:
return Path(repo_root_env)
script_dir = Path(__file__).parent.absolute()
return (script_dir / ".." / ".." / "..").resolve()
def read_toml(path: Path) -> dict[str, object]:
"""Read TOML file."""
with open(path, "rb") as f:
return tomllib.load(f)
def get_workspace_deps(repo_root: Path) -> dict[str, object]:
"""Extract workspace.dependencies from root Cargo.toml."""
cargo_toml_path = repo_root / "Cargo.toml"
data = read_toml(cargo_toml_path)
return data.get("workspace", {}).get("dependencies", {})
def get_workspace_version(repo_root: Path) -> str:
"""Extract version from workspace.package."""
cargo_toml_path = repo_root / "Cargo.toml"
data = read_toml(cargo_toml_path)
return data.get("workspace", {}).get("package", {}).get("version", "4.0.0")
def format_dependency(name: str, dep_spec: object) -> str:
"""Format a dependency spec for Cargo.toml."""
if isinstance(dep_spec, str):
return f'{name} = "{dep_spec}"'
elif isinstance(dep_spec, dict):
version: str = dep_spec.get("version", "")
package: str | None = dep_spec.get("package")
features: list[str] = dep_spec.get("features", [])
default_features: bool | None = dep_spec.get("default-features")
optional: bool | None = dep_spec.get("optional")
path: str | None = dep_spec.get("path")
git: str | None = dep_spec.get("git")
branch: str | None = dep_spec.get("branch")
tag: str | None = dep_spec.get("tag")
rev: str | None = dep_spec.get("rev")
parts: list[str] = []
if package:
parts.append(f'package = "{package}"')
if git:
parts.append(f'git = "{git}"')
if branch:
parts.append(f'branch = "{branch}"')
if tag:
parts.append(f'tag = "{tag}"')
if rev:
parts.append(f'rev = "{rev}"')
if path:
parts.append(f'path = "{path}"')
if version:
parts.append(f'version = "{version}"')
if features:
features_str = ', '.join(f'"{f}"' for f in features)
parts.append(f'features = [{features_str}]')
if default_features is False:
parts.append('default-features = false')
elif default_features is True:
parts.append('default-features = true')
if optional is True:
parts.append('optional = true')
elif optional is False:
parts.append('optional = false')
spec_str = ", ".join(parts)
return f"{name} = {{ {spec_str} }}"
return f'{name} = "{dep_spec}"'
def replace_workspace_deps_in_toml(toml_path: Path, workspace_deps: dict[str, object]) -> None:
"""Replace workspace = true with explicit versions in a Cargo.toml file."""
with open(toml_path, "r") as f:
content = f.read()
for name, dep_spec in workspace_deps.items():
pattern1 = rf'^{re.escape(name)} = \{{ workspace = true \}}$'
content = re.sub(pattern1, format_dependency(name, dep_spec), content, flags=re.MULTILINE)
def replace_with_fields(match: re.Match[str]) -> str:
other_fields_str = match.group(1).strip()
base_spec = format_dependency(name, dep_spec)
if " = { " not in base_spec:
# Simple string dep like `ctor = "0.6"` - wrap it
version_val = base_spec.split(" = ", 1)[1].strip('"')
spec_part = f'version = "{version_val}"'
else:
spec_part = base_spec.split(" = { ", 1)[1].rstrip("} ").rstrip("}")
# Extract existing keys and values from workspace spec, handling nested brackets
workspace_fields: dict[str, str] = {}
bracket_depth = 0
current_field = ""
for char in spec_part:
if char == '[':
bracket_depth += 1
current_field += char
elif char == ']':
bracket_depth -= 1
current_field += char
elif char == ',' and bracket_depth == 0:
# End of field
field = current_field.strip()
if field and "=" in field:
key, val = field.split("=", 1)
workspace_fields[key.strip()] = val.strip()
current_field = ""
else:
current_field += char
# Don't forget the last field
if current_field.strip():
field = current_field.strip()
if field and "=" in field:
key, val = field.split("=", 1)
workspace_fields[key.strip()] = val.strip()
# Extract crate-specific keys using bracket-aware parsing
crate_fields: dict[str, str] = {}
bracket_depth = 0
current_field = ""
for char in other_fields_str:
if char == '[':
bracket_depth += 1
current_field += char
elif char == ']':
bracket_depth -= 1
current_field += char
elif char == ',' and bracket_depth == 0:
# End of field
field = current_field.strip()
if field and "=" in field:
key, val = field.split("=", 1)
crate_fields[key.strip()] = val.strip()
current_field = ""
else:
current_field += char
# Don't forget the last field
if current_field.strip():
field = current_field.strip()
if field and "=" in field:
key, val = field.split("=", 1)
crate_fields[key.strip()] = val.strip()
# Merge: crate-specific fields override workspace fields
merged_fields = {**workspace_fields, **crate_fields}
# Build result from merged fields
merged_parts = [f"{k} = {v}" for k, v in merged_fields.items()]
merged_spec = ", ".join(merged_parts)
return f"{name} = {{ {merged_spec} }}"
pattern2 = rf'^{re.escape(name)} = \{{ workspace = true, (.+?) \}}$'
content = re.sub(pattern2, replace_with_fields, content, flags=re.MULTILINE | re.DOTALL)
with open(toml_path, "w") as f:
f.write(content)
def generate_vendor_cargo_toml(repo_root: Path, workspace_deps: dict[str, object], core_version: str, copied_crates: list[str]) -> None:
"""Generate vendor/Cargo.toml with workspace setup.
Args:
repo_root: Repository root directory
workspace_deps: Workspace dependencies from Cargo.toml
core_version: Core version string
copied_crates: List of crates that were successfully copied
"""
deps_lines: list[str] = []
for name, dep_spec in sorted(workspace_deps.items()):
deps_lines.append(format_dependency(name, dep_spec))
deps_str = "\n".join(deps_lines)
# Build members list based on actually copied crates
members = [name for name in ["kreuzberg", "kreuzberg-ffi", "kreuzberg-tesseract", "kreuzberg-paddle-ocr"]
if name in copied_crates]
members_str = ', '.join(f'"{m}"' for m in members)
vendor_toml = f'''[workspace]
members = [{members_str}]
[workspace.package]
version = "{core_version}"
edition = "2024"
rust-version = "1.91"
authors = ["Na'aman Hirschfeld <naaman@kreuzberg.dev>"]
license = "MIT"
repository = "https://github.com/kreuzberg-dev/kreuzberg"
homepage = "https://kreuzberg.dev"
[workspace.dependencies]
{deps_str}
'''
vendor_dir = repo_root / "packages" / "r" / "vendor"
vendor_dir.mkdir(parents=True, exist_ok=True)
toml_path = vendor_dir / "Cargo.toml"
with open(toml_path, "w") as f:
f.write(vendor_toml)
def main() -> None:
"""Main vendoring function."""
repo_root: Path = get_repo_root()
print("=== Vendoring kreuzberg core crate ===")
workspace_deps: dict[str, object] = get_workspace_deps(repo_root)
core_version: str = get_workspace_version(repo_root)
print(f"Core version: {core_version}")
print(f"Workspace dependencies: {len(workspace_deps)}")
vendor_base: Path = repo_root / "packages" / "r" / "vendor"
# Clean only crate directories
crate_names = ["kreuzberg", "kreuzberg-ffi", "kreuzberg-tesseract",
"kreuzberg-paddle-ocr"]
for name in crate_names:
crate_path = vendor_base / name
if crate_path.exists():
shutil.rmtree(crate_path)
# Also clean the vendor Cargo.toml (will be regenerated)
vendor_cargo = vendor_base / "Cargo.toml"
if vendor_cargo.exists():
vendor_cargo.unlink()
print("Cleaned vendor crate directories")
vendor_base.mkdir(parents=True, exist_ok=True)
crates_to_copy: list[tuple[str, str]] = [
("crates/kreuzberg", "kreuzberg"),
("crates/kreuzberg-ffi", "kreuzberg-ffi"),
("crates/kreuzberg-tesseract", "kreuzberg-tesseract"),
("crates/kreuzberg-paddle-ocr", "kreuzberg-paddle-ocr"),
]
copied_crates: list[str] = []
for src_rel, dest_name in crates_to_copy:
src: Path = repo_root / src_rel
dest: Path = vendor_base / dest_name
if src.exists():
try:
shutil.copytree(src, dest)
copied_crates.append(dest_name)
print(f"Copied {dest_name}")
except Exception as e:
print(f"Warning: Failed to copy {dest_name}: {e}", file=sys.stderr)
else:
print(f"Warning: Source directory not found: {src_rel}")
artifact_dirs: list[str] = [".fastembed_cache", "target"]
temp_patterns: list[str] = ["*.swp", "*.bak", "*.tmp", "*~"]
for crate_dir in copied_crates:
crate_path: Path = vendor_base / crate_dir
if crate_path.exists():
for artifact_dir in artifact_dirs:
artifact: Path = crate_path / artifact_dir
if artifact.exists():
shutil.rmtree(artifact)
for pattern in temp_patterns:
for f in crate_path.rglob(pattern):
f.unlink()
print("Cleaned build artifacts")
# Update workspace inheritance in Cargo.toml files
for crate_dir in copied_crates:
crate_toml = vendor_base / crate_dir / "Cargo.toml"
if crate_toml.exists():
with open(crate_toml, "r") as f:
content = f.read()
content = re.sub(r'^version\.workspace = true$', f'version = "{core_version}"', content, flags=re.MULTILINE)
content = re.sub(r'^edition\.workspace = true$', 'edition = "2024"', content, flags=re.MULTILINE)
content = re.sub(r'^rust-version\.workspace = true$', 'rust-version = "1.91"', content, flags=re.MULTILINE)
content = re.sub(r'^authors\.workspace = true$', 'authors = ["Na\'aman Hirschfeld <naaman@kreuzberg.dev>"]', content, flags=re.MULTILINE)
content = re.sub(r'^license\.workspace = true$', 'license = "MIT"', content, flags=re.MULTILINE)
with open(crate_toml, "w") as f:
f.write(content)
replace_workspace_deps_in_toml(crate_toml, workspace_deps)
print(f"Updated {crate_dir}/Cargo.toml")
# Update path dependencies in all crates that depend on other vendored crates
# First handle kreuzberg-ffi's dependency on kreuzberg
if "kreuzberg-ffi" in copied_crates:
ffi_toml = vendor_base / "kreuzberg-ffi" / "Cargo.toml"
if ffi_toml.exists():
with open(ffi_toml, "r") as f:
content = f.read()
if "kreuzberg" in copied_crates:
# Replace kreuzberg workspace references with path dependency
# Handle cases with path, version, or neither
content = re.sub(
r'(kreuzberg = \{) (?:(?:path|version) = "[^"]*", )?',
r'\1 path = "../kreuzberg", ',
content
)
with open(ffi_toml, "w") as f:
f.write(content)
# Update path dependencies in kreuzberg crate if tesseract was copied
if "kreuzberg" in copied_crates:
kreuzberg_toml = vendor_base / "kreuzberg" / "Cargo.toml"
if kreuzberg_toml.exists():
with open(kreuzberg_toml, "r") as f:
content = f.read()
# Only update tesseract path if it was actually copied
if "kreuzberg-tesseract" in copied_crates:
content = re.sub(
r'kreuzberg-tesseract = \{ version = "[^"]*", optional = true \}',
'kreuzberg-tesseract = { path = "../kreuzberg-tesseract", optional = true }',
content
)
# Only update paddle-ocr path if it was actually copied
if "kreuzberg-paddle-ocr" in copied_crates:
content = re.sub(
r'kreuzberg-paddle-ocr = \{ version = "[^"]*", optional = true \}',
'kreuzberg-paddle-ocr = { path = "../kreuzberg-paddle-ocr", optional = true }',
content
)
with open(kreuzberg_toml, "w") as f:
f.write(content)
generate_vendor_cargo_toml(repo_root, workspace_deps, core_version, copied_crates)
print("Generated vendor/Cargo.toml")
# Copy root Cargo.lock so vendor workspace uses identical dependency versions
root_lock = repo_root / "Cargo.lock"
vendor_lock = vendor_base / "Cargo.lock"
if root_lock.exists():
shutil.copy2(root_lock, vendor_lock)
print("Copied Cargo.lock to vendor directory")
# Update R package Cargo.toml to use vendored crates
r_toml = repo_root / "packages" / "r" / "src" / "rust" / "Cargo.toml"
if r_toml.exists():
with open(r_toml, "r") as f:
content = f.read()
# Replace path dependencies to point to vendored crates
# From: path = "../../../../crates/kreuzberg"
# To: path = "../../vendor/kreuzberg"
content = re.sub(
r'path = "\.\./\.\./\.\./\.\./crates/kreuzberg"',
'path = "../../vendor/kreuzberg"',
content
)
content = re.sub(
r'path = "\.\./\.\./\.\./\.\./crates/kreuzberg-ffi"',
'path = "../../vendor/kreuzberg-ffi"',
content
)
with open(r_toml, "w") as f:
f.write(content)
print("Updated R package Cargo.toml to use vendored crates")
print(f"\nVendoring complete (core version: {core_version})")
print(f"Copied crates: {', '.join(sorted(copied_crates))}")
if "kreuzberg" in copied_crates and "kreuzberg-ffi" in copied_crates:
print("R package Cargo.toml uses:")
print(" - path '../../vendor/kreuzberg' for kreuzberg crate")
print(" - path '../../vendor/kreuzberg-ffi' for kreuzberg-ffi crate")
else:
print("Warning: Some required crates were not copied. Check for missing source directories.")
if __name__ == "__main__":
try:
main()
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)

View File

@@ -0,0 +1,95 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="${REPO_ROOT:-$(cd "$SCRIPT_DIR/../../.." && pwd)}"
source "$REPO_ROOT/scripts/lib/common.sh"
source "$REPO_ROOT/scripts/lib/library-paths.sh"
validate_repo_root "$REPO_ROOT" || exit 1
setup_rust_ffi_paths "$REPO_ROOT"
echo "=== Compiling Ruby native extension (Verbose Debug) ==="
cd "$REPO_ROOT/packages/ruby"
export CARGO_BUILD_JOBS=1
export RUST_BACKTRACE=1
export RB_SYS_VERBOSE=1
echo ""
echo "=== Pre-compilation environment ==="
echo "Ruby version: $(ruby --version)"
echo "Ruby platform: $(ruby -e 'puts RUBY_PLATFORM')"
echo "Rustc version: $(rustc --version)"
echo "Cargo version: $(cargo --version)"
echo "Working directory: $(pwd)"
echo ""
echo "=== Build configuration variables ==="
echo "CARGO_BUILD_JOBS: ${CARGO_BUILD_JOBS}"
echo "RUST_BACKTRACE: ${RUST_BACKTRACE}"
echo "RB_SYS_VERBOSE: ${RB_SYS_VERBOSE}"
echo "LD_LIBRARY_PATH: ${LD_LIBRARY_PATH:-<not set>}"
echo "DYLD_LIBRARY_PATH: ${DYLD_LIBRARY_PATH:-<not set>}"
echo ""
echo "=== Pre-vendor directory state ==="
echo "packages/ruby directory contents:"
find . -maxdepth 1 -type f -o -maxdepth 1 -type d | head -20
echo ""
echo "=== Vendoring kreuzberg core ==="
python3 "$REPO_ROOT/scripts/ci/ruby/vendor-kreuzberg-core.py"
echo ""
echo "=== Post-vendor directory state ==="
if [ -d "ext/kreuzberg_rb/vendor" ]; then
echo "Vendor directory contents:"
find ext/kreuzberg_rb/vendor -maxdepth 2 -type f | head -10
else
echo "WARNING: No vendor directory found in ext/kreuzberg_rb"
fi
echo ""
echo "=== Running rake compile with verbose output ==="
bundle exec rake compile --verbose --trace 2>&1 || {
echo ""
echo "ERROR: rake compile failed"
echo "=== Attempting to capture compilation error details ==="
if [ -f "mkmf.log" ]; then
echo "=== mkmf.log (last 150 lines) ==="
tail -150 mkmf.log
fi
echo ""
echo "=== Looking for compiled artifacts ==="
find . -name "*.so" -o -name "*.dll" -o -name "*.dylib" 2>/dev/null | head -20
echo ""
echo "=== Checking gem installation ==="
gem list kreuzberg || echo "Gem not found"
exit 1
}
echo ""
echo "=== Post-compilation directory state ==="
echo "lib/ contents:"
if [ -d "lib" ]; then
find lib -type f -name "*.so" -o -name "*.dll" -o -name "*.dylib" 2>/dev/null || echo "No compiled extension found"
else
echo "ERROR: lib directory not found"
fi
echo ""
echo "=== Verifying extension can be loaded ==="
ruby -e "require_relative 'lib/kreuzberg'; puts 'Extension loaded successfully'" 2>&1 || {
echo "WARNING: Could not load extension directly"
echo "This might be expected if gem installation is required"
}
echo ""
echo "=== Compilation complete ==="

View File

@@ -0,0 +1,5 @@
#!/usr/bin/env bash
set -euo pipefail
gem install bundler -v 4.0.3 --no-document || gem install bundler --no-document
bundler --version

View File

@@ -0,0 +1,30 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="${REPO_ROOT:-$(cd "$SCRIPT_DIR/../../.." && pwd)}"
source "$REPO_ROOT/scripts/lib/common.sh"
validate_repo_root "$REPO_ROOT" || exit 1
echo "=== Installing Ruby dependencies ==="
cd "$REPO_ROOT/packages/ruby"
bundle_path="${BUNDLE_PATH:-$REPO_ROOT/packages/ruby/.bundle/bundle}"
if [[ -n "${GITHUB_ENV:-}" ]]; then
if [[ -z "${BUNDLE_GEMFILE:-}" ]]; then
echo "BUNDLE_GEMFILE=$REPO_ROOT/packages/ruby/Gemfile" >>"$GITHUB_ENV"
fi
if [[ -z "${BUNDLE_PATH:-}" ]]; then
echo "BUNDLE_PATH=$bundle_path" >>"$GITHUB_ENV"
fi
fi
bundle config set deployment false
bundle config set path "$bundle_path"
bundle install --jobs 4
echo "Ruby dependencies installed"

View File

@@ -0,0 +1,430 @@
#!/usr/bin/env python3
"""
Vendor kreuzberg core crate into Ruby package
Used by: ci-ruby.yaml - Vendor kreuzberg core crate step
This script:
1. Reads workspace.dependencies from root Cargo.toml
2. Copies core crates to packages/ruby/vendor/
3. Replaces workspace = true with explicit versions
4. Generates vendor/Cargo.toml with proper workspace setup
"""
import os
import sys
import shutil
import re
from pathlib import Path
try:
import tomllib
except ImportError:
import tomli as tomllib # type: ignore[import-not-found]
def get_repo_root() -> Path:
"""Get repository root directory."""
repo_root_env = os.environ.get("REPO_ROOT")
if repo_root_env:
return Path(repo_root_env)
script_dir = Path(__file__).parent.absolute()
return (script_dir / ".." / ".." / "..").resolve()
def read_toml(path: Path) -> dict[str, object]:
"""Read TOML file."""
with open(path, "rb") as f:
return tomllib.load(f)
def get_workspace_deps(repo_root: Path) -> dict[str, object]:
"""Extract workspace.dependencies from root Cargo.toml."""
cargo_toml_path = repo_root / "Cargo.toml"
data = read_toml(cargo_toml_path)
return data.get("workspace", {}).get("dependencies", {})
def get_workspace_version(repo_root: Path) -> str:
"""Extract version from workspace.package."""
cargo_toml_path = repo_root / "Cargo.toml"
data = read_toml(cargo_toml_path)
return data.get("workspace", {}).get("package", {}).get("version", "4.0.0")
def format_dependency(name: str, dep_spec: object) -> str:
"""Format a dependency spec for Cargo.toml."""
if isinstance(dep_spec, str):
return f'{name} = "{dep_spec}"'
elif isinstance(dep_spec, dict):
version: str = dep_spec.get("version", "")
package: str | None = dep_spec.get("package")
features: list[str] = dep_spec.get("features", [])
default_features: bool | None = dep_spec.get("default-features")
optional: bool | None = dep_spec.get("optional")
path: str | None = dep_spec.get("path")
git: str | None = dep_spec.get("git")
branch: str | None = dep_spec.get("branch")
tag: str | None = dep_spec.get("tag")
rev: str | None = dep_spec.get("rev")
parts: list[str] = []
if package:
parts.append(f'package = "{package}"')
if git:
parts.append(f'git = "{git}"')
if branch:
parts.append(f'branch = "{branch}"')
if tag:
parts.append(f'tag = "{tag}"')
if rev:
parts.append(f'rev = "{rev}"')
if path:
parts.append(f'path = "{path}"')
if version:
parts.append(f'version = "{version}"')
if features:
features_str = ', '.join(f'"{f}"' for f in features)
parts.append(f'features = [{features_str}]')
if default_features is False:
parts.append('default-features = false')
elif default_features is True:
parts.append('default-features = true')
if optional is True:
parts.append('optional = true')
elif optional is False:
parts.append('optional = false')
spec_str = ", ".join(parts)
return f"{name} = {{ {spec_str} }}"
return f'{name} = "{dep_spec}"'
def replace_workspace_deps_in_toml(toml_path: Path, workspace_deps: dict[str, object]) -> None:
"""Replace workspace = true with explicit versions in a Cargo.toml file."""
with open(toml_path, "r") as f:
content = f.read()
for name, dep_spec in workspace_deps.items():
pattern1 = rf'^{re.escape(name)} = \{{ workspace = true \}}$'
content = re.sub(pattern1, format_dependency(name, dep_spec), content, flags=re.MULTILINE)
def replace_with_fields(match: re.Match[str]) -> str:
other_fields_str = match.group(1).strip()
base_spec = format_dependency(name, dep_spec)
if " = { " not in base_spec:
# Simple string dep like `ctor = "0.6"` - wrap it
version_val = base_spec.split(" = ", 1)[1].strip('"')
spec_part = f'version = "{version_val}"'
else:
spec_part = base_spec.split(" = { ", 1)[1].rstrip("} ").rstrip("}")
# Extract existing keys and values from workspace spec, handling nested brackets
workspace_fields: dict[str, str] = {}
bracket_depth = 0
current_field = ""
for char in spec_part:
if char == '[':
bracket_depth += 1
current_field += char
elif char == ']':
bracket_depth -= 1
current_field += char
elif char == ',' and bracket_depth == 0:
# End of field
field = current_field.strip()
if field and "=" in field:
key, val = field.split("=", 1)
workspace_fields[key.strip()] = val.strip()
current_field = ""
else:
current_field += char
# Don't forget the last field
if current_field.strip():
field = current_field.strip()
if field and "=" in field:
key, val = field.split("=", 1)
workspace_fields[key.strip()] = val.strip()
# Extract crate-specific keys using bracket-aware parsing
crate_fields: dict[str, str] = {}
bracket_depth = 0
current_field = ""
for char in other_fields_str:
if char == '[':
bracket_depth += 1
current_field += char
elif char == ']':
bracket_depth -= 1
current_field += char
elif char == ',' and bracket_depth == 0:
# End of field
field = current_field.strip()
if field and "=" in field:
key, val = field.split("=", 1)
crate_fields[key.strip()] = val.strip()
current_field = ""
else:
current_field += char
# Don't forget the last field
if current_field.strip():
field = current_field.strip()
if field and "=" in field:
key, val = field.split("=", 1)
crate_fields[key.strip()] = val.strip()
# Merge: crate-specific fields override workspace fields
merged_fields = {**workspace_fields, **crate_fields}
# Build result from merged fields
merged_parts = [f"{k} = {v}" for k, v in merged_fields.items()]
merged_spec = ", ".join(merged_parts)
return f"{name} = {{ {merged_spec} }}"
pattern2 = rf'^{re.escape(name)} = \{{ workspace = true, (.+?) \}}$'
content = re.sub(pattern2, replace_with_fields, content, flags=re.MULTILINE | re.DOTALL)
with open(toml_path, "w") as f:
f.write(content)
def generate_vendor_cargo_toml(repo_root: Path, workspace_deps: dict[str, object], core_version: str, copied_crates: list[str]) -> None:
"""Generate vendor/Cargo.toml with workspace setup.
Args:
repo_root: Repository root directory
workspace_deps: Workspace dependencies from Cargo.toml
core_version: Core version string
copied_crates: List of crates that were successfully copied
"""
deps_lines: list[str] = []
for name, dep_spec in sorted(workspace_deps.items()):
deps_lines.append(format_dependency(name, dep_spec))
deps_str = "\n".join(deps_lines)
# Build members list based on actually copied crates
members = [name for name in ["kreuzberg", "kreuzberg-ffi", "kreuzberg-tesseract", "kreuzberg-paddle-ocr", "rb-sys"]
if name in copied_crates]
members_str = ', '.join(f'"{m}"' for m in members)
vendor_toml = f'''[workspace]
members = [{members_str}]
[workspace.package]
version = "{core_version}"
edition = "2024"
rust-version = "1.91"
authors = ["Na'aman Hirschfeld <naaman@kreuzberg.dev>"]
license = "MIT"
repository = "https://github.com/kreuzberg-dev/kreuzberg"
homepage = "https://kreuzberg.dev"
[workspace.dependencies]
{deps_str}
'''
vendor_dir = repo_root / "packages" / "ruby" / "vendor"
vendor_dir.mkdir(parents=True, exist_ok=True)
toml_path = vendor_dir / "Cargo.toml"
with open(toml_path, "w") as f:
f.write(vendor_toml)
def main() -> None:
"""Main vendoring function."""
repo_root: Path = get_repo_root()
print("=== Vendoring kreuzberg core crate ===")
workspace_deps: dict[str, object] = get_workspace_deps(repo_root)
core_version: str = get_workspace_version(repo_root)
print(f"Core version: {core_version}")
print(f"Workspace dependencies: {len(workspace_deps)}")
vendor_base: Path = repo_root / "packages" / "ruby" / "vendor"
# Clean only crate directories, preserving vendor/bundle/ (Bundler gems)
crate_names = ["kreuzberg", "kreuzberg-ffi", "kreuzberg-tesseract",
"kreuzberg-paddle-ocr", "rb-sys"]
for name in crate_names:
crate_path = vendor_base / name
if crate_path.exists():
shutil.rmtree(crate_path)
# Also clean the vendor Cargo.toml (will be regenerated)
vendor_cargo = vendor_base / "Cargo.toml"
if vendor_cargo.exists():
vendor_cargo.unlink()
print("Cleaned vendor crate directories")
vendor_base.mkdir(parents=True, exist_ok=True)
crates_to_copy: list[tuple[str, str]] = [
("crates/kreuzberg", "kreuzberg"),
("crates/kreuzberg-ffi", "kreuzberg-ffi"),
("crates/kreuzberg-tesseract", "kreuzberg-tesseract"),
("crates/kreuzberg-paddle-ocr", "kreuzberg-paddle-ocr"),
("vendor/rb-sys", "rb-sys"),
]
copied_crates: list[str] = []
for src_rel, dest_name in crates_to_copy:
src: Path = repo_root / src_rel
dest: Path = vendor_base / dest_name
if src.exists():
try:
shutil.copytree(src, dest)
copied_crates.append(dest_name)
print(f"Copied {dest_name}")
except Exception as e:
print(f"Warning: Failed to copy {dest_name}: {e}", file=sys.stderr)
else:
print(f"Warning: Source directory not found: {src_rel}")
artifact_dirs: list[str] = [".fastembed_cache", "target"]
temp_patterns: list[str] = ["*.swp", "*.bak", "*.tmp", "*~"]
for crate_dir in copied_crates:
crate_path: Path = vendor_base / crate_dir
if crate_path.exists():
for artifact_dir in artifact_dirs:
artifact: Path = crate_path / artifact_dir
if artifact.exists():
shutil.rmtree(artifact)
for pattern in temp_patterns:
for f in crate_path.rglob(pattern):
f.unlink()
print("Cleaned build artifacts")
# Update workspace inheritance in Cargo.toml files
for crate_dir in copied_crates:
crate_toml = vendor_base / crate_dir / "Cargo.toml"
if crate_toml.exists():
with open(crate_toml, "r") as f:
content = f.read()
content = re.sub(r'^version\.workspace = true$', f'version = "{core_version}"', content, flags=re.MULTILINE)
content = re.sub(r'^edition\.workspace = true$', 'edition = "2024"', content, flags=re.MULTILINE)
content = re.sub(r'^rust-version\.workspace = true$', 'rust-version = "1.91"', content, flags=re.MULTILINE)
content = re.sub(r'^authors\.workspace = true$', 'authors = ["Na\'aman Hirschfeld <naaman@kreuzberg.dev>"]', content, flags=re.MULTILINE)
content = re.sub(r'^license\.workspace = true$', 'license = "MIT"', content, flags=re.MULTILINE)
with open(crate_toml, "w") as f:
f.write(content)
replace_workspace_deps_in_toml(crate_toml, workspace_deps)
print(f"Updated {crate_dir}/Cargo.toml")
# Update path dependencies in kreuzberg-ffi crate
if "kreuzberg-ffi" in copied_crates and "kreuzberg" in copied_crates:
ffi_toml = vendor_base / "kreuzberg-ffi" / "Cargo.toml"
if ffi_toml.exists():
with open(ffi_toml, "r") as f:
content = f.read()
# Replace kreuzberg workspace references with path dependency
# Handle cases with path, version, or neither
content = re.sub(
r'(kreuzberg = \{) (?:(?:path|version) = "[^"]*", )?',
r'\1 path = "../kreuzberg", ',
content
)
with open(ffi_toml, "w") as f:
f.write(content)
# Update path dependencies in kreuzberg crate if tesseract was copied
if "kreuzberg" in copied_crates:
kreuzberg_toml = vendor_base / "kreuzberg" / "Cargo.toml"
if kreuzberg_toml.exists():
with open(kreuzberg_toml, "r") as f:
content = f.read()
# Only update tesseract path if it was actually copied
if "kreuzberg-tesseract" in copied_crates:
content = re.sub(
r'kreuzberg-tesseract = \{ (?:path = "[^"]*", )?version = "[^"]*", optional = true \}',
'kreuzberg-tesseract = { path = "../kreuzberg-tesseract", optional = true }',
content
)
# Only update paddle-ocr path if it was actually copied
if "kreuzberg-paddle-ocr" in copied_crates:
content = re.sub(
r'kreuzberg-paddle-ocr = \{ (?:path = "[^"]*", )?version = "[^"]*", optional = true \}',
'kreuzberg-paddle-ocr = { path = "../kreuzberg-paddle-ocr", optional = true }',
content
)
with open(kreuzberg_toml, "w") as f:
f.write(content)
generate_vendor_cargo_toml(repo_root, workspace_deps, core_version, copied_crates)
print("Generated vendor/Cargo.toml")
# Update native extension Cargo.toml to use vendored crates
native_toml = repo_root / "packages" / "ruby" / "ext" / "kreuzberg_rb" / "native" / "Cargo.toml"
if native_toml.exists():
with open(native_toml, "r") as f:
content = f.read()
# Replace path dependencies to point to vendored crates
# From: path = "../../../../../crates/kreuzberg"
# To: path = "../../../vendor/kreuzberg"
content = re.sub(
r'path = "\.\./\.\./\.\./\.\./\.\./crates/kreuzberg"',
'path = "../../../vendor/kreuzberg"',
content
)
content = re.sub(
r'path = "\.\./\.\./\.\./\.\./\.\./crates/kreuzberg-ffi"',
'path = "../../../vendor/kreuzberg-ffi"',
content
)
with open(native_toml, "w") as f:
f.write(content)
print("Updated native extension Cargo.toml to use vendored crates")
print(f"\nVendoring complete (core version: {core_version})")
print(f"Copied crates: {', '.join(sorted(copied_crates))}")
if "kreuzberg" in copied_crates and "kreuzberg-ffi" in copied_crates:
print("Native extension Cargo.toml uses:")
print(" - path '../../../vendor/kreuzberg' for kreuzberg crate")
print(" - path '../../../vendor/kreuzberg-ffi' for kreuzberg-ffi crate")
if "rb-sys" in copied_crates:
print(" - path '../../../vendor/rb-sys' for rb-sys crate")
else:
print(" - rb-sys from crates.io")
else:
print("Warning: Some required crates were not copied. Check for missing source directories.")
if __name__ == "__main__":
try:
main()
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)

View File

@@ -0,0 +1,19 @@
#!/usr/bin/env pwsh
# Package CLI binary as zip archive (Windows)
# Used by: ci-rust.yaml - Package CLI (Windows) step
# Arguments: TARGET (e.g., x86_64-pc-windows-msvc)
param(
[Parameter(Mandatory=$true)]
[string]$Target
)
Set-StrictMode -Version Latest
$ErrorActionPreference = 'Stop'
Write-Host "=== Packaging CLI binary for $Target ==="
cd target/$Target/release
Compress-Archive -Path kreuzberg.exe -DestinationPath ../../../kreuzberg-cli-$Target.zip
Write-Host "Packaging complete: kreuzberg-cli-$Target.zip"

103
scripts/ci/rust/run-unit-tests.sh Executable file
View File

@@ -0,0 +1,103 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="${REPO_ROOT:-$(cd "$SCRIPT_DIR/../../.." && pwd)}"
source "$REPO_ROOT/scripts/lib/common.sh"
source "$REPO_ROOT/scripts/lib/tessdata.sh"
validate_repo_root "$REPO_ROOT" || exit 1
cd "$REPO_ROOT"
echo "=== Running Rust unit tests ==="
setup_tessdata
echo "Test environment configuration:"
echo " TESSDATA_PREFIX: ${TESSDATA_PREFIX:-not set}"
echo " RUST_BACKTRACE: ${RUST_BACKTRACE:-not set}"
echo " CARGO_TERM_COLOR: ${CARGO_TERM_COLOR:-not set}"
echo "Workspace information:"
echo " Repository: $REPO_ROOT"
echo " Excluded packages: kreuzberg-e2e-generator, kreuzberg-py, kreuzberg-node (+ benchmark-harness on Windows)"
if [ ! -d "$TESSDATA_PREFIX" ]; then
echo "WARNING: TESSDATA_PREFIX directory not found: $TESSDATA_PREFIX"
echo "Attempting to create it..."
mkdir -p "$TESSDATA_PREFIX"
ensure_tessdata "$TESSDATA_PREFIX"
fi
echo "Verifying Tesseract data files..."
for lang in eng osd; do
langfile="$TESSDATA_PREFIX/${lang}.traineddata"
if [ -f "$langfile" ]; then
size=$(stat -f%z "$langfile" 2>/dev/null || stat -c%s "$langfile" 2>/dev/null || echo "unknown")
echo "${lang}.traineddata (${size} bytes)"
else
echo " WARNING: Missing ${lang}.traineddata"
fi
done
if [ -n "${KREUZBERG_PDFIUM_PREBUILT:-}" ]; then
export LD_LIBRARY_PATH="${KREUZBERG_PDFIUM_PREBUILT}/lib:${LD_LIBRARY_PATH:-}"
export DYLD_LIBRARY_PATH="${KREUZBERG_PDFIUM_PREBUILT}/lib:${DYLD_LIBRARY_PATH:-}"
export DYLD_FALLBACK_LIBRARY_PATH="${KREUZBERG_PDFIUM_PREBUILT}/lib:${DYLD_FALLBACK_LIBRARY_PATH:-}"
echo "Library path configuration:"
echo " LD_LIBRARY_PATH: $LD_LIBRARY_PATH"
echo " DYLD_LIBRARY_PATH: $DYLD_LIBRARY_PATH"
echo " DYLD_FALLBACK_LIBRARY_PATH: $DYLD_FALLBACK_LIBRARY_PATH"
fi
echo "=== Starting cargo test ==="
# NOTE: We intentionally avoid `--all-features` for the `kreuzberg` crate because
TEST_LOG="/tmp/cargo-test-$$.log"
if ! {
# `--all-targets` runs --lib --bins --tests --examples --benches but excludes
# `--doc`. 22 rustdoc examples in the kreuzberg crate currently reference
# private items (extraction::capacity::estimate_content_capacity et al.) and
# fail to compile. Tracking the cleanup separately; doc-test coverage is not
# on the v5.0.0 publish path. TODO: re-enable doc tests once the failing
# examples are rewritten against the public API.
echo "=== cargo test -p kreuzberg --features full ==="
RUST_BACKTRACE=full cargo test -p kreuzberg --features full --all-targets --verbose
echo "=== cargo test --workspace (all features, excluding kreuzberg) ==="
extra_excludes=()
if [[ "$OSTYPE" == "msys" || "$OSTYPE" == "cygwin" || "$OSTYPE" == "win32" ]]; then
extra_excludes+=(--exclude benchmark-harness)
fi
RUST_BACKTRACE=full cargo test \
--workspace \
--exclude kreuzberg \
--exclude kreuzberg-e2e-generator \
--exclude kreuzberg-py \
--exclude kreuzberg-node \
${extra_excludes[@]+"${extra_excludes[@]}"} \
--all-features \
--all-targets \
--verbose
} 2>&1 | tee "$TEST_LOG"; then
echo "=== Test execution failed ==="
echo "Last 50 lines of test output:"
tail -n 50 "$TEST_LOG"
echo ""
echo "Collecting diagnostic information..."
echo "Disk space:"
df -h . || du -h . 2>/dev/null | head -1
echo "Cargo environment:"
cargo --version
rustc --version
rm -f "$TEST_LOG"
exit 1
fi
rm -f "$TEST_LOG"
echo "=== Tests complete ==="

View File

@@ -0,0 +1,9 @@
#!/usr/bin/env bash
set -euo pipefail
label="${1:-Disk space}"
echo "=== ${label} ===" >&2
df -h / >&2
echo "Disk info:" >&2
df -B1 / | tail -1 >&2 || true