This commit is contained in:
202
.github/actions/setup-paddle-ocr-models/README.md
vendored
Normal file
202
.github/actions/setup-paddle-ocr-models/README.md
vendored
Normal file
@@ -0,0 +1,202 @@
|
||||
# Setup PaddleOCR Models Cache
|
||||
|
||||
GitHub Action to download and cache PaddleOCR ONNX models for CI testing and development.
|
||||
|
||||
## Overview
|
||||
|
||||
This action manages the setup of PaddleOCR PP-OCRv5 ONNX models used by the `kreuzberg-paddle-ocr` crate for optical character recognition testing. It:
|
||||
|
||||
- Downloads three model types (detection, classification, recognition) from Hugging Face
|
||||
- Caches models per OS and CPU architecture (Linux x86_64, Linux ARM64, macOS, Windows)
|
||||
- Provides environment variables for downstream use
|
||||
- Outputs cache hit status and available model information
|
||||
- Gracefully handles download failures (continues with available models)
|
||||
|
||||
## Models
|
||||
|
||||
The action downloads pre-converted ONNX format models from the `Kreuzberg/paddleocr-onnx-models` Hugging Face repository:
|
||||
|
||||
| Model Type | File | Size | Purpose |
|
||||
| -------------------- | ------------------------------------- | ------- | ----------------------------------------- |
|
||||
| Detection (det) | `PP-OCRv5_server_det_infer.onnx` | ~84 MB | Text location detection (PP-OCRv5 server) |
|
||||
| Classification (cls) | `ch_ppocr_mobile_v2.0_cls_infer.onnx` | ~0.6 MB | Text orientation classification |
|
||||
| Recognition (rec) | `rec/english/model.onnx` | ~8 MB | Text character recognition (PP-OCRv5) |
|
||||
|
||||
**Total cache size: ~93 MB per OS/architecture combination**
|
||||
|
||||
## Usage
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```yaml
|
||||
- uses: ./.github/actions/setup-paddle-ocr-models
|
||||
```
|
||||
|
||||
### With Custom Cache Suffix
|
||||
|
||||
```yaml
|
||||
- uses: ./.github/actions/setup-paddle-ocr-models
|
||||
with:
|
||||
cache-key-suffix: my-paddle-ocr-v5
|
||||
```
|
||||
|
||||
### Disable Caching
|
||||
|
||||
For cross-architecture builds where caching doesn't help:
|
||||
|
||||
```yaml
|
||||
- uses: ./.github/actions/setup-paddle-ocr-models
|
||||
with:
|
||||
cache-enabled: false
|
||||
```
|
||||
|
||||
### Download Specific Models Only
|
||||
|
||||
```yaml
|
||||
- uses: ./.github/actions/setup-paddle-ocr-models
|
||||
with:
|
||||
models: "det,rec" # Skip classification model
|
||||
```
|
||||
|
||||
## Inputs
|
||||
|
||||
| Name | Description | Required | Default |
|
||||
| ------------------ | --------------------------------------------------------------- | -------- | -------------------- |
|
||||
| `cache-enabled` | Enable model caching (set false for cross-arch builds) | No | `true` |
|
||||
| `models` | Comma-separated list of models to setup (det,cls,rec or subset) | No | `det,cls,rec` |
|
||||
| `cache-key-suffix` | Suffix for cache key to differentiate model sets | No | `paddle-ocr-v5-onnx` |
|
||||
|
||||
## Outputs
|
||||
|
||||
| Name | Description |
|
||||
| ------------------ | ---------------------------------------------------- |
|
||||
| `cache-hit` | Whether models were restored from cache (true/false) |
|
||||
| `cache-dir` | Path to the PaddleOCR model cache directory |
|
||||
| `models-available` | Comma-separated list of available models after setup |
|
||||
|
||||
## Outputs as Environment Variables
|
||||
|
||||
The action automatically exports:
|
||||
|
||||
- `PADDLE_OCR_MODEL_CACHE`: Absolute path to model cache directory
|
||||
|
||||
## Cache Strategy
|
||||
|
||||
Models are cached using GitHub Actions cache with the following key structure:
|
||||
|
||||
```text
|
||||
paddle-ocr-v5-onnx-{OS}-{ARCHITECTURE}-v4
|
||||
```
|
||||
|
||||
Cache restoration order (restore-keys):
|
||||
|
||||
1. Exact match: `paddle-ocr-v5-onnx-{OS}-{ARCHITECTURE}-v4`
|
||||
2. OS-Architecture: `paddle-ocr-v5-onnx-{OS}-{ARCHITECTURE}-`
|
||||
3. OS only: `paddle-ocr-v5-onnx-{OS}-`
|
||||
4. Any: `paddle-ocr-v5-onnx-`
|
||||
|
||||
## Example: CI Rust Workflow Integration
|
||||
|
||||
```yaml
|
||||
jobs:
|
||||
paddle-ocr-tests:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- uses: ./.github/actions/setup-paddle-ocr-models
|
||||
id: paddle-models
|
||||
|
||||
- name: Run PaddleOCR tests
|
||||
run: cargo test --package kreuzberg-paddle-ocr
|
||||
env:
|
||||
PADDLE_OCR_MODEL_CACHE: ${{ steps.paddle-models.outputs.cache-dir }}
|
||||
|
||||
- name: Report cache status
|
||||
if: always()
|
||||
run: |
|
||||
echo "Cache hit: ${{ steps.paddle-models.outputs.cache-hit }}"
|
||||
echo "Available models: ${{ steps.paddle-models.outputs.models-available }}"
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
The action downloads models sequentially and will fail if a required model download fails. After downloading:
|
||||
|
||||
- The verify step reports which models are actually available in the output
|
||||
- Downstream tests can check `models-available` to know what's available
|
||||
- If all models fail, tests can fall back to alternative behavior
|
||||
|
||||
## Download Sources
|
||||
|
||||
Models are downloaded from:
|
||||
|
||||
```text
|
||||
https://huggingface.co/Kreuzberg/paddleocr-onnx-models/resolve/main/
|
||||
```
|
||||
|
||||
If this repository becomes unavailable, the action will fail gracefully. Alternative sources can be configured by modifying the `MODEL_URL` environment variables in the action.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Models not being cached
|
||||
|
||||
1. Check that `cache-enabled` is not set to `false`
|
||||
2. Verify GitHub Actions cache is not full (max 10 GB per repository)
|
||||
3. Check runner OS and architecture match cache keys
|
||||
4. View cache in repository settings (Settings → Actions → Caches)
|
||||
|
||||
### Download timeouts
|
||||
|
||||
If downloads timeout:
|
||||
|
||||
- Increase the 300-second timeout in the action steps
|
||||
- Check Hugging Face API availability
|
||||
- Try reducing the number of models (`models: "det,rec"`)
|
||||
|
||||
### Verifying models are present
|
||||
|
||||
Check that all expected models exist in the correct directory structure:
|
||||
|
||||
```bash
|
||||
ls -lh ~/.cache/kreuzberg/paddle-ocr/
|
||||
```
|
||||
|
||||
Expected output:
|
||||
|
||||
```text
|
||||
drwxr-xr-x det/
|
||||
drwxr-xr-x cls/
|
||||
drwxr-xr-x rec/
|
||||
|
||||
ls -lh ~/.cache/kreuzberg/paddle-ocr/det/
|
||||
-rw-r--r-- model.onnx (~84 MB)
|
||||
|
||||
ls -lh ~/.cache/kreuzberg/paddle-ocr/cls/
|
||||
-rw-r--r-- model.onnx (~0.6 MB)
|
||||
|
||||
ls -lh ~/.cache/kreuzberg/paddle-ocr/rec/english/
|
||||
-rw-r--r-- model.onnx (~8 MB)
|
||||
-rw-r--r-- dict.txt
|
||||
```
|
||||
|
||||
The directory structure must match what `ModelManager` expects in `model_manager.rs`.
|
||||
|
||||
## Performance Impact
|
||||
|
||||
- **First run (no cache)**: ~30-60 seconds (download time depends on network)
|
||||
- **Cached run**: <1 second (cache restore)
|
||||
- **Cache size**: ~93 MB per OS/architecture
|
||||
- **Network bandwidth**: ~93 MB download on cache miss
|
||||
|
||||
## Related Actions
|
||||
|
||||
- `.github/actions/setup-tesseract-cache` - Similar caching for Tesseract models
|
||||
- `.github/actions/cache-hf-fastembed` - Hugging Face model caching for fastembed
|
||||
- `.github/actions/setup-onnx-runtime` - ONNX Runtime setup for inference
|
||||
|
||||
## See Also
|
||||
|
||||
- [PaddleOCR Documentation](https://github.com/PaddlePaddle/PaddleOCR)
|
||||
- [kreuzberg-paddle-ocr crate](../../../crates/kreuzberg-paddle-ocr)
|
||||
- [ModelManager source](../../../crates/kreuzberg/src/paddle_ocr/model_manager.rs)
|
||||
231
.github/actions/setup-paddle-ocr-models/action.yml
vendored
Normal file
231
.github/actions/setup-paddle-ocr-models/action.yml
vendored
Normal file
@@ -0,0 +1,231 @@
|
||||
name: Setup PaddleOCR Models Cache
|
||||
description: Download and cache PaddleOCR ONNX models for CI testing
|
||||
|
||||
inputs:
|
||||
cache-enabled:
|
||||
description: Enable model caching (set to false for cross-arch builds)
|
||||
required: false
|
||||
default: "true"
|
||||
models:
|
||||
description: Comma-separated list of models to setup (det,cls,rec or specific subset)
|
||||
required: false
|
||||
default: "det,cls,rec"
|
||||
cache-key-suffix:
|
||||
description: Suffix for cache key to differentiate model sets
|
||||
required: false
|
||||
default: "paddle-ocr-v5-onnx"
|
||||
|
||||
outputs:
|
||||
cache-hit:
|
||||
description: Whether models were restored from cache (true/false)
|
||||
value: ${{ steps.cache-models.outputs.cache-hit }}
|
||||
cache-dir:
|
||||
description: Path to the PaddleOCR model cache directory
|
||||
value: ${{ steps.set-outputs.outputs.cache-dir }}
|
||||
models-available:
|
||||
description: Comma-separated list of available models
|
||||
value: ${{ steps.verify-models.outputs.available-models }}
|
||||
|
||||
runs:
|
||||
using: composite
|
||||
steps:
|
||||
- name: Setup cache directory
|
||||
shell: bash
|
||||
run: |
|
||||
mkdir -p ~/.cache/kreuzberg/paddle-ocr
|
||||
echo "Cache directory: $HOME/.cache/kreuzberg/paddle-ocr"
|
||||
|
||||
- name: Restore PaddleOCR models from cache
|
||||
if: inputs.cache-enabled == 'true'
|
||||
uses: actions/cache@v5
|
||||
id: cache-models
|
||||
with:
|
||||
path: ~/.cache/kreuzberg/paddle-ocr
|
||||
key: ${{ inputs.cache-key-suffix }}-${{ runner.os }}-${{ runner.arch }}-v4
|
||||
restore-keys: |
|
||||
${{ inputs.cache-key-suffix }}-${{ runner.os }}-${{ runner.arch }}-
|
||||
${{ inputs.cache-key-suffix }}-${{ runner.os }}-
|
||||
${{ inputs.cache-key-suffix }}-
|
||||
|
||||
- name: Download detection model (det)
|
||||
if: contains(inputs.models, 'det') && steps.cache-models.outputs.cache-hit != 'true'
|
||||
shell: bash
|
||||
run: |
|
||||
MODEL_URL="https://huggingface.co/Kreuzberg/paddleocr-onnx-models/resolve/main/PP-OCRv5_server_det_infer.onnx"
|
||||
CACHE_DIR="$HOME/.cache/kreuzberg/paddle-ocr"
|
||||
MODEL_DIR="$CACHE_DIR/det"
|
||||
MODEL_FILE="$MODEL_DIR/model.onnx"
|
||||
|
||||
echo "Downloading detection model from $MODEL_URL"
|
||||
mkdir -p "$MODEL_DIR"
|
||||
|
||||
for attempt in 1 2 3; do
|
||||
if [ $attempt -gt 1 ]; then
|
||||
backoff=$((5 * 3 ** (attempt - 2)))
|
||||
echo "Retry attempt $attempt/3 after ${backoff}s backoff..."
|
||||
sleep $backoff
|
||||
fi
|
||||
|
||||
if curl -f -L --progress-bar --connect-timeout 30 --max-time 600 \
|
||||
-o "$MODEL_FILE" "$MODEL_URL"; then
|
||||
echo "Detection model downloaded successfully ($(du -h "$MODEL_FILE" | cut -f1))"
|
||||
exit 0
|
||||
fi
|
||||
done
|
||||
|
||||
echo "ERROR: Failed to download detection model after 3 attempts"
|
||||
rm -f "$MODEL_FILE"
|
||||
exit 1
|
||||
|
||||
- name: Download classification model (cls)
|
||||
if: contains(inputs.models, 'cls') && steps.cache-models.outputs.cache-hit != 'true'
|
||||
shell: bash
|
||||
run: |
|
||||
MODEL_URL="https://huggingface.co/Kreuzberg/paddleocr-onnx-models/resolve/main/ch_ppocr_mobile_v2.0_cls_infer.onnx"
|
||||
CACHE_DIR="$HOME/.cache/kreuzberg/paddle-ocr"
|
||||
MODEL_DIR="$CACHE_DIR/cls"
|
||||
MODEL_FILE="$MODEL_DIR/model.onnx"
|
||||
|
||||
echo "Downloading classification model from $MODEL_URL"
|
||||
mkdir -p "$MODEL_DIR"
|
||||
|
||||
for attempt in 1 2 3; do
|
||||
if [ $attempt -gt 1 ]; then
|
||||
backoff=$((5 * 3 ** (attempt - 2)))
|
||||
echo "Retry attempt $attempt/3 after ${backoff}s backoff..."
|
||||
sleep $backoff
|
||||
fi
|
||||
|
||||
if curl -f -L --progress-bar --connect-timeout 30 --max-time 600 \
|
||||
-o "$MODEL_FILE" "$MODEL_URL"; then
|
||||
echo "Classification model downloaded successfully ($(du -h "$MODEL_FILE" | cut -f1))"
|
||||
exit 0
|
||||
fi
|
||||
done
|
||||
|
||||
echo "ERROR: Failed to download classification model after 3 attempts"
|
||||
rm -f "$MODEL_FILE"
|
||||
exit 1
|
||||
|
||||
- name: Download recognition model (rec/english)
|
||||
if: contains(inputs.models, 'rec') && steps.cache-models.outputs.cache-hit != 'true'
|
||||
shell: bash
|
||||
run: |
|
||||
MODEL_URL="https://huggingface.co/Kreuzberg/paddleocr-onnx-models/resolve/main/rec/english/model.onnx"
|
||||
CACHE_DIR="$HOME/.cache/kreuzberg/paddle-ocr"
|
||||
MODEL_DIR="$CACHE_DIR/rec/english"
|
||||
MODEL_FILE="$MODEL_DIR/model.onnx"
|
||||
|
||||
echo "Downloading English recognition model from $MODEL_URL"
|
||||
mkdir -p "$MODEL_DIR"
|
||||
|
||||
for attempt in 1 2 3; do
|
||||
if [ $attempt -gt 1 ]; then
|
||||
backoff=$((5 * 3 ** (attempt - 2)))
|
||||
echo "Retry attempt $attempt/3 after ${backoff}s backoff..."
|
||||
sleep $backoff
|
||||
fi
|
||||
|
||||
if curl -f -L --progress-bar --connect-timeout 30 --max-time 600 \
|
||||
-o "$MODEL_FILE" "$MODEL_URL"; then
|
||||
echo "Recognition model downloaded successfully ($(du -h "$MODEL_FILE" | cut -f1))"
|
||||
exit 0
|
||||
fi
|
||||
done
|
||||
|
||||
echo "ERROR: Failed to download recognition model after 3 attempts"
|
||||
rm -f "$MODEL_FILE"
|
||||
exit 1
|
||||
|
||||
- name: Download recognition dictionary (rec/english/dict.txt)
|
||||
if: contains(inputs.models, 'rec') && steps.cache-models.outputs.cache-hit != 'true'
|
||||
shell: bash
|
||||
run: |
|
||||
DICT_URL="https://huggingface.co/Kreuzberg/paddleocr-onnx-models/resolve/main/rec/english/dict.txt"
|
||||
CACHE_DIR="$HOME/.cache/kreuzberg/paddle-ocr"
|
||||
MODEL_DIR="$CACHE_DIR/rec/english"
|
||||
DICT_FILE="$MODEL_DIR/dict.txt"
|
||||
|
||||
echo "Downloading English recognition dictionary from $DICT_URL"
|
||||
mkdir -p "$MODEL_DIR"
|
||||
|
||||
for attempt in 1 2 3; do
|
||||
if [ $attempt -gt 1 ]; then
|
||||
backoff=$((5 * 3 ** (attempt - 2)))
|
||||
echo "Retry attempt $attempt/3 after ${backoff}s backoff..."
|
||||
sleep $backoff
|
||||
fi
|
||||
|
||||
if curl -f -L --progress-bar --connect-timeout 30 --max-time 600 \
|
||||
-o "$DICT_FILE" "$DICT_URL"; then
|
||||
echo "Dictionary downloaded successfully ($(du -h "$DICT_FILE" | cut -f1))"
|
||||
exit 0
|
||||
fi
|
||||
done
|
||||
|
||||
echo "ERROR: Failed to download dictionary after 3 attempts"
|
||||
rm -f "$DICT_FILE"
|
||||
exit 1
|
||||
|
||||
- name: Verify downloaded models
|
||||
id: verify-models
|
||||
shell: bash
|
||||
run: |
|
||||
CACHE_DIR="$HOME/.cache/kreuzberg/paddle-ocr"
|
||||
AVAILABLE_MODELS=()
|
||||
TOTAL_SIZE=0
|
||||
|
||||
echo "Checking for PaddleOCR models in $CACHE_DIR"
|
||||
|
||||
if [ -f "$CACHE_DIR/det/model.onnx" ]; then
|
||||
SIZE=$(wc -c < "$CACHE_DIR/det/model.onnx" | tr -d ' ')
|
||||
AVAILABLE_MODELS+=("det")
|
||||
TOTAL_SIZE=$((TOTAL_SIZE + SIZE))
|
||||
echo " ✓ Detection model: $(numfmt --to=iec-i --suffix=B $SIZE 2>/dev/null || echo $SIZE bytes)"
|
||||
fi
|
||||
|
||||
if [ -f "$CACHE_DIR/cls/model.onnx" ]; then
|
||||
SIZE=$(wc -c < "$CACHE_DIR/cls/model.onnx" | tr -d ' ')
|
||||
AVAILABLE_MODELS+=("cls")
|
||||
TOTAL_SIZE=$((TOTAL_SIZE + SIZE))
|
||||
echo " ✓ Classification model: $(numfmt --to=iec-i --suffix=B $SIZE 2>/dev/null || echo $SIZE bytes)"
|
||||
fi
|
||||
|
||||
if [ -f "$CACHE_DIR/rec/english/model.onnx" ]; then
|
||||
SIZE=$(wc -c < "$CACHE_DIR/rec/english/model.onnx" | tr -d ' ')
|
||||
AVAILABLE_MODELS+=("rec")
|
||||
TOTAL_SIZE=$((TOTAL_SIZE + SIZE))
|
||||
echo " ✓ Recognition model (English): $(numfmt --to=iec-i --suffix=B $SIZE 2>/dev/null || echo $SIZE bytes)"
|
||||
fi
|
||||
|
||||
if [ -f "$CACHE_DIR/rec/english/dict.txt" ]; then
|
||||
SIZE=$(wc -c < "$CACHE_DIR/rec/english/dict.txt" | tr -d ' ')
|
||||
TOTAL_SIZE=$((TOTAL_SIZE + SIZE))
|
||||
echo " ✓ Recognition dictionary (English): $(numfmt --to=iec-i --suffix=B $SIZE 2>/dev/null || echo $SIZE bytes)"
|
||||
fi
|
||||
|
||||
if [ ${#AVAILABLE_MODELS[@]} -eq 0 ]; then
|
||||
echo "ERROR: No models found in cache directory after download"
|
||||
echo "available-models=" >> $GITHUB_OUTPUT
|
||||
exit 1
|
||||
fi
|
||||
|
||||
AVAILABLE_MODELS_STR=$(IFS=, ; echo "${AVAILABLE_MODELS[*]}")
|
||||
echo "✓ Total cached models: ${#AVAILABLE_MODELS[@]} ($(numfmt --to=iec-i --suffix=B $TOTAL_SIZE 2>/dev/null || echo $TOTAL_SIZE bytes))"
|
||||
echo "available-models=$AVAILABLE_MODELS_STR" >> $GITHUB_OUTPUT
|
||||
|
||||
- name: Set cache directory output
|
||||
id: set-outputs
|
||||
shell: bash
|
||||
run: |
|
||||
CACHE_DIR="$HOME/.cache/kreuzberg/paddle-ocr"
|
||||
echo "cache-dir=$CACHE_DIR" >> $GITHUB_OUTPUT
|
||||
echo "PADDLE_OCR_MODEL_CACHE=$CACHE_DIR" >> $GITHUB_ENV
|
||||
echo "KREUZBERG_CACHE_DIR=$HOME/.cache/kreuzberg" >> $GITHUB_ENV
|
||||
|
||||
- name: Export cache environment
|
||||
shell: bash
|
||||
run: |
|
||||
echo "PADDLE_OCR_MODEL_CACHE=$HOME/.cache/kreuzberg/paddle-ocr" >> $GITHUB_ENV
|
||||
echo "KREUZBERG_CACHE_DIR=$HOME/.cache/kreuzberg" >> $GITHUB_ENV
|
||||
echo "PaddleOCR model cache configured at: $HOME/.cache/kreuzberg/paddle-ocr"
|
||||
Reference in New Issue
Block a user