Files
fil/.github/actions/setup-paddle-ocr-models/README.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

203 lines
6.5 KiB
Markdown

# Setup PaddleOCR Models Cache
GitHub Action to download and cache PaddleOCR ONNX models for CI testing and development.
## Overview
This action manages the setup of PaddleOCR PP-OCRv5 ONNX models used by the `kreuzberg-paddle-ocr` crate for optical character recognition testing. It:
- Downloads three model types (detection, classification, recognition) from Hugging Face
- Caches models per OS and CPU architecture (Linux x86_64, Linux ARM64, macOS, Windows)
- Provides environment variables for downstream use
- Outputs cache hit status and available model information
- Gracefully handles download failures (continues with available models)
## Models
The action downloads pre-converted ONNX format models from the `Kreuzberg/paddleocr-onnx-models` Hugging Face repository:
| Model Type | File | Size | Purpose |
| -------------------- | ------------------------------------- | ------- | ----------------------------------------- |
| Detection (det) | `PP-OCRv5_server_det_infer.onnx` | ~84 MB | Text location detection (PP-OCRv5 server) |
| Classification (cls) | `ch_ppocr_mobile_v2.0_cls_infer.onnx` | ~0.6 MB | Text orientation classification |
| Recognition (rec) | `rec/english/model.onnx` | ~8 MB | Text character recognition (PP-OCRv5) |
**Total cache size: ~93 MB per OS/architecture combination**
## Usage
### Basic Usage
```yaml
- uses: ./.github/actions/setup-paddle-ocr-models
```
### With Custom Cache Suffix
```yaml
- uses: ./.github/actions/setup-paddle-ocr-models
with:
cache-key-suffix: my-paddle-ocr-v5
```
### Disable Caching
For cross-architecture builds where caching doesn't help:
```yaml
- uses: ./.github/actions/setup-paddle-ocr-models
with:
cache-enabled: false
```
### Download Specific Models Only
```yaml
- uses: ./.github/actions/setup-paddle-ocr-models
with:
models: "det,rec" # Skip classification model
```
## Inputs
| Name | Description | Required | Default |
| ------------------ | --------------------------------------------------------------- | -------- | -------------------- |
| `cache-enabled` | Enable model caching (set false for cross-arch builds) | No | `true` |
| `models` | Comma-separated list of models to setup (det,cls,rec or subset) | No | `det,cls,rec` |
| `cache-key-suffix` | Suffix for cache key to differentiate model sets | No | `paddle-ocr-v5-onnx` |
## Outputs
| Name | Description |
| ------------------ | ---------------------------------------------------- |
| `cache-hit` | Whether models were restored from cache (true/false) |
| `cache-dir` | Path to the PaddleOCR model cache directory |
| `models-available` | Comma-separated list of available models after setup |
## Outputs as Environment Variables
The action automatically exports:
- `PADDLE_OCR_MODEL_CACHE`: Absolute path to model cache directory
## Cache Strategy
Models are cached using GitHub Actions cache with the following key structure:
```text
paddle-ocr-v5-onnx-{OS}-{ARCHITECTURE}-v4
```
Cache restoration order (restore-keys):
1. Exact match: `paddle-ocr-v5-onnx-{OS}-{ARCHITECTURE}-v4`
2. OS-Architecture: `paddle-ocr-v5-onnx-{OS}-{ARCHITECTURE}-`
3. OS only: `paddle-ocr-v5-onnx-{OS}-`
4. Any: `paddle-ocr-v5-onnx-`
## Example: CI Rust Workflow Integration
```yaml
jobs:
paddle-ocr-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ./.github/actions/setup-paddle-ocr-models
id: paddle-models
- name: Run PaddleOCR tests
run: cargo test --package kreuzberg-paddle-ocr
env:
PADDLE_OCR_MODEL_CACHE: ${{ steps.paddle-models.outputs.cache-dir }}
- name: Report cache status
if: always()
run: |
echo "Cache hit: ${{ steps.paddle-models.outputs.cache-hit }}"
echo "Available models: ${{ steps.paddle-models.outputs.models-available }}"
```
## Error Handling
The action downloads models sequentially and will fail if a required model download fails. After downloading:
- The verify step reports which models are actually available in the output
- Downstream tests can check `models-available` to know what's available
- If all models fail, tests can fall back to alternative behavior
## Download Sources
Models are downloaded from:
```text
https://huggingface.co/Kreuzberg/paddleocr-onnx-models/resolve/main/
```
If this repository becomes unavailable, the action will fail gracefully. Alternative sources can be configured by modifying the `MODEL_URL` environment variables in the action.
## Troubleshooting
### Models not being cached
1. Check that `cache-enabled` is not set to `false`
2. Verify GitHub Actions cache is not full (max 10 GB per repository)
3. Check runner OS and architecture match cache keys
4. View cache in repository settings (Settings → Actions → Caches)
### Download timeouts
If downloads timeout:
- Increase the 300-second timeout in the action steps
- Check Hugging Face API availability
- Try reducing the number of models (`models: "det,rec"`)
### Verifying models are present
Check that all expected models exist in the correct directory structure:
```bash
ls -lh ~/.cache/kreuzberg/paddle-ocr/
```
Expected output:
```text
drwxr-xr-x det/
drwxr-xr-x cls/
drwxr-xr-x rec/
ls -lh ~/.cache/kreuzberg/paddle-ocr/det/
-rw-r--r-- model.onnx (~84 MB)
ls -lh ~/.cache/kreuzberg/paddle-ocr/cls/
-rw-r--r-- model.onnx (~0.6 MB)
ls -lh ~/.cache/kreuzberg/paddle-ocr/rec/english/
-rw-r--r-- model.onnx (~8 MB)
-rw-r--r-- dict.txt
```
The directory structure must match what `ModelManager` expects in `model_manager.rs`.
## Performance Impact
- **First run (no cache)**: ~30-60 seconds (download time depends on network)
- **Cached run**: <1 second (cache restore)
- **Cache size**: ~93 MB per OS/architecture
- **Network bandwidth**: ~93 MB download on cache miss
## Related Actions
- `.github/actions/setup-tesseract-cache` - Similar caching for Tesseract models
- `.github/actions/cache-hf-fastembed` - Hugging Face model caching for fastembed
- `.github/actions/setup-onnx-runtime` - ONNX Runtime setup for inference
## See Also
- [PaddleOCR Documentation](https://github.com/PaddlePaddle/PaddleOCR)
- [kreuzberg-paddle-ocr crate](../../../crates/kreuzberg-paddle-ocr)
- [ModelManager source](../../../crates/kreuzberg/src/paddle_ocr/model_manager.rs)