hjess/fil

Fork 0

Files

Henrik Jess Nielsen b4c07d3693

Deploy fil (kreuzberg) / deploy (push) Successful in 49s

Details

Nomad changes

2026-06-01 23:40:55 +02:00

6.5 KiB

Raw Blame History

Setup PaddleOCR Models Cache

GitHub Action to download and cache PaddleOCR ONNX models for CI testing and development.

Overview

This action manages the setup of PaddleOCR PP-OCRv5 ONNX models used by the kreuzberg-paddle-ocr crate for optical character recognition testing. It:

Downloads three model types (detection, classification, recognition) from Hugging Face
Caches models per OS and CPU architecture (Linux x86_64, Linux ARM64, macOS, Windows)
Provides environment variables for downstream use
Outputs cache hit status and available model information
Gracefully handles download failures (continues with available models)

Models

The action downloads pre-converted ONNX format models from the Kreuzberg/paddleocr-onnx-models Hugging Face repository:

Model Type	File	Size	Purpose
Detection (det)	`PP-OCRv5_server_det_infer.onnx`	~84 MB	Text location detection (PP-OCRv5 server)
Classification (cls)	`ch_ppocr_mobile_v2.0_cls_infer.onnx`	~0.6 MB	Text orientation classification
Recognition (rec)	`rec/english/model.onnx`	~8 MB	Text character recognition (PP-OCRv5)

Total cache size: ~93 MB per OS/architecture combination

Usage

Basic Usage

- uses: ./.github/actions/setup-paddle-ocr-models

With Custom Cache Suffix

- uses: ./.github/actions/setup-paddle-ocr-models
  with:
    cache-key-suffix: my-paddle-ocr-v5

Disable Caching

For cross-architecture builds where caching doesn't help:

- uses: ./.github/actions/setup-paddle-ocr-models
  with:
    cache-enabled: false

Download Specific Models Only

- uses: ./.github/actions/setup-paddle-ocr-models
  with:
    models: "det,rec" # Skip classification model

Inputs

Name	Description	Required	Default
`cache-enabled`	Enable model caching (set false for cross-arch builds)	No	`true`
`models`	Comma-separated list of models to setup (det,cls,rec or subset)	No	`det,cls,rec`
`cache-key-suffix`	Suffix for cache key to differentiate model sets	No	`paddle-ocr-v5-onnx`

Outputs

Name	Description
`cache-hit`	Whether models were restored from cache (true/false)
`cache-dir`	Path to the PaddleOCR model cache directory
`models-available`	Comma-separated list of available models after setup

Outputs as Environment Variables

The action automatically exports:

PADDLE_OCR_MODEL_CACHE: Absolute path to model cache directory

Cache Strategy

Models are cached using GitHub Actions cache with the following key structure:

paddle-ocr-v5-onnx-{OS}-{ARCHITECTURE}-v4

Cache restoration order (restore-keys):

Exact match: paddle-ocr-v5-onnx-{OS}-{ARCHITECTURE}-v4
OS-Architecture: paddle-ocr-v5-onnx-{OS}-{ARCHITECTURE}-
OS only: paddle-ocr-v5-onnx-{OS}-
Any: paddle-ocr-v5-onnx-

Example: CI Rust Workflow Integration

jobs:
  paddle-ocr-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: ./.github/actions/setup-paddle-ocr-models
        id: paddle-models

      - name: Run PaddleOCR tests
        run: cargo test --package kreuzberg-paddle-ocr
        env:
          PADDLE_OCR_MODEL_CACHE: ${{ steps.paddle-models.outputs.cache-dir }}

      - name: Report cache status
        if: always()
        run: |
          echo "Cache hit: ${{ steps.paddle-models.outputs.cache-hit }}"
          echo "Available models: ${{ steps.paddle-models.outputs.models-available }}"

Error Handling

The action downloads models sequentially and will fail if a required model download fails. After downloading:

The verify step reports which models are actually available in the output
Downstream tests can check models-available to know what's available
If all models fail, tests can fall back to alternative behavior

Download Sources

Models are downloaded from:

https://huggingface.co/Kreuzberg/paddleocr-onnx-models/resolve/main/

If this repository becomes unavailable, the action will fail gracefully. Alternative sources can be configured by modifying the MODEL_URL environment variables in the action.

Troubleshooting

Models not being cached

Check that cache-enabled is not set to false
Verify GitHub Actions cache is not full (max 10 GB per repository)
Check runner OS and architecture match cache keys
View cache in repository settings (Settings → Actions → Caches)

Download timeouts

If downloads timeout:

Increase the 300-second timeout in the action steps
Check Hugging Face API availability
Try reducing the number of models (models: "det,rec")

Verifying models are present

Check that all expected models exist in the correct directory structure:

ls -lh ~/.cache/kreuzberg/paddle-ocr/

Expected output:

drwxr-xr-x det/
drwxr-xr-x cls/
drwxr-xr-x rec/

ls -lh ~/.cache/kreuzberg/paddle-ocr/det/
-rw-r--r-- model.onnx (~84 MB)

ls -lh ~/.cache/kreuzberg/paddle-ocr/cls/
-rw-r--r-- model.onnx (~0.6 MB)

ls -lh ~/.cache/kreuzberg/paddle-ocr/rec/english/
-rw-r--r-- model.onnx (~8 MB)
-rw-r--r-- dict.txt

The directory structure must match what ModelManager expects in model_manager.rs.

Performance Impact

First run (no cache): ~30-60 seconds (download time depends on network)
Cached run: <1 second (cache restore)
Cache size: ~93 MB per OS/architecture
Network bandwidth: ~93 MB download on cache miss

.github/actions/setup-tesseract-cache - Similar caching for Tesseract models
.github/actions/cache-hf-fastembed - Hugging Face model caching for fastembed
.github/actions/setup-onnx-runtime - ONNX Runtime setup for inference

6.5 KiB Raw Blame History