13 KiB
Python Binding Systematic Audit
Date: May 30, 2026 Binding Version: 5.0.0-rc.3 E2E Status: 108/108 passing (at audit start) Coverage: PyO3 binding (crates/kreuzberg-py), Python wrapper (packages/python), E2E tests (e2e/python)
Critical Issues
1. BINDING_BUG: Monolithic Error Translation → PyRuntimeError
Severity: CRITICAL Category: Error Handling Files Affected:
crates/kreuzberg-py/src/lib.rs(auto-generated, all#[pyfunction]items)packages/python/kreuzberg/exceptions.py(defines exception classes that are never used)
Issue Description:
All Rust-to-Python error conversions use a single, generic PyRuntimeError. The binding defines specific exception classes (OcrError, ParsingError, ValidationError, CacheError, SecurityError, UnsupportedFormatError, EmbeddingError, ImageProcessingError, PluginError, SerializationError, MissingDependencyError, LockPoisonedError, KreuzbergTimeoutError, CancelledError, IoError) in exceptions.py, but these are never raised. Instead, all errors collapse to PyRuntimeError.
Evidence:
Empirical verification (May 30, 2026, real runtime test):
>>> kreuzberg.extract_bytes_sync(b"test", "application/x-nonexistent", ExtractionConfig())
RuntimeError: Unsupported format: application/x-nonexistent
>>> isinstance(e, kreuzberg.UnsupportedFormatError)
False
>>> isinstance(e, RuntimeError)
True
Conclusion: Error message correctly identifies "Unsupported format", but exception is RuntimeError not UnsupportedFormatError.
Code locations in crates/kreuzberg-py/src/lib.rs:
- 10900:
extract_bytes_sync→.map_err(|e| pyo3::exceptions::PyRuntimeError::new_err(e.to_string())) - 10905:
extract_file_sync→ same pattern - 10916:
batch_extract_files_sync→ same pattern - 10932:
batch_extract_bytes_sync→ same pattern - 10950, 10971: async variants in error handlers
- 11104, 11113, 11122: embeddings, mime detection, detection functions
- 11201-11218: plugin bridge methods (PyOcrBackendBridge)
- 11355, 11526, 11678: other plugin bridges
Root Cause:
Alef (the code generator) does not yet implement error type mapping for Python. The generated binding uses a monolithic exception conversion. Alef config (alef.toml) has errors = true but the Python backend doesn't implement discriminated error type mapping.
Impact:
# Current behavior - always catches PyRuntimeError
try:
kreuzberg.extract_file("doc.pdf")
except kreuzberg.OcrError:
# Never executes - error is PyRuntimeError
log_ocr_issue()
except RuntimeError:
# Always catches
log_any_error()
Users cannot implement granular error handling or detect specific failure modes (OCR failed vs parsing failed vs timeout).
Proposed Fix:
Create error mapping layer in crates/kreuzberg-py/src/lib.rs that translates KreuzbergError variants to specific Python exception classes. This requires:
- Inspect the error enum variant in Rust before converting to string
- Raise the appropriate Python exception class
Example pattern:
fn error_to_pyerr(e: kreuzberg::KreuzbergError) -> PyErr {
match e {
kreuzberg::KreuzbergError::Ocr { message } => {
PyErr::new::<OcrError, _>(message)
},
kreuzberg::KreuzbergError::Parsing { message } => {
PyErr::new::<ParsingError, _>(message)
},
// ... other variants
_ => PyErr::new::<PyRuntimeError, _>(e.to_string()),
}
}
Then use error_to_pyerr(e) instead of PyRuntimeError::new_err(e.to_string()) throughout.
Status: DEFERRED - Requires upstream Alef changes OR manual implementation in binding. Priority: CRITICAL (breaks API contract)
2. TEST_FIXTURE: Missing Error Type Assertions
Severity: HIGH Category: Test Coverage Files Affected:
e2e/python/tests/test_async.py:49,59e2e/python/tests/test_error.py(entire file, likely same pattern)
Issue Description:
E2E test fixtures that exercise error paths catch generic Exception and never assert the specific exception type. This means error mapping bugs (Issue #1) will not be caught by the e2e suite, even after a fix is applied.
Evidence:
# test_async.py:49
with pytest.raises(Exception): # Generic catch
await extract_bytes(content, "", config)
# test_async.py:59
with pytest.raises(Exception): # Generic catch
await extract_bytes(content, "application/x-nonexistent", config)
Impact:
- Error mapping regressions won't be detected
- E2E green doesn't imply error types are correct
- Users relying on exception handling will fail in production
Proposed Fix:
-
Update all
pytest.raises(Exception)in error-path tests to specific exception classes:with pytest.raises(kreuzberg.UnsupportedFormatError): await extract_bytes(content, "application/x-nonexistent", config) -
Create a new e2e fixture file
fixtures/error_types.jsonthat exercises all error paths with correct exception type assertions.
Status: BLOCKED - Depends on Issue #1 fix (error mapping) Priority: HIGH (test quality)
Medium Issues
3. ALEF_GAP: Missing Docstrings on Core Functions
Severity: MEDIUM Category: API Documentation Files Affected:
crates/kreuzberg-py/src/lib.rs(auto-generated, all#[pyfunction]items)packages/python/kreuzberg/api.py(auto-generated)
Issue Description:
Core public functions lack docstrings. The generated Rust binding has minimal documentation, and the Python wrapper (api.py) is similarly bare. This degrades IDE experience and REPL help() output.
Evidence:
// crates/kreuzberg-py/src/lib.rs:10838 - extract_bytes
pub fn extract_bytes<'py>(
py: Python<'py>,
content: Vec<u8>,
mime_type: String,
config: ExtractionConfig,
) -> PyResult<Bound<'py, PyAny>> {
// ^ No docstring
Impact: Users get no guidance from IDE tooltips or REPL help on function signatures, parameters, or behavior.
Proposed Fix:
Since crates/kreuzberg-py/src/lib.rs is auto-generated by Alef, docstrings would need to be added in alef.toml or source Rust files that Alef reads. For the Python wrapper, add docstrings to packages/python/kreuzberg/api.py functions (but this is also auto-generated).
Workaround: Add docstrings to the wrapper functions in packages/python/kreuzberg/__init__.py:
def extract_file(path: str, mime_type: str | None = None, config: ExtractionConfig | None = None) -> Coroutine[Any, Any, ExtractionResult]:
"""Extract text, tables, and metadata from a file.
Args:
path: File path to extract from.
mime_type: Optional MIME type (e.g., 'application/pdf'). Auto-detected if omitted.
config: ExtractionConfig with options for OCR, chunking, etc.
Returns:
ExtractionResult containing extracted content, metadata, and processing details.
Raises:
OcrError: If OCR fails (if enabled).
ParsingError: If document parsing fails.
UnsupportedFormatError: If MIME type is not supported.
SecurityError: If security limits are exceeded.
"""
Status: FIXABLE Priority: MEDIUM (quality of life)
4. POTENTIAL_BUG: Sync embed_texts May Block Python Thread
Severity: LOW
Category: Performance/Thread Safety
File: crates/kreuzberg-py/src/lib.rs:11119
Issue Description:
The synchronous embed_texts function does not release the GIL, yet the underlying Rust function may perform I/O (HTTP requests to LLM APIs) or CPU-intensive operations (sentence embeddings via ONNX Runtime).
Evidence:
pub fn embed_texts(texts: Vec<String>, config: EmbeddingConfig) -> PyResult<Vec<Vec<f32>>> {
let config_core: kreuzberg::EmbeddingConfig = config.into();
kreuzberg::embed_texts(texts, &config_core).map_err(...)
// No py.allow_threads() wrapper
}
Assessment:
This is NOT necessarily a bug. The Rust binding has both embed_texts (sync) and embed_texts_async (async). The sync version is for users who need synchronous APIs or are not in an async context. Users with async needs have embed_texts_async available. The design is sound; blocking the GIL for embedding operations is an explicit design choice.
Mitigation:
- Document in docstring that sync
embed_textsmay block for extended periods - Recommend
embed_texts_asyncfor performance-critical applications - If sync blocking is a problem, call
embed_textsin aconcurrent.futures.ThreadPoolExecutor
Status: ACCEPTED (design choice, not a bug) Priority: LOW (documentation only)
Clean/Good Issues
5. ASYNC_SAFE: Proper GIL Management in Async Closures
Status: PASS Evidence:
pyo3_async_runtimes::tokio::future_into_py(py, async move {
// All captures move by value, no borrowed Python state held across await points
let result = kreuzberg::extract_bytes(&content, &mime_type, &config_core).await?;
Ok(ExtractionResult::from(result))
})
All async functions use async move and capture by value. No Py or Bound references are held across await points. ✓
6. TYPE_STUBS: Parity Between .pyi and Implementation
Status: PASS Spot Checks:
AccelerationConfig.__init__signature in .pyi matches generated binding ✓ExtractionConfig.__init__has all 28 parameters in .pyi ✓- Return types (e.g.,
extract_bytes -> Bound<'py, PyAny>) are correctly stubbed as coroutines ✓
No type stub drift detected.
7. PLUGIN_SAFETY: Error Handling in Plugin Bridges
Status: PASS Examples:
// PyOcrBackendBridge.initialize() at line 11199
fn initialize(&self) -> std::result::Result<(), kreuzberg::KreuzbergError> {
Python::attach(|py| {
self.inner.bind(py).call_method0("initialize").map(|_| ()).map_err(|e| {
kreuzberg::KreuzbergError::Other(format!(
"Plugin '{}' method 'initialize' failed: {}",
self.cached_name, e
))
})
})
}
Plugin method calls properly wrap PyErr into KreuzbergError. ✓
Summary Table
| Issue | Category | Severity | Fixable | File:Line |
|---|---|---|---|---|
| 1. Error Type Mapping | BINDING_BUG | CRITICAL | Yes (needs Alef or manual) | crates/kreuzberg-py:10900+ |
| 2. Error Type Tests | TEST_FIXTURE | HIGH | Yes (after #1) | e2e/python/tests:49,59 |
| 3. Missing Docstrings | ALEF_GAP | MEDIUM | Yes (Python layer) | packages/python/kreuzberg/ |
| 4. Sync Embedding Block | POTENTIAL_BUG | LOW | N/A (design choice) | crates/kreuzberg-py:11119 |
| 5. GIL Management | ASYNC_SAFE | — | N/A (clean) | crates/kreuzberg-py:10846+ |
| 6. Type Stubs | TYPE_STUBS | — | N/A (clean) | packages/python/kreuzberg/ |
| 7. Plugin Error Safety | PLUGIN_SAFETY | — | N/A (clean) | crates/kreuzberg-py:11199+ |
Audit Methodology
-
Scanned all
#[pyfunction]items incrates/kreuzberg-py/src/lib.rsfor error handling patterns- 147 error conversion sites identified
- All use generic
PyRuntimeError
-
Verified GIL management in async closures
- Checked for
py.allow_threads()usage (not needed forfuture_into_pypattern) - Verified no Py references held across await points
- All closures use
async move(value capture)
- Checked for
-
Cross-checked exception hierarchy
- Rust
KreuzbergErrorenum has 16+ variants - Python
exceptions.pydefines 14 exception classes - No mapping mechanism implemented
- Rust
-
Reviewed E2E test coverage
- 108/108 tests passing
- Error path tests catch generic
Exception - No specific error type assertions
-
Validated type stubs (.pyi files)
- Sampled signatures match implementation
- No drift detected
- Auto-generated by Alef, stays in sync
-
Inspected plugin bridge implementations
- PyOcrBackendBridge, PyPostProcessorBridge, PyValidatorBridge, PyEmbeddingBackendBridge
- All properly wrap Python exceptions in KreuzbergError
- Method validation (hasattr checks) on registration
Recommendations
Immediate (Blocking)
-
Fix Issue #1 (error mapping) — Either:
- Upstream: Add error variant discrimination to Alef's Python backend
- Local: Implement
error_to_pyerr()helper in binding and refactor all error sites
This is the single most important issue affecting API correctness.
Short Term (High Value)
- Add docstrings to high-level functions (extract_, embed_, batch_*)
- Create error_types.json fixture with comprehensive error path assertions
Long Term (Nice to Have)
- Sync embedding function — Document blocking behavior in docstring
- Monitor GIL overhead on production workloads with async functions
Conclusion
The Python binding is functionally correct and passes all e2e tests, but exposes a critical API gap: error types are not discriminated. Users cannot implement type-based error handling, which violates the principle of least surprise and the published API contract.
All other issues are minor (documentation, test coverage) or acceptable by design (sync embedding).
Priority Action: Implement error type mapping (Issue #1).