This commit is contained in:
515
audit-notes/kotlin-android.md
Normal file
515
audit-notes/kotlin-android.md
Normal file
@@ -0,0 +1,515 @@
|
||||
# Kotlin-Android Hand-Edits Audit
|
||||
|
||||
**Status**: 82/82 e2e tests green
|
||||
**Audit Scope**: Commits bd1bef129d..519abc3001 (5 commits)
|
||||
**Summary**: All hand-edits are categorized below for upstream alef-template consolidation.
|
||||
|
||||
---
|
||||
|
||||
## ALEF_GAP: Missing Template Coverage
|
||||
|
||||
These edits represent gaps in the alef kotlin-android binding generator. Alef generates public-API Kotlin wrappers but does not currently:
|
||||
|
||||
1. Produce a JNI shim crate with typed FFI symbol resolution
|
||||
2. Configure Jackson serialization for Rust wire formats (ByteArray, sealed classes, nullable fields)
|
||||
3. Implement path-or-UTF8 file resolution for e2e test fixtures
|
||||
4. Custom serializers for Rust enum/sealed types (OutputFormat, FormatMetadata)
|
||||
5. Mark Rust Option<T> fields nullable in Kotlin with defaults
|
||||
|
||||
### Kreuzberg-JNI Shim Crate (Entire File)
|
||||
|
||||
**File**: `crates/kreuzberg-jni/src/lib.rs` (1194 lines)
|
||||
**Category**: ALEF_GAP
|
||||
**Scope**: Hand-written entirely — alef does not generate JNI shims
|
||||
|
||||
**Summary**: The JNI shim is a complete, separate crate that:
|
||||
|
||||
- Imports all kreuzberg-ffi typed functions by name to keep rlib symbols live
|
||||
- Implements `#[unsafe(no_mangle)] extern "system"` JNI entry points
|
||||
- Bridges Rust strings ↔ JStrings, Base64 encodes/decodes bytes for JNI safety
|
||||
- Wires `#[no_mangle]` FFI symbols into JNI function bodies
|
||||
- Calls `kreuzberg_last_error_code()` / `kreuzberg_last_error_context()` on failures
|
||||
- Throws Java exceptions with FFI error messages via `env.throw_new()`
|
||||
|
||||
**Key Patterns**:
|
||||
|
||||
- `base64_decode()` (lines 37–66): manual Base64 decoding; candidate for `base64` crate
|
||||
- `get_ffi_error_message()` (lines 80–93): reads FFI error stack
|
||||
- `cstr_ptr_or_null()` (lines 106–108): null-pointer convention for optional mime type
|
||||
- `throw_exception()` / `throw_exception_void()` (lines 69–77): exception wiring
|
||||
- Batch operation functions (lines 438–642): all delegate to FFI via JSON marshalling
|
||||
|
||||
**Suggested Upstream Fix**:
|
||||
|
||||
Add to alef's kotlin-android template generator:
|
||||
|
||||
```toml
|
||||
[jni_shim]
|
||||
enabled = true
|
||||
target_path = "crates/{lib}-jni/"
|
||||
features = ["default"]
|
||||
```
|
||||
|
||||
Alef should emit:
|
||||
|
||||
1. A workspace crate at `crates/{lib}-jni/Cargo.toml` with `crate-type = ["cdylib"]`
|
||||
2. JNI entry points via a `#[proc_macro]` or code generation that produces:
|
||||
- FFI function imports (typed, not magic strings)
|
||||
- Exception-throwing helpers with last_error wiring
|
||||
- Base64 marshalling for bytes
|
||||
- CString construction and null-pointer conventions for optional params
|
||||
|
||||
---
|
||||
|
||||
### Jackson Mapper Configuration (Kreuzberg.kt lines 38–100)
|
||||
|
||||
**File**: `packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt`
|
||||
**Category**: ALEF_GAP
|
||||
**Lines**: 38–100
|
||||
|
||||
**Summary**: Four Jackson configuration changes:
|
||||
|
||||
1. **ByteArray Module** (lines 43–74): Custom serializer that encodes `ByteArray` as JSON array `[u8, u8, ...]`, matching Rust serde's `Vec<u8>` wire format. Jackson's default Base64 encoding causes Rust deserialization to fail: `invalid type: string, expected a sequence`.
|
||||
|
||||
2. **KotlinModule Configuration** (lines 84–90):
|
||||
- `NullIsSameAsDefault = true`: missing JSON properties use Kotlin constructor defaults rather than throwing
|
||||
- `NullToEmptyCollection = true`: null → `[]`
|
||||
- `NullToEmptyMap = true`: null → `{}`
|
||||
|
||||
3. **Serialization Inclusion** (line 98): `JsonInclude.Include.NON_EMPTY` — omit null/empty fields so Rust serde defaults trigger. Without this, Kotlin's `emptyList()` becomes `"[]"` which Rust `#[serde(default)]` tuples like `(usize, usize)` cannot parse.
|
||||
|
||||
4. **Unknown Properties** (line 100): `FAIL_ON_UNKNOWN_PROPERTIES = false` — allow Rust to add new fields without breaking old Kotlin clients.
|
||||
|
||||
**Suggested Upstream Fix**:
|
||||
|
||||
Alef should emit this configuration in every `<language>/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt`:
|
||||
|
||||
```kotlin
|
||||
private val mapper = jacksonObjectMapper()
|
||||
.registerModule(Jdk8Module())
|
||||
.registerModule(byteArrayModule)
|
||||
.registerModule(
|
||||
KotlinModule.Builder()
|
||||
.configure(KotlinFeature.NullIsSameAsDefault, true)
|
||||
.configure(KotlinFeature.NullToEmptyCollection, true)
|
||||
.configure(KotlinFeature.NullToEmptyMap, true)
|
||||
.build(),
|
||||
)
|
||||
.setPropertyNamingStrategy(PropertyNamingStrategies.SNAKE_CASE)
|
||||
.setSerializationInclusion(JsonInclude.Include.NON_EMPTY)
|
||||
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### loadBytesFromPathOrUtf8() Helper (Kreuzberg.kt lines 167–210)
|
||||
|
||||
**File**: `packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt`
|
||||
**Category**: ALEF_GAP
|
||||
**Lines**: 167–210
|
||||
|
||||
**Summary**: Path resolution for e2e test fixtures. The alef e2e generator emits JSON fixture paths (e.g., `"documents/sample.pdf"`) into function parameters, but production callers may pass inline string content. This helper:
|
||||
|
||||
1. Searches CWD and parents for `test_documents/` or `fixtures/` directories
|
||||
2. Checks `KREUZBERG_TEST_DOCUMENTS_DIR` environment variable
|
||||
3. Falls back to treating the string as UTF-8 bytes if no file found
|
||||
|
||||
Used by `extractBytes()`, `extractBytesSync()`, and `renderPdfPageToPng()` to support both e2e fixtures and production inline payloads.
|
||||
|
||||
**Suggested Upstream Fix**:
|
||||
|
||||
Alef should auto-inject this into every extraction method parameter that accepts bytes in Kotlin Android:
|
||||
|
||||
```kotlin
|
||||
private fun loadBytesFromPathOrUtf8(pathOrContent: String): ByteArray {
|
||||
// Walk directories, check env vars, fall back to UTF-8
|
||||
}
|
||||
|
||||
fun extractBytes(content: String, mimeType: String, config: ExtractionConfig): ExtractionResult {
|
||||
val contentBytes = loadBytesFromPathOrUtf8(content)
|
||||
val contentStr = Base64.getEncoder().encodeToString(contentBytes)
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
Alef should recognize that `content: &[u8]` in Rust becomes `content: String` in Kotlin JNI callers (string marshalling), and auto-resolve paths for test environments.
|
||||
|
||||
---
|
||||
|
||||
### fixConfigSerialization() + fixOutputFormatInNode() Helpers (Kreuzberg.kt lines 102–165)
|
||||
|
||||
**File**: `packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt`
|
||||
**Category**: ALEF_GAP (status: partially superseded)
|
||||
**Lines**: 102–165
|
||||
|
||||
**Summary**: Two functions that repair serialization issues at call time:
|
||||
|
||||
1. **fixConfigSerialization()** (lines 109–123): Jackson serializes sealed class objects as `{}` (empty), but Rust expects a string discriminant. This function searches the JSON tree for `"output_format": {}` and replaces with `"output_format": "plain"`. Also removes the `cancel_token` field (Kotlin has it, Rust struct doesn't).
|
||||
|
||||
2. **fixOutputFormatInNode()** (lines 129–165): Recursive tree walk that fixes OutputFormat sealed class serialization at every nesting level (including inside batch items).
|
||||
|
||||
**Status**: The OutputFormat custom serializer (see below) now handles the sealed-class conversion automatically, reducing the need for this tree-walk repair. However, `cancel_token` removal may still be needed if the field persists in alef-generated ExtractionConfig.
|
||||
|
||||
**Suggested Upstream Fix**:
|
||||
|
||||
1. Implement a custom `(De)Serializer` for OutputFormat (done; see below).
|
||||
2. Either:
|
||||
- Mark `cancel_token` as `#[serde(skip)]` in Rust ExtractionConfig, or
|
||||
- Auto-inject a config-level custom deserializer in the Kotlin mapper that strips unknown fields silently (already done via `FAIL_ON_UNKNOWN_PROPERTIES = false`).
|
||||
3. If `cancel_token` persists in future alef generations, apply a targeted fix at the ExtractionConfig level:
|
||||
|
||||
```kotlin
|
||||
@com.fasterxml.jackson.databind.annotation.JsonDeserialize(using = ExtractionConfigDeserializer::class)
|
||||
data class ExtractionConfig(...)
|
||||
```
|
||||
|
||||
Consider removing `fixConfigSerialization()` after validating that the OutputFormatSerializer and `FAIL_ON_UNKNOWN_PROPERTIES = false` handle all cases.
|
||||
|
||||
---
|
||||
|
||||
### OutputFormat Custom Serializer (OutputFormat.kt)
|
||||
|
||||
**File**: `packages/kotlin-android/src/main/kotlin/dev/kreuzberg/OutputFormat.kt`
|
||||
**Category**: ALEF_GAP
|
||||
**Lines**: 34–35 (decorators), 56–101 (custom serializers)
|
||||
|
||||
**Summary**: Custom Jackson serializers for the sealed class `OutputFormat`:
|
||||
|
||||
- **Deserializer** (lines 56–80): Accepts Rust string discriminant `"markdown"` or Kotlin round-trip `{"value": "markdown"}`, converts to sealed class variant
|
||||
- **Serializer** (lines 82–101): Writes sealed class variants as strings (`"plain"`, `"markdown"`, etc.) for Rust consumption
|
||||
|
||||
Without these, Jackson treats sealed classes as objects with discriminator fields, which Rust `#[derive(serde)]` cannot parse.
|
||||
|
||||
**Suggested Upstream Fix**:
|
||||
|
||||
Alef should auto-generate custom (de)serializers for all sealed classes in Rust that map to sealed classes in Kotlin. Template pattern:
|
||||
|
||||
```kotlin
|
||||
@com.fasterxml.jackson.databind.annotation.JsonDeserialize(using = SealedTypeDeserializer::class)
|
||||
@com.fasterxml.jackson.databind.annotation.JsonSerialize(using = SealedTypeSerializer::class)
|
||||
sealed class SealedType { ... }
|
||||
|
||||
private class SealedTypeDeserializer : StdDeserializer<SealedType>(...) {
|
||||
override fun deserialize(...): SealedType {
|
||||
val node = parser.codec.readTree<JsonNode>(parser)
|
||||
val tag = when {
|
||||
node.isTextual -> node.asText()
|
||||
node.isObject && node.has("value") -> node.get("value").asText()
|
||||
else -> "default_variant"
|
||||
}
|
||||
return when (tag.lowercase()) {
|
||||
"variant_a" -> SealedType.VariantA
|
||||
"variant_b" -> SealedType.VariantB(...)
|
||||
else -> SealedType.Default
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
private class SealedTypeSerializer : StdSerializer<SealedType>(...) {
|
||||
override fun serialize(value: SealedType, gen: JsonGenerator, provider: SerializerProvider) {
|
||||
gen.writeString(when (value) {
|
||||
is SealedType.VariantA -> "variant_a"
|
||||
is SealedType.VariantB -> "variant_b"
|
||||
})
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### FormatMetadata Custom Serializer (FormatMetadata.kt)
|
||||
|
||||
**File**: `packages/kotlin-android/src/main/kotlin/dev/kreuzberg/FormatMetadata.kt`
|
||||
**Category**: ALEF_GAP
|
||||
**Lines**: 31–32 (decorators), 56–230 (custom serializers)
|
||||
|
||||
**Summary**: Custom Jackson (de)serializers for `FormatMetadata`, a discriminated union. Key detail:
|
||||
|
||||
- **Code variant** (line 89): Rust's `FormatMetadata::Code` wraps `tree_sitter_language_pack::ProcessResult`, which serializes as a JSON object. Kotlin stashes the raw JSON string in `FormatMetadata.Code(value: String)` so callers can re-parse if needed.
|
||||
|
||||
**Suggested Upstream Fix**: Same pattern as OutputFormat; alef should generate these for all sealed classes with complex payloads.
|
||||
|
||||
---
|
||||
|
||||
### DocumentNode.contentLayer Nullable (DocumentNode.kt)
|
||||
|
||||
**File**: `packages/kotlin-android/src/main/kotlin/dev/kreuzberg/DocumentNode.kt`
|
||||
**Category**: ALEF_GAP
|
||||
**Line**: 42 (changed from `val contentLayer: ContentLayer` to `val contentLayer: ContentLayer? = null`)
|
||||
|
||||
**Summary**: Marks `contentLayer` optional with a default of `null`. This is a hand-edit to make the Kotlin field nullable to match Rust's `Option<ContentLayer>` default, which Rust serializes by omitting the field entirely. Without the nullable + default, alef-generated Kotlin would make the field required, and deserialization would fail when Rust omits it.
|
||||
|
||||
**Suggested Upstream Fix**: Alef should inspect Rust `Option<T>` fields and auto-generate Kotlin as `T? = null` (nullable with null default).
|
||||
|
||||
---
|
||||
|
||||
### ChunkingConfig.sizing Nullable (ChunkingConfig.kt)
|
||||
|
||||
**File**: `packages/kotlin-android/src/main/kotlin/dev/kreuzberg/ChunkingConfig.kt`
|
||||
**Category**: ALEF_GAP
|
||||
**Line**: 74 (changed from `val sizing: ChunkSizing` to `val sizing: ChunkSizing? = null`)
|
||||
|
||||
**Summary**: Same as `contentLayer`; marks Rust `Option<ChunkSizing>` as nullable in Kotlin.
|
||||
|
||||
**Suggested Upstream Fix**: Alef should auto-generate `Option<T>` as `T? = null` in Kotlin.
|
||||
|
||||
---
|
||||
|
||||
### renderPdfPageToPng() Path Resolution (Kreuzberg.kt)
|
||||
|
||||
**File**: `packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt`
|
||||
**Category**: ALEF_GAP
|
||||
**Lines**: 783–792 (changed from one-liner to multi-statement with path resolution)
|
||||
|
||||
**Summary**: Uses `loadBytesFromPathOrUtf8()` to resolve fixture paths for PDF bytes, matching behavior of `extractBytes()` and `extractBytesSync()`. The alef e2e generator emits fixture paths; production code may pass inline bytes.
|
||||
|
||||
**Suggested Upstream Fix**: Auto-apply path resolution to all methods that accept binary payloads, not just those explicitly named `*Bytes*`.
|
||||
|
||||
---
|
||||
|
||||
## ROOT_CAUSE: Rust/FFI Changes
|
||||
|
||||
### kreuzberg-ffi Crate-Type: Add rlib (Cargo.toml)
|
||||
|
||||
**File**: `crates/kreuzberg-ffi/Cargo.toml`
|
||||
**Category**: ROOT_CAUSE
|
||||
**Change**: `crate-type = ["cdylib", "staticlib"]` → `crate-type = ["cdylib", "staticlib", "rlib"]`
|
||||
**Commit**: `66ca4f40eb fix(kotlin-android): force-link kreuzberg-ffi symbols into JNI cdylib`
|
||||
|
||||
**Rationale**: The JNI shim (`kreuzberg-jni`) is a `cdylib` that imports kreuzberg-ffi functions by name. Without `"rlib"` in kreuzberg-ffi's crate-type, the linker drops `#[no_mangle]` symbols as dead code, and JNI calls resolve to null at runtime.
|
||||
|
||||
**Impact**: This is a one-time FFI infrastructure fix, not a breaking change to the public API.
|
||||
|
||||
---
|
||||
|
||||
## TEST_FIXTURE & BINDING_BUG: None Found
|
||||
|
||||
No test fixtures were modified in this audit cycle. The e2e test suite (82 tests, alef-generated) passes without modification.
|
||||
|
||||
---
|
||||
|
||||
## Summary by Category
|
||||
|
||||
| Category | Count | Files |
|
||||
|----------|-------|-------|
|
||||
| **ALEF_GAP** | 10 | kreuzberg-jni shim; Jackson config; path resolution; OutputFormat serializer; FormatMetadata serializer; nullable fields |
|
||||
| **ROOT_CAUSE** | 1 | kreuzberg-ffi Cargo.toml (rlib crate-type) |
|
||||
| **BINDING_BUG** | 0 | — |
|
||||
| **TEST_FIXTURE** | 0 | — |
|
||||
|
||||
---
|
||||
|
||||
## Suggested Cleanup In-Repo
|
||||
|
||||
Before upstreaming to alef, consolidate the following hand-written code:
|
||||
|
||||
### 1. Replace Hand-Rolled base64_decode() with `base64` Crate
|
||||
|
||||
**Location**: `crates/kreuzberg-jni/src/lib.rs` lines 37–66
|
||||
**Current**: Manual Base64 alphabet mapping
|
||||
**Suggested**: Add `base64` crate and use `base64::engine::general_purpose::STANDARD.decode()`
|
||||
|
||||
### 2. Evaluate Partial Deprecation of fixConfigSerialization()
|
||||
|
||||
**Location**: `packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt` lines 102–165
|
||||
|
||||
With OutputFormatSerializer and `FAIL_ON_UNKNOWN_PROPERTIES = false` in place, `fixConfigSerialization()` may only be needed for `cancel_token` removal. Options:
|
||||
|
||||
1. Keep as-is (safe, explicit fix)
|
||||
2. Remove if Rust ExtractionConfig adds `#[serde(skip)]` to `cancel_token` (or alef generates the field to omit it)
|
||||
3. Replace with a targeted OutputFormat-only fix if other uses have been absorbed by custom serializers
|
||||
|
||||
**Recommendation**: Keep for now; deprecate after confirming OutputFormatSerializer handles all discovered edge cases.
|
||||
|
||||
---
|
||||
|
||||
## JNI Marshalling Pattern (Reusable Spec)
|
||||
|
||||
Alef kotlin-android template should standardize on this pattern:
|
||||
|
||||
### 1. Byte Marshalling
|
||||
|
||||
```text
|
||||
Rust Vec<u8> ──→ JVM byte[] (via JNI) ──→ Kotlin ByteArray
|
||||
↓ (unsafe)
|
||||
String (Base64)
|
||||
↓ JNI bound
|
||||
Rust byte slice
|
||||
```
|
||||
|
||||
**In Kotlin**: `Base64.getEncoder().encodeToString(bytes)`
|
||||
**In JNI**: `base64_decode(&content_str)` → `Vec<u8>`
|
||||
**Convention**: All binary payloads Base64-encoded for JNI safety
|
||||
|
||||
### 2. Configuration Marshalling
|
||||
|
||||
```text
|
||||
Kotlin ExtractionConfig ──→ mapper.writeValueAsString()
|
||||
↓
|
||||
JSON string
|
||||
↓ (JNI safe)
|
||||
JNI function
|
||||
↓
|
||||
Rust: kreuzberg_extraction_config_from_json()
|
||||
↓
|
||||
*mut ExtractionConfig (opaque)
|
||||
```
|
||||
|
||||
**In Kotlin**: `mapper.writeValueAsString(config)`
|
||||
**In JNI**: Accept `*const c_char` (JSON), parse via `serde_json`
|
||||
|
||||
### 3. MIME Type Handling
|
||||
|
||||
```text
|
||||
Kotlin: mimeType ?: "" (null collapse to empty string)
|
||||
↓ (JNI)
|
||||
Rust: cstr_ptr_or_null() → *const c_char (null if empty)
|
||||
↓
|
||||
CreuzbergFFI: Treat null as "auto-detect from path"
|
||||
```
|
||||
|
||||
**Convention**: Optional MIME type as empty string in Kotlin, null pointer in FFI
|
||||
|
||||
### 4. File Path Resolution (E2E & Production)
|
||||
|
||||
```text
|
||||
Kotlin parameter: content: String (could be path OR UTF-8 bytes)
|
||||
↓
|
||||
loadBytesFromPathOrUtf8(content)
|
||||
├─ Search test_documents/ / fixtures/ dirs
|
||||
├─ Check KREUZBERG_TEST_DOCUMENTS_DIR env var
|
||||
└─ Fall back to UTF-8 bytes of string
|
||||
↓
|
||||
ByteArray (ready for Base64 encoding)
|
||||
```
|
||||
|
||||
**Convention**: All byte parameters support both path and inline content; walk directories for tests, fall back to bytes for production
|
||||
|
||||
### 5. Exception Handling
|
||||
|
||||
```text
|
||||
Rust FFI returns:
|
||||
- NULL pointer on failure
|
||||
- Valid pointer on success
|
||||
|
||||
JNI handler:
|
||||
if (result.is_null()) {
|
||||
let msg = get_ffi_error_message(); // kreuzberg_last_error_context()
|
||||
throw_exception(env, &msg);
|
||||
return null_or_zero();
|
||||
}
|
||||
```
|
||||
|
||||
**Convention**: Check every FFI return; wire `last_error_code()` + `last_error_context()` on every throw
|
||||
|
||||
### 6. Sealed Class Serialization
|
||||
|
||||
```text
|
||||
Rust: #[derive(serde::Serialize)]
|
||||
pub enum OutputFormat {
|
||||
Plain, Markdown, Custom(String), ...
|
||||
}
|
||||
|
||||
JSON: "plain" or "markdown" or "custom_name"
|
||||
|
||||
Kotlin (custom serializer):
|
||||
when (tag) {
|
||||
"plain" -> OutputFormat.Plain
|
||||
"markdown" -> OutputFormat.Markdown
|
||||
else -> OutputFormat.Custom(tag)
|
||||
}
|
||||
```
|
||||
|
||||
**Convention**: Sealed classes in Kotlin use custom (de)serializers that accept Rust discriminant strings
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
**82/82 kotlin-android e2e tests are passing.** All hand-edits fall into two categories:
|
||||
|
||||
1. **ALEF_GAP (10 entries)**: Template-level features alef doesn't yet generate for kotlin-android
|
||||
2. **ROOT_CAUSE (1 entry)**: FFI infrastructure fix (rlib crate-type)
|
||||
|
||||
No binding bugs or test fixture issues were found. The hand-edits are production-ready and provide a concrete specification for alef kotlin-android template upstreaming.
|
||||
|
||||
**Next Step**: Upstream each ALEF_GAP into alef's kotlin-android binding template using the patterns documented above.
|
||||
|
||||
---
|
||||
|
||||
## Cleanup Session — May 30, 2026
|
||||
|
||||
### Changes Applied
|
||||
|
||||
#### 1. Replace Hand-Rolled base64_decode() with `base64` Crate
|
||||
|
||||
**File**: `crates/kreuzberg-jni/src/lib.rs`
|
||||
|
||||
- **Added**: `base64 = "0.22"` to `Cargo.toml` dependencies
|
||||
- **Replaced**: Manual Base64 alphabet mapping (lines 37–66) with:
|
||||
|
||||
```rust
|
||||
use base64::engine::general_purpose::STANDARD;
|
||||
use base64::Engine;
|
||||
|
||||
fn base64_decode(input: &str) -> Result<Vec<u8>, String> {
|
||||
STANDARD.decode(input).map_err(|e| format!("Invalid Base64: {}", e))
|
||||
}
|
||||
```
|
||||
|
||||
- **Rationale**: Eliminates 30 lines of hand-rolled code; uses well-tested standard library
|
||||
|
||||
#### 2. Improve Exception Handling Pattern
|
||||
|
||||
**File**: `crates/kreuzberg-jni/src/lib.rs`
|
||||
|
||||
- **Refactored**: Early return patterns to use `return throw_exception(...)` instead of:
|
||||
|
||||
```rust
|
||||
throw_exception(&mut env, &e);
|
||||
return std::ptr::null_mut();
|
||||
```
|
||||
|
||||
- **Scope**: Fixed in `nativeRenderPdfPageToPngImpl` (lines 1121–1150)
|
||||
- **Impact**: Cleaner code, no functional change (throw_exception already returns the null value)
|
||||
|
||||
#### 3. Add SAFETY Comments to Critical Unsafe Blocks
|
||||
|
||||
**File**: `crates/kreuzberg-jni/src/lib.rs`
|
||||
|
||||
- **Enhanced**: `get_ffi_error_message()` (lines 53–67)
|
||||
- **Enhanced**: `nativeExtractBytesImpl()` config parsing section (lines 164–172)
|
||||
- **Pattern**: Each SAFETY comment documents invariants and null-check patterns
|
||||
- **Not Yet Complete**: Future work to add SAFETY comments to all 70+ unsafe blocks
|
||||
|
||||
### Bugs Found & Status
|
||||
|
||||
#### Confirmed Non-Issues
|
||||
|
||||
1. **JNI Exception Behavior**: JNI exceptions are lazy — `env.throw_new()` doesn't immediately interrupt the JNI function. However, all call sites properly check for null returns and early-return, so pending exceptions are handled correctly before calling back into JNI.
|
||||
|
||||
2. **fixConfigSerialization() Deprecation**: The audit notes flagged this for potential deprecation. Analysis shows:
|
||||
- OutputFormatSerializer now handles sealed class conversion (✓ working)
|
||||
- `cancel_token` removal still needed (Kotlin ExtractionConfig has field, Rust doesn't)
|
||||
- Jackson's `FAIL_ON_UNKNOWN_PROPERTIES = false` provides defense-in-depth
|
||||
- **Recommendation**: Keep as-is; the defensive fix is zero-cost and will survive future alef generations
|
||||
|
||||
3. **Memory Leaks**: All FFI pointers are properly freed:
|
||||
- `kreuzberg_extraction_config_free()` called on both success and error paths
|
||||
- `kreuzberg_free_string()` called on all JSON pointers from FFI
|
||||
- `kreuzberg_extraction_result_free()` called after serialization
|
||||
- `kreuzberg_embedding_preset_free()` called after use
|
||||
|
||||
#### No Active Bugs Found
|
||||
|
||||
- Type signature matching: All JNI function signatures match KreuzbergBridge.kt external declarations
|
||||
- Exception handling: All throw sites properly propagate via JVM's exception state
|
||||
- Null pointer checks: All FFI returns checked before use
|
||||
- String ownership: All JString → Rust String conversions via `jstring_to_string()` with error handling
|
||||
|
||||
### Testing Status
|
||||
|
||||
- **Target**: 82/82 kotlin-android e2e tests per variant (debug + release = 164 total)
|
||||
- **Build**: JNI shim compiles cleanly with no clippy warnings
|
||||
- **Verification**: ✓ Full e2e test suite passed (164/164 tests, 0 failures)
|
||||
- **Commit**: All cleanup applied and committed to HEAD (c5f192f3ff)
|
||||
Reference in New Issue
Block a user