21 KiB
Kotlin-Android Hand-Edits Audit
Status: 82/82 e2e tests green Audit Scope: Commits bd1bef129d..519abc3001 (5 commits) Summary: All hand-edits are categorized below for upstream alef-template consolidation.
ALEF_GAP: Missing Template Coverage
These edits represent gaps in the alef kotlin-android binding generator. Alef generates public-API Kotlin wrappers but does not currently:
- Produce a JNI shim crate with typed FFI symbol resolution
- Configure Jackson serialization for Rust wire formats (ByteArray, sealed classes, nullable fields)
- Implement path-or-UTF8 file resolution for e2e test fixtures
- Custom serializers for Rust enum/sealed types (OutputFormat, FormatMetadata)
- Mark Rust Option fields nullable in Kotlin with defaults
Kreuzberg-JNI Shim Crate (Entire File)
File: crates/kreuzberg-jni/src/lib.rs (1194 lines)
Category: ALEF_GAP
Scope: Hand-written entirely — alef does not generate JNI shims
Summary: The JNI shim is a complete, separate crate that:
- Imports all kreuzberg-ffi typed functions by name to keep rlib symbols live
- Implements
#[unsafe(no_mangle)] extern "system"JNI entry points - Bridges Rust strings ↔ JStrings, Base64 encodes/decodes bytes for JNI safety
- Wires
#[no_mangle]FFI symbols into JNI function bodies - Calls
kreuzberg_last_error_code()/kreuzberg_last_error_context()on failures - Throws Java exceptions with FFI error messages via
env.throw_new()
Key Patterns:
base64_decode()(lines 37–66): manual Base64 decoding; candidate forbase64crateget_ffi_error_message()(lines 80–93): reads FFI error stackcstr_ptr_or_null()(lines 106–108): null-pointer convention for optional mime typethrow_exception()/throw_exception_void()(lines 69–77): exception wiring- Batch operation functions (lines 438–642): all delegate to FFI via JSON marshalling
Suggested Upstream Fix:
Add to alef's kotlin-android template generator:
[jni_shim]
enabled = true
target_path = "crates/{lib}-jni/"
features = ["default"]
Alef should emit:
- A workspace crate at
crates/{lib}-jni/Cargo.tomlwithcrate-type = ["cdylib"] - JNI entry points via a
#[proc_macro]or code generation that produces:- FFI function imports (typed, not magic strings)
- Exception-throwing helpers with last_error wiring
- Base64 marshalling for bytes
- CString construction and null-pointer conventions for optional params
Jackson Mapper Configuration (Kreuzberg.kt lines 38–100)
File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt
Category: ALEF_GAP
Lines: 38–100
Summary: Four Jackson configuration changes:
-
ByteArray Module (lines 43–74): Custom serializer that encodes
ByteArrayas JSON array[u8, u8, ...], matching Rust serde'sVec<u8>wire format. Jackson's default Base64 encoding causes Rust deserialization to fail:invalid type: string, expected a sequence. -
KotlinModule Configuration (lines 84–90):
NullIsSameAsDefault = true: missing JSON properties use Kotlin constructor defaults rather than throwingNullToEmptyCollection = true: null →[]NullToEmptyMap = true: null →{}
-
Serialization Inclusion (line 98):
JsonInclude.Include.NON_EMPTY— omit null/empty fields so Rust serde defaults trigger. Without this, Kotlin'semptyList()becomes"[]"which Rust#[serde(default)]tuples like(usize, usize)cannot parse. -
Unknown Properties (line 100):
FAIL_ON_UNKNOWN_PROPERTIES = false— allow Rust to add new fields without breaking old Kotlin clients.
Suggested Upstream Fix:
Alef should emit this configuration in every <language>/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt:
private val mapper = jacksonObjectMapper()
.registerModule(Jdk8Module())
.registerModule(byteArrayModule)
.registerModule(
KotlinModule.Builder()
.configure(KotlinFeature.NullIsSameAsDefault, true)
.configure(KotlinFeature.NullToEmptyCollection, true)
.configure(KotlinFeature.NullToEmptyMap, true)
.build(),
)
.setPropertyNamingStrategy(PropertyNamingStrategies.SNAKE_CASE)
.setSerializationInclusion(JsonInclude.Include.NON_EMPTY)
.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
loadBytesFromPathOrUtf8() Helper (Kreuzberg.kt lines 167–210)
File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt
Category: ALEF_GAP
Lines: 167–210
Summary: Path resolution for e2e test fixtures. The alef e2e generator emits JSON fixture paths (e.g., "documents/sample.pdf") into function parameters, but production callers may pass inline string content. This helper:
- Searches CWD and parents for
test_documents/orfixtures/directories - Checks
KREUZBERG_TEST_DOCUMENTS_DIRenvironment variable - Falls back to treating the string as UTF-8 bytes if no file found
Used by extractBytes(), extractBytesSync(), and renderPdfPageToPng() to support both e2e fixtures and production inline payloads.
Suggested Upstream Fix:
Alef should auto-inject this into every extraction method parameter that accepts bytes in Kotlin Android:
private fun loadBytesFromPathOrUtf8(pathOrContent: String): ByteArray {
// Walk directories, check env vars, fall back to UTF-8
}
fun extractBytes(content: String, mimeType: String, config: ExtractionConfig): ExtractionResult {
val contentBytes = loadBytesFromPathOrUtf8(content)
val contentStr = Base64.getEncoder().encodeToString(contentBytes)
// ...
}
Alef should recognize that content: &[u8] in Rust becomes content: String in Kotlin JNI callers (string marshalling), and auto-resolve paths for test environments.
fixConfigSerialization() + fixOutputFormatInNode() Helpers (Kreuzberg.kt lines 102–165)
File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt
Category: ALEF_GAP (status: partially superseded)
Lines: 102–165
Summary: Two functions that repair serialization issues at call time:
-
fixConfigSerialization() (lines 109–123): Jackson serializes sealed class objects as
{}(empty), but Rust expects a string discriminant. This function searches the JSON tree for"output_format": {}and replaces with"output_format": "plain". Also removes thecancel_tokenfield (Kotlin has it, Rust struct doesn't). -
fixOutputFormatInNode() (lines 129–165): Recursive tree walk that fixes OutputFormat sealed class serialization at every nesting level (including inside batch items).
Status: The OutputFormat custom serializer (see below) now handles the sealed-class conversion automatically, reducing the need for this tree-walk repair. However, cancel_token removal may still be needed if the field persists in alef-generated ExtractionConfig.
Suggested Upstream Fix:
- Implement a custom
(De)Serializerfor OutputFormat (done; see below). - Either:
- Mark
cancel_tokenas#[serde(skip)]in Rust ExtractionConfig, or - Auto-inject a config-level custom deserializer in the Kotlin mapper that strips unknown fields silently (already done via
FAIL_ON_UNKNOWN_PROPERTIES = false).
- Mark
- If
cancel_tokenpersists in future alef generations, apply a targeted fix at the ExtractionConfig level:
@com.fasterxml.jackson.databind.annotation.JsonDeserialize(using = ExtractionConfigDeserializer::class)
data class ExtractionConfig(...)
Consider removing fixConfigSerialization() after validating that the OutputFormatSerializer and FAIL_ON_UNKNOWN_PROPERTIES = false handle all cases.
OutputFormat Custom Serializer (OutputFormat.kt)
File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/OutputFormat.kt
Category: ALEF_GAP
Lines: 34–35 (decorators), 56–101 (custom serializers)
Summary: Custom Jackson serializers for the sealed class OutputFormat:
- Deserializer (lines 56–80): Accepts Rust string discriminant
"markdown"or Kotlin round-trip{"value": "markdown"}, converts to sealed class variant - Serializer (lines 82–101): Writes sealed class variants as strings (
"plain","markdown", etc.) for Rust consumption
Without these, Jackson treats sealed classes as objects with discriminator fields, which Rust #[derive(serde)] cannot parse.
Suggested Upstream Fix:
Alef should auto-generate custom (de)serializers for all sealed classes in Rust that map to sealed classes in Kotlin. Template pattern:
@com.fasterxml.jackson.databind.annotation.JsonDeserialize(using = SealedTypeDeserializer::class)
@com.fasterxml.jackson.databind.annotation.JsonSerialize(using = SealedTypeSerializer::class)
sealed class SealedType { ... }
private class SealedTypeDeserializer : StdDeserializer<SealedType>(...) {
override fun deserialize(...): SealedType {
val node = parser.codec.readTree<JsonNode>(parser)
val tag = when {
node.isTextual -> node.asText()
node.isObject && node.has("value") -> node.get("value").asText()
else -> "default_variant"
}
return when (tag.lowercase()) {
"variant_a" -> SealedType.VariantA
"variant_b" -> SealedType.VariantB(...)
else -> SealedType.Default
}
}
}
private class SealedTypeSerializer : StdSerializer<SealedType>(...) {
override fun serialize(value: SealedType, gen: JsonGenerator, provider: SerializerProvider) {
gen.writeString(when (value) {
is SealedType.VariantA -> "variant_a"
is SealedType.VariantB -> "variant_b"
})
}
}
FormatMetadata Custom Serializer (FormatMetadata.kt)
File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/FormatMetadata.kt
Category: ALEF_GAP
Lines: 31–32 (decorators), 56–230 (custom serializers)
Summary: Custom Jackson (de)serializers for FormatMetadata, a discriminated union. Key detail:
- Code variant (line 89): Rust's
FormatMetadata::Codewrapstree_sitter_language_pack::ProcessResult, which serializes as a JSON object. Kotlin stashes the raw JSON string inFormatMetadata.Code(value: String)so callers can re-parse if needed.
Suggested Upstream Fix: Same pattern as OutputFormat; alef should generate these for all sealed classes with complex payloads.
DocumentNode.contentLayer Nullable (DocumentNode.kt)
File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/DocumentNode.kt
Category: ALEF_GAP
Line: 42 (changed from val contentLayer: ContentLayer to val contentLayer: ContentLayer? = null)
Summary: Marks contentLayer optional with a default of null. This is a hand-edit to make the Kotlin field nullable to match Rust's Option<ContentLayer> default, which Rust serializes by omitting the field entirely. Without the nullable + default, alef-generated Kotlin would make the field required, and deserialization would fail when Rust omits it.
Suggested Upstream Fix: Alef should inspect Rust Option<T> fields and auto-generate Kotlin as T? = null (nullable with null default).
ChunkingConfig.sizing Nullable (ChunkingConfig.kt)
File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/ChunkingConfig.kt
Category: ALEF_GAP
Line: 74 (changed from val sizing: ChunkSizing to val sizing: ChunkSizing? = null)
Summary: Same as contentLayer; marks Rust Option<ChunkSizing> as nullable in Kotlin.
Suggested Upstream Fix: Alef should auto-generate Option<T> as T? = null in Kotlin.
renderPdfPageToPng() Path Resolution (Kreuzberg.kt)
File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt
Category: ALEF_GAP
Lines: 783–792 (changed from one-liner to multi-statement with path resolution)
Summary: Uses loadBytesFromPathOrUtf8() to resolve fixture paths for PDF bytes, matching behavior of extractBytes() and extractBytesSync(). The alef e2e generator emits fixture paths; production code may pass inline bytes.
Suggested Upstream Fix: Auto-apply path resolution to all methods that accept binary payloads, not just those explicitly named *Bytes*.
ROOT_CAUSE: Rust/FFI Changes
kreuzberg-ffi Crate-Type: Add rlib (Cargo.toml)
File: crates/kreuzberg-ffi/Cargo.toml
Category: ROOT_CAUSE
Change: crate-type = ["cdylib", "staticlib"] → crate-type = ["cdylib", "staticlib", "rlib"]
Commit: 66ca4f40eb fix(kotlin-android): force-link kreuzberg-ffi symbols into JNI cdylib
Rationale: The JNI shim (kreuzberg-jni) is a cdylib that imports kreuzberg-ffi functions by name. Without "rlib" in kreuzberg-ffi's crate-type, the linker drops #[no_mangle] symbols as dead code, and JNI calls resolve to null at runtime.
Impact: This is a one-time FFI infrastructure fix, not a breaking change to the public API.
TEST_FIXTURE & BINDING_BUG: None Found
No test fixtures were modified in this audit cycle. The e2e test suite (82 tests, alef-generated) passes without modification.
Summary by Category
| Category | Count | Files |
|---|---|---|
| ALEF_GAP | 10 | kreuzberg-jni shim; Jackson config; path resolution; OutputFormat serializer; FormatMetadata serializer; nullable fields |
| ROOT_CAUSE | 1 | kreuzberg-ffi Cargo.toml (rlib crate-type) |
| BINDING_BUG | 0 | — |
| TEST_FIXTURE | 0 | — |
Suggested Cleanup In-Repo
Before upstreaming to alef, consolidate the following hand-written code:
1. Replace Hand-Rolled base64_decode() with base64 Crate
Location: crates/kreuzberg-jni/src/lib.rs lines 37–66
Current: Manual Base64 alphabet mapping
Suggested: Add base64 crate and use base64::engine::general_purpose::STANDARD.decode()
2. Evaluate Partial Deprecation of fixConfigSerialization()
Location: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt lines 102–165
With OutputFormatSerializer and FAIL_ON_UNKNOWN_PROPERTIES = false in place, fixConfigSerialization() may only be needed for cancel_token removal. Options:
- Keep as-is (safe, explicit fix)
- Remove if Rust ExtractionConfig adds
#[serde(skip)]tocancel_token(or alef generates the field to omit it) - Replace with a targeted OutputFormat-only fix if other uses have been absorbed by custom serializers
Recommendation: Keep for now; deprecate after confirming OutputFormatSerializer handles all discovered edge cases.
JNI Marshalling Pattern (Reusable Spec)
Alef kotlin-android template should standardize on this pattern:
1. Byte Marshalling
Rust Vec<u8> ──→ JVM byte[] (via JNI) ──→ Kotlin ByteArray
↓ (unsafe)
String (Base64)
↓ JNI bound
Rust byte slice
In Kotlin: Base64.getEncoder().encodeToString(bytes)
In JNI: base64_decode(&content_str) → Vec<u8>
Convention: All binary payloads Base64-encoded for JNI safety
2. Configuration Marshalling
Kotlin ExtractionConfig ──→ mapper.writeValueAsString()
↓
JSON string
↓ (JNI safe)
JNI function
↓
Rust: kreuzberg_extraction_config_from_json()
↓
*mut ExtractionConfig (opaque)
In Kotlin: mapper.writeValueAsString(config)
In JNI: Accept *const c_char (JSON), parse via serde_json
3. MIME Type Handling
Kotlin: mimeType ?: "" (null collapse to empty string)
↓ (JNI)
Rust: cstr_ptr_or_null() → *const c_char (null if empty)
↓
CreuzbergFFI: Treat null as "auto-detect from path"
Convention: Optional MIME type as empty string in Kotlin, null pointer in FFI
4. File Path Resolution (E2E & Production)
Kotlin parameter: content: String (could be path OR UTF-8 bytes)
↓
loadBytesFromPathOrUtf8(content)
├─ Search test_documents/ / fixtures/ dirs
├─ Check KREUZBERG_TEST_DOCUMENTS_DIR env var
└─ Fall back to UTF-8 bytes of string
↓
ByteArray (ready for Base64 encoding)
Convention: All byte parameters support both path and inline content; walk directories for tests, fall back to bytes for production
5. Exception Handling
Rust FFI returns:
- NULL pointer on failure
- Valid pointer on success
JNI handler:
if (result.is_null()) {
let msg = get_ffi_error_message(); // kreuzberg_last_error_context()
throw_exception(env, &msg);
return null_or_zero();
}
Convention: Check every FFI return; wire last_error_code() + last_error_context() on every throw
6. Sealed Class Serialization
Rust: #[derive(serde::Serialize)]
pub enum OutputFormat {
Plain, Markdown, Custom(String), ...
}
JSON: "plain" or "markdown" or "custom_name"
Kotlin (custom serializer):
when (tag) {
"plain" -> OutputFormat.Plain
"markdown" -> OutputFormat.Markdown
else -> OutputFormat.Custom(tag)
}
Convention: Sealed classes in Kotlin use custom (de)serializers that accept Rust discriminant strings
Conclusion
82/82 kotlin-android e2e tests are passing. All hand-edits fall into two categories:
- ALEF_GAP (10 entries): Template-level features alef doesn't yet generate for kotlin-android
- ROOT_CAUSE (1 entry): FFI infrastructure fix (rlib crate-type)
No binding bugs or test fixture issues were found. The hand-edits are production-ready and provide a concrete specification for alef kotlin-android template upstreaming.
Next Step: Upstream each ALEF_GAP into alef's kotlin-android binding template using the patterns documented above.
Cleanup Session — May 30, 2026
Changes Applied
1. Replace Hand-Rolled base64_decode() with base64 Crate
File: crates/kreuzberg-jni/src/lib.rs
-
Added:
base64 = "0.22"toCargo.tomldependencies -
Replaced: Manual Base64 alphabet mapping (lines 37–66) with:
use base64::engine::general_purpose::STANDARD; use base64::Engine; fn base64_decode(input: &str) -> Result<Vec<u8>, String> { STANDARD.decode(input).map_err(|e| format!("Invalid Base64: {}", e)) } -
Rationale: Eliminates 30 lines of hand-rolled code; uses well-tested standard library
2. Improve Exception Handling Pattern
File: crates/kreuzberg-jni/src/lib.rs
-
Refactored: Early return patterns to use
return throw_exception(...)instead of:throw_exception(&mut env, &e); return std::ptr::null_mut(); -
Scope: Fixed in
nativeRenderPdfPageToPngImpl(lines 1121–1150) -
Impact: Cleaner code, no functional change (throw_exception already returns the null value)
3. Add SAFETY Comments to Critical Unsafe Blocks
File: crates/kreuzberg-jni/src/lib.rs
- Enhanced:
get_ffi_error_message()(lines 53–67) - Enhanced:
nativeExtractBytesImpl()config parsing section (lines 164–172) - Pattern: Each SAFETY comment documents invariants and null-check patterns
- Not Yet Complete: Future work to add SAFETY comments to all 70+ unsafe blocks
Bugs Found & Status
Confirmed Non-Issues
-
JNI Exception Behavior: JNI exceptions are lazy —
env.throw_new()doesn't immediately interrupt the JNI function. However, all call sites properly check for null returns and early-return, so pending exceptions are handled correctly before calling back into JNI. -
fixConfigSerialization() Deprecation: The audit notes flagged this for potential deprecation. Analysis shows:
- OutputFormatSerializer now handles sealed class conversion (✓ working)
cancel_tokenremoval still needed (Kotlin ExtractionConfig has field, Rust doesn't)- Jackson's
FAIL_ON_UNKNOWN_PROPERTIES = falseprovides defense-in-depth - Recommendation: Keep as-is; the defensive fix is zero-cost and will survive future alef generations
-
Memory Leaks: All FFI pointers are properly freed:
kreuzberg_extraction_config_free()called on both success and error pathskreuzberg_free_string()called on all JSON pointers from FFIkreuzberg_extraction_result_free()called after serializationkreuzberg_embedding_preset_free()called after use
No Active Bugs Found
- Type signature matching: All JNI function signatures match KreuzbergBridge.kt external declarations
- Exception handling: All throw sites properly propagate via JVM's exception state
- Null pointer checks: All FFI returns checked before use
- String ownership: All JString → Rust String conversions via
jstring_to_string()with error handling
Testing Status
- Target: 82/82 kotlin-android e2e tests per variant (debug + release = 164 total)
- Build: JNI shim compiles cleanly with no clippy warnings
- Verification: ✓ Full e2e test suite passed (164/164 tests, 0 failures)
- Commit: All cleanup applied and committed to HEAD (c5f192f3ff)