Files
fil/audit-notes/kotlin-android.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

21 KiB
Raw Blame History

Kotlin-Android Hand-Edits Audit

Status: 82/82 e2e tests green Audit Scope: Commits bd1bef129d..519abc3001 (5 commits) Summary: All hand-edits are categorized below for upstream alef-template consolidation.


ALEF_GAP: Missing Template Coverage

These edits represent gaps in the alef kotlin-android binding generator. Alef generates public-API Kotlin wrappers but does not currently:

  1. Produce a JNI shim crate with typed FFI symbol resolution
  2. Configure Jackson serialization for Rust wire formats (ByteArray, sealed classes, nullable fields)
  3. Implement path-or-UTF8 file resolution for e2e test fixtures
  4. Custom serializers for Rust enum/sealed types (OutputFormat, FormatMetadata)
  5. Mark Rust Option fields nullable in Kotlin with defaults

Kreuzberg-JNI Shim Crate (Entire File)

File: crates/kreuzberg-jni/src/lib.rs (1194 lines) Category: ALEF_GAP Scope: Hand-written entirely — alef does not generate JNI shims

Summary: The JNI shim is a complete, separate crate that:

  • Imports all kreuzberg-ffi typed functions by name to keep rlib symbols live
  • Implements #[unsafe(no_mangle)] extern "system" JNI entry points
  • Bridges Rust strings ↔ JStrings, Base64 encodes/decodes bytes for JNI safety
  • Wires #[no_mangle] FFI symbols into JNI function bodies
  • Calls kreuzberg_last_error_code() / kreuzberg_last_error_context() on failures
  • Throws Java exceptions with FFI error messages via env.throw_new()

Key Patterns:

  • base64_decode() (lines 3766): manual Base64 decoding; candidate for base64 crate
  • get_ffi_error_message() (lines 8093): reads FFI error stack
  • cstr_ptr_or_null() (lines 106108): null-pointer convention for optional mime type
  • throw_exception() / throw_exception_void() (lines 6977): exception wiring
  • Batch operation functions (lines 438642): all delegate to FFI via JSON marshalling

Suggested Upstream Fix:

Add to alef's kotlin-android template generator:

[jni_shim]
enabled = true
target_path = "crates/{lib}-jni/"
features = ["default"]

Alef should emit:

  1. A workspace crate at crates/{lib}-jni/Cargo.toml with crate-type = ["cdylib"]
  2. JNI entry points via a #[proc_macro] or code generation that produces:
    • FFI function imports (typed, not magic strings)
    • Exception-throwing helpers with last_error wiring
    • Base64 marshalling for bytes
    • CString construction and null-pointer conventions for optional params

Jackson Mapper Configuration (Kreuzberg.kt lines 38100)

File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt Category: ALEF_GAP Lines: 38100

Summary: Four Jackson configuration changes:

  1. ByteArray Module (lines 4374): Custom serializer that encodes ByteArray as JSON array [u8, u8, ...], matching Rust serde's Vec<u8> wire format. Jackson's default Base64 encoding causes Rust deserialization to fail: invalid type: string, expected a sequence.

  2. KotlinModule Configuration (lines 8490):

    • NullIsSameAsDefault = true: missing JSON properties use Kotlin constructor defaults rather than throwing
    • NullToEmptyCollection = true: null → []
    • NullToEmptyMap = true: null → {}
  3. Serialization Inclusion (line 98): JsonInclude.Include.NON_EMPTY — omit null/empty fields so Rust serde defaults trigger. Without this, Kotlin's emptyList() becomes "[]" which Rust #[serde(default)] tuples like (usize, usize) cannot parse.

  4. Unknown Properties (line 100): FAIL_ON_UNKNOWN_PROPERTIES = false — allow Rust to add new fields without breaking old Kotlin clients.

Suggested Upstream Fix:

Alef should emit this configuration in every <language>/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt:

private val mapper = jacksonObjectMapper()
    .registerModule(Jdk8Module())
    .registerModule(byteArrayModule)
    .registerModule(
        KotlinModule.Builder()
            .configure(KotlinFeature.NullIsSameAsDefault, true)
            .configure(KotlinFeature.NullToEmptyCollection, true)
            .configure(KotlinFeature.NullToEmptyMap, true)
            .build(),
    )
    .setPropertyNamingStrategy(PropertyNamingStrategies.SNAKE_CASE)
    .setSerializationInclusion(JsonInclude.Include.NON_EMPTY)
    .configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)

loadBytesFromPathOrUtf8() Helper (Kreuzberg.kt lines 167210)

File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt Category: ALEF_GAP Lines: 167210

Summary: Path resolution for e2e test fixtures. The alef e2e generator emits JSON fixture paths (e.g., "documents/sample.pdf") into function parameters, but production callers may pass inline string content. This helper:

  1. Searches CWD and parents for test_documents/ or fixtures/ directories
  2. Checks KREUZBERG_TEST_DOCUMENTS_DIR environment variable
  3. Falls back to treating the string as UTF-8 bytes if no file found

Used by extractBytes(), extractBytesSync(), and renderPdfPageToPng() to support both e2e fixtures and production inline payloads.

Suggested Upstream Fix:

Alef should auto-inject this into every extraction method parameter that accepts bytes in Kotlin Android:

private fun loadBytesFromPathOrUtf8(pathOrContent: String): ByteArray {
    // Walk directories, check env vars, fall back to UTF-8
}

fun extractBytes(content: String, mimeType: String, config: ExtractionConfig): ExtractionResult {
    val contentBytes = loadBytesFromPathOrUtf8(content)
    val contentStr = Base64.getEncoder().encodeToString(contentBytes)
    // ...
}

Alef should recognize that content: &[u8] in Rust becomes content: String in Kotlin JNI callers (string marshalling), and auto-resolve paths for test environments.


fixConfigSerialization() + fixOutputFormatInNode() Helpers (Kreuzberg.kt lines 102165)

File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt Category: ALEF_GAP (status: partially superseded) Lines: 102165

Summary: Two functions that repair serialization issues at call time:

  1. fixConfigSerialization() (lines 109123): Jackson serializes sealed class objects as {} (empty), but Rust expects a string discriminant. This function searches the JSON tree for "output_format": {} and replaces with "output_format": "plain". Also removes the cancel_token field (Kotlin has it, Rust struct doesn't).

  2. fixOutputFormatInNode() (lines 129165): Recursive tree walk that fixes OutputFormat sealed class serialization at every nesting level (including inside batch items).

Status: The OutputFormat custom serializer (see below) now handles the sealed-class conversion automatically, reducing the need for this tree-walk repair. However, cancel_token removal may still be needed if the field persists in alef-generated ExtractionConfig.

Suggested Upstream Fix:

  1. Implement a custom (De)Serializer for OutputFormat (done; see below).
  2. Either:
    • Mark cancel_token as #[serde(skip)] in Rust ExtractionConfig, or
    • Auto-inject a config-level custom deserializer in the Kotlin mapper that strips unknown fields silently (already done via FAIL_ON_UNKNOWN_PROPERTIES = false).
  3. If cancel_token persists in future alef generations, apply a targeted fix at the ExtractionConfig level:
@com.fasterxml.jackson.databind.annotation.JsonDeserialize(using = ExtractionConfigDeserializer::class)
data class ExtractionConfig(...)

Consider removing fixConfigSerialization() after validating that the OutputFormatSerializer and FAIL_ON_UNKNOWN_PROPERTIES = false handle all cases.


OutputFormat Custom Serializer (OutputFormat.kt)

File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/OutputFormat.kt Category: ALEF_GAP Lines: 3435 (decorators), 56101 (custom serializers)

Summary: Custom Jackson serializers for the sealed class OutputFormat:

  • Deserializer (lines 5680): Accepts Rust string discriminant "markdown" or Kotlin round-trip {"value": "markdown"}, converts to sealed class variant
  • Serializer (lines 82101): Writes sealed class variants as strings ("plain", "markdown", etc.) for Rust consumption

Without these, Jackson treats sealed classes as objects with discriminator fields, which Rust #[derive(serde)] cannot parse.

Suggested Upstream Fix:

Alef should auto-generate custom (de)serializers for all sealed classes in Rust that map to sealed classes in Kotlin. Template pattern:

@com.fasterxml.jackson.databind.annotation.JsonDeserialize(using = SealedTypeDeserializer::class)
@com.fasterxml.jackson.databind.annotation.JsonSerialize(using = SealedTypeSerializer::class)
sealed class SealedType { ... }

private class SealedTypeDeserializer : StdDeserializer<SealedType>(...) {
    override fun deserialize(...): SealedType {
        val node = parser.codec.readTree<JsonNode>(parser)
        val tag = when {
            node.isTextual -> node.asText()
            node.isObject && node.has("value") -> node.get("value").asText()
            else -> "default_variant"
        }
        return when (tag.lowercase()) {
            "variant_a" -> SealedType.VariantA
            "variant_b" -> SealedType.VariantB(...)
            else -> SealedType.Default
        }
    }
}

private class SealedTypeSerializer : StdSerializer<SealedType>(...) {
    override fun serialize(value: SealedType, gen: JsonGenerator, provider: SerializerProvider) {
        gen.writeString(when (value) {
            is SealedType.VariantA -> "variant_a"
            is SealedType.VariantB -> "variant_b"
        })
    }
}

FormatMetadata Custom Serializer (FormatMetadata.kt)

File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/FormatMetadata.kt Category: ALEF_GAP Lines: 3132 (decorators), 56230 (custom serializers)

Summary: Custom Jackson (de)serializers for FormatMetadata, a discriminated union. Key detail:

  • Code variant (line 89): Rust's FormatMetadata::Code wraps tree_sitter_language_pack::ProcessResult, which serializes as a JSON object. Kotlin stashes the raw JSON string in FormatMetadata.Code(value: String) so callers can re-parse if needed.

Suggested Upstream Fix: Same pattern as OutputFormat; alef should generate these for all sealed classes with complex payloads.


DocumentNode.contentLayer Nullable (DocumentNode.kt)

File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/DocumentNode.kt Category: ALEF_GAP Line: 42 (changed from val contentLayer: ContentLayer to val contentLayer: ContentLayer? = null)

Summary: Marks contentLayer optional with a default of null. This is a hand-edit to make the Kotlin field nullable to match Rust's Option<ContentLayer> default, which Rust serializes by omitting the field entirely. Without the nullable + default, alef-generated Kotlin would make the field required, and deserialization would fail when Rust omits it.

Suggested Upstream Fix: Alef should inspect Rust Option<T> fields and auto-generate Kotlin as T? = null (nullable with null default).


ChunkingConfig.sizing Nullable (ChunkingConfig.kt)

File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/ChunkingConfig.kt Category: ALEF_GAP Line: 74 (changed from val sizing: ChunkSizing to val sizing: ChunkSizing? = null)

Summary: Same as contentLayer; marks Rust Option<ChunkSizing> as nullable in Kotlin.

Suggested Upstream Fix: Alef should auto-generate Option<T> as T? = null in Kotlin.


renderPdfPageToPng() Path Resolution (Kreuzberg.kt)

File: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt Category: ALEF_GAP Lines: 783792 (changed from one-liner to multi-statement with path resolution)

Summary: Uses loadBytesFromPathOrUtf8() to resolve fixture paths for PDF bytes, matching behavior of extractBytes() and extractBytesSync(). The alef e2e generator emits fixture paths; production code may pass inline bytes.

Suggested Upstream Fix: Auto-apply path resolution to all methods that accept binary payloads, not just those explicitly named *Bytes*.


ROOT_CAUSE: Rust/FFI Changes

kreuzberg-ffi Crate-Type: Add rlib (Cargo.toml)

File: crates/kreuzberg-ffi/Cargo.toml Category: ROOT_CAUSE Change: crate-type = ["cdylib", "staticlib"]crate-type = ["cdylib", "staticlib", "rlib"] Commit: 66ca4f40eb fix(kotlin-android): force-link kreuzberg-ffi symbols into JNI cdylib

Rationale: The JNI shim (kreuzberg-jni) is a cdylib that imports kreuzberg-ffi functions by name. Without "rlib" in kreuzberg-ffi's crate-type, the linker drops #[no_mangle] symbols as dead code, and JNI calls resolve to null at runtime.

Impact: This is a one-time FFI infrastructure fix, not a breaking change to the public API.


TEST_FIXTURE & BINDING_BUG: None Found

No test fixtures were modified in this audit cycle. The e2e test suite (82 tests, alef-generated) passes without modification.


Summary by Category

Category Count Files
ALEF_GAP 10 kreuzberg-jni shim; Jackson config; path resolution; OutputFormat serializer; FormatMetadata serializer; nullable fields
ROOT_CAUSE 1 kreuzberg-ffi Cargo.toml (rlib crate-type)
BINDING_BUG 0
TEST_FIXTURE 0

Suggested Cleanup In-Repo

Before upstreaming to alef, consolidate the following hand-written code:

1. Replace Hand-Rolled base64_decode() with base64 Crate

Location: crates/kreuzberg-jni/src/lib.rs lines 3766 Current: Manual Base64 alphabet mapping Suggested: Add base64 crate and use base64::engine::general_purpose::STANDARD.decode()

2. Evaluate Partial Deprecation of fixConfigSerialization()

Location: packages/kotlin-android/src/main/kotlin/dev/kreuzberg/Kreuzberg.kt lines 102165

With OutputFormatSerializer and FAIL_ON_UNKNOWN_PROPERTIES = false in place, fixConfigSerialization() may only be needed for cancel_token removal. Options:

  1. Keep as-is (safe, explicit fix)
  2. Remove if Rust ExtractionConfig adds #[serde(skip)] to cancel_token (or alef generates the field to omit it)
  3. Replace with a targeted OutputFormat-only fix if other uses have been absorbed by custom serializers

Recommendation: Keep for now; deprecate after confirming OutputFormatSerializer handles all discovered edge cases.


JNI Marshalling Pattern (Reusable Spec)

Alef kotlin-android template should standardize on this pattern:

1. Byte Marshalling

Rust Vec<u8>  ──→  JVM byte[] (via JNI)  ──→  Kotlin ByteArray
                       ↓ (unsafe)
                    String (Base64)
                       ↓ JNI bound
                    Rust byte slice

In Kotlin: Base64.getEncoder().encodeToString(bytes) In JNI: base64_decode(&content_str)Vec<u8> Convention: All binary payloads Base64-encoded for JNI safety

2. Configuration Marshalling

Kotlin ExtractionConfig  ──→  mapper.writeValueAsString()
    ↓
  JSON string
    ↓ (JNI safe)
  JNI function
    ↓
  Rust: kreuzberg_extraction_config_from_json()
    ↓
  *mut ExtractionConfig (opaque)

In Kotlin: mapper.writeValueAsString(config) In JNI: Accept *const c_char (JSON), parse via serde_json

3. MIME Type Handling

Kotlin: mimeType ?: ""  (null collapse to empty string)
  ↓ (JNI)
Rust: cstr_ptr_or_null() → *const c_char (null if empty)
  ↓
CreuzbergFFI: Treat null as "auto-detect from path"

Convention: Optional MIME type as empty string in Kotlin, null pointer in FFI

4. File Path Resolution (E2E & Production)

Kotlin parameter: content: String  (could be path OR UTF-8 bytes)
  ↓
loadBytesFromPathOrUtf8(content)
  ├─ Search test_documents/ / fixtures/ dirs
  ├─ Check KREUZBERG_TEST_DOCUMENTS_DIR env var
  └─ Fall back to UTF-8 bytes of string
  ↓
ByteArray (ready for Base64 encoding)

Convention: All byte parameters support both path and inline content; walk directories for tests, fall back to bytes for production

5. Exception Handling

Rust FFI returns:
  - NULL pointer on failure
  - Valid pointer on success

JNI handler:
  if (result.is_null()) {
    let msg = get_ffi_error_message();  // kreuzberg_last_error_context()
    throw_exception(env, &msg);
    return null_or_zero();
  }

Convention: Check every FFI return; wire last_error_code() + last_error_context() on every throw

6. Sealed Class Serialization

Rust:  #[derive(serde::Serialize)]
       pub enum OutputFormat {
           Plain, Markdown, Custom(String), ...
       }

JSON:  "plain"  or  "markdown"  or  "custom_name"

Kotlin (custom serializer):
    when (tag) {
        "plain" -> OutputFormat.Plain
        "markdown" -> OutputFormat.Markdown
        else -> OutputFormat.Custom(tag)
    }

Convention: Sealed classes in Kotlin use custom (de)serializers that accept Rust discriminant strings


Conclusion

82/82 kotlin-android e2e tests are passing. All hand-edits fall into two categories:

  1. ALEF_GAP (10 entries): Template-level features alef doesn't yet generate for kotlin-android
  2. ROOT_CAUSE (1 entry): FFI infrastructure fix (rlib crate-type)

No binding bugs or test fixture issues were found. The hand-edits are production-ready and provide a concrete specification for alef kotlin-android template upstreaming.

Next Step: Upstream each ALEF_GAP into alef's kotlin-android binding template using the patterns documented above.


Cleanup Session — May 30, 2026

Changes Applied

1. Replace Hand-Rolled base64_decode() with base64 Crate

File: crates/kreuzberg-jni/src/lib.rs

  • Added: base64 = "0.22" to Cargo.toml dependencies

  • Replaced: Manual Base64 alphabet mapping (lines 3766) with:

    use base64::engine::general_purpose::STANDARD;
    use base64::Engine;
    
    fn base64_decode(input: &str) -> Result<Vec<u8>, String> {
        STANDARD.decode(input).map_err(|e| format!("Invalid Base64: {}", e))
    }
    
  • Rationale: Eliminates 30 lines of hand-rolled code; uses well-tested standard library

2. Improve Exception Handling Pattern

File: crates/kreuzberg-jni/src/lib.rs

  • Refactored: Early return patterns to use return throw_exception(...) instead of:

    throw_exception(&mut env, &e);
    return std::ptr::null_mut();
    
  • Scope: Fixed in nativeRenderPdfPageToPngImpl (lines 11211150)

  • Impact: Cleaner code, no functional change (throw_exception already returns the null value)

3. Add SAFETY Comments to Critical Unsafe Blocks

File: crates/kreuzberg-jni/src/lib.rs

  • Enhanced: get_ffi_error_message() (lines 5367)
  • Enhanced: nativeExtractBytesImpl() config parsing section (lines 164172)
  • Pattern: Each SAFETY comment documents invariants and null-check patterns
  • Not Yet Complete: Future work to add SAFETY comments to all 70+ unsafe blocks

Bugs Found & Status

Confirmed Non-Issues

  1. JNI Exception Behavior: JNI exceptions are lazy — env.throw_new() doesn't immediately interrupt the JNI function. However, all call sites properly check for null returns and early-return, so pending exceptions are handled correctly before calling back into JNI.

  2. fixConfigSerialization() Deprecation: The audit notes flagged this for potential deprecation. Analysis shows:

    • OutputFormatSerializer now handles sealed class conversion (✓ working)
    • cancel_token removal still needed (Kotlin ExtractionConfig has field, Rust doesn't)
    • Jackson's FAIL_ON_UNKNOWN_PROPERTIES = false provides defense-in-depth
    • Recommendation: Keep as-is; the defensive fix is zero-cost and will survive future alef generations
  3. Memory Leaks: All FFI pointers are properly freed:

    • kreuzberg_extraction_config_free() called on both success and error paths
    • kreuzberg_free_string() called on all JSON pointers from FFI
    • kreuzberg_extraction_result_free() called after serialization
    • kreuzberg_embedding_preset_free() called after use

No Active Bugs Found

  • Type signature matching: All JNI function signatures match KreuzbergBridge.kt external declarations
  • Exception handling: All throw sites properly propagate via JVM's exception state
  • Null pointer checks: All FFI returns checked before use
  • String ownership: All JString → Rust String conversions via jstring_to_string() with error handling

Testing Status

  • Target: 82/82 kotlin-android e2e tests per variant (debug + release = 164 total)
  • Build: JNI shim compiles cleanly with no clippy warnings
  • Verification: ✓ Full e2e test suite passed (164/164 tests, 0 failures)
  • Commit: All cleanup applied and committed to HEAD (c5f192f3ff)