Files
fil/audit-notes/node.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

10 KiB

Node.js/TypeScript Binding Audit

Audit Date: 2026-05-30 Status: In Progress

Overview

Systematic bug audit of packages/typescript/, crates/kreuzberg-node/, and e2e/node/. This document tracks identified issues, their severity, root causes, and fixes.

Key Files Examined

  • crates/kreuzberg-node/src/lib.rs (14,426 lines, auto-generated)
  • crates/kreuzberg-node/index.d.ts (auto-generated type defs)
  • crates/kreuzberg-node/index.js (simple pass-through loader)
  • crates/kreuzberg-node/package.json (packaging metadata)
  • e2e/node/tests/ (generated test fixtures)

Issues Found

1. BINDING_BUG: Duplicate Function Declarations in .d.ts (CRITICAL)

Severity: HIGH File: crates/kreuzberg-node/index.d.ts Location: Lines 99-101 (and others) Description: The generated .d.ts file contains duplicate function declarations for six registry management functions:

  • clearDocumentExtractors() (appears twice)
  • clearEmbeddingBackends() (appears twice)
  • clearOcrBackends() (appears twice)
  • clearPostProcessors() (appears twice)
  • clearRenderers() (appears twice)
  • clearValidators() (appears twice)

Root Cause: Alef-generated code is emitting duplicate declarations, likely from a pre-commit or generation loop that processes trait-bridge exports multiple times.

Impact: TypeScript compilation may error or generate incorrect type information. IDEs may show duplicate suggestions.

Test Coverage: No e2e test validates type definition uniqueness.

Status: PENDING FIX (need to verify with alef CLI or regenerate)

2. ERROR_HANDLING: All Errors Mapped to GenericFailure

Severity: MEDIUM File: crates/kreuzberg-node/src/lib.rs Occurrences: 76 instances Description: All Rust errors are converted to napi::Status::GenericFailure with only the error message preserved. This prevents proper categorization of errors on the JavaScript side.

Example:

.map_err(|e| napi::Error::new(napi::Status::GenericFailure, e.to_string()))

Missing Opportunities:

  • InvalidArg for validation errors
  • InvalidData for parsing errors
  • ObjectExpected for type mismatches
  • PendingException for async rejections

Impact: JavaScript callers cannot distinguish between file-not-found, unsupported-format, and internal errors without string parsing.

Test Coverage: e2e tests check error paths but don't validate error status codes.

Status: PENDING ANALYSIS (low priority if errors are contextual enough in messages)

3. TYPE_COERCION: i64 for Time/Size Fields

Severity: LOW File: crates/kreuzberg-node/src/lib.rs Occurrences: ~30 Option<i64> fields in config structs Description: Timeout, cache TTL, archive size, and nesting depth fields use i64 instead of u32/u64.

Examples:

  • extraction_timeout_secs: Option<i64>
  • cache_ttl_secs: Option<i64>
  • max_archive_size: Option<i64>

Why It's Safe: All values are within Number.MAX_SAFE_INTEGER (2^53 - 1). No precision loss expected in practice.

Status: ACCEPTABLE (no action needed)

4. BUFFER_HANDLING: Vec Copies

Severity: LOW File: crates/kreuzberg-node/src/lib.rs, lines 5878, 5971, 6139 Description: All Buffer inputs are converted to Vec<u8> via .to_vec(), which always copies the underlying data.

Code Pattern:

let content: Vec<u8> = content.to_vec();
kreuzberg::extract_bytes(&content, &mime_type, &config_core)

Trade-offs:

  • Pro: Zero-copy would require unsafe lifetime transmission to Rust
  • Con: Double memory usage for large files (Buffer + Vec)
  • Acceptable: Node.js handles garbage collection; trade-off is reasonable for simplicity

Status: ACCEPTABLE (safe ownership semantics)

5. ASYNC_HANDLING: Global Tokio Runtime

Severity: LOW (design choice) File: crates/kreuzberg-node/src/lib.rs, lines 54-59 Description: A static WORKER_POOL is initialized as a global Tokio runtime for both async and sync functions.

Pattern:

static WORKER_POOL: std::sync::LazyLock<tokio::runtime::Runtime> = ...

Correctness:

  • async fn functions like extract_bytes() are directly exposed via NAPI and return Promises ✓
  • Sync functions use WORKER_POOL.block_on() to bridge to async Rust ✓
  • No blocking on event loop (sync functions use dedicated thread pool) ✓

Potential Concern: If called from too many concurrent contexts, thread pool could be saturated. Mitigated by reasonable defaults in kreuzberg core.

Status: ACCEPTABLE (standard pattern for NAPI-RS)

6. EMBEDDING_PRECISION: f32 to f64 Conversion

Severity: LOW (intentional) File: crates/kreuzberg-node/src/lib.rs, line 6319 Description: Rust core returns Vec<Vec<f32>> embeddings, but Node binding promotes them to Vec<Vec<f64>> before returning to JavaScript.

Code:

row.into_iter().map(|x| x as f64).collect::<Vec<_>>()

Rationale: JavaScript uses IEEE 754 f64 natively; promoting f32→f64 simplifies client code and avoids typed array overhead.

Impact: Zero precision loss (f32 fits exactly in f64 mantissa). Slight memory overhead (2x per embedding vector).

Status: ACCEPTABLE (intentional design)

7. TYPE_DEFINITIONS: JSDoc Parity

Severity: LOW File: crates/kreuzberg-node/index.d.ts Description: TypeScript docs use legacy rustdoc syntax ([...] links, : in param names) instead of JSDoc/TSDoc syntax.

Examples:

// Generated (rustdoc):
@param items - Vector of `BatchBytesItem` structs, ...
@returns A vector of `ExtractionResult` in ...

// Expected (TSDoc):
@param {Array<BatchBytesItem>} items - Vector of byte items
@returns {Promise<Array<ExtractionResult>>} Result vector

Impact: IDEs with strict JSDoc checkers may warn. Auto-docs generators expect standard JSDoc format.

Status: ALEF_GAP (generator produces rustdoc-style comments, not JSDoc)

8. TRAIT_BRIDGE: Object Lifetime Safety

Severity: LOW (well-handled) File: crates/kreuzberg-node/src/lib.rs, lines 138-164 (JsVisitorRef), 6679-6711 (JsPostProcessorBridge) Description: Trait bridge wrappers use Object<'static> transmute to store JS objects across async boundaries.

Code Pattern:

let js_obj: napi::bindgen_prelude::Object<'static> = unsafe { std::mem::transmute(js_obj) };

Safety Justification (from comments):

  • JS object is owned by Node.js runtime
  • Bridge is only used synchronously within the enclosing #[napi] call

Correctness: ✓ (lifetime is safe for trait dispatch)

Status: ACCEPTABLE (unsafe is justified and documented)

Audit Checklist

  • NAPI-RS signature drift — all main functions checked
  • Promise rejection paths — async functions verified
  • Buffer/Vec ownership — safe conversions confirmed
  • BigInt vs Number — all large values stay within safe range
  • Event-loop blocking — sync functions use runtime.block_on()
  • .d.ts parity — found duplicate declarations (BUG)
  • TSDoc/JSDoc parity — rustdoc syntax used (alef gap)
  • Error status codes — all use GenericFailure (improvement opportunity)

Duplicate Functions in .d.ts - Full List

These 6 functions have duplicate declarations in index.d.ts:

  1. clearDocumentExtractors() — lines 91, 93
  2. clearEmbeddingBackends() — lines 87, 89
  3. clearOcrBackends() — lines 99, 101
  4. clearPostProcessors() — lines 107, 109
  5. clearRenderers() — lines 103, 105
  6. clearValidators() — lines 95, 97

Action Item: Regenerate with alef generate to fix (not permitted in this audit).

Additional Findings

9. CONFIG_CONVERSION: Field Serialization

Severity: LOW File: crates/kreuzberg-node/src/lib.rs, lines 8061, 8075, 8079 Description: Fields html_options, concurrency, and cancel_token are serialized as format!("{:?}") because they contain complex Rust types that cannot serialize to JSON.

Impact: These fields are read-only on the JS side and return debug representations. Acceptable for internal use but not user-facing config.

Status: ACCEPTABLE (design choice for internal fields)

E2E Test Status

Tests are currently building (napi build running). The e2e suite is comprehensive with 20+ test files covering:

  • Async/sync operations
  • Batch processing
  • Plugin APIs (OCR, embeddings, document extractor)
  • Configuration contracts
  • Error handling
  • Format-specific extraction
  • MIME type detection

Current Green Status: e2e/node last ran successfully (pre-audit).

Summary of Findings

Issue Severity Status Action
Duplicate .d.ts declarations (6 functions) HIGH Confirmed Regenerate with alef
All errors map to GenericFailure MEDIUM As-designed Optional improvement
JSDoc syntax in comments LOW Alef gap Upstream fix needed
Buffer double-copy on input LOW Acceptable No fix needed
Config debug fields LOW Acceptable No fix needed

Recommendations

  1. Critical: Fix duplicate .d.ts declarations

    • Regenerate with alef generate
    • File: crates/kreuzberg-node/index.d.ts
    • Affects: clearDocumentExtractors, clearEmbeddingBackends, clearOcrBackends, clearPostProcessors, clearRenderers, clearValidators
  2. Nice to Have: Map specific Rust errors to appropriate napi::Status codes

    • Currently all use GenericFailure
    • Consider: InvalidArg for validation, InvalidData for parsing
    • Benefit: Better error categorization on JS side
  3. Documentation: Update Alef to generate JSDoc-compliant comments

    • Current: rustdoc syntax ([...] links)
    • Target: TSDoc format with @param, @returns tags
    • Affected files: .d.ts file comments
  4. Verification: Run TypeScript strict checking

    • Command: cd e2e/node && tsc --noEmit
    • Ensure .d.ts has no duplicates and proper types

What Went Right

  • ✓ NAPI-RS signatures correctly expose kreuzberg core API
  • ✓ Async functions properly return Promises
  • ✓ Sync functions use Tokio runtime correctly (no event loop blocking)
  • ✓ Buffer ownership properly transferred (no leaks)
  • ✓ Trait bridges safely transmute Object<'static> with documented SAFETY
  • ✓ Embedding precision (f32→f64) intentional and documented
  • ✓ Config conversion comprehensive, all fields mapped
  • ✓ Error messages preserve context via .to_string()