# Node.js/TypeScript API Reference ## Overview **Package**: `@kreuzberg/node` — A high-performance TypeScript SDK built on a Rust core for document intelligence and content extraction. Supports both **ESM** (`import`) and **CommonJS** (`require`): ```typescript // ESM import { extractFile, batchExtractFiles } from "@kreuzberg/node"; // CommonJS const { extractFile, batchExtractFiles } = require("@kreuzberg/node"); ``` **Current Version**: 4.2.14 --- ## Core Extraction Functions All extraction functions return `ExtractionResult` containing extracted content, metadata, tables, and optional chunks/images. ### Single File Extraction #### `extractFile(filePath, mimeType?, config?): Promise` Extract content from a single file asynchronously. ```typescript import { extractFile } from "@kreuzberg/node"; // Auto-detect MIME type from file extension const result = await extractFile("document.pdf"); console.log(result.content); // Explicit MIME type const result2 = await extractFile("document.pdf", "application/pdf"); // With configuration const result3 = await extractFile("document.pdf", null, { chunking: { maxChars: 1000, maxOverlap: 200, }, }); ``` **Parameters**: - `filePath: string` — Path to the file to extract - `mimeType?: string | null` — Optional MIME type hint (auto-detect if null) - `config?: ExtractionConfig` — Optional extraction configuration **Returns**: `Promise` **Throws**: `ParsingError`, `OcrError`, `ValidationError`, `KreuzbergError` #### `extractFileSync(filePath, mimeType?, config?): ExtractionResult` Extract content from a single file synchronously. ```typescript import { extractFileSync } from "@kreuzberg/node"; const result = extractFileSync("document.pdf"); console.log(result.content); ``` **Parameters**: Same as `extractFile()` **Returns**: `ExtractionResult` --- ### Raw Bytes Extraction #### `extractBytes(data, mimeType, config?): Promise` Extract content from raw bytes (Buffer or Uint8Array) asynchronously. ```typescript import { extractBytes } from "@kreuzberg/node"; import { readFile } from "fs/promises"; const data = await readFile("document.pdf"); const result = await extractBytes(data, "application/pdf"); console.log(result.content); ``` **Parameters**: - `data: Buffer | Uint8Array` — Raw file content - `mimeType: string` — MIME type (required) - `config?: ExtractionConfig` — Optional configuration **Returns**: `Promise` #### `extractBytesSync(data, mimeType, config?): ExtractionResult` Extract content from raw bytes synchronously. ```typescript import { extractBytesSync } from "@kreuzberg/node"; import { readFileSync } from "fs"; const data = readFileSync("document.pdf"); const result = extractBytesSync(data, "application/pdf"); ``` **Parameters**: Same as `extractBytes()` **Returns**: `ExtractionResult` --- ### Batch Extraction (Recommended) For processing multiple documents, batch APIs provide superior performance and memory management. #### `batchExtractFiles(paths, config?): Promise` Extract content from multiple files in parallel (asynchronous). ```typescript import { batchExtractFiles } from "@kreuzberg/node"; const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]; const results = await batchExtractFiles(files); results.forEach((result, i) => { console.log(`${files[i]}: ${result.content.substring(0, 100)}...`); }); ``` **Parameters**: - `paths: string[]` — Array of file paths - `config?: ExtractionConfig` — Configuration (applied to all files) **Returns**: `Promise` — Results in same order as input #### `batchExtractFilesSync(paths, config?): ExtractionResult[]` Extract content from multiple files synchronously. ```typescript import { batchExtractFilesSync } from "@kreuzberg/node"; const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]; const results = batchExtractFilesSync(files); ``` **Parameters**: Same as `batchExtractFiles()` **Returns**: `ExtractionResult[]` #### `batchExtractBytes(dataList, mimeTypes, config?): Promise` Extract content from multiple byte arrays in parallel (asynchronous). ```typescript import { batchExtractBytes } from "@kreuzberg/node"; import { readFile } from "fs/promises"; const files = ["doc1.pdf", "doc2.docx"]; const dataList = await Promise.all(files.map((f) => readFile(f))); const mimeTypes = [ "application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", ]; const results = await batchExtractBytes(dataList, mimeTypes); ``` **Parameters**: - `dataList: Uint8Array[]` — Array of file contents - `mimeTypes: string[]` — MIME types (one per item, must match length) - `config?: ExtractionConfig` — Configuration (applied to all items) **Returns**: `Promise` #### `batchExtractBytesSync(dataList, mimeTypes, config?): ExtractionResult[]` Extract content from multiple byte arrays synchronously. ```typescript import { batchExtractBytesSync } from "@kreuzberg/node"; import { readFileSync } from "fs"; const dataList = ["doc1.pdf", "doc2.docx"].map((f) => readFileSync(f)); const mimeTypes = [ "application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", ]; const results = batchExtractBytesSync(dataList, mimeTypes); ``` **Parameters**: Same as `batchExtractBytes()` **Returns**: `ExtractionResult[]` #### `batchExtractFilesWithConfigs(paths, fileConfigs, config?): Promise` Extract multiple files with per-file configuration overrides (asynchronous). ```typescript const results = await batchExtractFilesWithConfigs( ["report.pdf", "scanned.pdf"], [null, { forceOcr: true, ocr: { backend: "tesseract", language: "deu" } }], ); ``` **Parameters**: - `paths: string[]` — File paths - `fileConfigs: (FileExtractionConfig | null)[]` — Per-file configs (null = use batch defaults) - `config?: ExtractionConfig` — Batch-level configuration #### `batchExtractFilesWithConfigsSync(paths, fileConfigs, config?): ExtractionResult[]` Synchronous variant. #### `batchExtractBytesWithConfigs(dataList, mimeTypes, fileConfigs, config?): Promise` Extract multiple byte arrays with per-file overrides (asynchronous). #### `batchExtractBytesWithConfigsSync(dataList, mimeTypes, fileConfigs, config?): ExtractionResult[]` Synchronous variant. --- ## Worker Pool APIs Worker pools enable concurrent extraction using Node.js worker threads for CPU-bound processing. ### `createWorkerPool(size?): WorkerPool` Create a worker pool for concurrent extraction. ```typescript import { createWorkerPool } from "@kreuzberg/node"; // Create pool with default size (number of CPU cores) const pool = createWorkerPool(); // Create pool with specific size const pool4 = createWorkerPool(4); ``` **Parameters**: - `size?: number` — Number of workers (defaults to CPU core count) **Returns**: `WorkerPool` — Opaque handle for use with worker extraction functions ### `extractFileInWorker(pool, filePath, mimeType?, config?): Promise` Extract a single file using a worker from the pool. ```typescript import { createWorkerPool, extractFileInWorker, closeWorkerPool } from "@kreuzberg/node"; const pool = createWorkerPool(4); try { const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"]; const results = await Promise.all(files.map((f) => extractFileInWorker(pool, f))); results.forEach((r, i) => { console.log(`${files[i]}: ${r.content.substring(0, 100)}...`); }); } finally { await closeWorkerPool(pool); } ``` **Parameters**: - `pool: WorkerPool` — Worker pool instance - `filePath: string` — File path - `mimeType?: string | null` — Optional MIME type - `config?: ExtractionConfig` — Optional configuration **Returns**: `Promise` ### `batchExtractFilesInWorker(pool, paths, config?): Promise` Extract multiple files using the worker pool for concurrent processing. ```typescript import { createWorkerPool, batchExtractFilesInWorker, closeWorkerPool } from "@kreuzberg/node"; const pool = createWorkerPool(4); try { const files = ["invoice1.pdf", "invoice2.pdf", "invoice3.pdf"]; const results = await batchExtractFilesInWorker(pool, files, { ocr: { backend: "tesseract", language: "eng" }, }); const total = results.reduce((sum, r) => sum + extractAmount(r.content), 0); console.log(`Total: $${total}`); } finally { await closeWorkerPool(pool); } ``` **Parameters**: - `pool: WorkerPool` — Worker pool instance - `paths: string[]` — File paths - `config?: ExtractionConfig` — Configuration (applied to all files) **Returns**: `Promise` ### `getWorkerPoolStats(pool): WorkerPoolStats` Get statistics about a worker pool. ```typescript import { createWorkerPool, getWorkerPoolStats } from "@kreuzberg/node"; const pool = createWorkerPool(4); const stats = getWorkerPoolStats(pool); console.log(`Pool size: ${stats.size}`); console.log(`Active workers: ${stats.activeWorkers}`); console.log(`Queued tasks: ${stats.queuedTasks}`); ``` **Parameters**: - `pool: WorkerPool` — Worker pool instance **Returns**: `WorkerPoolStats` ### `closeWorkerPool(pool): Promise` Close a worker pool and shut down all worker threads. ```typescript import { createWorkerPool, closeWorkerPool } from "@kreuzberg/node"; const pool = createWorkerPool(4); try { // Use pool } finally { await closeWorkerPool(pool); } ``` **Parameters**: - `pool: WorkerPool` — Worker pool instance to close **Returns**: `Promise` --- ## Configuration Interface ### `ExtractionConfig` Main configuration object controlling extraction behavior. ```typescript interface ExtractionConfig { // Caching and processing useCache?: boolean; // Default: true enableQualityProcessing?: boolean; // Default: false // OCR configuration ocr?: OcrConfig; // OCR settings forceOcr?: boolean; // Default: false // Document processing chunking?: ChunkingConfig; // Break into chunks images?: ImageExtractionConfig; // Image extraction pdfOptions?: PdfConfig; // PDF-specific options tokenReduction?: TokenReductionConfig; // Token optimization languageDetection?: LanguageDetectionConfig; // Language detection postprocessor?: PostProcessorConfig; // Post-processing htmlOptions?: HtmlConversionOptions; // HTML conversion keywords?: KeywordConfig; // Keyword extraction pages?: PageExtractionConfig; // Page extraction // Output control maxConcurrentExtractions?: number; // Default: 4 outputFormat?: "plain" | "markdown" | "djot" | "html"; // Default: 'plain' resultFormat?: "unified" | "element_based"; // Default: 'unified' } ``` ### `FileExtractionConfig` Per-file overrides for batch operations. All fields optional (omitted = use batch default). ```typescript interface FileExtractionConfig { enableQualityProcessing?: boolean; ocr?: OcrConfig; forceOcr?: boolean; chunking?: ChunkingConfig; images?: ImageExtractionConfig; pdfOptions?: PdfConfig; tokenReduction?: TokenReductionConfig; languageDetection?: LanguageDetectionConfig; pages?: PageExtractionConfig; keywords?: KeywordConfig; postprocessor?: PostProcessorConfig; outputFormat?: "plain" | "markdown" | "djot" | "html"; resultFormat?: "unified" | "element_based"; includeDocumentStructure?: boolean; } ``` Excluded (batch-level only): `maxConcurrentExtractions`, `useCache`, `securityLimits`. ### `ChunkingConfig` Configuration for breaking documents into chunks (useful for RAG and vector databases). ```typescript interface ChunkingConfig { maxChars?: number; // Max characters per chunk (default: 4096) maxOverlap?: number; // Overlap between chunks (default: 512) chunkSize?: number; // Alternative unit (mutually exclusive with maxChars) chunkOverlap?: number; // Alternative unit (mutually exclusive with maxOverlap) preset?: string; // Named preset ('default', 'aggressive', 'minimal') embedding?: Record; // Embedding config enabled?: boolean; // Enable chunking (default: true when config provided) } ``` **Key Point**: Use `maxChars` and `maxOverlap`, NOT `maxCharacters` or `overlap`. ### `OcrConfig` Configuration for optical character recognition. ```typescript interface OcrConfig { backend: string; // OCR backend name (e.g., 'tesseract') language?: string; // Language code (e.g., 'eng', 'deu') tesseractConfig?: TesseractConfig; } interface TesseractConfig { psm?: number; // Page Segmentation Mode (0-13) enableTableDetection?: boolean; tesseditCharWhitelist?: string; // Character whitelist } ``` ### `ImageExtractionConfig` Configuration for extracting and optimizing images. ```typescript interface ImageExtractionConfig { extractImages?: boolean; // Default: true targetDpi?: number; // Target DPI (default: 150) maxImageDimension?: number; // Max width/height in pixels (default: 2000) autoAdjustDpi?: boolean; // Auto-adjust DPI (default: true) minDpi?: number; // Minimum DPI (default: 72) maxDpi?: number; // Maximum DPI (default: 300) } ``` ### `PdfConfig` PDF-specific extraction options. ```typescript interface PdfConfig { extractImages?: boolean; // Default: true passwords?: string[]; // Passwords for encrypted PDFs extractMetadata?: boolean; // Default: true hierarchy?: HierarchyConfig; // Hierarchy extraction } ``` ### `LanguageDetectionConfig` Configuration for automatic language detection. ```typescript interface LanguageDetectionConfig { enabled?: boolean; // Default: true minConfidence?: number; // Threshold 0.0-1.0 (default: 0.5) detectMultiple?: boolean; // Detect multiple languages (default: false) } ``` ### `TokenReductionConfig` Configuration for optimizing token usage. ```typescript interface TokenReductionConfig { mode?: string; // 'aggressive' or 'conservative' (default: 'conservative') preserveImportantWords?: boolean; // Default: true } ``` ### `KeywordConfig` Configuration for keyword extraction. ```typescript interface KeywordConfig { algorithm?: "yake" | "rake"; // Default: 'yake' maxKeywords?: number; // Maximum keywords (default: 10) minScore?: number; // Minimum relevance score (default: 0.1) ngramRange?: [number, number]; // N-gram range (default: [1, 3]) language?: string; // Language code (default: 'en') yakeParams?: YakeParams; rakeParams?: RakeParams; } ``` ### `PageExtractionConfig` Configuration for page-level content tracking. ```typescript interface PageExtractionConfig { extractPages?: boolean; // Extract as separate pages array insertPageMarkers?: boolean; // Insert page markers in content markerFormat?: string; // Marker format with {page_num} placeholder } ``` ### `HtmlConversionOptions` Configuration for HTML to Markdown conversion. ```typescript interface HtmlConversionOptions { headingStyle?: "atx" | "underlined" | "atx_closed"; listIndentType?: "spaces" | "tabs"; listIndentWidth?: number; bullets?: string; strongEmSymbol?: string; escapeAsterisks?: boolean; escapeUnderscores?: boolean; escapeMisc?: boolean; escapeAscii?: boolean; codeLanguage?: string; autolinks?: boolean; defaultTitle?: boolean; brInTables?: boolean; hocrSpatialTables?: boolean; highlightStyle?: "double_equal" | "html" | "bold" | "none"; extractMetadata?: boolean; whitespaceMode?: "normalized" | "strict"; stripNewlines?: boolean; wrap?: boolean; wrapWidth?: number; convertAsInline?: boolean; subSymbol?: string; supSymbol?: string; newlineStyle?: "spaces" | "backslash"; codeBlockStyle?: "indented" | "backticks" | "tildes"; keepInlineImagesIn?: string[]; encoding?: string; debug?: boolean; stripTags?: string[]; preserveTags?: string[]; preprocessing?: HtmlPreprocessingOptions; } ``` --- ## Result Types ### `ExtractionResult` Complete extraction result from document processing. ```typescript interface ExtractionResult { // Main content content: string; // Document type mimeType: string; // Metadata (format-specific) metadata: Metadata; // Extracted structures tables: Table[]; // Optional processed data detectedLanguages: string[] | null; chunks: Chunk[] | null; // From chunking config images: ExtractedImage[] | null; // From image extraction elements?: Element[] | null; // From element_based result format pages?: PageContent[] | null; // From page extraction extractedKeywords?: ExtractedKeyword[] | null; // Extracted keywords with scores qualityScore?: number | null; // Overall extraction quality (0.0-1.0) processingWarnings?: ProcessingWarning[]; // Non-fatal warnings from pipeline } ``` ### `Table` Extracted table data with cell structure. ```typescript interface Table { cells: string[][]; // 2D array of cell contents (rows × columns) markdown: string; // Markdown representation pageNumber: number; // 1-indexed page number } ``` ### `Chunk` Text chunk for RAG or vector database indexing. ```typescript interface Chunk { content: string; embedding?: number[] | null; // Vector embedding if computed metadata: ChunkMetadata; } interface ChunkMetadata { byteStart: number; // UTF-8 byte offset in original text byteEnd: number; // UTF-8 byte offset tokenCount?: number | null; chunkIndex: number; // Zero-based index totalChunks: number; // Total number of chunks firstPage?: number | null; // 1-indexed, if page tracking enabled lastPage?: number | null; } ``` ### `ExtractedImage` Image extracted from document. ```typescript interface ExtractedImage { data: Uint8Array; // Raw image bytes format: string; // Format (e.g., 'png', 'jpeg', 'tiff') imageIndex: number; // Sequential index (0-indexed) pageNumber?: number | null; width?: number | null; height?: number | null; colorspace?: string | null; bitsPerComponent?: number | null; isMask: boolean; description?: string | null; ocrResult?: ExtractionResult | null; // OCR result if processed } ``` ### `PageContent` Per-page content when page extraction is enabled. ```typescript interface PageContent { pageNumber: number; // 1-indexed content: string; // Page text content tables: Table[]; // Tables on this page images: ExtractedImage[]; // Images on this page } ``` ### `ExtractedKeyword` Extracted keyword with relevance score and position information. ```typescript interface ExtractedKeyword { text: string; // Keyword text score: number; // Relevance score (0.0-1.0) algorithm: string; // Algorithm used ("tfidf", "textrank", "yake", etc.) positions?: number[] | null; // Character positions in content (if available) } ``` ### `ProcessingWarning` Non-fatal warning encountered during document processing. ```typescript interface ProcessingWarning { source: string; // Component that generated the warning message: string; // Warning message describing the issue } ``` ### `Metadata` Extraction result metadata (format-specific). ```typescript interface Metadata { // Common fields language?: string | null; date?: string | null; subject?: string | null; format_type?: | "pdf" | "excel" | "email" | "pptx" | "archive" | "image" | "xml" | "text" | "html" | "ocr"; // PDF metadata title?: string | null; author?: string | null; creator?: string | null; producer?: string | null; creation_date?: string | null; modification_date?: string | null; page_count?: number; // Excel metadata sheet_count?: number; sheet_names?: string[]; // Email metadata from_email?: string | null; from_name?: string | null; to_emails?: string[]; cc_emails?: string[]; bcc_emails?: string[]; message_id?: string | null; attachments?: string[]; // Image metadata width?: number; height?: number; exif?: Record; // OCR metadata psm?: number; output_format?: string; table_count?: number; // HTML metadata canonical_url?: string | null; html_language?: string | null; text_direction?: "ltr" | "rtl" | "auto" | null; open_graph?: Record; twitter_card?: Record; meta_tags?: Record; html_headers?: HeaderMetadata[]; html_links?: LinkMetadata[]; html_images?: HtmlImageMetadata[]; structured_data?: StructuredData[]; // Text metadata line_count?: number; word_count?: number; character_count?: number; headers?: string[] | null; links?: [string, string][] | null; code_blocks?: [string, string][] | null; // Page structure page_structure?: PageStructure | null; // Additional typed fields category?: string | null; tags?: string[]; document_version?: string | null; abstract_text?: string | null; // Custom fields from postprocessors [key: string]: unknown; } ``` --- ## Error Handling ### Error Classes ```typescript import { KreuzbergError, ParsingError, OcrError, // Note: camelCase, not "OCRError" ValidationError, MissingDependencyError, CacheError, ImageProcessingError, PluginError, ErrorCode, } from "@kreuzberg/node"; ``` **Error Hierarchy**: - `KreuzbergError` — Base class for all Kreuzberg errors - `ParsingError` — Document format invalid or corrupted - `OcrError` — OCR processing failed - `ValidationError` — Extraction validation failed - `MissingDependencyError` — Required dependency unavailable - `CacheError` — Cache operation failed - `ImageProcessingError` — Image extraction or processing failed - `PluginError` — Plugin registration or execution failed ### Error Diagnostics ```typescript import { classifyError, getErrorCodeDescription, getErrorCodeName, getLastErrorCode, getLastPanicContext, } from "@kreuzberg/node"; try { const result = await extractFile("document.pdf"); } catch (error) { const classification = classifyError(error.message); console.log(`Error code: ${getErrorCodeName(classification.code)}`); console.log(`Description: ${getErrorCodeDescription(classification.code)}`); console.log(`Confidence: ${classification.confidence}`); } ``` ### `ErrorCode` Enum ```typescript enum ErrorCode { Success = 0, GenericError = 1, Panic = 2, InvalidArgument = 3, IoError = 4, ParsingError = 5, OcrError = 6, MissingDependency = 7, } ``` --- ## Plugin System ### Post-Processors Custom post-processors can enrich extraction results without failing the extraction if they encounter errors. #### `registerPostProcessor(processor): void` Register a custom post-processor. ```typescript import { registerPostProcessor, extractFile } from "@kreuzberg/node"; const processor = { name() { return "my_processor"; }, async process(result) { // Enrich result with custom metadata result.metadata["custom_field"] = "value"; return result; }, processingStage() { return "late"; // 'early', 'middle', or 'late' }, async initialize() { // Called once when registered }, async shutdown() { // Called when unregistered }, }; registerPostProcessor(processor); const result = await extractFile("document.pdf"); ``` #### `unregisterPostProcessor(name): void` Remove a registered post-processor. ```typescript import { unregisterPostProcessor } from "@kreuzberg/node"; unregisterPostProcessor("my_processor"); ``` #### `listPostProcessors(): string[]` List all registered post-processor names. ```typescript import { listPostProcessors } from "@kreuzberg/node"; const processors = listPostProcessors(); console.log("Registered processors:", processors); ``` #### `clearPostProcessors(): void` Unregister all post-processors. ```typescript import { clearPostProcessors } from "@kreuzberg/node"; clearPostProcessors(); ``` ### Validators Custom validators check extraction results and fail the extraction if validation fails (unlike post-processors). #### `registerValidator(validator): void` Register a custom validator. ```typescript import { registerValidator, extractFile } from "@kreuzberg/node"; const validator = { name() { return "content_length_validator"; }, validate(result) { if (result.content.length < 10) { throw new Error("Content too short"); } }, priority() { return 100; // Higher = runs first }, shouldValidate(result) { return result.mimeType === "application/pdf"; // Conditional validation }, async initialize() { // Called once when registered }, async shutdown() { // Called when unregistered }, }; registerValidator(validator); const result = await extractFile("document.pdf"); ``` #### `unregisterValidator(name): void` Remove a registered validator. ```typescript import { unregisterValidator } from "@kreuzberg/node"; unregisterValidator("content_length_validator"); ``` #### `listValidators(): string[]` List all registered validator names. ```typescript import { listValidators } from "@kreuzberg/node"; const validators = listValidators(); ``` #### `clearValidators(): void` Unregister all validators. ```typescript import { clearValidators } from "@kreuzberg/node"; clearValidators(); ``` ### OCR Backends Custom OCR backends can be registered to handle image text extraction. #### `registerOcrBackend(backend): void` Register a custom OCR backend. ```typescript import { registerOcrBackend, extractFile } from "@kreuzberg/node"; const backend = { name() { return "my-ocr"; }, supportedLanguages() { return ["eng", "deu", "fra"]; }, async processImage(imageBytes, language) { // imageBytes: Uint8Array or Base64 string const buffer = typeof imageBytes === "string" ? Buffer.from(imageBytes, "base64") : Buffer.from(imageBytes); // Process and extract text return { content: "extracted text", mime_type: "text/plain", metadata: { confidence: 0.95, language }, tables: [], }; }, async initialize() { // Load models, setup resources }, async shutdown() { // Cleanup resources }, }; registerOcrBackend(backend); ``` #### `GutenOcrBackend` Built-in OCR backend implementation using Guten-OCR. ```typescript import { GutenOcrBackend, registerOcrBackend, extractFile } from "@kreuzberg/node"; const backend = new GutenOcrBackend(); await backend.initialize(); registerOcrBackend(backend); const result = await extractFile("scanned.pdf", null, { ocr: { backend: "guten-ocr", language: "eng" }, }); ``` #### `unregisterOcrBackend(name): void` Remove a registered OCR backend. ```typescript import { unregisterOcrBackend } from "@kreuzberg/node"; unregisterOcrBackend("my-ocr"); ``` #### `listOcrBackends(): string[]` List all registered OCR backend names. ```typescript import { listOcrBackends } from "@kreuzberg/node"; const backends = listOcrBackends(); ``` #### `clearOcrBackends(): void` Unregister all OCR backends. ```typescript import { clearOcrBackends } from "@kreuzberg/node"; clearOcrBackends(); ``` --- ## MIME Type Utilities ### `detectMimeType(data): string | null` Detect MIME type from file content (magic bytes). ```typescript import { detectMimeType } from "@kreuzberg/node"; import { readFileSync } from "fs"; const data = readFileSync("document"); const mimeType = detectMimeType(data); console.log(`Detected MIME type: ${mimeType}`); ``` ### `detectMimeTypeFromPath(filePath): string | null` Detect MIME type from file extension. ```typescript import { detectMimeTypeFromPath } from "@kreuzberg/node"; const mimeType = detectMimeTypeFromPath("document.pdf"); console.log(`MIME type: ${mimeType}`); // 'application/pdf' ``` ### `getExtensionsForMime(mimeType): string[]` Get file extensions for a MIME type. ```typescript import { getExtensionsForMime } from "@kreuzberg/node"; const extensions = getExtensionsForMime("application/pdf"); console.log(`Extensions: ${extensions}`); // ['.pdf'] ``` ### `validateMimeType(mimeType): boolean` Check if a MIME type is valid. ```typescript import { validateMimeType } from "@kreuzberg/node"; if (validateMimeType("application/pdf")) { console.log("Valid MIME type"); } ``` --- ## Configuration Loading ### `ExtractionConfig.fromFile(filePath): ExtractionConfig` Load extraction configuration from a file (JSON, YAML, or TOML). ```typescript import { ExtractionConfig, extractFile } from "@kreuzberg/node"; const config = ExtractionConfig.fromFile("./kreuzberg.toml"); const result = await extractFile("document.pdf", null, config); ``` ### `ExtractionConfig.discover(): ExtractionConfig | null` Auto-discover extraction configuration file in current and parent directories. ```typescript import { ExtractionConfig, extractFile } from "@kreuzberg/node"; // Searches for kreuzberg.{toml,yaml,json} in current directory and parents const config = ExtractionConfig.discover(); if (config) { const result = await extractFile("document.pdf", null, config); } ``` --- ## Embeddings ### `getEmbeddingPreset(name): EmbeddingPreset | null` Get a named embedding model preset. ```typescript import { getEmbeddingPreset } from "@kreuzberg/node"; const preset = getEmbeddingPreset("default"); if (preset) { console.log(`Model: ${preset.modelName}`); console.log(`Dimensions: ${preset.dimensions}`); } ``` ### `listEmbeddingPresets(): string[]` List all available embedding presets. ```typescript import { listEmbeddingPresets } from "@kreuzberg/node"; const presets = listEmbeddingPresets(); console.log("Available presets:", presets); ``` ### `EmbeddingPreset` Type definition for embedding model presets. ```typescript interface EmbeddingPreset { name: string; // Preset name (e.g., "fast", "balanced", "quality", "multilingual") chunkSize: number; // Recommended chunk size in characters overlap: number; // Recommended overlap in characters modelName: string; // Model identifier (e.g., "AllMiniLML6V2Q", "BGEBaseENV15") dimensions: number; // Embedding vector dimensions description: string; // Human-readable description } ``` --- ## Plugin Protocols ### `PostProcessorProtocol` Interface for custom post-processors. ```typescript interface PostProcessorProtocol { name(): string; process(result: ExtractionResult): ExtractionResult | Promise; processingStage?(): ProcessingStage; // 'early' | 'middle' | 'late' initialize?(): void | Promise; shutdown?(): void | Promise; } ``` ### `ValidatorProtocol` Interface for custom validators. ```typescript interface ValidatorProtocol { name(): string; validate(result: ExtractionResult): void | Promise; priority?(): number; // Higher = runs first shouldValidate?(result: ExtractionResult): boolean; initialize?(): void | Promise; shutdown?(): void | Promise; } ``` ### `OcrBackendProtocol` Interface for custom OCR backends. ```typescript interface OcrBackendProtocol { name(): string; supportedLanguages(): string[]; processImage( imageBytes: Uint8Array | string, language: string, ): Promise<{ content: string; mime_type: string; metadata: Record; tables: unknown[]; }>; initialize?(): void | Promise; shutdown?(): void | Promise; } ``` --- ## Supported Document Formats - **Documents**: PDF, DOCX, PPTX, XLSX, DOC, PPT - **Text**: Markdown, Plain Text, XML, JSON, YAML, TOML - **Web**: HTML (converted to Markdown) - **Email**: EML, MSG - **Images**: PNG, JPEG, TIFF (with OCR support) - **Archives**: ZIP, TAR, GZIP (file listing) --- ## Registry Functions ### Document Extractors ```typescript import { listDocumentExtractors, unregisterDocumentExtractor, clearDocumentExtractors, } from "@kreuzberg/node"; // List registered extractors const extractors = listDocumentExtractors(); // Unregister a specific extractor unregisterDocumentExtractor("pdf"); // Clear all extractors clearDocumentExtractors(); ``` --- ## Type Exports All types are exported from `@kreuzberg/node`: ```typescript export type { Chunk, ChunkingConfig, ExtractionConfig, ExtractionResult, ExtractedImage, KeywordConfig, LanguageDetectionConfig, OcrBackendProtocol, OcrConfig, PageContent, PageExtractionConfig, PdfConfig, PostProcessorProtocol, Table, TokenReductionConfig, ValidatorProtocol, WorkerPool, WorkerPoolStats, EmbeddingPreset, // ... and many more }; ``` --- ## Best Practices 1. **Use batch APIs for multiple documents**: `batchExtractFiles()` provides superior performance vs. calling `extractFile()` in a loop. 2. **Enable chunking for RAG/vector DB**: Set `chunking` config to automatically break documents into overlapping chunks. 3. **Use worker pools for high-concurrency scenarios**: Distribute CPU-bound work across multiple threads for 4+ concurrent extractions. 4. **Configure language detection**: Enable automatic language detection for multilingual documents. 5. **Register validators early**: Set up validators before calling extraction functions to catch quality issues immediately. 6. **Use specific MIME types**: Provide explicit MIME types when available to avoid detection overhead. 7. **Clean up resources**: Always call `closeWorkerPool()` when done to prevent resource leaks. 8. **Handle extraction errors gracefully**: Catch specific error types (`ParsingError`, `OcrError`, etc.) for appropriate error handling. --- ## Version **Package Version**: 4.2.14