Files
fil/skills/kreuzberg/references/nodejs-api.md

1381 lines
33 KiB
Markdown
Raw Normal View History

2026-06-01 23:40:55 +02:00
# Node.js/TypeScript API Reference
## Overview
**Package**: `@kreuzberg/node` — A high-performance TypeScript SDK built on a Rust core for document intelligence and content extraction.
Supports both **ESM** (`import`) and **CommonJS** (`require`):
```typescript
// ESM
import { extractFile, batchExtractFiles } from "@kreuzberg/node";
// CommonJS
const { extractFile, batchExtractFiles } = require("@kreuzberg/node");
```
**Current Version**: 4.2.14
---
## Core Extraction Functions
All extraction functions return `ExtractionResult` containing extracted content, metadata, tables, and optional chunks/images.
### Single File Extraction
#### `extractFile(filePath, mimeType?, config?): Promise<ExtractionResult>`
Extract content from a single file asynchronously.
```typescript
import { extractFile } from "@kreuzberg/node";
// Auto-detect MIME type from file extension
const result = await extractFile("document.pdf");
console.log(result.content);
// Explicit MIME type
const result2 = await extractFile("document.pdf", "application/pdf");
// With configuration
const result3 = await extractFile("document.pdf", null, {
chunking: {
maxChars: 1000,
maxOverlap: 200,
},
});
```
**Parameters**:
- `filePath: string` — Path to the file to extract
- `mimeType?: string | null` — Optional MIME type hint (auto-detect if null)
- `config?: ExtractionConfig` — Optional extraction configuration
**Returns**: `Promise<ExtractionResult>`
**Throws**: `ParsingError`, `OcrError`, `ValidationError`, `KreuzbergError`
#### `extractFileSync(filePath, mimeType?, config?): ExtractionResult`
Extract content from a single file synchronously.
```typescript
import { extractFileSync } from "@kreuzberg/node";
const result = extractFileSync("document.pdf");
console.log(result.content);
```
**Parameters**: Same as `extractFile()`
**Returns**: `ExtractionResult`
---
### Raw Bytes Extraction
#### `extractBytes(data, mimeType, config?): Promise<ExtractionResult>`
Extract content from raw bytes (Buffer or Uint8Array) asynchronously.
```typescript
import { extractBytes } from "@kreuzberg/node";
import { readFile } from "fs/promises";
const data = await readFile("document.pdf");
const result = await extractBytes(data, "application/pdf");
console.log(result.content);
```
**Parameters**:
- `data: Buffer | Uint8Array` — Raw file content
- `mimeType: string` — MIME type (required)
- `config?: ExtractionConfig` — Optional configuration
**Returns**: `Promise<ExtractionResult>`
#### `extractBytesSync(data, mimeType, config?): ExtractionResult`
Extract content from raw bytes synchronously.
```typescript
import { extractBytesSync } from "@kreuzberg/node";
import { readFileSync } from "fs";
const data = readFileSync("document.pdf");
const result = extractBytesSync(data, "application/pdf");
```
**Parameters**: Same as `extractBytes()`
**Returns**: `ExtractionResult`
---
### Batch Extraction (Recommended)
For processing multiple documents, batch APIs provide superior performance and memory management.
#### `batchExtractFiles(paths, config?): Promise<ExtractionResult[]>`
Extract content from multiple files in parallel (asynchronous).
```typescript
import { batchExtractFiles } from "@kreuzberg/node";
const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
const results = await batchExtractFiles(files);
results.forEach((result, i) => {
console.log(`${files[i]}: ${result.content.substring(0, 100)}...`);
});
```
**Parameters**:
- `paths: string[]` — Array of file paths
- `config?: ExtractionConfig` — Configuration (applied to all files)
**Returns**: `Promise<ExtractionResult[]>` — Results in same order as input
#### `batchExtractFilesSync(paths, config?): ExtractionResult[]`
Extract content from multiple files synchronously.
```typescript
import { batchExtractFilesSync } from "@kreuzberg/node";
const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
const results = batchExtractFilesSync(files);
```
**Parameters**: Same as `batchExtractFiles()`
**Returns**: `ExtractionResult[]`
#### `batchExtractBytes(dataList, mimeTypes, config?): Promise<ExtractionResult[]>`
Extract content from multiple byte arrays in parallel (asynchronous).
```typescript
import { batchExtractBytes } from "@kreuzberg/node";
import { readFile } from "fs/promises";
const files = ["doc1.pdf", "doc2.docx"];
const dataList = await Promise.all(files.map((f) => readFile(f)));
const mimeTypes = [
"application/pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
];
const results = await batchExtractBytes(dataList, mimeTypes);
```
**Parameters**:
- `dataList: Uint8Array[]` — Array of file contents
- `mimeTypes: string[]` — MIME types (one per item, must match length)
- `config?: ExtractionConfig` — Configuration (applied to all items)
**Returns**: `Promise<ExtractionResult[]>`
#### `batchExtractBytesSync(dataList, mimeTypes, config?): ExtractionResult[]`
Extract content from multiple byte arrays synchronously.
```typescript
import { batchExtractBytesSync } from "@kreuzberg/node";
import { readFileSync } from "fs";
const dataList = ["doc1.pdf", "doc2.docx"].map((f) => readFileSync(f));
const mimeTypes = [
"application/pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
];
const results = batchExtractBytesSync(dataList, mimeTypes);
```
**Parameters**: Same as `batchExtractBytes()`
**Returns**: `ExtractionResult[]`
#### `batchExtractFilesWithConfigs(paths, fileConfigs, config?): Promise<ExtractionResult[]>`
Extract multiple files with per-file configuration overrides (asynchronous).
```typescript
const results = await batchExtractFilesWithConfigs(
["report.pdf", "scanned.pdf"],
[null, { forceOcr: true, ocr: { backend: "tesseract", language: "deu" } }],
);
```
**Parameters**:
- `paths: string[]` — File paths
- `fileConfigs: (FileExtractionConfig | null)[]` — Per-file configs (null = use batch defaults)
- `config?: ExtractionConfig` — Batch-level configuration
#### `batchExtractFilesWithConfigsSync(paths, fileConfigs, config?): ExtractionResult[]`
Synchronous variant.
#### `batchExtractBytesWithConfigs(dataList, mimeTypes, fileConfigs, config?): Promise<ExtractionResult[]>`
Extract multiple byte arrays with per-file overrides (asynchronous).
#### `batchExtractBytesWithConfigsSync(dataList, mimeTypes, fileConfigs, config?): ExtractionResult[]`
Synchronous variant.
---
## Worker Pool APIs
Worker pools enable concurrent extraction using Node.js worker threads for CPU-bound processing.
### `createWorkerPool(size?): WorkerPool`
Create a worker pool for concurrent extraction.
```typescript
import { createWorkerPool } from "@kreuzberg/node";
// Create pool with default size (number of CPU cores)
const pool = createWorkerPool();
// Create pool with specific size
const pool4 = createWorkerPool(4);
```
**Parameters**:
- `size?: number` — Number of workers (defaults to CPU core count)
**Returns**: `WorkerPool` — Opaque handle for use with worker extraction functions
### `extractFileInWorker(pool, filePath, mimeType?, config?): Promise<ExtractionResult>`
Extract a single file using a worker from the pool.
```typescript
import { createWorkerPool, extractFileInWorker, closeWorkerPool } from "@kreuzberg/node";
const pool = createWorkerPool(4);
try {
const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
const results = await Promise.all(files.map((f) => extractFileInWorker(pool, f)));
results.forEach((r, i) => {
console.log(`${files[i]}: ${r.content.substring(0, 100)}...`);
});
} finally {
await closeWorkerPool(pool);
}
```
**Parameters**:
- `pool: WorkerPool` — Worker pool instance
- `filePath: string` — File path
- `mimeType?: string | null` — Optional MIME type
- `config?: ExtractionConfig` — Optional configuration
**Returns**: `Promise<ExtractionResult>`
### `batchExtractFilesInWorker(pool, paths, config?): Promise<ExtractionResult[]>`
Extract multiple files using the worker pool for concurrent processing.
```typescript
import { createWorkerPool, batchExtractFilesInWorker, closeWorkerPool } from "@kreuzberg/node";
const pool = createWorkerPool(4);
try {
const files = ["invoice1.pdf", "invoice2.pdf", "invoice3.pdf"];
const results = await batchExtractFilesInWorker(pool, files, {
ocr: { backend: "tesseract", language: "eng" },
});
const total = results.reduce((sum, r) => sum + extractAmount(r.content), 0);
console.log(`Total: $${total}`);
} finally {
await closeWorkerPool(pool);
}
```
**Parameters**:
- `pool: WorkerPool` — Worker pool instance
- `paths: string[]` — File paths
- `config?: ExtractionConfig` — Configuration (applied to all files)
**Returns**: `Promise<ExtractionResult[]>`
### `getWorkerPoolStats(pool): WorkerPoolStats`
Get statistics about a worker pool.
```typescript
import { createWorkerPool, getWorkerPoolStats } from "@kreuzberg/node";
const pool = createWorkerPool(4);
const stats = getWorkerPoolStats(pool);
console.log(`Pool size: ${stats.size}`);
console.log(`Active workers: ${stats.activeWorkers}`);
console.log(`Queued tasks: ${stats.queuedTasks}`);
```
**Parameters**:
- `pool: WorkerPool` — Worker pool instance
**Returns**: `WorkerPoolStats`
### `closeWorkerPool(pool): Promise<void>`
Close a worker pool and shut down all worker threads.
```typescript
import { createWorkerPool, closeWorkerPool } from "@kreuzberg/node";
const pool = createWorkerPool(4);
try {
// Use pool
} finally {
await closeWorkerPool(pool);
}
```
**Parameters**:
- `pool: WorkerPool` — Worker pool instance to close
**Returns**: `Promise<void>`
---
## Configuration Interface
### `ExtractionConfig`
Main configuration object controlling extraction behavior.
```typescript
interface ExtractionConfig {
// Caching and processing
useCache?: boolean; // Default: true
enableQualityProcessing?: boolean; // Default: false
// OCR configuration
ocr?: OcrConfig; // OCR settings
forceOcr?: boolean; // Default: false
// Document processing
chunking?: ChunkingConfig; // Break into chunks
images?: ImageExtractionConfig; // Image extraction
pdfOptions?: PdfConfig; // PDF-specific options
tokenReduction?: TokenReductionConfig; // Token optimization
languageDetection?: LanguageDetectionConfig; // Language detection
postprocessor?: PostProcessorConfig; // Post-processing
htmlOptions?: HtmlConversionOptions; // HTML conversion
keywords?: KeywordConfig; // Keyword extraction
pages?: PageExtractionConfig; // Page extraction
// Output control
maxConcurrentExtractions?: number; // Default: 4
outputFormat?: "plain" | "markdown" | "djot" | "html"; // Default: 'plain'
resultFormat?: "unified" | "element_based"; // Default: 'unified'
}
```
### `FileExtractionConfig`
Per-file overrides for batch operations. All fields optional (omitted = use batch default).
```typescript
interface FileExtractionConfig {
enableQualityProcessing?: boolean;
ocr?: OcrConfig;
forceOcr?: boolean;
chunking?: ChunkingConfig;
images?: ImageExtractionConfig;
pdfOptions?: PdfConfig;
tokenReduction?: TokenReductionConfig;
languageDetection?: LanguageDetectionConfig;
pages?: PageExtractionConfig;
keywords?: KeywordConfig;
postprocessor?: PostProcessorConfig;
outputFormat?: "plain" | "markdown" | "djot" | "html";
resultFormat?: "unified" | "element_based";
includeDocumentStructure?: boolean;
}
```
Excluded (batch-level only): `maxConcurrentExtractions`, `useCache`, `securityLimits`.
### `ChunkingConfig`
Configuration for breaking documents into chunks (useful for RAG and vector databases).
```typescript
interface ChunkingConfig {
maxChars?: number; // Max characters per chunk (default: 4096)
maxOverlap?: number; // Overlap between chunks (default: 512)
chunkSize?: number; // Alternative unit (mutually exclusive with maxChars)
chunkOverlap?: number; // Alternative unit (mutually exclusive with maxOverlap)
preset?: string; // Named preset ('default', 'aggressive', 'minimal')
embedding?: Record<string, unknown>; // Embedding config
enabled?: boolean; // Enable chunking (default: true when config provided)
}
```
**Key Point**: Use `maxChars` and `maxOverlap`, NOT `maxCharacters` or `overlap`.
### `OcrConfig`
Configuration for optical character recognition.
```typescript
interface OcrConfig {
backend: string; // OCR backend name (e.g., 'tesseract')
language?: string; // Language code (e.g., 'eng', 'deu')
tesseractConfig?: TesseractConfig;
}
interface TesseractConfig {
psm?: number; // Page Segmentation Mode (0-13)
enableTableDetection?: boolean;
tesseditCharWhitelist?: string; // Character whitelist
}
```
### `ImageExtractionConfig`
Configuration for extracting and optimizing images.
```typescript
interface ImageExtractionConfig {
extractImages?: boolean; // Default: true
targetDpi?: number; // Target DPI (default: 150)
maxImageDimension?: number; // Max width/height in pixels (default: 2000)
autoAdjustDpi?: boolean; // Auto-adjust DPI (default: true)
minDpi?: number; // Minimum DPI (default: 72)
maxDpi?: number; // Maximum DPI (default: 300)
}
```
### `PdfConfig`
PDF-specific extraction options.
```typescript
interface PdfConfig {
extractImages?: boolean; // Default: true
passwords?: string[]; // Passwords for encrypted PDFs
extractMetadata?: boolean; // Default: true
hierarchy?: HierarchyConfig; // Hierarchy extraction
}
```
### `LanguageDetectionConfig`
Configuration for automatic language detection.
```typescript
interface LanguageDetectionConfig {
enabled?: boolean; // Default: true
minConfidence?: number; // Threshold 0.0-1.0 (default: 0.5)
detectMultiple?: boolean; // Detect multiple languages (default: false)
}
```
### `TokenReductionConfig`
Configuration for optimizing token usage.
```typescript
interface TokenReductionConfig {
mode?: string; // 'aggressive' or 'conservative' (default: 'conservative')
preserveImportantWords?: boolean; // Default: true
}
```
### `KeywordConfig`
Configuration for keyword extraction.
```typescript
interface KeywordConfig {
algorithm?: "yake" | "rake"; // Default: 'yake'
maxKeywords?: number; // Maximum keywords (default: 10)
minScore?: number; // Minimum relevance score (default: 0.1)
ngramRange?: [number, number]; // N-gram range (default: [1, 3])
language?: string; // Language code (default: 'en')
yakeParams?: YakeParams;
rakeParams?: RakeParams;
}
```
### `PageExtractionConfig`
Configuration for page-level content tracking.
```typescript
interface PageExtractionConfig {
extractPages?: boolean; // Extract as separate pages array
insertPageMarkers?: boolean; // Insert page markers in content
markerFormat?: string; // Marker format with {page_num} placeholder
}
```
### `HtmlConversionOptions`
Configuration for HTML to Markdown conversion.
```typescript
interface HtmlConversionOptions {
headingStyle?: "atx" | "underlined" | "atx_closed";
listIndentType?: "spaces" | "tabs";
listIndentWidth?: number;
bullets?: string;
strongEmSymbol?: string;
escapeAsterisks?: boolean;
escapeUnderscores?: boolean;
escapeMisc?: boolean;
escapeAscii?: boolean;
codeLanguage?: string;
autolinks?: boolean;
defaultTitle?: boolean;
brInTables?: boolean;
hocrSpatialTables?: boolean;
highlightStyle?: "double_equal" | "html" | "bold" | "none";
extractMetadata?: boolean;
whitespaceMode?: "normalized" | "strict";
stripNewlines?: boolean;
wrap?: boolean;
wrapWidth?: number;
convertAsInline?: boolean;
subSymbol?: string;
supSymbol?: string;
newlineStyle?: "spaces" | "backslash";
codeBlockStyle?: "indented" | "backticks" | "tildes";
keepInlineImagesIn?: string[];
encoding?: string;
debug?: boolean;
stripTags?: string[];
preserveTags?: string[];
preprocessing?: HtmlPreprocessingOptions;
}
```
---
## Result Types
### `ExtractionResult`
Complete extraction result from document processing.
```typescript
interface ExtractionResult {
// Main content
content: string;
// Document type
mimeType: string;
// Metadata (format-specific)
metadata: Metadata;
// Extracted structures
tables: Table[];
// Optional processed data
detectedLanguages: string[] | null;
chunks: Chunk[] | null; // From chunking config
images: ExtractedImage[] | null; // From image extraction
elements?: Element[] | null; // From element_based result format
pages?: PageContent[] | null; // From page extraction
extractedKeywords?: ExtractedKeyword[] | null; // Extracted keywords with scores
qualityScore?: number | null; // Overall extraction quality (0.0-1.0)
processingWarnings?: ProcessingWarning[]; // Non-fatal warnings from pipeline
}
```
### `Table`
Extracted table data with cell structure.
```typescript
interface Table {
cells: string[][]; // 2D array of cell contents (rows × columns)
markdown: string; // Markdown representation
pageNumber: number; // 1-indexed page number
}
```
### `Chunk`
Text chunk for RAG or vector database indexing.
```typescript
interface Chunk {
content: string;
embedding?: number[] | null; // Vector embedding if computed
metadata: ChunkMetadata;
}
interface ChunkMetadata {
byteStart: number; // UTF-8 byte offset in original text
byteEnd: number; // UTF-8 byte offset
tokenCount?: number | null;
chunkIndex: number; // Zero-based index
totalChunks: number; // Total number of chunks
firstPage?: number | null; // 1-indexed, if page tracking enabled
lastPage?: number | null;
}
```
### `ExtractedImage`
Image extracted from document.
```typescript
interface ExtractedImage {
data: Uint8Array; // Raw image bytes
format: string; // Format (e.g., 'png', 'jpeg', 'tiff')
imageIndex: number; // Sequential index (0-indexed)
pageNumber?: number | null;
width?: number | null;
height?: number | null;
colorspace?: string | null;
bitsPerComponent?: number | null;
isMask: boolean;
description?: string | null;
ocrResult?: ExtractionResult | null; // OCR result if processed
}
```
### `PageContent`
Per-page content when page extraction is enabled.
```typescript
interface PageContent {
pageNumber: number; // 1-indexed
content: string; // Page text content
tables: Table[]; // Tables on this page
images: ExtractedImage[]; // Images on this page
}
```
### `ExtractedKeyword`
Extracted keyword with relevance score and position information.
```typescript
interface ExtractedKeyword {
text: string; // Keyword text
score: number; // Relevance score (0.0-1.0)
algorithm: string; // Algorithm used ("tfidf", "textrank", "yake", etc.)
positions?: number[] | null; // Character positions in content (if available)
}
```
### `ProcessingWarning`
Non-fatal warning encountered during document processing.
```typescript
interface ProcessingWarning {
source: string; // Component that generated the warning
message: string; // Warning message describing the issue
}
```
### `Metadata`
Extraction result metadata (format-specific).
```typescript
interface Metadata {
// Common fields
language?: string | null;
date?: string | null;
subject?: string | null;
format_type?:
| "pdf"
| "excel"
| "email"
| "pptx"
| "archive"
| "image"
| "xml"
| "text"
| "html"
| "ocr";
// PDF metadata
title?: string | null;
author?: string | null;
creator?: string | null;
producer?: string | null;
creation_date?: string | null;
modification_date?: string | null;
page_count?: number;
// Excel metadata
sheet_count?: number;
sheet_names?: string[];
// Email metadata
from_email?: string | null;
from_name?: string | null;
to_emails?: string[];
cc_emails?: string[];
bcc_emails?: string[];
message_id?: string | null;
attachments?: string[];
// Image metadata
width?: number;
height?: number;
exif?: Record<string, string>;
// OCR metadata
psm?: number;
output_format?: string;
table_count?: number;
// HTML metadata
canonical_url?: string | null;
html_language?: string | null;
text_direction?: "ltr" | "rtl" | "auto" | null;
open_graph?: Record<string, string>;
twitter_card?: Record<string, string>;
meta_tags?: Record<string, string>;
html_headers?: HeaderMetadata[];
html_links?: LinkMetadata[];
html_images?: HtmlImageMetadata[];
structured_data?: StructuredData[];
// Text metadata
line_count?: number;
word_count?: number;
character_count?: number;
headers?: string[] | null;
links?: [string, string][] | null;
code_blocks?: [string, string][] | null;
// Page structure
page_structure?: PageStructure | null;
// Additional typed fields
category?: string | null;
tags?: string[];
document_version?: string | null;
abstract_text?: string | null;
// Custom fields from postprocessors
[key: string]: unknown;
}
```
---
## Error Handling
### Error Classes
```typescript
import {
KreuzbergError,
ParsingError,
OcrError, // Note: camelCase, not "OCRError"
ValidationError,
MissingDependencyError,
CacheError,
ImageProcessingError,
PluginError,
ErrorCode,
} from "@kreuzberg/node";
```
**Error Hierarchy**:
- `KreuzbergError` — Base class for all Kreuzberg errors
- `ParsingError` — Document format invalid or corrupted
- `OcrError` — OCR processing failed
- `ValidationError` — Extraction validation failed
- `MissingDependencyError` — Required dependency unavailable
- `CacheError` — Cache operation failed
- `ImageProcessingError` — Image extraction or processing failed
- `PluginError` — Plugin registration or execution failed
### Error Diagnostics
```typescript
import {
classifyError,
getErrorCodeDescription,
getErrorCodeName,
getLastErrorCode,
getLastPanicContext,
} from "@kreuzberg/node";
try {
const result = await extractFile("document.pdf");
} catch (error) {
const classification = classifyError(error.message);
console.log(`Error code: ${getErrorCodeName(classification.code)}`);
console.log(`Description: ${getErrorCodeDescription(classification.code)}`);
console.log(`Confidence: ${classification.confidence}`);
}
```
### `ErrorCode` Enum
```typescript
enum ErrorCode {
Success = 0,
GenericError = 1,
Panic = 2,
InvalidArgument = 3,
IoError = 4,
ParsingError = 5,
OcrError = 6,
MissingDependency = 7,
}
```
---
## Plugin System
### Post-Processors
Custom post-processors can enrich extraction results without failing the extraction if they encounter errors.
#### `registerPostProcessor(processor): void`
Register a custom post-processor.
```typescript
import { registerPostProcessor, extractFile } from "@kreuzberg/node";
const processor = {
name() {
return "my_processor";
},
async process(result) {
// Enrich result with custom metadata
result.metadata["custom_field"] = "value";
return result;
},
processingStage() {
return "late"; // 'early', 'middle', or 'late'
},
async initialize() {
// Called once when registered
},
async shutdown() {
// Called when unregistered
},
};
registerPostProcessor(processor);
const result = await extractFile("document.pdf");
```
#### `unregisterPostProcessor(name): void`
Remove a registered post-processor.
```typescript
import { unregisterPostProcessor } from "@kreuzberg/node";
unregisterPostProcessor("my_processor");
```
#### `listPostProcessors(): string[]`
List all registered post-processor names.
```typescript
import { listPostProcessors } from "@kreuzberg/node";
const processors = listPostProcessors();
console.log("Registered processors:", processors);
```
#### `clearPostProcessors(): void`
Unregister all post-processors.
```typescript
import { clearPostProcessors } from "@kreuzberg/node";
clearPostProcessors();
```
### Validators
Custom validators check extraction results and fail the extraction if validation fails (unlike post-processors).
#### `registerValidator(validator): void`
Register a custom validator.
```typescript
import { registerValidator, extractFile } from "@kreuzberg/node";
const validator = {
name() {
return "content_length_validator";
},
validate(result) {
if (result.content.length < 10) {
throw new Error("Content too short");
}
},
priority() {
return 100; // Higher = runs first
},
shouldValidate(result) {
return result.mimeType === "application/pdf"; // Conditional validation
},
async initialize() {
// Called once when registered
},
async shutdown() {
// Called when unregistered
},
};
registerValidator(validator);
const result = await extractFile("document.pdf");
```
#### `unregisterValidator(name): void`
Remove a registered validator.
```typescript
import { unregisterValidator } from "@kreuzberg/node";
unregisterValidator("content_length_validator");
```
#### `listValidators(): string[]`
List all registered validator names.
```typescript
import { listValidators } from "@kreuzberg/node";
const validators = listValidators();
```
#### `clearValidators(): void`
Unregister all validators.
```typescript
import { clearValidators } from "@kreuzberg/node";
clearValidators();
```
### OCR Backends
Custom OCR backends can be registered to handle image text extraction.
#### `registerOcrBackend(backend): void`
Register a custom OCR backend.
```typescript
import { registerOcrBackend, extractFile } from "@kreuzberg/node";
const backend = {
name() {
return "my-ocr";
},
supportedLanguages() {
return ["eng", "deu", "fra"];
},
async processImage(imageBytes, language) {
// imageBytes: Uint8Array or Base64 string
const buffer =
typeof imageBytes === "string" ? Buffer.from(imageBytes, "base64") : Buffer.from(imageBytes);
// Process and extract text
return {
content: "extracted text",
mime_type: "text/plain",
metadata: { confidence: 0.95, language },
tables: [],
};
},
async initialize() {
// Load models, setup resources
},
async shutdown() {
// Cleanup resources
},
};
registerOcrBackend(backend);
```
#### `GutenOcrBackend`
Built-in OCR backend implementation using Guten-OCR.
```typescript
import { GutenOcrBackend, registerOcrBackend, extractFile } from "@kreuzberg/node";
const backend = new GutenOcrBackend();
await backend.initialize();
registerOcrBackend(backend);
const result = await extractFile("scanned.pdf", null, {
ocr: { backend: "guten-ocr", language: "eng" },
});
```
#### `unregisterOcrBackend(name): void`
Remove a registered OCR backend.
```typescript
import { unregisterOcrBackend } from "@kreuzberg/node";
unregisterOcrBackend("my-ocr");
```
#### `listOcrBackends(): string[]`
List all registered OCR backend names.
```typescript
import { listOcrBackends } from "@kreuzberg/node";
const backends = listOcrBackends();
```
#### `clearOcrBackends(): void`
Unregister all OCR backends.
```typescript
import { clearOcrBackends } from "@kreuzberg/node";
clearOcrBackends();
```
---
## MIME Type Utilities
### `detectMimeType(data): string | null`
Detect MIME type from file content (magic bytes).
```typescript
import { detectMimeType } from "@kreuzberg/node";
import { readFileSync } from "fs";
const data = readFileSync("document");
const mimeType = detectMimeType(data);
console.log(`Detected MIME type: ${mimeType}`);
```
### `detectMimeTypeFromPath(filePath): string | null`
Detect MIME type from file extension.
```typescript
import { detectMimeTypeFromPath } from "@kreuzberg/node";
const mimeType = detectMimeTypeFromPath("document.pdf");
console.log(`MIME type: ${mimeType}`); // 'application/pdf'
```
### `getExtensionsForMime(mimeType): string[]`
Get file extensions for a MIME type.
```typescript
import { getExtensionsForMime } from "@kreuzberg/node";
const extensions = getExtensionsForMime("application/pdf");
console.log(`Extensions: ${extensions}`); // ['.pdf']
```
### `validateMimeType(mimeType): boolean`
Check if a MIME type is valid.
```typescript
import { validateMimeType } from "@kreuzberg/node";
if (validateMimeType("application/pdf")) {
console.log("Valid MIME type");
}
```
---
## Configuration Loading
### `ExtractionConfig.fromFile(filePath): ExtractionConfig`
Load extraction configuration from a file (JSON, YAML, or TOML).
```typescript
import { ExtractionConfig, extractFile } from "@kreuzberg/node";
const config = ExtractionConfig.fromFile("./kreuzberg.toml");
const result = await extractFile("document.pdf", null, config);
```
### `ExtractionConfig.discover(): ExtractionConfig | null`
Auto-discover extraction configuration file in current and parent directories.
```typescript
import { ExtractionConfig, extractFile } from "@kreuzberg/node";
// Searches for kreuzberg.{toml,yaml,json} in current directory and parents
const config = ExtractionConfig.discover();
if (config) {
const result = await extractFile("document.pdf", null, config);
}
```
---
## Embeddings
### `getEmbeddingPreset(name): EmbeddingPreset | null`
Get a named embedding model preset.
```typescript
import { getEmbeddingPreset } from "@kreuzberg/node";
const preset = getEmbeddingPreset("default");
if (preset) {
console.log(`Model: ${preset.modelName}`);
console.log(`Dimensions: ${preset.dimensions}`);
}
```
### `listEmbeddingPresets(): string[]`
List all available embedding presets.
```typescript
import { listEmbeddingPresets } from "@kreuzberg/node";
const presets = listEmbeddingPresets();
console.log("Available presets:", presets);
```
### `EmbeddingPreset`
Type definition for embedding model presets.
```typescript
interface EmbeddingPreset {
name: string; // Preset name (e.g., "fast", "balanced", "quality", "multilingual")
chunkSize: number; // Recommended chunk size in characters
overlap: number; // Recommended overlap in characters
modelName: string; // Model identifier (e.g., "AllMiniLML6V2Q", "BGEBaseENV15")
dimensions: number; // Embedding vector dimensions
description: string; // Human-readable description
}
```
---
## Plugin Protocols
### `PostProcessorProtocol`
Interface for custom post-processors.
```typescript
interface PostProcessorProtocol {
name(): string;
process(result: ExtractionResult): ExtractionResult | Promise<ExtractionResult>;
processingStage?(): ProcessingStage; // 'early' | 'middle' | 'late'
initialize?(): void | Promise<void>;
shutdown?(): void | Promise<void>;
}
```
### `ValidatorProtocol`
Interface for custom validators.
```typescript
interface ValidatorProtocol {
name(): string;
validate(result: ExtractionResult): void | Promise<void>;
priority?(): number; // Higher = runs first
shouldValidate?(result: ExtractionResult): boolean;
initialize?(): void | Promise<void>;
shutdown?(): void | Promise<void>;
}
```
### `OcrBackendProtocol`
Interface for custom OCR backends.
```typescript
interface OcrBackendProtocol {
name(): string;
supportedLanguages(): string[];
processImage(
imageBytes: Uint8Array | string,
language: string,
): Promise<{
content: string;
mime_type: string;
metadata: Record<string, unknown>;
tables: unknown[];
}>;
initialize?(): void | Promise<void>;
shutdown?(): void | Promise<void>;
}
```
---
## Supported Document Formats
- **Documents**: PDF, DOCX, PPTX, XLSX, DOC, PPT
- **Text**: Markdown, Plain Text, XML, JSON, YAML, TOML
- **Web**: HTML (converted to Markdown)
- **Email**: EML, MSG
- **Images**: PNG, JPEG, TIFF (with OCR support)
- **Archives**: ZIP, TAR, GZIP (file listing)
---
## Registry Functions
### Document Extractors
```typescript
import {
listDocumentExtractors,
unregisterDocumentExtractor,
clearDocumentExtractors,
} from "@kreuzberg/node";
// List registered extractors
const extractors = listDocumentExtractors();
// Unregister a specific extractor
unregisterDocumentExtractor("pdf");
// Clear all extractors
clearDocumentExtractors();
```
---
## Type Exports
All types are exported from `@kreuzberg/node`:
```typescript
export type {
Chunk,
ChunkingConfig,
ExtractionConfig,
ExtractionResult,
ExtractedImage,
KeywordConfig,
LanguageDetectionConfig,
OcrBackendProtocol,
OcrConfig,
PageContent,
PageExtractionConfig,
PdfConfig,
PostProcessorProtocol,
Table,
TokenReductionConfig,
ValidatorProtocol,
WorkerPool,
WorkerPoolStats,
EmbeddingPreset,
// ... and many more
};
```
---
## Best Practices
1. **Use batch APIs for multiple documents**: `batchExtractFiles()` provides superior performance vs. calling `extractFile()` in a loop.
2. **Enable chunking for RAG/vector DB**: Set `chunking` config to automatically break documents into overlapping chunks.
3. **Use worker pools for high-concurrency scenarios**: Distribute CPU-bound work across multiple threads for 4+ concurrent extractions.
4. **Configure language detection**: Enable automatic language detection for multilingual documents.
5. **Register validators early**: Set up validators before calling extraction functions to catch quality issues immediately.
6. **Use specific MIME types**: Provide explicit MIME types when available to avoid detection overhead.
7. **Clean up resources**: Always call `closeWorkerPool()` when done to prevent resource leaks.
8. **Handle extraction errors gracefully**: Catch specific error types (`ParsingError`, `OcrError`, etc.) for appropriate error handling.
---
## Version
**Package Version**: 4.2.14