33 KiB
Node.js/TypeScript API Reference
Overview
Package: @kreuzberg/node — A high-performance TypeScript SDK built on a Rust core for document intelligence and content extraction.
Supports both ESM (import) and CommonJS (require):
// ESM
import { extractFile, batchExtractFiles } from "@kreuzberg/node";
// CommonJS
const { extractFile, batchExtractFiles } = require("@kreuzberg/node");
Current Version: 4.2.14
Core Extraction Functions
All extraction functions return ExtractionResult containing extracted content, metadata, tables, and optional chunks/images.
Single File Extraction
extractFile(filePath, mimeType?, config?): Promise<ExtractionResult>
Extract content from a single file asynchronously.
import { extractFile } from "@kreuzberg/node";
// Auto-detect MIME type from file extension
const result = await extractFile("document.pdf");
console.log(result.content);
// Explicit MIME type
const result2 = await extractFile("document.pdf", "application/pdf");
// With configuration
const result3 = await extractFile("document.pdf", null, {
chunking: {
maxChars: 1000,
maxOverlap: 200,
},
});
Parameters:
filePath: string— Path to the file to extractmimeType?: string | null— Optional MIME type hint (auto-detect if null)config?: ExtractionConfig— Optional extraction configuration
Returns: Promise<ExtractionResult>
Throws: ParsingError, OcrError, ValidationError, KreuzbergError
extractFileSync(filePath, mimeType?, config?): ExtractionResult
Extract content from a single file synchronously.
import { extractFileSync } from "@kreuzberg/node";
const result = extractFileSync("document.pdf");
console.log(result.content);
Parameters: Same as extractFile()
Returns: ExtractionResult
Raw Bytes Extraction
extractBytes(data, mimeType, config?): Promise<ExtractionResult>
Extract content from raw bytes (Buffer or Uint8Array) asynchronously.
import { extractBytes } from "@kreuzberg/node";
import { readFile } from "fs/promises";
const data = await readFile("document.pdf");
const result = await extractBytes(data, "application/pdf");
console.log(result.content);
Parameters:
data: Buffer | Uint8Array— Raw file contentmimeType: string— MIME type (required)config?: ExtractionConfig— Optional configuration
Returns: Promise<ExtractionResult>
extractBytesSync(data, mimeType, config?): ExtractionResult
Extract content from raw bytes synchronously.
import { extractBytesSync } from "@kreuzberg/node";
import { readFileSync } from "fs";
const data = readFileSync("document.pdf");
const result = extractBytesSync(data, "application/pdf");
Parameters: Same as extractBytes()
Returns: ExtractionResult
Batch Extraction (Recommended)
For processing multiple documents, batch APIs provide superior performance and memory management.
batchExtractFiles(paths, config?): Promise<ExtractionResult[]>
Extract content from multiple files in parallel (asynchronous).
import { batchExtractFiles } from "@kreuzberg/node";
const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
const results = await batchExtractFiles(files);
results.forEach((result, i) => {
console.log(`${files[i]}: ${result.content.substring(0, 100)}...`);
});
Parameters:
paths: string[]— Array of file pathsconfig?: ExtractionConfig— Configuration (applied to all files)
Returns: Promise<ExtractionResult[]> — Results in same order as input
batchExtractFilesSync(paths, config?): ExtractionResult[]
Extract content from multiple files synchronously.
import { batchExtractFilesSync } from "@kreuzberg/node";
const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
const results = batchExtractFilesSync(files);
Parameters: Same as batchExtractFiles()
Returns: ExtractionResult[]
batchExtractBytes(dataList, mimeTypes, config?): Promise<ExtractionResult[]>
Extract content from multiple byte arrays in parallel (asynchronous).
import { batchExtractBytes } from "@kreuzberg/node";
import { readFile } from "fs/promises";
const files = ["doc1.pdf", "doc2.docx"];
const dataList = await Promise.all(files.map((f) => readFile(f)));
const mimeTypes = [
"application/pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
];
const results = await batchExtractBytes(dataList, mimeTypes);
Parameters:
dataList: Uint8Array[]— Array of file contentsmimeTypes: string[]— MIME types (one per item, must match length)config?: ExtractionConfig— Configuration (applied to all items)
Returns: Promise<ExtractionResult[]>
batchExtractBytesSync(dataList, mimeTypes, config?): ExtractionResult[]
Extract content from multiple byte arrays synchronously.
import { batchExtractBytesSync } from "@kreuzberg/node";
import { readFileSync } from "fs";
const dataList = ["doc1.pdf", "doc2.docx"].map((f) => readFileSync(f));
const mimeTypes = [
"application/pdf",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
];
const results = batchExtractBytesSync(dataList, mimeTypes);
Parameters: Same as batchExtractBytes()
Returns: ExtractionResult[]
batchExtractFilesWithConfigs(paths, fileConfigs, config?): Promise<ExtractionResult[]>
Extract multiple files with per-file configuration overrides (asynchronous).
const results = await batchExtractFilesWithConfigs(
["report.pdf", "scanned.pdf"],
[null, { forceOcr: true, ocr: { backend: "tesseract", language: "deu" } }],
);
Parameters:
paths: string[]— File pathsfileConfigs: (FileExtractionConfig | null)[]— Per-file configs (null = use batch defaults)config?: ExtractionConfig— Batch-level configuration
batchExtractFilesWithConfigsSync(paths, fileConfigs, config?): ExtractionResult[]
Synchronous variant.
batchExtractBytesWithConfigs(dataList, mimeTypes, fileConfigs, config?): Promise<ExtractionResult[]>
Extract multiple byte arrays with per-file overrides (asynchronous).
batchExtractBytesWithConfigsSync(dataList, mimeTypes, fileConfigs, config?): ExtractionResult[]
Synchronous variant.
Worker Pool APIs
Worker pools enable concurrent extraction using Node.js worker threads for CPU-bound processing.
createWorkerPool(size?): WorkerPool
Create a worker pool for concurrent extraction.
import { createWorkerPool } from "@kreuzberg/node";
// Create pool with default size (number of CPU cores)
const pool = createWorkerPool();
// Create pool with specific size
const pool4 = createWorkerPool(4);
Parameters:
size?: number— Number of workers (defaults to CPU core count)
Returns: WorkerPool — Opaque handle for use with worker extraction functions
extractFileInWorker(pool, filePath, mimeType?, config?): Promise<ExtractionResult>
Extract a single file using a worker from the pool.
import { createWorkerPool, extractFileInWorker, closeWorkerPool } from "@kreuzberg/node";
const pool = createWorkerPool(4);
try {
const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
const results = await Promise.all(files.map((f) => extractFileInWorker(pool, f)));
results.forEach((r, i) => {
console.log(`${files[i]}: ${r.content.substring(0, 100)}...`);
});
} finally {
await closeWorkerPool(pool);
}
Parameters:
pool: WorkerPool— Worker pool instancefilePath: string— File pathmimeType?: string | null— Optional MIME typeconfig?: ExtractionConfig— Optional configuration
Returns: Promise<ExtractionResult>
batchExtractFilesInWorker(pool, paths, config?): Promise<ExtractionResult[]>
Extract multiple files using the worker pool for concurrent processing.
import { createWorkerPool, batchExtractFilesInWorker, closeWorkerPool } from "@kreuzberg/node";
const pool = createWorkerPool(4);
try {
const files = ["invoice1.pdf", "invoice2.pdf", "invoice3.pdf"];
const results = await batchExtractFilesInWorker(pool, files, {
ocr: { backend: "tesseract", language: "eng" },
});
const total = results.reduce((sum, r) => sum + extractAmount(r.content), 0);
console.log(`Total: $${total}`);
} finally {
await closeWorkerPool(pool);
}
Parameters:
pool: WorkerPool— Worker pool instancepaths: string[]— File pathsconfig?: ExtractionConfig— Configuration (applied to all files)
Returns: Promise<ExtractionResult[]>
getWorkerPoolStats(pool): WorkerPoolStats
Get statistics about a worker pool.
import { createWorkerPool, getWorkerPoolStats } from "@kreuzberg/node";
const pool = createWorkerPool(4);
const stats = getWorkerPoolStats(pool);
console.log(`Pool size: ${stats.size}`);
console.log(`Active workers: ${stats.activeWorkers}`);
console.log(`Queued tasks: ${stats.queuedTasks}`);
Parameters:
pool: WorkerPool— Worker pool instance
Returns: WorkerPoolStats
closeWorkerPool(pool): Promise<void>
Close a worker pool and shut down all worker threads.
import { createWorkerPool, closeWorkerPool } from "@kreuzberg/node";
const pool = createWorkerPool(4);
try {
// Use pool
} finally {
await closeWorkerPool(pool);
}
Parameters:
pool: WorkerPool— Worker pool instance to close
Returns: Promise<void>
Configuration Interface
ExtractionConfig
Main configuration object controlling extraction behavior.
interface ExtractionConfig {
// Caching and processing
useCache?: boolean; // Default: true
enableQualityProcessing?: boolean; // Default: false
// OCR configuration
ocr?: OcrConfig; // OCR settings
forceOcr?: boolean; // Default: false
// Document processing
chunking?: ChunkingConfig; // Break into chunks
images?: ImageExtractionConfig; // Image extraction
pdfOptions?: PdfConfig; // PDF-specific options
tokenReduction?: TokenReductionConfig; // Token optimization
languageDetection?: LanguageDetectionConfig; // Language detection
postprocessor?: PostProcessorConfig; // Post-processing
htmlOptions?: HtmlConversionOptions; // HTML conversion
keywords?: KeywordConfig; // Keyword extraction
pages?: PageExtractionConfig; // Page extraction
// Output control
maxConcurrentExtractions?: number; // Default: 4
outputFormat?: "plain" | "markdown" | "djot" | "html"; // Default: 'plain'
resultFormat?: "unified" | "element_based"; // Default: 'unified'
}
FileExtractionConfig
Per-file overrides for batch operations. All fields optional (omitted = use batch default).
interface FileExtractionConfig {
enableQualityProcessing?: boolean;
ocr?: OcrConfig;
forceOcr?: boolean;
chunking?: ChunkingConfig;
images?: ImageExtractionConfig;
pdfOptions?: PdfConfig;
tokenReduction?: TokenReductionConfig;
languageDetection?: LanguageDetectionConfig;
pages?: PageExtractionConfig;
keywords?: KeywordConfig;
postprocessor?: PostProcessorConfig;
outputFormat?: "plain" | "markdown" | "djot" | "html";
resultFormat?: "unified" | "element_based";
includeDocumentStructure?: boolean;
}
Excluded (batch-level only): maxConcurrentExtractions, useCache, securityLimits.
ChunkingConfig
Configuration for breaking documents into chunks (useful for RAG and vector databases).
interface ChunkingConfig {
maxChars?: number; // Max characters per chunk (default: 4096)
maxOverlap?: number; // Overlap between chunks (default: 512)
chunkSize?: number; // Alternative unit (mutually exclusive with maxChars)
chunkOverlap?: number; // Alternative unit (mutually exclusive with maxOverlap)
preset?: string; // Named preset ('default', 'aggressive', 'minimal')
embedding?: Record<string, unknown>; // Embedding config
enabled?: boolean; // Enable chunking (default: true when config provided)
}
Key Point: Use maxChars and maxOverlap, NOT maxCharacters or overlap.
OcrConfig
Configuration for optical character recognition.
interface OcrConfig {
backend: string; // OCR backend name (e.g., 'tesseract')
language?: string; // Language code (e.g., 'eng', 'deu')
tesseractConfig?: TesseractConfig;
}
interface TesseractConfig {
psm?: number; // Page Segmentation Mode (0-13)
enableTableDetection?: boolean;
tesseditCharWhitelist?: string; // Character whitelist
}
ImageExtractionConfig
Configuration for extracting and optimizing images.
interface ImageExtractionConfig {
extractImages?: boolean; // Default: true
targetDpi?: number; // Target DPI (default: 150)
maxImageDimension?: number; // Max width/height in pixels (default: 2000)
autoAdjustDpi?: boolean; // Auto-adjust DPI (default: true)
minDpi?: number; // Minimum DPI (default: 72)
maxDpi?: number; // Maximum DPI (default: 300)
}
PdfConfig
PDF-specific extraction options.
interface PdfConfig {
extractImages?: boolean; // Default: true
passwords?: string[]; // Passwords for encrypted PDFs
extractMetadata?: boolean; // Default: true
hierarchy?: HierarchyConfig; // Hierarchy extraction
}
LanguageDetectionConfig
Configuration for automatic language detection.
interface LanguageDetectionConfig {
enabled?: boolean; // Default: true
minConfidence?: number; // Threshold 0.0-1.0 (default: 0.5)
detectMultiple?: boolean; // Detect multiple languages (default: false)
}
TokenReductionConfig
Configuration for optimizing token usage.
interface TokenReductionConfig {
mode?: string; // 'aggressive' or 'conservative' (default: 'conservative')
preserveImportantWords?: boolean; // Default: true
}
KeywordConfig
Configuration for keyword extraction.
interface KeywordConfig {
algorithm?: "yake" | "rake"; // Default: 'yake'
maxKeywords?: number; // Maximum keywords (default: 10)
minScore?: number; // Minimum relevance score (default: 0.1)
ngramRange?: [number, number]; // N-gram range (default: [1, 3])
language?: string; // Language code (default: 'en')
yakeParams?: YakeParams;
rakeParams?: RakeParams;
}
PageExtractionConfig
Configuration for page-level content tracking.
interface PageExtractionConfig {
extractPages?: boolean; // Extract as separate pages array
insertPageMarkers?: boolean; // Insert page markers in content
markerFormat?: string; // Marker format with {page_num} placeholder
}
HtmlConversionOptions
Configuration for HTML to Markdown conversion.
interface HtmlConversionOptions {
headingStyle?: "atx" | "underlined" | "atx_closed";
listIndentType?: "spaces" | "tabs";
listIndentWidth?: number;
bullets?: string;
strongEmSymbol?: string;
escapeAsterisks?: boolean;
escapeUnderscores?: boolean;
escapeMisc?: boolean;
escapeAscii?: boolean;
codeLanguage?: string;
autolinks?: boolean;
defaultTitle?: boolean;
brInTables?: boolean;
hocrSpatialTables?: boolean;
highlightStyle?: "double_equal" | "html" | "bold" | "none";
extractMetadata?: boolean;
whitespaceMode?: "normalized" | "strict";
stripNewlines?: boolean;
wrap?: boolean;
wrapWidth?: number;
convertAsInline?: boolean;
subSymbol?: string;
supSymbol?: string;
newlineStyle?: "spaces" | "backslash";
codeBlockStyle?: "indented" | "backticks" | "tildes";
keepInlineImagesIn?: string[];
encoding?: string;
debug?: boolean;
stripTags?: string[];
preserveTags?: string[];
preprocessing?: HtmlPreprocessingOptions;
}
Result Types
ExtractionResult
Complete extraction result from document processing.
interface ExtractionResult {
// Main content
content: string;
// Document type
mimeType: string;
// Metadata (format-specific)
metadata: Metadata;
// Extracted structures
tables: Table[];
// Optional processed data
detectedLanguages: string[] | null;
chunks: Chunk[] | null; // From chunking config
images: ExtractedImage[] | null; // From image extraction
elements?: Element[] | null; // From element_based result format
pages?: PageContent[] | null; // From page extraction
extractedKeywords?: ExtractedKeyword[] | null; // Extracted keywords with scores
qualityScore?: number | null; // Overall extraction quality (0.0-1.0)
processingWarnings?: ProcessingWarning[]; // Non-fatal warnings from pipeline
}
Table
Extracted table data with cell structure.
interface Table {
cells: string[][]; // 2D array of cell contents (rows × columns)
markdown: string; // Markdown representation
pageNumber: number; // 1-indexed page number
}
Chunk
Text chunk for RAG or vector database indexing.
interface Chunk {
content: string;
embedding?: number[] | null; // Vector embedding if computed
metadata: ChunkMetadata;
}
interface ChunkMetadata {
byteStart: number; // UTF-8 byte offset in original text
byteEnd: number; // UTF-8 byte offset
tokenCount?: number | null;
chunkIndex: number; // Zero-based index
totalChunks: number; // Total number of chunks
firstPage?: number | null; // 1-indexed, if page tracking enabled
lastPage?: number | null;
}
ExtractedImage
Image extracted from document.
interface ExtractedImage {
data: Uint8Array; // Raw image bytes
format: string; // Format (e.g., 'png', 'jpeg', 'tiff')
imageIndex: number; // Sequential index (0-indexed)
pageNumber?: number | null;
width?: number | null;
height?: number | null;
colorspace?: string | null;
bitsPerComponent?: number | null;
isMask: boolean;
description?: string | null;
ocrResult?: ExtractionResult | null; // OCR result if processed
}
PageContent
Per-page content when page extraction is enabled.
interface PageContent {
pageNumber: number; // 1-indexed
content: string; // Page text content
tables: Table[]; // Tables on this page
images: ExtractedImage[]; // Images on this page
}
ExtractedKeyword
Extracted keyword with relevance score and position information.
interface ExtractedKeyword {
text: string; // Keyword text
score: number; // Relevance score (0.0-1.0)
algorithm: string; // Algorithm used ("tfidf", "textrank", "yake", etc.)
positions?: number[] | null; // Character positions in content (if available)
}
ProcessingWarning
Non-fatal warning encountered during document processing.
interface ProcessingWarning {
source: string; // Component that generated the warning
message: string; // Warning message describing the issue
}
Metadata
Extraction result metadata (format-specific).
interface Metadata {
// Common fields
language?: string | null;
date?: string | null;
subject?: string | null;
format_type?:
| "pdf"
| "excel"
| "email"
| "pptx"
| "archive"
| "image"
| "xml"
| "text"
| "html"
| "ocr";
// PDF metadata
title?: string | null;
author?: string | null;
creator?: string | null;
producer?: string | null;
creation_date?: string | null;
modification_date?: string | null;
page_count?: number;
// Excel metadata
sheet_count?: number;
sheet_names?: string[];
// Email metadata
from_email?: string | null;
from_name?: string | null;
to_emails?: string[];
cc_emails?: string[];
bcc_emails?: string[];
message_id?: string | null;
attachments?: string[];
// Image metadata
width?: number;
height?: number;
exif?: Record<string, string>;
// OCR metadata
psm?: number;
output_format?: string;
table_count?: number;
// HTML metadata
canonical_url?: string | null;
html_language?: string | null;
text_direction?: "ltr" | "rtl" | "auto" | null;
open_graph?: Record<string, string>;
twitter_card?: Record<string, string>;
meta_tags?: Record<string, string>;
html_headers?: HeaderMetadata[];
html_links?: LinkMetadata[];
html_images?: HtmlImageMetadata[];
structured_data?: StructuredData[];
// Text metadata
line_count?: number;
word_count?: number;
character_count?: number;
headers?: string[] | null;
links?: [string, string][] | null;
code_blocks?: [string, string][] | null;
// Page structure
page_structure?: PageStructure | null;
// Additional typed fields
category?: string | null;
tags?: string[];
document_version?: string | null;
abstract_text?: string | null;
// Custom fields from postprocessors
[key: string]: unknown;
}
Error Handling
Error Classes
import {
KreuzbergError,
ParsingError,
OcrError, // Note: camelCase, not "OCRError"
ValidationError,
MissingDependencyError,
CacheError,
ImageProcessingError,
PluginError,
ErrorCode,
} from "@kreuzberg/node";
Error Hierarchy:
KreuzbergError— Base class for all Kreuzberg errorsParsingError— Document format invalid or corruptedOcrError— OCR processing failedValidationError— Extraction validation failedMissingDependencyError— Required dependency unavailableCacheError— Cache operation failedImageProcessingError— Image extraction or processing failedPluginError— Plugin registration or execution failed
Error Diagnostics
import {
classifyError,
getErrorCodeDescription,
getErrorCodeName,
getLastErrorCode,
getLastPanicContext,
} from "@kreuzberg/node";
try {
const result = await extractFile("document.pdf");
} catch (error) {
const classification = classifyError(error.message);
console.log(`Error code: ${getErrorCodeName(classification.code)}`);
console.log(`Description: ${getErrorCodeDescription(classification.code)}`);
console.log(`Confidence: ${classification.confidence}`);
}
ErrorCode Enum
enum ErrorCode {
Success = 0,
GenericError = 1,
Panic = 2,
InvalidArgument = 3,
IoError = 4,
ParsingError = 5,
OcrError = 6,
MissingDependency = 7,
}
Plugin System
Post-Processors
Custom post-processors can enrich extraction results without failing the extraction if they encounter errors.
registerPostProcessor(processor): void
Register a custom post-processor.
import { registerPostProcessor, extractFile } from "@kreuzberg/node";
const processor = {
name() {
return "my_processor";
},
async process(result) {
// Enrich result with custom metadata
result.metadata["custom_field"] = "value";
return result;
},
processingStage() {
return "late"; // 'early', 'middle', or 'late'
},
async initialize() {
// Called once when registered
},
async shutdown() {
// Called when unregistered
},
};
registerPostProcessor(processor);
const result = await extractFile("document.pdf");
unregisterPostProcessor(name): void
Remove a registered post-processor.
import { unregisterPostProcessor } from "@kreuzberg/node";
unregisterPostProcessor("my_processor");
listPostProcessors(): string[]
List all registered post-processor names.
import { listPostProcessors } from "@kreuzberg/node";
const processors = listPostProcessors();
console.log("Registered processors:", processors);
clearPostProcessors(): void
Unregister all post-processors.
import { clearPostProcessors } from "@kreuzberg/node";
clearPostProcessors();
Validators
Custom validators check extraction results and fail the extraction if validation fails (unlike post-processors).
registerValidator(validator): void
Register a custom validator.
import { registerValidator, extractFile } from "@kreuzberg/node";
const validator = {
name() {
return "content_length_validator";
},
validate(result) {
if (result.content.length < 10) {
throw new Error("Content too short");
}
},
priority() {
return 100; // Higher = runs first
},
shouldValidate(result) {
return result.mimeType === "application/pdf"; // Conditional validation
},
async initialize() {
// Called once when registered
},
async shutdown() {
// Called when unregistered
},
};
registerValidator(validator);
const result = await extractFile("document.pdf");
unregisterValidator(name): void
Remove a registered validator.
import { unregisterValidator } from "@kreuzberg/node";
unregisterValidator("content_length_validator");
listValidators(): string[]
List all registered validator names.
import { listValidators } from "@kreuzberg/node";
const validators = listValidators();
clearValidators(): void
Unregister all validators.
import { clearValidators } from "@kreuzberg/node";
clearValidators();
OCR Backends
Custom OCR backends can be registered to handle image text extraction.
registerOcrBackend(backend): void
Register a custom OCR backend.
import { registerOcrBackend, extractFile } from "@kreuzberg/node";
const backend = {
name() {
return "my-ocr";
},
supportedLanguages() {
return ["eng", "deu", "fra"];
},
async processImage(imageBytes, language) {
// imageBytes: Uint8Array or Base64 string
const buffer =
typeof imageBytes === "string" ? Buffer.from(imageBytes, "base64") : Buffer.from(imageBytes);
// Process and extract text
return {
content: "extracted text",
mime_type: "text/plain",
metadata: { confidence: 0.95, language },
tables: [],
};
},
async initialize() {
// Load models, setup resources
},
async shutdown() {
// Cleanup resources
},
};
registerOcrBackend(backend);
GutenOcrBackend
Built-in OCR backend implementation using Guten-OCR.
import { GutenOcrBackend, registerOcrBackend, extractFile } from "@kreuzberg/node";
const backend = new GutenOcrBackend();
await backend.initialize();
registerOcrBackend(backend);
const result = await extractFile("scanned.pdf", null, {
ocr: { backend: "guten-ocr", language: "eng" },
});
unregisterOcrBackend(name): void
Remove a registered OCR backend.
import { unregisterOcrBackend } from "@kreuzberg/node";
unregisterOcrBackend("my-ocr");
listOcrBackends(): string[]
List all registered OCR backend names.
import { listOcrBackends } from "@kreuzberg/node";
const backends = listOcrBackends();
clearOcrBackends(): void
Unregister all OCR backends.
import { clearOcrBackends } from "@kreuzberg/node";
clearOcrBackends();
MIME Type Utilities
detectMimeType(data): string | null
Detect MIME type from file content (magic bytes).
import { detectMimeType } from "@kreuzberg/node";
import { readFileSync } from "fs";
const data = readFileSync("document");
const mimeType = detectMimeType(data);
console.log(`Detected MIME type: ${mimeType}`);
detectMimeTypeFromPath(filePath): string | null
Detect MIME type from file extension.
import { detectMimeTypeFromPath } from "@kreuzberg/node";
const mimeType = detectMimeTypeFromPath("document.pdf");
console.log(`MIME type: ${mimeType}`); // 'application/pdf'
getExtensionsForMime(mimeType): string[]
Get file extensions for a MIME type.
import { getExtensionsForMime } from "@kreuzberg/node";
const extensions = getExtensionsForMime("application/pdf");
console.log(`Extensions: ${extensions}`); // ['.pdf']
validateMimeType(mimeType): boolean
Check if a MIME type is valid.
import { validateMimeType } from "@kreuzberg/node";
if (validateMimeType("application/pdf")) {
console.log("Valid MIME type");
}
Configuration Loading
ExtractionConfig.fromFile(filePath): ExtractionConfig
Load extraction configuration from a file (JSON, YAML, or TOML).
import { ExtractionConfig, extractFile } from "@kreuzberg/node";
const config = ExtractionConfig.fromFile("./kreuzberg.toml");
const result = await extractFile("document.pdf", null, config);
ExtractionConfig.discover(): ExtractionConfig | null
Auto-discover extraction configuration file in current and parent directories.
import { ExtractionConfig, extractFile } from "@kreuzberg/node";
// Searches for kreuzberg.{toml,yaml,json} in current directory and parents
const config = ExtractionConfig.discover();
if (config) {
const result = await extractFile("document.pdf", null, config);
}
Embeddings
getEmbeddingPreset(name): EmbeddingPreset | null
Get a named embedding model preset.
import { getEmbeddingPreset } from "@kreuzberg/node";
const preset = getEmbeddingPreset("default");
if (preset) {
console.log(`Model: ${preset.modelName}`);
console.log(`Dimensions: ${preset.dimensions}`);
}
listEmbeddingPresets(): string[]
List all available embedding presets.
import { listEmbeddingPresets } from "@kreuzberg/node";
const presets = listEmbeddingPresets();
console.log("Available presets:", presets);
EmbeddingPreset
Type definition for embedding model presets.
interface EmbeddingPreset {
name: string; // Preset name (e.g., "fast", "balanced", "quality", "multilingual")
chunkSize: number; // Recommended chunk size in characters
overlap: number; // Recommended overlap in characters
modelName: string; // Model identifier (e.g., "AllMiniLML6V2Q", "BGEBaseENV15")
dimensions: number; // Embedding vector dimensions
description: string; // Human-readable description
}
Plugin Protocols
PostProcessorProtocol
Interface for custom post-processors.
interface PostProcessorProtocol {
name(): string;
process(result: ExtractionResult): ExtractionResult | Promise<ExtractionResult>;
processingStage?(): ProcessingStage; // 'early' | 'middle' | 'late'
initialize?(): void | Promise<void>;
shutdown?(): void | Promise<void>;
}
ValidatorProtocol
Interface for custom validators.
interface ValidatorProtocol {
name(): string;
validate(result: ExtractionResult): void | Promise<void>;
priority?(): number; // Higher = runs first
shouldValidate?(result: ExtractionResult): boolean;
initialize?(): void | Promise<void>;
shutdown?(): void | Promise<void>;
}
OcrBackendProtocol
Interface for custom OCR backends.
interface OcrBackendProtocol {
name(): string;
supportedLanguages(): string[];
processImage(
imageBytes: Uint8Array | string,
language: string,
): Promise<{
content: string;
mime_type: string;
metadata: Record<string, unknown>;
tables: unknown[];
}>;
initialize?(): void | Promise<void>;
shutdown?(): void | Promise<void>;
}
Supported Document Formats
- Documents: PDF, DOCX, PPTX, XLSX, DOC, PPT
- Text: Markdown, Plain Text, XML, JSON, YAML, TOML
- Web: HTML (converted to Markdown)
- Email: EML, MSG
- Images: PNG, JPEG, TIFF (with OCR support)
- Archives: ZIP, TAR, GZIP (file listing)
Registry Functions
Document Extractors
import {
listDocumentExtractors,
unregisterDocumentExtractor,
clearDocumentExtractors,
} from "@kreuzberg/node";
// List registered extractors
const extractors = listDocumentExtractors();
// Unregister a specific extractor
unregisterDocumentExtractor("pdf");
// Clear all extractors
clearDocumentExtractors();
Type Exports
All types are exported from @kreuzberg/node:
export type {
Chunk,
ChunkingConfig,
ExtractionConfig,
ExtractionResult,
ExtractedImage,
KeywordConfig,
LanguageDetectionConfig,
OcrBackendProtocol,
OcrConfig,
PageContent,
PageExtractionConfig,
PdfConfig,
PostProcessorProtocol,
Table,
TokenReductionConfig,
ValidatorProtocol,
WorkerPool,
WorkerPoolStats,
EmbeddingPreset,
// ... and many more
};
Best Practices
-
Use batch APIs for multiple documents:
batchExtractFiles()provides superior performance vs. callingextractFile()in a loop. -
Enable chunking for RAG/vector DB: Set
chunkingconfig to automatically break documents into overlapping chunks. -
Use worker pools for high-concurrency scenarios: Distribute CPU-bound work across multiple threads for 4+ concurrent extractions.
-
Configure language detection: Enable automatic language detection for multilingual documents.
-
Register validators early: Set up validators before calling extraction functions to catch quality issues immediately.
-
Use specific MIME types: Provide explicit MIME types when available to avoid detection overhead.
-
Clean up resources: Always call
closeWorkerPool()when done to prevent resource leaks. -
Handle extraction errors gracefully: Catch specific error types (
ParsingError,OcrError, etc.) for appropriate error handling.
Version
Package Version: 4.2.14