Files
fil/skills/kreuzberg/references/nodejs-api.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

33 KiB
Raw Permalink Blame History

Node.js/TypeScript API Reference

Overview

Package: @kreuzberg/node — A high-performance TypeScript SDK built on a Rust core for document intelligence and content extraction.

Supports both ESM (import) and CommonJS (require):

// ESM
import { extractFile, batchExtractFiles } from "@kreuzberg/node";

// CommonJS
const { extractFile, batchExtractFiles } = require("@kreuzberg/node");

Current Version: 4.2.14


Core Extraction Functions

All extraction functions return ExtractionResult containing extracted content, metadata, tables, and optional chunks/images.

Single File Extraction

extractFile(filePath, mimeType?, config?): Promise<ExtractionResult>

Extract content from a single file asynchronously.

import { extractFile } from "@kreuzberg/node";

// Auto-detect MIME type from file extension
const result = await extractFile("document.pdf");
console.log(result.content);

// Explicit MIME type
const result2 = await extractFile("document.pdf", "application/pdf");

// With configuration
const result3 = await extractFile("document.pdf", null, {
  chunking: {
    maxChars: 1000,
    maxOverlap: 200,
  },
});

Parameters:

  • filePath: string — Path to the file to extract
  • mimeType?: string | null — Optional MIME type hint (auto-detect if null)
  • config?: ExtractionConfig — Optional extraction configuration

Returns: Promise<ExtractionResult>

Throws: ParsingError, OcrError, ValidationError, KreuzbergError

extractFileSync(filePath, mimeType?, config?): ExtractionResult

Extract content from a single file synchronously.

import { extractFileSync } from "@kreuzberg/node";

const result = extractFileSync("document.pdf");
console.log(result.content);

Parameters: Same as extractFile()

Returns: ExtractionResult


Raw Bytes Extraction

extractBytes(data, mimeType, config?): Promise<ExtractionResult>

Extract content from raw bytes (Buffer or Uint8Array) asynchronously.

import { extractBytes } from "@kreuzberg/node";
import { readFile } from "fs/promises";

const data = await readFile("document.pdf");
const result = await extractBytes(data, "application/pdf");
console.log(result.content);

Parameters:

  • data: Buffer | Uint8Array — Raw file content
  • mimeType: string — MIME type (required)
  • config?: ExtractionConfig — Optional configuration

Returns: Promise<ExtractionResult>

extractBytesSync(data, mimeType, config?): ExtractionResult

Extract content from raw bytes synchronously.

import { extractBytesSync } from "@kreuzberg/node";
import { readFileSync } from "fs";

const data = readFileSync("document.pdf");
const result = extractBytesSync(data, "application/pdf");

Parameters: Same as extractBytes()

Returns: ExtractionResult


For processing multiple documents, batch APIs provide superior performance and memory management.

batchExtractFiles(paths, config?): Promise<ExtractionResult[]>

Extract content from multiple files in parallel (asynchronous).

import { batchExtractFiles } from "@kreuzberg/node";

const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
const results = await batchExtractFiles(files);

results.forEach((result, i) => {
  console.log(`${files[i]}: ${result.content.substring(0, 100)}...`);
});

Parameters:

  • paths: string[] — Array of file paths
  • config?: ExtractionConfig — Configuration (applied to all files)

Returns: Promise<ExtractionResult[]> — Results in same order as input

batchExtractFilesSync(paths, config?): ExtractionResult[]

Extract content from multiple files synchronously.

import { batchExtractFilesSync } from "@kreuzberg/node";

const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
const results = batchExtractFilesSync(files);

Parameters: Same as batchExtractFiles()

Returns: ExtractionResult[]

batchExtractBytes(dataList, mimeTypes, config?): Promise<ExtractionResult[]>

Extract content from multiple byte arrays in parallel (asynchronous).

import { batchExtractBytes } from "@kreuzberg/node";
import { readFile } from "fs/promises";

const files = ["doc1.pdf", "doc2.docx"];
const dataList = await Promise.all(files.map((f) => readFile(f)));
const mimeTypes = [
  "application/pdf",
  "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
];

const results = await batchExtractBytes(dataList, mimeTypes);

Parameters:

  • dataList: Uint8Array[] — Array of file contents
  • mimeTypes: string[] — MIME types (one per item, must match length)
  • config?: ExtractionConfig — Configuration (applied to all items)

Returns: Promise<ExtractionResult[]>

batchExtractBytesSync(dataList, mimeTypes, config?): ExtractionResult[]

Extract content from multiple byte arrays synchronously.

import { batchExtractBytesSync } from "@kreuzberg/node";
import { readFileSync } from "fs";

const dataList = ["doc1.pdf", "doc2.docx"].map((f) => readFileSync(f));
const mimeTypes = [
  "application/pdf",
  "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
];

const results = batchExtractBytesSync(dataList, mimeTypes);

Parameters: Same as batchExtractBytes()

Returns: ExtractionResult[]

batchExtractFilesWithConfigs(paths, fileConfigs, config?): Promise<ExtractionResult[]>

Extract multiple files with per-file configuration overrides (asynchronous).

const results = await batchExtractFilesWithConfigs(
  ["report.pdf", "scanned.pdf"],
  [null, { forceOcr: true, ocr: { backend: "tesseract", language: "deu" } }],
);

Parameters:

  • paths: string[] — File paths
  • fileConfigs: (FileExtractionConfig | null)[] — Per-file configs (null = use batch defaults)
  • config?: ExtractionConfig — Batch-level configuration

batchExtractFilesWithConfigsSync(paths, fileConfigs, config?): ExtractionResult[]

Synchronous variant.

batchExtractBytesWithConfigs(dataList, mimeTypes, fileConfigs, config?): Promise<ExtractionResult[]>

Extract multiple byte arrays with per-file overrides (asynchronous).

batchExtractBytesWithConfigsSync(dataList, mimeTypes, fileConfigs, config?): ExtractionResult[]

Synchronous variant.


Worker Pool APIs

Worker pools enable concurrent extraction using Node.js worker threads for CPU-bound processing.

createWorkerPool(size?): WorkerPool

Create a worker pool for concurrent extraction.

import { createWorkerPool } from "@kreuzberg/node";

// Create pool with default size (number of CPU cores)
const pool = createWorkerPool();

// Create pool with specific size
const pool4 = createWorkerPool(4);

Parameters:

  • size?: number — Number of workers (defaults to CPU core count)

Returns: WorkerPool — Opaque handle for use with worker extraction functions

extractFileInWorker(pool, filePath, mimeType?, config?): Promise<ExtractionResult>

Extract a single file using a worker from the pool.

import { createWorkerPool, extractFileInWorker, closeWorkerPool } from "@kreuzberg/node";

const pool = createWorkerPool(4);

try {
  const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
  const results = await Promise.all(files.map((f) => extractFileInWorker(pool, f)));

  results.forEach((r, i) => {
    console.log(`${files[i]}: ${r.content.substring(0, 100)}...`);
  });
} finally {
  await closeWorkerPool(pool);
}

Parameters:

  • pool: WorkerPool — Worker pool instance
  • filePath: string — File path
  • mimeType?: string | null — Optional MIME type
  • config?: ExtractionConfig — Optional configuration

Returns: Promise<ExtractionResult>

batchExtractFilesInWorker(pool, paths, config?): Promise<ExtractionResult[]>

Extract multiple files using the worker pool for concurrent processing.

import { createWorkerPool, batchExtractFilesInWorker, closeWorkerPool } from "@kreuzberg/node";

const pool = createWorkerPool(4);

try {
  const files = ["invoice1.pdf", "invoice2.pdf", "invoice3.pdf"];
  const results = await batchExtractFilesInWorker(pool, files, {
    ocr: { backend: "tesseract", language: "eng" },
  });

  const total = results.reduce((sum, r) => sum + extractAmount(r.content), 0);
  console.log(`Total: $${total}`);
} finally {
  await closeWorkerPool(pool);
}

Parameters:

  • pool: WorkerPool — Worker pool instance
  • paths: string[] — File paths
  • config?: ExtractionConfig — Configuration (applied to all files)

Returns: Promise<ExtractionResult[]>

getWorkerPoolStats(pool): WorkerPoolStats

Get statistics about a worker pool.

import { createWorkerPool, getWorkerPoolStats } from "@kreuzberg/node";

const pool = createWorkerPool(4);
const stats = getWorkerPoolStats(pool);

console.log(`Pool size: ${stats.size}`);
console.log(`Active workers: ${stats.activeWorkers}`);
console.log(`Queued tasks: ${stats.queuedTasks}`);

Parameters:

  • pool: WorkerPool — Worker pool instance

Returns: WorkerPoolStats

closeWorkerPool(pool): Promise<void>

Close a worker pool and shut down all worker threads.

import { createWorkerPool, closeWorkerPool } from "@kreuzberg/node";

const pool = createWorkerPool(4);

try {
  // Use pool
} finally {
  await closeWorkerPool(pool);
}

Parameters:

  • pool: WorkerPool — Worker pool instance to close

Returns: Promise<void>


Configuration Interface

ExtractionConfig

Main configuration object controlling extraction behavior.

interface ExtractionConfig {
  // Caching and processing
  useCache?: boolean; // Default: true
  enableQualityProcessing?: boolean; // Default: false

  // OCR configuration
  ocr?: OcrConfig; // OCR settings
  forceOcr?: boolean; // Default: false

  // Document processing
  chunking?: ChunkingConfig; // Break into chunks
  images?: ImageExtractionConfig; // Image extraction
  pdfOptions?: PdfConfig; // PDF-specific options
  tokenReduction?: TokenReductionConfig; // Token optimization
  languageDetection?: LanguageDetectionConfig; // Language detection
  postprocessor?: PostProcessorConfig; // Post-processing
  htmlOptions?: HtmlConversionOptions; // HTML conversion
  keywords?: KeywordConfig; // Keyword extraction
  pages?: PageExtractionConfig; // Page extraction

  // Output control
  maxConcurrentExtractions?: number; // Default: 4
  outputFormat?: "plain" | "markdown" | "djot" | "html"; // Default: 'plain'
  resultFormat?: "unified" | "element_based"; // Default: 'unified'
}

FileExtractionConfig

Per-file overrides for batch operations. All fields optional (omitted = use batch default).

interface FileExtractionConfig {
  enableQualityProcessing?: boolean;
  ocr?: OcrConfig;
  forceOcr?: boolean;
  chunking?: ChunkingConfig;
  images?: ImageExtractionConfig;
  pdfOptions?: PdfConfig;
  tokenReduction?: TokenReductionConfig;
  languageDetection?: LanguageDetectionConfig;
  pages?: PageExtractionConfig;
  keywords?: KeywordConfig;
  postprocessor?: PostProcessorConfig;
  outputFormat?: "plain" | "markdown" | "djot" | "html";
  resultFormat?: "unified" | "element_based";
  includeDocumentStructure?: boolean;
}

Excluded (batch-level only): maxConcurrentExtractions, useCache, securityLimits.

ChunkingConfig

Configuration for breaking documents into chunks (useful for RAG and vector databases).

interface ChunkingConfig {
  maxChars?: number; // Max characters per chunk (default: 4096)
  maxOverlap?: number; // Overlap between chunks (default: 512)
  chunkSize?: number; // Alternative unit (mutually exclusive with maxChars)
  chunkOverlap?: number; // Alternative unit (mutually exclusive with maxOverlap)
  preset?: string; // Named preset ('default', 'aggressive', 'minimal')
  embedding?: Record<string, unknown>; // Embedding config
  enabled?: boolean; // Enable chunking (default: true when config provided)
}

Key Point: Use maxChars and maxOverlap, NOT maxCharacters or overlap.

OcrConfig

Configuration for optical character recognition.

interface OcrConfig {
  backend: string; // OCR backend name (e.g., 'tesseract')
  language?: string; // Language code (e.g., 'eng', 'deu')
  tesseractConfig?: TesseractConfig;
}

interface TesseractConfig {
  psm?: number; // Page Segmentation Mode (0-13)
  enableTableDetection?: boolean;
  tesseditCharWhitelist?: string; // Character whitelist
}

ImageExtractionConfig

Configuration for extracting and optimizing images.

interface ImageExtractionConfig {
  extractImages?: boolean; // Default: true
  targetDpi?: number; // Target DPI (default: 150)
  maxImageDimension?: number; // Max width/height in pixels (default: 2000)
  autoAdjustDpi?: boolean; // Auto-adjust DPI (default: true)
  minDpi?: number; // Minimum DPI (default: 72)
  maxDpi?: number; // Maximum DPI (default: 300)
}

PdfConfig

PDF-specific extraction options.

interface PdfConfig {
  extractImages?: boolean; // Default: true
  passwords?: string[]; // Passwords for encrypted PDFs
  extractMetadata?: boolean; // Default: true
  hierarchy?: HierarchyConfig; // Hierarchy extraction
}

LanguageDetectionConfig

Configuration for automatic language detection.

interface LanguageDetectionConfig {
  enabled?: boolean; // Default: true
  minConfidence?: number; // Threshold 0.0-1.0 (default: 0.5)
  detectMultiple?: boolean; // Detect multiple languages (default: false)
}

TokenReductionConfig

Configuration for optimizing token usage.

interface TokenReductionConfig {
  mode?: string; // 'aggressive' or 'conservative' (default: 'conservative')
  preserveImportantWords?: boolean; // Default: true
}

KeywordConfig

Configuration for keyword extraction.

interface KeywordConfig {
  algorithm?: "yake" | "rake"; // Default: 'yake'
  maxKeywords?: number; // Maximum keywords (default: 10)
  minScore?: number; // Minimum relevance score (default: 0.1)
  ngramRange?: [number, number]; // N-gram range (default: [1, 3])
  language?: string; // Language code (default: 'en')
  yakeParams?: YakeParams;
  rakeParams?: RakeParams;
}

PageExtractionConfig

Configuration for page-level content tracking.

interface PageExtractionConfig {
  extractPages?: boolean; // Extract as separate pages array
  insertPageMarkers?: boolean; // Insert page markers in content
  markerFormat?: string; // Marker format with {page_num} placeholder
}

HtmlConversionOptions

Configuration for HTML to Markdown conversion.

interface HtmlConversionOptions {
  headingStyle?: "atx" | "underlined" | "atx_closed";
  listIndentType?: "spaces" | "tabs";
  listIndentWidth?: number;
  bullets?: string;
  strongEmSymbol?: string;
  escapeAsterisks?: boolean;
  escapeUnderscores?: boolean;
  escapeMisc?: boolean;
  escapeAscii?: boolean;
  codeLanguage?: string;
  autolinks?: boolean;
  defaultTitle?: boolean;
  brInTables?: boolean;
  hocrSpatialTables?: boolean;
  highlightStyle?: "double_equal" | "html" | "bold" | "none";
  extractMetadata?: boolean;
  whitespaceMode?: "normalized" | "strict";
  stripNewlines?: boolean;
  wrap?: boolean;
  wrapWidth?: number;
  convertAsInline?: boolean;
  subSymbol?: string;
  supSymbol?: string;
  newlineStyle?: "spaces" | "backslash";
  codeBlockStyle?: "indented" | "backticks" | "tildes";
  keepInlineImagesIn?: string[];
  encoding?: string;
  debug?: boolean;
  stripTags?: string[];
  preserveTags?: string[];
  preprocessing?: HtmlPreprocessingOptions;
}

Result Types

ExtractionResult

Complete extraction result from document processing.

interface ExtractionResult {
  // Main content
  content: string;

  // Document type
  mimeType: string;

  // Metadata (format-specific)
  metadata: Metadata;

  // Extracted structures
  tables: Table[];

  // Optional processed data
  detectedLanguages: string[] | null;
  chunks: Chunk[] | null; // From chunking config
  images: ExtractedImage[] | null; // From image extraction
  elements?: Element[] | null; // From element_based result format
  pages?: PageContent[] | null; // From page extraction
  extractedKeywords?: ExtractedKeyword[] | null; // Extracted keywords with scores
  qualityScore?: number | null; // Overall extraction quality (0.0-1.0)
  processingWarnings?: ProcessingWarning[]; // Non-fatal warnings from pipeline
}

Table

Extracted table data with cell structure.

interface Table {
  cells: string[][]; // 2D array of cell contents (rows × columns)
  markdown: string; // Markdown representation
  pageNumber: number; // 1-indexed page number
}

Chunk

Text chunk for RAG or vector database indexing.

interface Chunk {
  content: string;
  embedding?: number[] | null; // Vector embedding if computed
  metadata: ChunkMetadata;
}

interface ChunkMetadata {
  byteStart: number; // UTF-8 byte offset in original text
  byteEnd: number; // UTF-8 byte offset
  tokenCount?: number | null;
  chunkIndex: number; // Zero-based index
  totalChunks: number; // Total number of chunks
  firstPage?: number | null; // 1-indexed, if page tracking enabled
  lastPage?: number | null;
}

ExtractedImage

Image extracted from document.

interface ExtractedImage {
  data: Uint8Array; // Raw image bytes
  format: string; // Format (e.g., 'png', 'jpeg', 'tiff')
  imageIndex: number; // Sequential index (0-indexed)
  pageNumber?: number | null;
  width?: number | null;
  height?: number | null;
  colorspace?: string | null;
  bitsPerComponent?: number | null;
  isMask: boolean;
  description?: string | null;
  ocrResult?: ExtractionResult | null; // OCR result if processed
}

PageContent

Per-page content when page extraction is enabled.

interface PageContent {
  pageNumber: number; // 1-indexed
  content: string; // Page text content
  tables: Table[]; // Tables on this page
  images: ExtractedImage[]; // Images on this page
}

ExtractedKeyword

Extracted keyword with relevance score and position information.

interface ExtractedKeyword {
  text: string; // Keyword text
  score: number; // Relevance score (0.0-1.0)
  algorithm: string; // Algorithm used ("tfidf", "textrank", "yake", etc.)
  positions?: number[] | null; // Character positions in content (if available)
}

ProcessingWarning

Non-fatal warning encountered during document processing.

interface ProcessingWarning {
  source: string; // Component that generated the warning
  message: string; // Warning message describing the issue
}

Metadata

Extraction result metadata (format-specific).

interface Metadata {
  // Common fields
  language?: string | null;
  date?: string | null;
  subject?: string | null;
  format_type?:
    | "pdf"
    | "excel"
    | "email"
    | "pptx"
    | "archive"
    | "image"
    | "xml"
    | "text"
    | "html"
    | "ocr";

  // PDF metadata
  title?: string | null;
  author?: string | null;
  creator?: string | null;
  producer?: string | null;
  creation_date?: string | null;
  modification_date?: string | null;
  page_count?: number;

  // Excel metadata
  sheet_count?: number;
  sheet_names?: string[];

  // Email metadata
  from_email?: string | null;
  from_name?: string | null;
  to_emails?: string[];
  cc_emails?: string[];
  bcc_emails?: string[];
  message_id?: string | null;
  attachments?: string[];

  // Image metadata
  width?: number;
  height?: number;
  exif?: Record<string, string>;

  // OCR metadata
  psm?: number;
  output_format?: string;
  table_count?: number;

  // HTML metadata
  canonical_url?: string | null;
  html_language?: string | null;
  text_direction?: "ltr" | "rtl" | "auto" | null;
  open_graph?: Record<string, string>;
  twitter_card?: Record<string, string>;
  meta_tags?: Record<string, string>;
  html_headers?: HeaderMetadata[];
  html_links?: LinkMetadata[];
  html_images?: HtmlImageMetadata[];
  structured_data?: StructuredData[];

  // Text metadata
  line_count?: number;
  word_count?: number;
  character_count?: number;
  headers?: string[] | null;
  links?: [string, string][] | null;
  code_blocks?: [string, string][] | null;

  // Page structure
  page_structure?: PageStructure | null;

  // Additional typed fields
  category?: string | null;
  tags?: string[];
  document_version?: string | null;
  abstract_text?: string | null;

  // Custom fields from postprocessors
  [key: string]: unknown;
}

Error Handling

Error Classes

import {
  KreuzbergError,
  ParsingError,
  OcrError, // Note: camelCase, not "OCRError"
  ValidationError,
  MissingDependencyError,
  CacheError,
  ImageProcessingError,
  PluginError,
  ErrorCode,
} from "@kreuzberg/node";

Error Hierarchy:

  • KreuzbergError — Base class for all Kreuzberg errors
    • ParsingError — Document format invalid or corrupted
    • OcrError — OCR processing failed
    • ValidationError — Extraction validation failed
    • MissingDependencyError — Required dependency unavailable
    • CacheError — Cache operation failed
    • ImageProcessingError — Image extraction or processing failed
    • PluginError — Plugin registration or execution failed

Error Diagnostics

import {
  classifyError,
  getErrorCodeDescription,
  getErrorCodeName,
  getLastErrorCode,
  getLastPanicContext,
} from "@kreuzberg/node";

try {
  const result = await extractFile("document.pdf");
} catch (error) {
  const classification = classifyError(error.message);
  console.log(`Error code: ${getErrorCodeName(classification.code)}`);
  console.log(`Description: ${getErrorCodeDescription(classification.code)}`);
  console.log(`Confidence: ${classification.confidence}`);
}

ErrorCode Enum

enum ErrorCode {
  Success = 0,
  GenericError = 1,
  Panic = 2,
  InvalidArgument = 3,
  IoError = 4,
  ParsingError = 5,
  OcrError = 6,
  MissingDependency = 7,
}

Plugin System

Post-Processors

Custom post-processors can enrich extraction results without failing the extraction if they encounter errors.

registerPostProcessor(processor): void

Register a custom post-processor.

import { registerPostProcessor, extractFile } from "@kreuzberg/node";

const processor = {
  name() {
    return "my_processor";
  },

  async process(result) {
    // Enrich result with custom metadata
    result.metadata["custom_field"] = "value";
    return result;
  },

  processingStage() {
    return "late"; // 'early', 'middle', or 'late'
  },

  async initialize() {
    // Called once when registered
  },

  async shutdown() {
    // Called when unregistered
  },
};

registerPostProcessor(processor);
const result = await extractFile("document.pdf");

unregisterPostProcessor(name): void

Remove a registered post-processor.

import { unregisterPostProcessor } from "@kreuzberg/node";

unregisterPostProcessor("my_processor");

listPostProcessors(): string[]

List all registered post-processor names.

import { listPostProcessors } from "@kreuzberg/node";

const processors = listPostProcessors();
console.log("Registered processors:", processors);

clearPostProcessors(): void

Unregister all post-processors.

import { clearPostProcessors } from "@kreuzberg/node";

clearPostProcessors();

Validators

Custom validators check extraction results and fail the extraction if validation fails (unlike post-processors).

registerValidator(validator): void

Register a custom validator.

import { registerValidator, extractFile } from "@kreuzberg/node";

const validator = {
  name() {
    return "content_length_validator";
  },

  validate(result) {
    if (result.content.length < 10) {
      throw new Error("Content too short");
    }
  },

  priority() {
    return 100; // Higher = runs first
  },

  shouldValidate(result) {
    return result.mimeType === "application/pdf"; // Conditional validation
  },

  async initialize() {
    // Called once when registered
  },

  async shutdown() {
    // Called when unregistered
  },
};

registerValidator(validator);
const result = await extractFile("document.pdf");

unregisterValidator(name): void

Remove a registered validator.

import { unregisterValidator } from "@kreuzberg/node";

unregisterValidator("content_length_validator");

listValidators(): string[]

List all registered validator names.

import { listValidators } from "@kreuzberg/node";

const validators = listValidators();

clearValidators(): void

Unregister all validators.

import { clearValidators } from "@kreuzberg/node";

clearValidators();

OCR Backends

Custom OCR backends can be registered to handle image text extraction.

registerOcrBackend(backend): void

Register a custom OCR backend.

import { registerOcrBackend, extractFile } from "@kreuzberg/node";

const backend = {
  name() {
    return "my-ocr";
  },

  supportedLanguages() {
    return ["eng", "deu", "fra"];
  },

  async processImage(imageBytes, language) {
    // imageBytes: Uint8Array or Base64 string
    const buffer =
      typeof imageBytes === "string" ? Buffer.from(imageBytes, "base64") : Buffer.from(imageBytes);

    // Process and extract text
    return {
      content: "extracted text",
      mime_type: "text/plain",
      metadata: { confidence: 0.95, language },
      tables: [],
    };
  },

  async initialize() {
    // Load models, setup resources
  },

  async shutdown() {
    // Cleanup resources
  },
};

registerOcrBackend(backend);

GutenOcrBackend

Built-in OCR backend implementation using Guten-OCR.

import { GutenOcrBackend, registerOcrBackend, extractFile } from "@kreuzberg/node";

const backend = new GutenOcrBackend();
await backend.initialize();
registerOcrBackend(backend);

const result = await extractFile("scanned.pdf", null, {
  ocr: { backend: "guten-ocr", language: "eng" },
});

unregisterOcrBackend(name): void

Remove a registered OCR backend.

import { unregisterOcrBackend } from "@kreuzberg/node";

unregisterOcrBackend("my-ocr");

listOcrBackends(): string[]

List all registered OCR backend names.

import { listOcrBackends } from "@kreuzberg/node";

const backends = listOcrBackends();

clearOcrBackends(): void

Unregister all OCR backends.

import { clearOcrBackends } from "@kreuzberg/node";

clearOcrBackends();

MIME Type Utilities

detectMimeType(data): string | null

Detect MIME type from file content (magic bytes).

import { detectMimeType } from "@kreuzberg/node";
import { readFileSync } from "fs";

const data = readFileSync("document");
const mimeType = detectMimeType(data);
console.log(`Detected MIME type: ${mimeType}`);

detectMimeTypeFromPath(filePath): string | null

Detect MIME type from file extension.

import { detectMimeTypeFromPath } from "@kreuzberg/node";

const mimeType = detectMimeTypeFromPath("document.pdf");
console.log(`MIME type: ${mimeType}`); // 'application/pdf'

getExtensionsForMime(mimeType): string[]

Get file extensions for a MIME type.

import { getExtensionsForMime } from "@kreuzberg/node";

const extensions = getExtensionsForMime("application/pdf");
console.log(`Extensions: ${extensions}`); // ['.pdf']

validateMimeType(mimeType): boolean

Check if a MIME type is valid.

import { validateMimeType } from "@kreuzberg/node";

if (validateMimeType("application/pdf")) {
  console.log("Valid MIME type");
}

Configuration Loading

ExtractionConfig.fromFile(filePath): ExtractionConfig

Load extraction configuration from a file (JSON, YAML, or TOML).

import { ExtractionConfig, extractFile } from "@kreuzberg/node";

const config = ExtractionConfig.fromFile("./kreuzberg.toml");
const result = await extractFile("document.pdf", null, config);

ExtractionConfig.discover(): ExtractionConfig | null

Auto-discover extraction configuration file in current and parent directories.

import { ExtractionConfig, extractFile } from "@kreuzberg/node";

// Searches for kreuzberg.{toml,yaml,json} in current directory and parents
const config = ExtractionConfig.discover();
if (config) {
  const result = await extractFile("document.pdf", null, config);
}

Embeddings

getEmbeddingPreset(name): EmbeddingPreset | null

Get a named embedding model preset.

import { getEmbeddingPreset } from "@kreuzberg/node";

const preset = getEmbeddingPreset("default");
if (preset) {
  console.log(`Model: ${preset.modelName}`);
  console.log(`Dimensions: ${preset.dimensions}`);
}

listEmbeddingPresets(): string[]

List all available embedding presets.

import { listEmbeddingPresets } from "@kreuzberg/node";

const presets = listEmbeddingPresets();
console.log("Available presets:", presets);

EmbeddingPreset

Type definition for embedding model presets.

interface EmbeddingPreset {
  name: string; // Preset name (e.g., "fast", "balanced", "quality", "multilingual")
  chunkSize: number; // Recommended chunk size in characters
  overlap: number; // Recommended overlap in characters
  modelName: string; // Model identifier (e.g., "AllMiniLML6V2Q", "BGEBaseENV15")
  dimensions: number; // Embedding vector dimensions
  description: string; // Human-readable description
}

Plugin Protocols

PostProcessorProtocol

Interface for custom post-processors.

interface PostProcessorProtocol {
  name(): string;

  process(result: ExtractionResult): ExtractionResult | Promise<ExtractionResult>;

  processingStage?(): ProcessingStage; // 'early' | 'middle' | 'late'

  initialize?(): void | Promise<void>;

  shutdown?(): void | Promise<void>;
}

ValidatorProtocol

Interface for custom validators.

interface ValidatorProtocol {
  name(): string;

  validate(result: ExtractionResult): void | Promise<void>;

  priority?(): number; // Higher = runs first

  shouldValidate?(result: ExtractionResult): boolean;

  initialize?(): void | Promise<void>;

  shutdown?(): void | Promise<void>;
}

OcrBackendProtocol

Interface for custom OCR backends.

interface OcrBackendProtocol {
  name(): string;

  supportedLanguages(): string[];

  processImage(
    imageBytes: Uint8Array | string,
    language: string,
  ): Promise<{
    content: string;
    mime_type: string;
    metadata: Record<string, unknown>;
    tables: unknown[];
  }>;

  initialize?(): void | Promise<void>;

  shutdown?(): void | Promise<void>;
}

Supported Document Formats

  • Documents: PDF, DOCX, PPTX, XLSX, DOC, PPT
  • Text: Markdown, Plain Text, XML, JSON, YAML, TOML
  • Web: HTML (converted to Markdown)
  • Email: EML, MSG
  • Images: PNG, JPEG, TIFF (with OCR support)
  • Archives: ZIP, TAR, GZIP (file listing)

Registry Functions

Document Extractors

import {
  listDocumentExtractors,
  unregisterDocumentExtractor,
  clearDocumentExtractors,
} from "@kreuzberg/node";

// List registered extractors
const extractors = listDocumentExtractors();

// Unregister a specific extractor
unregisterDocumentExtractor("pdf");

// Clear all extractors
clearDocumentExtractors();

Type Exports

All types are exported from @kreuzberg/node:

export type {
  Chunk,
  ChunkingConfig,
  ExtractionConfig,
  ExtractionResult,
  ExtractedImage,
  KeywordConfig,
  LanguageDetectionConfig,
  OcrBackendProtocol,
  OcrConfig,
  PageContent,
  PageExtractionConfig,
  PdfConfig,
  PostProcessorProtocol,
  Table,
  TokenReductionConfig,
  ValidatorProtocol,
  WorkerPool,
  WorkerPoolStats,
  EmbeddingPreset,
  // ... and many more
};

Best Practices

  1. Use batch APIs for multiple documents: batchExtractFiles() provides superior performance vs. calling extractFile() in a loop.

  2. Enable chunking for RAG/vector DB: Set chunking config to automatically break documents into overlapping chunks.

  3. Use worker pools for high-concurrency scenarios: Distribute CPU-bound work across multiple threads for 4+ concurrent extractions.

  4. Configure language detection: Enable automatic language detection for multilingual documents.

  5. Register validators early: Set up validators before calling extraction functions to catch quality issues immediately.

  6. Use specific MIME types: Provide explicit MIME types when available to avoid detection overhead.

  7. Clean up resources: Always call closeWorkerPool() when done to prevent resource leaks.

  8. Handle extraction errors gracefully: Catch specific error types (ParsingError, OcrError, etc.) for appropriate error handling.


Version

Package Version: 4.2.14