1381 lines
33 KiB
Markdown
1381 lines
33 KiB
Markdown
# Node.js/TypeScript API Reference
|
||
|
||
## Overview
|
||
|
||
**Package**: `@kreuzberg/node` — A high-performance TypeScript SDK built on a Rust core for document intelligence and content extraction.
|
||
|
||
Supports both **ESM** (`import`) and **CommonJS** (`require`):
|
||
|
||
```typescript
|
||
// ESM
|
||
import { extractFile, batchExtractFiles } from "@kreuzberg/node";
|
||
|
||
// CommonJS
|
||
const { extractFile, batchExtractFiles } = require("@kreuzberg/node");
|
||
```
|
||
|
||
**Current Version**: 4.2.14
|
||
|
||
---
|
||
|
||
## Core Extraction Functions
|
||
|
||
All extraction functions return `ExtractionResult` containing extracted content, metadata, tables, and optional chunks/images.
|
||
|
||
### Single File Extraction
|
||
|
||
#### `extractFile(filePath, mimeType?, config?): Promise<ExtractionResult>`
|
||
|
||
Extract content from a single file asynchronously.
|
||
|
||
```typescript
|
||
import { extractFile } from "@kreuzberg/node";
|
||
|
||
// Auto-detect MIME type from file extension
|
||
const result = await extractFile("document.pdf");
|
||
console.log(result.content);
|
||
|
||
// Explicit MIME type
|
||
const result2 = await extractFile("document.pdf", "application/pdf");
|
||
|
||
// With configuration
|
||
const result3 = await extractFile("document.pdf", null, {
|
||
chunking: {
|
||
maxChars: 1000,
|
||
maxOverlap: 200,
|
||
},
|
||
});
|
||
```
|
||
|
||
**Parameters**:
|
||
|
||
- `filePath: string` — Path to the file to extract
|
||
- `mimeType?: string | null` — Optional MIME type hint (auto-detect if null)
|
||
- `config?: ExtractionConfig` — Optional extraction configuration
|
||
|
||
**Returns**: `Promise<ExtractionResult>`
|
||
|
||
**Throws**: `ParsingError`, `OcrError`, `ValidationError`, `KreuzbergError`
|
||
|
||
#### `extractFileSync(filePath, mimeType?, config?): ExtractionResult`
|
||
|
||
Extract content from a single file synchronously.
|
||
|
||
```typescript
|
||
import { extractFileSync } from "@kreuzberg/node";
|
||
|
||
const result = extractFileSync("document.pdf");
|
||
console.log(result.content);
|
||
```
|
||
|
||
**Parameters**: Same as `extractFile()`
|
||
|
||
**Returns**: `ExtractionResult`
|
||
|
||
---
|
||
|
||
### Raw Bytes Extraction
|
||
|
||
#### `extractBytes(data, mimeType, config?): Promise<ExtractionResult>`
|
||
|
||
Extract content from raw bytes (Buffer or Uint8Array) asynchronously.
|
||
|
||
```typescript
|
||
import { extractBytes } from "@kreuzberg/node";
|
||
import { readFile } from "fs/promises";
|
||
|
||
const data = await readFile("document.pdf");
|
||
const result = await extractBytes(data, "application/pdf");
|
||
console.log(result.content);
|
||
```
|
||
|
||
**Parameters**:
|
||
|
||
- `data: Buffer | Uint8Array` — Raw file content
|
||
- `mimeType: string` — MIME type (required)
|
||
- `config?: ExtractionConfig` — Optional configuration
|
||
|
||
**Returns**: `Promise<ExtractionResult>`
|
||
|
||
#### `extractBytesSync(data, mimeType, config?): ExtractionResult`
|
||
|
||
Extract content from raw bytes synchronously.
|
||
|
||
```typescript
|
||
import { extractBytesSync } from "@kreuzberg/node";
|
||
import { readFileSync } from "fs";
|
||
|
||
const data = readFileSync("document.pdf");
|
||
const result = extractBytesSync(data, "application/pdf");
|
||
```
|
||
|
||
**Parameters**: Same as `extractBytes()`
|
||
|
||
**Returns**: `ExtractionResult`
|
||
|
||
---
|
||
|
||
### Batch Extraction (Recommended)
|
||
|
||
For processing multiple documents, batch APIs provide superior performance and memory management.
|
||
|
||
#### `batchExtractFiles(paths, config?): Promise<ExtractionResult[]>`
|
||
|
||
Extract content from multiple files in parallel (asynchronous).
|
||
|
||
```typescript
|
||
import { batchExtractFiles } from "@kreuzberg/node";
|
||
|
||
const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
|
||
const results = await batchExtractFiles(files);
|
||
|
||
results.forEach((result, i) => {
|
||
console.log(`${files[i]}: ${result.content.substring(0, 100)}...`);
|
||
});
|
||
```
|
||
|
||
**Parameters**:
|
||
|
||
- `paths: string[]` — Array of file paths
|
||
- `config?: ExtractionConfig` — Configuration (applied to all files)
|
||
|
||
**Returns**: `Promise<ExtractionResult[]>` — Results in same order as input
|
||
|
||
#### `batchExtractFilesSync(paths, config?): ExtractionResult[]`
|
||
|
||
Extract content from multiple files synchronously.
|
||
|
||
```typescript
|
||
import { batchExtractFilesSync } from "@kreuzberg/node";
|
||
|
||
const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
|
||
const results = batchExtractFilesSync(files);
|
||
```
|
||
|
||
**Parameters**: Same as `batchExtractFiles()`
|
||
|
||
**Returns**: `ExtractionResult[]`
|
||
|
||
#### `batchExtractBytes(dataList, mimeTypes, config?): Promise<ExtractionResult[]>`
|
||
|
||
Extract content from multiple byte arrays in parallel (asynchronous).
|
||
|
||
```typescript
|
||
import { batchExtractBytes } from "@kreuzberg/node";
|
||
import { readFile } from "fs/promises";
|
||
|
||
const files = ["doc1.pdf", "doc2.docx"];
|
||
const dataList = await Promise.all(files.map((f) => readFile(f)));
|
||
const mimeTypes = [
|
||
"application/pdf",
|
||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||
];
|
||
|
||
const results = await batchExtractBytes(dataList, mimeTypes);
|
||
```
|
||
|
||
**Parameters**:
|
||
|
||
- `dataList: Uint8Array[]` — Array of file contents
|
||
- `mimeTypes: string[]` — MIME types (one per item, must match length)
|
||
- `config?: ExtractionConfig` — Configuration (applied to all items)
|
||
|
||
**Returns**: `Promise<ExtractionResult[]>`
|
||
|
||
#### `batchExtractBytesSync(dataList, mimeTypes, config?): ExtractionResult[]`
|
||
|
||
Extract content from multiple byte arrays synchronously.
|
||
|
||
```typescript
|
||
import { batchExtractBytesSync } from "@kreuzberg/node";
|
||
import { readFileSync } from "fs";
|
||
|
||
const dataList = ["doc1.pdf", "doc2.docx"].map((f) => readFileSync(f));
|
||
const mimeTypes = [
|
||
"application/pdf",
|
||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||
];
|
||
|
||
const results = batchExtractBytesSync(dataList, mimeTypes);
|
||
```
|
||
|
||
**Parameters**: Same as `batchExtractBytes()`
|
||
|
||
**Returns**: `ExtractionResult[]`
|
||
|
||
#### `batchExtractFilesWithConfigs(paths, fileConfigs, config?): Promise<ExtractionResult[]>`
|
||
|
||
Extract multiple files with per-file configuration overrides (asynchronous).
|
||
|
||
```typescript
|
||
const results = await batchExtractFilesWithConfigs(
|
||
["report.pdf", "scanned.pdf"],
|
||
[null, { forceOcr: true, ocr: { backend: "tesseract", language: "deu" } }],
|
||
);
|
||
```
|
||
|
||
**Parameters**:
|
||
|
||
- `paths: string[]` — File paths
|
||
- `fileConfigs: (FileExtractionConfig | null)[]` — Per-file configs (null = use batch defaults)
|
||
- `config?: ExtractionConfig` — Batch-level configuration
|
||
|
||
#### `batchExtractFilesWithConfigsSync(paths, fileConfigs, config?): ExtractionResult[]`
|
||
|
||
Synchronous variant.
|
||
|
||
#### `batchExtractBytesWithConfigs(dataList, mimeTypes, fileConfigs, config?): Promise<ExtractionResult[]>`
|
||
|
||
Extract multiple byte arrays with per-file overrides (asynchronous).
|
||
|
||
#### `batchExtractBytesWithConfigsSync(dataList, mimeTypes, fileConfigs, config?): ExtractionResult[]`
|
||
|
||
Synchronous variant.
|
||
|
||
---
|
||
|
||
## Worker Pool APIs
|
||
|
||
Worker pools enable concurrent extraction using Node.js worker threads for CPU-bound processing.
|
||
|
||
### `createWorkerPool(size?): WorkerPool`
|
||
|
||
Create a worker pool for concurrent extraction.
|
||
|
||
```typescript
|
||
import { createWorkerPool } from "@kreuzberg/node";
|
||
|
||
// Create pool with default size (number of CPU cores)
|
||
const pool = createWorkerPool();
|
||
|
||
// Create pool with specific size
|
||
const pool4 = createWorkerPool(4);
|
||
```
|
||
|
||
**Parameters**:
|
||
|
||
- `size?: number` — Number of workers (defaults to CPU core count)
|
||
|
||
**Returns**: `WorkerPool` — Opaque handle for use with worker extraction functions
|
||
|
||
### `extractFileInWorker(pool, filePath, mimeType?, config?): Promise<ExtractionResult>`
|
||
|
||
Extract a single file using a worker from the pool.
|
||
|
||
```typescript
|
||
import { createWorkerPool, extractFileInWorker, closeWorkerPool } from "@kreuzberg/node";
|
||
|
||
const pool = createWorkerPool(4);
|
||
|
||
try {
|
||
const files = ["doc1.pdf", "doc2.docx", "doc3.xlsx"];
|
||
const results = await Promise.all(files.map((f) => extractFileInWorker(pool, f)));
|
||
|
||
results.forEach((r, i) => {
|
||
console.log(`${files[i]}: ${r.content.substring(0, 100)}...`);
|
||
});
|
||
} finally {
|
||
await closeWorkerPool(pool);
|
||
}
|
||
```
|
||
|
||
**Parameters**:
|
||
|
||
- `pool: WorkerPool` — Worker pool instance
|
||
- `filePath: string` — File path
|
||
- `mimeType?: string | null` — Optional MIME type
|
||
- `config?: ExtractionConfig` — Optional configuration
|
||
|
||
**Returns**: `Promise<ExtractionResult>`
|
||
|
||
### `batchExtractFilesInWorker(pool, paths, config?): Promise<ExtractionResult[]>`
|
||
|
||
Extract multiple files using the worker pool for concurrent processing.
|
||
|
||
```typescript
|
||
import { createWorkerPool, batchExtractFilesInWorker, closeWorkerPool } from "@kreuzberg/node";
|
||
|
||
const pool = createWorkerPool(4);
|
||
|
||
try {
|
||
const files = ["invoice1.pdf", "invoice2.pdf", "invoice3.pdf"];
|
||
const results = await batchExtractFilesInWorker(pool, files, {
|
||
ocr: { backend: "tesseract", language: "eng" },
|
||
});
|
||
|
||
const total = results.reduce((sum, r) => sum + extractAmount(r.content), 0);
|
||
console.log(`Total: $${total}`);
|
||
} finally {
|
||
await closeWorkerPool(pool);
|
||
}
|
||
```
|
||
|
||
**Parameters**:
|
||
|
||
- `pool: WorkerPool` — Worker pool instance
|
||
- `paths: string[]` — File paths
|
||
- `config?: ExtractionConfig` — Configuration (applied to all files)
|
||
|
||
**Returns**: `Promise<ExtractionResult[]>`
|
||
|
||
### `getWorkerPoolStats(pool): WorkerPoolStats`
|
||
|
||
Get statistics about a worker pool.
|
||
|
||
```typescript
|
||
import { createWorkerPool, getWorkerPoolStats } from "@kreuzberg/node";
|
||
|
||
const pool = createWorkerPool(4);
|
||
const stats = getWorkerPoolStats(pool);
|
||
|
||
console.log(`Pool size: ${stats.size}`);
|
||
console.log(`Active workers: ${stats.activeWorkers}`);
|
||
console.log(`Queued tasks: ${stats.queuedTasks}`);
|
||
```
|
||
|
||
**Parameters**:
|
||
|
||
- `pool: WorkerPool` — Worker pool instance
|
||
|
||
**Returns**: `WorkerPoolStats`
|
||
|
||
### `closeWorkerPool(pool): Promise<void>`
|
||
|
||
Close a worker pool and shut down all worker threads.
|
||
|
||
```typescript
|
||
import { createWorkerPool, closeWorkerPool } from "@kreuzberg/node";
|
||
|
||
const pool = createWorkerPool(4);
|
||
|
||
try {
|
||
// Use pool
|
||
} finally {
|
||
await closeWorkerPool(pool);
|
||
}
|
||
```
|
||
|
||
**Parameters**:
|
||
|
||
- `pool: WorkerPool` — Worker pool instance to close
|
||
|
||
**Returns**: `Promise<void>`
|
||
|
||
---
|
||
|
||
## Configuration Interface
|
||
|
||
### `ExtractionConfig`
|
||
|
||
Main configuration object controlling extraction behavior.
|
||
|
||
```typescript
|
||
interface ExtractionConfig {
|
||
// Caching and processing
|
||
useCache?: boolean; // Default: true
|
||
enableQualityProcessing?: boolean; // Default: false
|
||
|
||
// OCR configuration
|
||
ocr?: OcrConfig; // OCR settings
|
||
forceOcr?: boolean; // Default: false
|
||
|
||
// Document processing
|
||
chunking?: ChunkingConfig; // Break into chunks
|
||
images?: ImageExtractionConfig; // Image extraction
|
||
pdfOptions?: PdfConfig; // PDF-specific options
|
||
tokenReduction?: TokenReductionConfig; // Token optimization
|
||
languageDetection?: LanguageDetectionConfig; // Language detection
|
||
postprocessor?: PostProcessorConfig; // Post-processing
|
||
htmlOptions?: HtmlConversionOptions; // HTML conversion
|
||
keywords?: KeywordConfig; // Keyword extraction
|
||
pages?: PageExtractionConfig; // Page extraction
|
||
|
||
// Output control
|
||
maxConcurrentExtractions?: number; // Default: 4
|
||
outputFormat?: "plain" | "markdown" | "djot" | "html"; // Default: 'plain'
|
||
resultFormat?: "unified" | "element_based"; // Default: 'unified'
|
||
}
|
||
```
|
||
|
||
### `FileExtractionConfig`
|
||
|
||
Per-file overrides for batch operations. All fields optional (omitted = use batch default).
|
||
|
||
```typescript
|
||
interface FileExtractionConfig {
|
||
enableQualityProcessing?: boolean;
|
||
ocr?: OcrConfig;
|
||
forceOcr?: boolean;
|
||
chunking?: ChunkingConfig;
|
||
images?: ImageExtractionConfig;
|
||
pdfOptions?: PdfConfig;
|
||
tokenReduction?: TokenReductionConfig;
|
||
languageDetection?: LanguageDetectionConfig;
|
||
pages?: PageExtractionConfig;
|
||
keywords?: KeywordConfig;
|
||
postprocessor?: PostProcessorConfig;
|
||
outputFormat?: "plain" | "markdown" | "djot" | "html";
|
||
resultFormat?: "unified" | "element_based";
|
||
includeDocumentStructure?: boolean;
|
||
}
|
||
```
|
||
|
||
Excluded (batch-level only): `maxConcurrentExtractions`, `useCache`, `securityLimits`.
|
||
|
||
### `ChunkingConfig`
|
||
|
||
Configuration for breaking documents into chunks (useful for RAG and vector databases).
|
||
|
||
```typescript
|
||
interface ChunkingConfig {
|
||
maxChars?: number; // Max characters per chunk (default: 4096)
|
||
maxOverlap?: number; // Overlap between chunks (default: 512)
|
||
chunkSize?: number; // Alternative unit (mutually exclusive with maxChars)
|
||
chunkOverlap?: number; // Alternative unit (mutually exclusive with maxOverlap)
|
||
preset?: string; // Named preset ('default', 'aggressive', 'minimal')
|
||
embedding?: Record<string, unknown>; // Embedding config
|
||
enabled?: boolean; // Enable chunking (default: true when config provided)
|
||
}
|
||
```
|
||
|
||
**Key Point**: Use `maxChars` and `maxOverlap`, NOT `maxCharacters` or `overlap`.
|
||
|
||
### `OcrConfig`
|
||
|
||
Configuration for optical character recognition.
|
||
|
||
```typescript
|
||
interface OcrConfig {
|
||
backend: string; // OCR backend name (e.g., 'tesseract')
|
||
language?: string; // Language code (e.g., 'eng', 'deu')
|
||
tesseractConfig?: TesseractConfig;
|
||
}
|
||
|
||
interface TesseractConfig {
|
||
psm?: number; // Page Segmentation Mode (0-13)
|
||
enableTableDetection?: boolean;
|
||
tesseditCharWhitelist?: string; // Character whitelist
|
||
}
|
||
```
|
||
|
||
### `ImageExtractionConfig`
|
||
|
||
Configuration for extracting and optimizing images.
|
||
|
||
```typescript
|
||
interface ImageExtractionConfig {
|
||
extractImages?: boolean; // Default: true
|
||
targetDpi?: number; // Target DPI (default: 150)
|
||
maxImageDimension?: number; // Max width/height in pixels (default: 2000)
|
||
autoAdjustDpi?: boolean; // Auto-adjust DPI (default: true)
|
||
minDpi?: number; // Minimum DPI (default: 72)
|
||
maxDpi?: number; // Maximum DPI (default: 300)
|
||
}
|
||
```
|
||
|
||
### `PdfConfig`
|
||
|
||
PDF-specific extraction options.
|
||
|
||
```typescript
|
||
interface PdfConfig {
|
||
extractImages?: boolean; // Default: true
|
||
passwords?: string[]; // Passwords for encrypted PDFs
|
||
extractMetadata?: boolean; // Default: true
|
||
hierarchy?: HierarchyConfig; // Hierarchy extraction
|
||
}
|
||
```
|
||
|
||
### `LanguageDetectionConfig`
|
||
|
||
Configuration for automatic language detection.
|
||
|
||
```typescript
|
||
interface LanguageDetectionConfig {
|
||
enabled?: boolean; // Default: true
|
||
minConfidence?: number; // Threshold 0.0-1.0 (default: 0.5)
|
||
detectMultiple?: boolean; // Detect multiple languages (default: false)
|
||
}
|
||
```
|
||
|
||
### `TokenReductionConfig`
|
||
|
||
Configuration for optimizing token usage.
|
||
|
||
```typescript
|
||
interface TokenReductionConfig {
|
||
mode?: string; // 'aggressive' or 'conservative' (default: 'conservative')
|
||
preserveImportantWords?: boolean; // Default: true
|
||
}
|
||
```
|
||
|
||
### `KeywordConfig`
|
||
|
||
Configuration for keyword extraction.
|
||
|
||
```typescript
|
||
interface KeywordConfig {
|
||
algorithm?: "yake" | "rake"; // Default: 'yake'
|
||
maxKeywords?: number; // Maximum keywords (default: 10)
|
||
minScore?: number; // Minimum relevance score (default: 0.1)
|
||
ngramRange?: [number, number]; // N-gram range (default: [1, 3])
|
||
language?: string; // Language code (default: 'en')
|
||
yakeParams?: YakeParams;
|
||
rakeParams?: RakeParams;
|
||
}
|
||
```
|
||
|
||
### `PageExtractionConfig`
|
||
|
||
Configuration for page-level content tracking.
|
||
|
||
```typescript
|
||
interface PageExtractionConfig {
|
||
extractPages?: boolean; // Extract as separate pages array
|
||
insertPageMarkers?: boolean; // Insert page markers in content
|
||
markerFormat?: string; // Marker format with {page_num} placeholder
|
||
}
|
||
```
|
||
|
||
### `HtmlConversionOptions`
|
||
|
||
Configuration for HTML to Markdown conversion.
|
||
|
||
```typescript
|
||
interface HtmlConversionOptions {
|
||
headingStyle?: "atx" | "underlined" | "atx_closed";
|
||
listIndentType?: "spaces" | "tabs";
|
||
listIndentWidth?: number;
|
||
bullets?: string;
|
||
strongEmSymbol?: string;
|
||
escapeAsterisks?: boolean;
|
||
escapeUnderscores?: boolean;
|
||
escapeMisc?: boolean;
|
||
escapeAscii?: boolean;
|
||
codeLanguage?: string;
|
||
autolinks?: boolean;
|
||
defaultTitle?: boolean;
|
||
brInTables?: boolean;
|
||
hocrSpatialTables?: boolean;
|
||
highlightStyle?: "double_equal" | "html" | "bold" | "none";
|
||
extractMetadata?: boolean;
|
||
whitespaceMode?: "normalized" | "strict";
|
||
stripNewlines?: boolean;
|
||
wrap?: boolean;
|
||
wrapWidth?: number;
|
||
convertAsInline?: boolean;
|
||
subSymbol?: string;
|
||
supSymbol?: string;
|
||
newlineStyle?: "spaces" | "backslash";
|
||
codeBlockStyle?: "indented" | "backticks" | "tildes";
|
||
keepInlineImagesIn?: string[];
|
||
encoding?: string;
|
||
debug?: boolean;
|
||
stripTags?: string[];
|
||
preserveTags?: string[];
|
||
preprocessing?: HtmlPreprocessingOptions;
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Result Types
|
||
|
||
### `ExtractionResult`
|
||
|
||
Complete extraction result from document processing.
|
||
|
||
```typescript
|
||
interface ExtractionResult {
|
||
// Main content
|
||
content: string;
|
||
|
||
// Document type
|
||
mimeType: string;
|
||
|
||
// Metadata (format-specific)
|
||
metadata: Metadata;
|
||
|
||
// Extracted structures
|
||
tables: Table[];
|
||
|
||
// Optional processed data
|
||
detectedLanguages: string[] | null;
|
||
chunks: Chunk[] | null; // From chunking config
|
||
images: ExtractedImage[] | null; // From image extraction
|
||
elements?: Element[] | null; // From element_based result format
|
||
pages?: PageContent[] | null; // From page extraction
|
||
extractedKeywords?: ExtractedKeyword[] | null; // Extracted keywords with scores
|
||
qualityScore?: number | null; // Overall extraction quality (0.0-1.0)
|
||
processingWarnings?: ProcessingWarning[]; // Non-fatal warnings from pipeline
|
||
}
|
||
```
|
||
|
||
### `Table`
|
||
|
||
Extracted table data with cell structure.
|
||
|
||
```typescript
|
||
interface Table {
|
||
cells: string[][]; // 2D array of cell contents (rows × columns)
|
||
markdown: string; // Markdown representation
|
||
pageNumber: number; // 1-indexed page number
|
||
}
|
||
```
|
||
|
||
### `Chunk`
|
||
|
||
Text chunk for RAG or vector database indexing.
|
||
|
||
```typescript
|
||
interface Chunk {
|
||
content: string;
|
||
embedding?: number[] | null; // Vector embedding if computed
|
||
metadata: ChunkMetadata;
|
||
}
|
||
|
||
interface ChunkMetadata {
|
||
byteStart: number; // UTF-8 byte offset in original text
|
||
byteEnd: number; // UTF-8 byte offset
|
||
tokenCount?: number | null;
|
||
chunkIndex: number; // Zero-based index
|
||
totalChunks: number; // Total number of chunks
|
||
firstPage?: number | null; // 1-indexed, if page tracking enabled
|
||
lastPage?: number | null;
|
||
}
|
||
```
|
||
|
||
### `ExtractedImage`
|
||
|
||
Image extracted from document.
|
||
|
||
```typescript
|
||
interface ExtractedImage {
|
||
data: Uint8Array; // Raw image bytes
|
||
format: string; // Format (e.g., 'png', 'jpeg', 'tiff')
|
||
imageIndex: number; // Sequential index (0-indexed)
|
||
pageNumber?: number | null;
|
||
width?: number | null;
|
||
height?: number | null;
|
||
colorspace?: string | null;
|
||
bitsPerComponent?: number | null;
|
||
isMask: boolean;
|
||
description?: string | null;
|
||
ocrResult?: ExtractionResult | null; // OCR result if processed
|
||
}
|
||
```
|
||
|
||
### `PageContent`
|
||
|
||
Per-page content when page extraction is enabled.
|
||
|
||
```typescript
|
||
interface PageContent {
|
||
pageNumber: number; // 1-indexed
|
||
content: string; // Page text content
|
||
tables: Table[]; // Tables on this page
|
||
images: ExtractedImage[]; // Images on this page
|
||
}
|
||
```
|
||
|
||
### `ExtractedKeyword`
|
||
|
||
Extracted keyword with relevance score and position information.
|
||
|
||
```typescript
|
||
interface ExtractedKeyword {
|
||
text: string; // Keyword text
|
||
score: number; // Relevance score (0.0-1.0)
|
||
algorithm: string; // Algorithm used ("tfidf", "textrank", "yake", etc.)
|
||
positions?: number[] | null; // Character positions in content (if available)
|
||
}
|
||
```
|
||
|
||
### `ProcessingWarning`
|
||
|
||
Non-fatal warning encountered during document processing.
|
||
|
||
```typescript
|
||
interface ProcessingWarning {
|
||
source: string; // Component that generated the warning
|
||
message: string; // Warning message describing the issue
|
||
}
|
||
```
|
||
|
||
### `Metadata`
|
||
|
||
Extraction result metadata (format-specific).
|
||
|
||
```typescript
|
||
interface Metadata {
|
||
// Common fields
|
||
language?: string | null;
|
||
date?: string | null;
|
||
subject?: string | null;
|
||
format_type?:
|
||
| "pdf"
|
||
| "excel"
|
||
| "email"
|
||
| "pptx"
|
||
| "archive"
|
||
| "image"
|
||
| "xml"
|
||
| "text"
|
||
| "html"
|
||
| "ocr";
|
||
|
||
// PDF metadata
|
||
title?: string | null;
|
||
author?: string | null;
|
||
creator?: string | null;
|
||
producer?: string | null;
|
||
creation_date?: string | null;
|
||
modification_date?: string | null;
|
||
page_count?: number;
|
||
|
||
// Excel metadata
|
||
sheet_count?: number;
|
||
sheet_names?: string[];
|
||
|
||
// Email metadata
|
||
from_email?: string | null;
|
||
from_name?: string | null;
|
||
to_emails?: string[];
|
||
cc_emails?: string[];
|
||
bcc_emails?: string[];
|
||
message_id?: string | null;
|
||
attachments?: string[];
|
||
|
||
// Image metadata
|
||
width?: number;
|
||
height?: number;
|
||
exif?: Record<string, string>;
|
||
|
||
// OCR metadata
|
||
psm?: number;
|
||
output_format?: string;
|
||
table_count?: number;
|
||
|
||
// HTML metadata
|
||
canonical_url?: string | null;
|
||
html_language?: string | null;
|
||
text_direction?: "ltr" | "rtl" | "auto" | null;
|
||
open_graph?: Record<string, string>;
|
||
twitter_card?: Record<string, string>;
|
||
meta_tags?: Record<string, string>;
|
||
html_headers?: HeaderMetadata[];
|
||
html_links?: LinkMetadata[];
|
||
html_images?: HtmlImageMetadata[];
|
||
structured_data?: StructuredData[];
|
||
|
||
// Text metadata
|
||
line_count?: number;
|
||
word_count?: number;
|
||
character_count?: number;
|
||
headers?: string[] | null;
|
||
links?: [string, string][] | null;
|
||
code_blocks?: [string, string][] | null;
|
||
|
||
// Page structure
|
||
page_structure?: PageStructure | null;
|
||
|
||
// Additional typed fields
|
||
category?: string | null;
|
||
tags?: string[];
|
||
document_version?: string | null;
|
||
abstract_text?: string | null;
|
||
|
||
// Custom fields from postprocessors
|
||
[key: string]: unknown;
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Error Handling
|
||
|
||
### Error Classes
|
||
|
||
```typescript
|
||
import {
|
||
KreuzbergError,
|
||
ParsingError,
|
||
OcrError, // Note: camelCase, not "OCRError"
|
||
ValidationError,
|
||
MissingDependencyError,
|
||
CacheError,
|
||
ImageProcessingError,
|
||
PluginError,
|
||
ErrorCode,
|
||
} from "@kreuzberg/node";
|
||
```
|
||
|
||
**Error Hierarchy**:
|
||
|
||
- `KreuzbergError` — Base class for all Kreuzberg errors
|
||
- `ParsingError` — Document format invalid or corrupted
|
||
- `OcrError` — OCR processing failed
|
||
- `ValidationError` — Extraction validation failed
|
||
- `MissingDependencyError` — Required dependency unavailable
|
||
- `CacheError` — Cache operation failed
|
||
- `ImageProcessingError` — Image extraction or processing failed
|
||
- `PluginError` — Plugin registration or execution failed
|
||
|
||
### Error Diagnostics
|
||
|
||
```typescript
|
||
import {
|
||
classifyError,
|
||
getErrorCodeDescription,
|
||
getErrorCodeName,
|
||
getLastErrorCode,
|
||
getLastPanicContext,
|
||
} from "@kreuzberg/node";
|
||
|
||
try {
|
||
const result = await extractFile("document.pdf");
|
||
} catch (error) {
|
||
const classification = classifyError(error.message);
|
||
console.log(`Error code: ${getErrorCodeName(classification.code)}`);
|
||
console.log(`Description: ${getErrorCodeDescription(classification.code)}`);
|
||
console.log(`Confidence: ${classification.confidence}`);
|
||
}
|
||
```
|
||
|
||
### `ErrorCode` Enum
|
||
|
||
```typescript
|
||
enum ErrorCode {
|
||
Success = 0,
|
||
GenericError = 1,
|
||
Panic = 2,
|
||
InvalidArgument = 3,
|
||
IoError = 4,
|
||
ParsingError = 5,
|
||
OcrError = 6,
|
||
MissingDependency = 7,
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Plugin System
|
||
|
||
### Post-Processors
|
||
|
||
Custom post-processors can enrich extraction results without failing the extraction if they encounter errors.
|
||
|
||
#### `registerPostProcessor(processor): void`
|
||
|
||
Register a custom post-processor.
|
||
|
||
```typescript
|
||
import { registerPostProcessor, extractFile } from "@kreuzberg/node";
|
||
|
||
const processor = {
|
||
name() {
|
||
return "my_processor";
|
||
},
|
||
|
||
async process(result) {
|
||
// Enrich result with custom metadata
|
||
result.metadata["custom_field"] = "value";
|
||
return result;
|
||
},
|
||
|
||
processingStage() {
|
||
return "late"; // 'early', 'middle', or 'late'
|
||
},
|
||
|
||
async initialize() {
|
||
// Called once when registered
|
||
},
|
||
|
||
async shutdown() {
|
||
// Called when unregistered
|
||
},
|
||
};
|
||
|
||
registerPostProcessor(processor);
|
||
const result = await extractFile("document.pdf");
|
||
```
|
||
|
||
#### `unregisterPostProcessor(name): void`
|
||
|
||
Remove a registered post-processor.
|
||
|
||
```typescript
|
||
import { unregisterPostProcessor } from "@kreuzberg/node";
|
||
|
||
unregisterPostProcessor("my_processor");
|
||
```
|
||
|
||
#### `listPostProcessors(): string[]`
|
||
|
||
List all registered post-processor names.
|
||
|
||
```typescript
|
||
import { listPostProcessors } from "@kreuzberg/node";
|
||
|
||
const processors = listPostProcessors();
|
||
console.log("Registered processors:", processors);
|
||
```
|
||
|
||
#### `clearPostProcessors(): void`
|
||
|
||
Unregister all post-processors.
|
||
|
||
```typescript
|
||
import { clearPostProcessors } from "@kreuzberg/node";
|
||
|
||
clearPostProcessors();
|
||
```
|
||
|
||
### Validators
|
||
|
||
Custom validators check extraction results and fail the extraction if validation fails (unlike post-processors).
|
||
|
||
#### `registerValidator(validator): void`
|
||
|
||
Register a custom validator.
|
||
|
||
```typescript
|
||
import { registerValidator, extractFile } from "@kreuzberg/node";
|
||
|
||
const validator = {
|
||
name() {
|
||
return "content_length_validator";
|
||
},
|
||
|
||
validate(result) {
|
||
if (result.content.length < 10) {
|
||
throw new Error("Content too short");
|
||
}
|
||
},
|
||
|
||
priority() {
|
||
return 100; // Higher = runs first
|
||
},
|
||
|
||
shouldValidate(result) {
|
||
return result.mimeType === "application/pdf"; // Conditional validation
|
||
},
|
||
|
||
async initialize() {
|
||
// Called once when registered
|
||
},
|
||
|
||
async shutdown() {
|
||
// Called when unregistered
|
||
},
|
||
};
|
||
|
||
registerValidator(validator);
|
||
const result = await extractFile("document.pdf");
|
||
```
|
||
|
||
#### `unregisterValidator(name): void`
|
||
|
||
Remove a registered validator.
|
||
|
||
```typescript
|
||
import { unregisterValidator } from "@kreuzberg/node";
|
||
|
||
unregisterValidator("content_length_validator");
|
||
```
|
||
|
||
#### `listValidators(): string[]`
|
||
|
||
List all registered validator names.
|
||
|
||
```typescript
|
||
import { listValidators } from "@kreuzberg/node";
|
||
|
||
const validators = listValidators();
|
||
```
|
||
|
||
#### `clearValidators(): void`
|
||
|
||
Unregister all validators.
|
||
|
||
```typescript
|
||
import { clearValidators } from "@kreuzberg/node";
|
||
|
||
clearValidators();
|
||
```
|
||
|
||
### OCR Backends
|
||
|
||
Custom OCR backends can be registered to handle image text extraction.
|
||
|
||
#### `registerOcrBackend(backend): void`
|
||
|
||
Register a custom OCR backend.
|
||
|
||
```typescript
|
||
import { registerOcrBackend, extractFile } from "@kreuzberg/node";
|
||
|
||
const backend = {
|
||
name() {
|
||
return "my-ocr";
|
||
},
|
||
|
||
supportedLanguages() {
|
||
return ["eng", "deu", "fra"];
|
||
},
|
||
|
||
async processImage(imageBytes, language) {
|
||
// imageBytes: Uint8Array or Base64 string
|
||
const buffer =
|
||
typeof imageBytes === "string" ? Buffer.from(imageBytes, "base64") : Buffer.from(imageBytes);
|
||
|
||
// Process and extract text
|
||
return {
|
||
content: "extracted text",
|
||
mime_type: "text/plain",
|
||
metadata: { confidence: 0.95, language },
|
||
tables: [],
|
||
};
|
||
},
|
||
|
||
async initialize() {
|
||
// Load models, setup resources
|
||
},
|
||
|
||
async shutdown() {
|
||
// Cleanup resources
|
||
},
|
||
};
|
||
|
||
registerOcrBackend(backend);
|
||
```
|
||
|
||
#### `GutenOcrBackend`
|
||
|
||
Built-in OCR backend implementation using Guten-OCR.
|
||
|
||
```typescript
|
||
import { GutenOcrBackend, registerOcrBackend, extractFile } from "@kreuzberg/node";
|
||
|
||
const backend = new GutenOcrBackend();
|
||
await backend.initialize();
|
||
registerOcrBackend(backend);
|
||
|
||
const result = await extractFile("scanned.pdf", null, {
|
||
ocr: { backend: "guten-ocr", language: "eng" },
|
||
});
|
||
```
|
||
|
||
#### `unregisterOcrBackend(name): void`
|
||
|
||
Remove a registered OCR backend.
|
||
|
||
```typescript
|
||
import { unregisterOcrBackend } from "@kreuzberg/node";
|
||
|
||
unregisterOcrBackend("my-ocr");
|
||
```
|
||
|
||
#### `listOcrBackends(): string[]`
|
||
|
||
List all registered OCR backend names.
|
||
|
||
```typescript
|
||
import { listOcrBackends } from "@kreuzberg/node";
|
||
|
||
const backends = listOcrBackends();
|
||
```
|
||
|
||
#### `clearOcrBackends(): void`
|
||
|
||
Unregister all OCR backends.
|
||
|
||
```typescript
|
||
import { clearOcrBackends } from "@kreuzberg/node";
|
||
|
||
clearOcrBackends();
|
||
```
|
||
|
||
---
|
||
|
||
## MIME Type Utilities
|
||
|
||
### `detectMimeType(data): string | null`
|
||
|
||
Detect MIME type from file content (magic bytes).
|
||
|
||
```typescript
|
||
import { detectMimeType } from "@kreuzberg/node";
|
||
import { readFileSync } from "fs";
|
||
|
||
const data = readFileSync("document");
|
||
const mimeType = detectMimeType(data);
|
||
console.log(`Detected MIME type: ${mimeType}`);
|
||
```
|
||
|
||
### `detectMimeTypeFromPath(filePath): string | null`
|
||
|
||
Detect MIME type from file extension.
|
||
|
||
```typescript
|
||
import { detectMimeTypeFromPath } from "@kreuzberg/node";
|
||
|
||
const mimeType = detectMimeTypeFromPath("document.pdf");
|
||
console.log(`MIME type: ${mimeType}`); // 'application/pdf'
|
||
```
|
||
|
||
### `getExtensionsForMime(mimeType): string[]`
|
||
|
||
Get file extensions for a MIME type.
|
||
|
||
```typescript
|
||
import { getExtensionsForMime } from "@kreuzberg/node";
|
||
|
||
const extensions = getExtensionsForMime("application/pdf");
|
||
console.log(`Extensions: ${extensions}`); // ['.pdf']
|
||
```
|
||
|
||
### `validateMimeType(mimeType): boolean`
|
||
|
||
Check if a MIME type is valid.
|
||
|
||
```typescript
|
||
import { validateMimeType } from "@kreuzberg/node";
|
||
|
||
if (validateMimeType("application/pdf")) {
|
||
console.log("Valid MIME type");
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Configuration Loading
|
||
|
||
### `ExtractionConfig.fromFile(filePath): ExtractionConfig`
|
||
|
||
Load extraction configuration from a file (JSON, YAML, or TOML).
|
||
|
||
```typescript
|
||
import { ExtractionConfig, extractFile } from "@kreuzberg/node";
|
||
|
||
const config = ExtractionConfig.fromFile("./kreuzberg.toml");
|
||
const result = await extractFile("document.pdf", null, config);
|
||
```
|
||
|
||
### `ExtractionConfig.discover(): ExtractionConfig | null`
|
||
|
||
Auto-discover extraction configuration file in current and parent directories.
|
||
|
||
```typescript
|
||
import { ExtractionConfig, extractFile } from "@kreuzberg/node";
|
||
|
||
// Searches for kreuzberg.{toml,yaml,json} in current directory and parents
|
||
const config = ExtractionConfig.discover();
|
||
if (config) {
|
||
const result = await extractFile("document.pdf", null, config);
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Embeddings
|
||
|
||
### `getEmbeddingPreset(name): EmbeddingPreset | null`
|
||
|
||
Get a named embedding model preset.
|
||
|
||
```typescript
|
||
import { getEmbeddingPreset } from "@kreuzberg/node";
|
||
|
||
const preset = getEmbeddingPreset("default");
|
||
if (preset) {
|
||
console.log(`Model: ${preset.modelName}`);
|
||
console.log(`Dimensions: ${preset.dimensions}`);
|
||
}
|
||
```
|
||
|
||
### `listEmbeddingPresets(): string[]`
|
||
|
||
List all available embedding presets.
|
||
|
||
```typescript
|
||
import { listEmbeddingPresets } from "@kreuzberg/node";
|
||
|
||
const presets = listEmbeddingPresets();
|
||
console.log("Available presets:", presets);
|
||
```
|
||
|
||
### `EmbeddingPreset`
|
||
|
||
Type definition for embedding model presets.
|
||
|
||
```typescript
|
||
interface EmbeddingPreset {
|
||
name: string; // Preset name (e.g., "fast", "balanced", "quality", "multilingual")
|
||
chunkSize: number; // Recommended chunk size in characters
|
||
overlap: number; // Recommended overlap in characters
|
||
modelName: string; // Model identifier (e.g., "AllMiniLML6V2Q", "BGEBaseENV15")
|
||
dimensions: number; // Embedding vector dimensions
|
||
description: string; // Human-readable description
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Plugin Protocols
|
||
|
||
### `PostProcessorProtocol`
|
||
|
||
Interface for custom post-processors.
|
||
|
||
```typescript
|
||
interface PostProcessorProtocol {
|
||
name(): string;
|
||
|
||
process(result: ExtractionResult): ExtractionResult | Promise<ExtractionResult>;
|
||
|
||
processingStage?(): ProcessingStage; // 'early' | 'middle' | 'late'
|
||
|
||
initialize?(): void | Promise<void>;
|
||
|
||
shutdown?(): void | Promise<void>;
|
||
}
|
||
```
|
||
|
||
### `ValidatorProtocol`
|
||
|
||
Interface for custom validators.
|
||
|
||
```typescript
|
||
interface ValidatorProtocol {
|
||
name(): string;
|
||
|
||
validate(result: ExtractionResult): void | Promise<void>;
|
||
|
||
priority?(): number; // Higher = runs first
|
||
|
||
shouldValidate?(result: ExtractionResult): boolean;
|
||
|
||
initialize?(): void | Promise<void>;
|
||
|
||
shutdown?(): void | Promise<void>;
|
||
}
|
||
```
|
||
|
||
### `OcrBackendProtocol`
|
||
|
||
Interface for custom OCR backends.
|
||
|
||
```typescript
|
||
interface OcrBackendProtocol {
|
||
name(): string;
|
||
|
||
supportedLanguages(): string[];
|
||
|
||
processImage(
|
||
imageBytes: Uint8Array | string,
|
||
language: string,
|
||
): Promise<{
|
||
content: string;
|
||
mime_type: string;
|
||
metadata: Record<string, unknown>;
|
||
tables: unknown[];
|
||
}>;
|
||
|
||
initialize?(): void | Promise<void>;
|
||
|
||
shutdown?(): void | Promise<void>;
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Supported Document Formats
|
||
|
||
- **Documents**: PDF, DOCX, PPTX, XLSX, DOC, PPT
|
||
- **Text**: Markdown, Plain Text, XML, JSON, YAML, TOML
|
||
- **Web**: HTML (converted to Markdown)
|
||
- **Email**: EML, MSG
|
||
- **Images**: PNG, JPEG, TIFF (with OCR support)
|
||
- **Archives**: ZIP, TAR, GZIP (file listing)
|
||
|
||
---
|
||
|
||
## Registry Functions
|
||
|
||
### Document Extractors
|
||
|
||
```typescript
|
||
import {
|
||
listDocumentExtractors,
|
||
unregisterDocumentExtractor,
|
||
clearDocumentExtractors,
|
||
} from "@kreuzberg/node";
|
||
|
||
// List registered extractors
|
||
const extractors = listDocumentExtractors();
|
||
|
||
// Unregister a specific extractor
|
||
unregisterDocumentExtractor("pdf");
|
||
|
||
// Clear all extractors
|
||
clearDocumentExtractors();
|
||
```
|
||
|
||
---
|
||
|
||
## Type Exports
|
||
|
||
All types are exported from `@kreuzberg/node`:
|
||
|
||
```typescript
|
||
export type {
|
||
Chunk,
|
||
ChunkingConfig,
|
||
ExtractionConfig,
|
||
ExtractionResult,
|
||
ExtractedImage,
|
||
KeywordConfig,
|
||
LanguageDetectionConfig,
|
||
OcrBackendProtocol,
|
||
OcrConfig,
|
||
PageContent,
|
||
PageExtractionConfig,
|
||
PdfConfig,
|
||
PostProcessorProtocol,
|
||
Table,
|
||
TokenReductionConfig,
|
||
ValidatorProtocol,
|
||
WorkerPool,
|
||
WorkerPoolStats,
|
||
EmbeddingPreset,
|
||
// ... and many more
|
||
};
|
||
```
|
||
|
||
---
|
||
|
||
## Best Practices
|
||
|
||
1. **Use batch APIs for multiple documents**: `batchExtractFiles()` provides superior performance vs. calling `extractFile()` in a loop.
|
||
|
||
2. **Enable chunking for RAG/vector DB**: Set `chunking` config to automatically break documents into overlapping chunks.
|
||
|
||
3. **Use worker pools for high-concurrency scenarios**: Distribute CPU-bound work across multiple threads for 4+ concurrent extractions.
|
||
|
||
4. **Configure language detection**: Enable automatic language detection for multilingual documents.
|
||
|
||
5. **Register validators early**: Set up validators before calling extraction functions to catch quality issues immediately.
|
||
|
||
6. **Use specific MIME types**: Provide explicit MIME types when available to avoid detection overhead.
|
||
|
||
7. **Clean up resources**: Always call `closeWorkerPool()` when done to prevent resource leaks.
|
||
|
||
8. **Handle extraction errors gracefully**: Catch specific error types (`ParsingError`, `OcrError`, etc.) for appropriate error handling.
|
||
|
||
---
|
||
|
||
## Version
|
||
|
||
**Package Version**: 4.2.14
|