skills/kreuzberg/SKILL.md

---
name: kreuzberg
description: >-
  Extract text, tables, metadata, and images from 91+ document formats
  (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg.
  Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript,
  Rust, or CLI. Covers installation, extraction (sync/async), configuration
  (OCR, chunking, output format), batch processing, error handling, and plugins.
license: Elastic-2.0
metadata:
  author: kreuzberg-dev
  version: "1.0"
  repository: https://github.com/kreuzberg-dev/kreuzberg
---

# Kreuzberg Document Extraction

Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.

Use this skill when writing code that:

- Extracts text or metadata from documents
- Performs OCR on scanned documents or images
- Batch-processes multiple files
- Configures extraction options (output format, chunking, OCR, language detection)
- Implements custom plugins (post-processors, validators, OCR backends)

## Installation

### Python

```bash
pip install kreuzberg
# Optional OCR backends:
pip install kreuzberg[easyocr]    # EasyOCR
```

### Node.js

```bash
npm install @kreuzberg/node
```

### Rust

```toml
# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
#           embeddings, language-detection, keywords-yake, keywords-rake
```

### CLI

```bash
# Download from GitHub releases, or:
cargo install kreuzberg-cli
```

## Quick Start

### Python (Async)

```python
from kreuzberg import extract_file

result = await extract_file("document.pdf")
print(result.content)       # extracted text
print(result.metadata)      # document metadata
print(result.tables)        # extracted tables
```

### Python (Sync)

```python
from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")
print(result.content)
```

### Node.js

```typescript
import { extractFile } from "@kreuzberg/node";

const result = await extractFile("document.pdf");
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);
```

### Node.js (Sync)

```typescript
import { extractFileSync } from "@kreuzberg/node";

const result = extractFileSync("document.pdf");
```

### Rust (Async)

```rust
use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file("document.pdf", None, &config).await?;
    println!("{}", result.content);
    Ok(())
}
```

### Rust (Sync) — requires `tokio-runtime` feature

```rust
use kreuzberg::{extract_file_sync, ExtractionConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file_sync("document.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}
```

### CLI

```bash
kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown
```

## Configuration

All languages use the same configuration structure with language-appropriate naming conventions.

### Python (snake_case)

```python
from kreuzberg import (
    ExtractionConfig, OcrConfig, TesseractConfig,
    PdfConfig, ChunkingConfig,
)

config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        language="eng",
        tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
    ),
    pdf_options=PdfConfig(passwords=["secret123"]),
    chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
    output_format="markdown",
)

result = await extract_file("document.pdf", config=config)
```

### Node.js (camelCase)

```typescript
import { extractFile, type ExtractionConfig } from "@kreuzberg/node";

const config: ExtractionConfig = {
  ocr: { backend: "tesseract", language: "eng" },
  pdfOptions: { passwords: ["secret123"] },
  chunking: { maxChars: 1000, maxOverlap: 200 },
  outputFormat: "markdown",
};

const result = await extractFile("document.pdf", null, config);
```

### Rust (snake_case)

```rust
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};

let config = ExtractionConfig {
    ocr: Some(OcrConfig {
        backend: "tesseract".into(),
        language: "eng".into(),
        ..Default::default()
    }),
    chunking: Some(ChunkingConfig {
        max_characters: 1000,
        overlap: 200,
        ..Default::default()
    }),
    output_format: OutputFormat::Markdown,
    ..Default::default()
};

let result = extract_file("document.pdf", None, &config).await?;
```

### Config File (TOML)

```toml
output_format = "markdown"

[ocr]
backend = "tesseract"
language = "eng"

[chunking]
max_chars = 1000
max_overlap = 200

[pdf_options]
passwords = ["secret123"]
```

```bash
# CLI: auto-discovers kreuzberg.toml in current/parent directories
kreuzberg extract doc.pdf
# or explicit:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
```

## Batch Processing

### Python

```python
from kreuzberg import batch_extract_files, batch_extract_files_sync

# Async
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])

# Sync
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])

for result in results:
    print(f"{len(result.content)} chars extracted")
```

### Node.js

```typescript
import { batchExtractFiles } from "@kreuzberg/node";

const results = await batchExtractFiles(["doc1.pdf", "doc2.docx"]);
```

### Rust — requires `tokio-runtime` feature

```rust
use kreuzberg::{batch_extract_file, ExtractionConfig};

let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;
```

### CLI

```bash
kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown
```

## OCR

OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).

### Backends

- **Tesseract** (default): Built-in native binding. All Tesseract languages supported.
- **EasyOCR** (Python only): `pip install kreuzberg[easyocr]`. Pass `easyocr_kwargs={"gpu": True}`.
- **PaddleOCR** (Python only): Bundled since 4.8.5, no extra install needed. Pass `paddleocr_kwargs={"use_angle_cls": True}`.
- **Guten** (Node.js only): Built-in OCR backend via `GutenOcrBackend`.

### Language Codes

```python
config = ExtractionConfig(ocr=OcrConfig(language="eng"))       # English
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu"))   # Multiple
config = ExtractionConfig(ocr=OcrConfig(language="all"))       # All installed
```

### Force OCR

```python
config = ExtractionConfig(force_ocr=True)  # OCR even if text is extractable
```

## ExtractionResult Fields

| Field        | Python                      | Node.js                    | Rust                        | Description                                   |
| ------------ | --------------------------- | -------------------------- | --------------------------- | --------------------------------------------- |
| Text content | `result.content`            | `result.content`           | `result.content`            | Extracted text (str/String)                   |
| MIME type    | `result.mime_type`          | `result.mimeType`          | `result.mime_type`          | Input document MIME type                      |
| Metadata     | `result.metadata`           | `result.metadata`          | `result.metadata`           | Document metadata (dict/object/HashMap)       |
| Tables       | `result.tables`             | `result.tables`            | `result.tables`             | Extracted tables with cells + markdown        |
| Languages    | `result.detected_languages` | `result.detectedLanguages` | `result.detected_languages` | Detected languages (if enabled)               |
| Chunks       | `result.chunks`             | `result.chunks`            | `result.chunks`             | Text chunks (if chunking enabled)             |
| Images       | `result.images`             | `result.images`            | `result.images`             | Extracted images (if enabled)                 |
| Elements     | `result.elements`           | `result.elements`          | `result.elements`           | Semantic elements (if element_based format)   |
| Pages        | `result.pages`              | `result.pages`             | `result.pages`              | Per-page content (if page extraction enabled) |
| Keywords     | `result.keywords`           | `result.keywords`          | `result.keywords`           | Extracted keywords (if enabled)               |

## Error Handling

### Python

```python
from kreuzberg import (
    extract_file_sync, KreuzbergError, ParsingError,
    OCRError, ValidationError, MissingDependencyError,
)

try:
    result = extract_file_sync("file.pdf")
except ParsingError as e:
    print(f"Failed to parse: {e}")
except OCRError as e:
    print(f"OCR failed: {e}")
except ValidationError as e:
    print(f"Invalid input: {e}")
except MissingDependencyError as e:
    print(f"Missing dependency: {e}")
except KreuzbergError as e:
    print(f"Extraction failed: {e}")
```

### Node.js

```typescript
import {
  extractFile,
  KreuzbergError,
  ParsingError,
  OcrError,
  ValidationError,
  MissingDependencyError,
} from "@kreuzberg/node";

try {
  const result = await extractFile("file.pdf");
} catch (e) {
  if (e instanceof ParsingError) {
    /* ... */
  } else if (e instanceof OcrError) {
    /* ... */
  } else if (e instanceof ValidationError) {
    /* ... */
  } else if (e instanceof KreuzbergError) {
    /* ... */
  }
}
```

### Rust

```rust
use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};

let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
    Ok(result) => println!("{}", result.content),
    Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
    Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
    Err(e) => eprintln!("Error: {e}"),
}
```

## Common Pitfalls

1. **Python ChunkingConfig fields**: Use `max_chars` and `max_overlap`, NOT `max_characters` or `overlap`.
2. **Rust extract_file signature**: Third argument is `&ExtractionConfig` (a reference), not `Option`. Use `&ExtractionConfig::default()` for defaults.
3. **Rust feature gates**: `extract_file_sync`, `batch_extract_file`, and `batch_extract_file_sync` all require `features = ["tokio-runtime"]` in Cargo.toml.
4. **Rust async context**: `extract_file` is async. Use `#[tokio::main]` or call from an async context.
5. **CLI --format vs --output-format**: `--format` controls CLI output (text/json). `--output-format` controls content format (plain/markdown/djot/html).
6. **Node.js extractFile signature**: `extractFile(path, mimeType?, config?)` — mimeType is the second arg (pass `null` to skip).
7. **Python detect_mime_type**: The function for detecting from bytes is `detect_mime_type(data)`. For paths use `detect_mime_type_from_path(path)`.
8. **Config file field names**: Use snake_case in TOML/YAML/JSON config files (e.g., `max_chars`, `max_overlap`, `pdf_options`).

## Supported Formats (Summary)

| Category          | Extensions                                                                                                                                                  |
| ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **PDF**           | `.pdf`                                                                                                                                                      |
| **Word**          | `.docx`, `.odt`                                                                                                                                             |
| **Spreadsheets**  | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods`                                                                                         |
| **Presentations** | `.pptx`, `.ppt`, `.ppsx`                                                                                                                                    |
| **eBooks**        | `.epub`, `.fb2`                                                                                                                                             |
| **Images**        | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`, `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`, `.svg` |
| **Markup**        | `.html`, `.htm`, `.xhtml`, `.xml`                                                                                                                           |
| **Data**          | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv`                                                                                                           |
| **Text**          | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf`                                                                                                 |
| **Email**         | `.eml`, `.msg`                                                                                                                                              |
| **Archives**      | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z`                                                                                                                        |
| **Academic**      | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`, `.tex`, `.latex`, `.typ`, `.jats`, `.ipynb`, `.docbook`, `.opml`, `.pod`, `.mdoc`, `.troff`           |

See [references/supported-formats.md](references/supported-formats.md) for the complete format reference with MIME types.

## Additional Resources

Detailed reference files for specific topics:

- **[Python API Reference](references/python-api.md)** — All functions, config classes, plugin protocols, exact signatures
- **[Node.js API Reference](references/nodejs-api.md)** — All functions, TypeScript interfaces, worker pool APIs
- **[Rust API Reference](references/rust-api.md)** — All functions with feature gates, structs, Cargo.toml examples
- **[CLI Reference](references/cli-reference.md)** — All commands, flags, config precedence, exit codes
- **[Configuration Reference](references/configuration.md)** — TOML/YAML/JSON formats, auto-discovery, env vars, full schema
- **[Supported Formats](references/supported-formats.md)** — All 85+ formats with file extensions and MIME types
- **[Advanced Features](references/advanced-features.md)** — Plugins, embeddings, MCP server, API server, security limits
- **[Other Language Bindings](references/other-bindings.md)** — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker

Full documentation: <https://docs.kreuzberg.dev>
GitHub: <https://github.com/kreuzberg-dev/kreuzberg>
Nomad changes 2026-06-01 23:40:55 +02:00			`---`
			`name: kreuzberg`
			`description: >-`
			`Extract text, tables, metadata, and images from 91+ document formats`
			`(PDF, Office, images, HTML, email, archives, academic) using Kreuzberg.`
			`Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript,`
			`Rust, or CLI. Covers installation, extraction (sync/async), configuration`
			`(OCR, chunking, output format), batch processing, error handling, and plugins.`
			`license: Elastic-2.0`
			`metadata:`
			`author: kreuzberg-dev`
			`version: "1.0"`
			`repository: https://github.com/kreuzberg-dev/kreuzberg`
			`---`

			`# Kreuzberg Document Extraction`

			`Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.`

			`Use this skill when writing code that:`

			`- Extracts text or metadata from documents`
			`- Performs OCR on scanned documents or images`
			`- Batch-processes multiple files`
			`- Configures extraction options (output format, chunking, OCR, language detection)`
			`- Implements custom plugins (post-processors, validators, OCR backends)`

			`## Installation`

			`### Python`

			```bash
			`pip install kreuzberg`
			`# Optional OCR backends:`
			`pip install kreuzberg[easyocr] # EasyOCR`
			```

			`### Node.js`

			```bash
			`npm install @kreuzberg/node`
			```

			`### Rust`

			```toml
			`# Cargo.toml`
			`[dependencies]`
			`kreuzberg = { version = "4", features = ["tokio-runtime"] }`
			`# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,`
			`# embeddings, language-detection, keywords-yake, keywords-rake`
			```

			`### CLI`

			```bash
			`# Download from GitHub releases, or:`
			`cargo install kreuzberg-cli`
			```

			`## Quick Start`

			`### Python (Async)`

			```python
			`from kreuzberg import extract_file`

			`result = await extract_file("document.pdf")`
			`print(result.content) # extracted text`
			`print(result.metadata) # document metadata`
			`print(result.tables) # extracted tables`
			```

			`### Python (Sync)`

			```python
			`from kreuzberg import extract_file_sync`

			`result = extract_file_sync("document.pdf")`
			`print(result.content)`
			```

			`### Node.js`

			```typescript
			`import { extractFile } from "@kreuzberg/node";`

			`const result = await extractFile("document.pdf");`
			`console.log(result.content);`
			`console.log(result.metadata);`
			`console.log(result.tables);`
			```

			`### Node.js (Sync)`

			```typescript
			`import { extractFileSync } from "@kreuzberg/node";`

			`const result = extractFileSync("document.pdf");`
			```

			`### Rust (Async)`

			```rust
			`use kreuzberg::{extract_file, ExtractionConfig};`

			`#[tokio::main]`
			`async fn main() -> kreuzberg::Result<()> {`
			`let config = ExtractionConfig::default();`
			`let result = extract_file("document.pdf", None, &config).await?;`
			`println!("{}", result.content);`
			`Ok(())`
			`}`
			```

			### Rust (Sync) — requires `tokio-runtime` feature

			```rust
			`use kreuzberg::{extract_file_sync, ExtractionConfig};`

			`fn main() -> kreuzberg::Result<()> {`
			`let config = ExtractionConfig::default();`
			`let result = extract_file_sync("document.pdf", None, &config)?;`
			`println!("{}", result.content);`
			`Ok(())`
			`}`
			```

			`### CLI`

			```bash
			`kreuzberg extract document.pdf`
			`kreuzberg extract document.pdf --format json`
			`kreuzberg extract document.pdf --output-format markdown`
			```

			`## Configuration`

			`All languages use the same configuration structure with language-appropriate naming conventions.`

			`### Python (snake_case)`

			```python
			`from kreuzberg import (`
			`ExtractionConfig, OcrConfig, TesseractConfig,`
			`PdfConfig, ChunkingConfig,`
			`)`

			`config = ExtractionConfig(`
			`ocr=OcrConfig(`
			`backend="tesseract",`
			`language="eng",`
			`tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),`
			`),`
			`pdf_options=PdfConfig(passwords=["secret123"]),`
			`chunking=ChunkingConfig(max_chars=1000, max_overlap=200),`
			`output_format="markdown",`
			`)`

			`result = await extract_file("document.pdf", config=config)`
			```

			`### Node.js (camelCase)`

			```typescript
			`import { extractFile, type ExtractionConfig } from "@kreuzberg/node";`

			`const config: ExtractionConfig = {`
			`ocr: { backend: "tesseract", language: "eng" },`
			`pdfOptions: { passwords: ["secret123"] },`
			`chunking: { maxChars: 1000, maxOverlap: 200 },`
			`outputFormat: "markdown",`
			`};`

			`const result = await extractFile("document.pdf", null, config);`
			```

			`### Rust (snake_case)`

			```rust
			`use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};`

			`let config = ExtractionConfig {`
			`ocr: Some(OcrConfig {`
			`backend: "tesseract".into(),`
			`language: "eng".into(),`
			`..Default::default()`
			`}),`
			`chunking: Some(ChunkingConfig {`
			`max_characters: 1000,`
			`overlap: 200,`
			`..Default::default()`
			`}),`
			`output_format: OutputFormat::Markdown,`
			`..Default::default()`
			`};`

			`let result = extract_file("document.pdf", None, &config).await?;`
			```

			`### Config File (TOML)`

			```toml
			`output_format = "markdown"`

			`[ocr]`
			`backend = "tesseract"`
			`language = "eng"`

			`[chunking]`
			`max_chars = 1000`
			`max_overlap = 200`

			`[pdf_options]`
			`passwords = ["secret123"]`
			```

			```bash
			`# CLI: auto-discovers kreuzberg.toml in current/parent directories`
			`kreuzberg extract doc.pdf`
			`# or explicit:`
			`kreuzberg extract doc.pdf --config kreuzberg.toml`
			`kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'`
			```

			`## Batch Processing`

			`### Python`

			```python
			`from kreuzberg import batch_extract_files, batch_extract_files_sync`

			`# Async`
			`results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])`

			`# Sync`
			`results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])`

			`for result in results:`
			`print(f"{len(result.content)} chars extracted")`
			```

			`### Node.js`

			```typescript
			`import { batchExtractFiles } from "@kreuzberg/node";`

			`const results = await batchExtractFiles(["doc1.pdf", "doc2.docx"]);`
			```

			### Rust — requires `tokio-runtime` feature

			```rust
			`use kreuzberg::{batch_extract_file, ExtractionConfig};`

			`let config = ExtractionConfig::default();`
			`let paths = vec!["doc1.pdf", "doc2.docx"];`
			`let results = batch_extract_file(paths, &config).await?;`
			```

			`### CLI`

			```bash
			`kreuzberg batch *.pdf --format json`
			`kreuzberg batch docs/*.docx --output-format markdown`
			```

			`## OCR`

			`OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).`

			`### Backends`

			`- Tesseract (default): Built-in native binding. All Tesseract languages supported.`
			- EasyOCR (Python only): `pip install kreuzberg[easyocr]`. Pass `easyocr_kwargs={"gpu": True}`.
			- PaddleOCR (Python only): Bundled since 4.8.5, no extra install needed. Pass `paddleocr_kwargs={"use_angle_cls": True}`.
			- Guten (Node.js only): Built-in OCR backend via `GutenOcrBackend`.

			`### Language Codes`

			```python
			`config = ExtractionConfig(ocr=OcrConfig(language="eng")) # English`
			`config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # Multiple`
			`config = ExtractionConfig(ocr=OcrConfig(language="all")) # All installed`
			```

			`### Force OCR`

			```python
			`config = ExtractionConfig(force_ocr=True) # OCR even if text is extractable`
			```

			`## ExtractionResult Fields`

			`\| Field \| Python \| Node.js \| Rust \| Description \|`
			`\| ------------ \| --------------------------- \| -------------------------- \| --------------------------- \| --------------------------------------------- \|`
			\| Text content \| `result.content` \| `result.content` \| `result.content` \| Extracted text (str/String) \|
			\| MIME type \| `result.mime_type` \| `result.mimeType` \| `result.mime_type` \| Input document MIME type \|
			\| Metadata \| `result.metadata` \| `result.metadata` \| `result.metadata` \| Document metadata (dict/object/HashMap) \|
			\| Tables \| `result.tables` \| `result.tables` \| `result.tables` \| Extracted tables with cells + markdown \|
			\| Languages \| `result.detected_languages` \| `result.detectedLanguages` \| `result.detected_languages` \| Detected languages (if enabled) \|
			\| Chunks \| `result.chunks` \| `result.chunks` \| `result.chunks` \| Text chunks (if chunking enabled) \|
			\| Images \| `result.images` \| `result.images` \| `result.images` \| Extracted images (if enabled) \|
			\| Elements \| `result.elements` \| `result.elements` \| `result.elements` \| Semantic elements (if element_based format) \|
			\| Pages \| `result.pages` \| `result.pages` \| `result.pages` \| Per-page content (if page extraction enabled) \|
			\| Keywords \| `result.keywords` \| `result.keywords` \| `result.keywords` \| Extracted keywords (if enabled) \|

			`## Error Handling`

			`### Python`

			```python
			`from kreuzberg import (`
			`extract_file_sync, KreuzbergError, ParsingError,`
			`OCRError, ValidationError, MissingDependencyError,`
			`)`

			`try:`
			`result = extract_file_sync("file.pdf")`
			`except ParsingError as e:`
			`print(f"Failed to parse: {e}")`
			`except OCRError as e:`
			`print(f"OCR failed: {e}")`
			`except ValidationError as e:`
			`print(f"Invalid input: {e}")`
			`except MissingDependencyError as e:`
			`print(f"Missing dependency: {e}")`
			`except KreuzbergError as e:`
			`print(f"Extraction failed: {e}")`
			```

			`### Node.js`

			```typescript
			`import {`
			`extractFile,`
			`KreuzbergError,`
			`ParsingError,`
			`OcrError,`
			`ValidationError,`
			`MissingDependencyError,`
			`} from "@kreuzberg/node";`

			`try {`
			`const result = await extractFile("file.pdf");`
			`} catch (e) {`
			`if (e instanceof ParsingError) {`
			`/* ... */`
			`} else if (e instanceof OcrError) {`
			`/* ... */`
			`} else if (e instanceof ValidationError) {`
			`/* ... */`
			`} else if (e instanceof KreuzbergError) {`
			`/* ... */`
			`}`
			`}`
			```

			`### Rust`

			```rust
			`use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};`

			`let config = ExtractionConfig::default();`
			`match extract_file("file.pdf", None, &config).await {`
			`Ok(result) => println!("{}", result.content),`
			`Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),`
			`Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),`
			`Err(e) => eprintln!("Error: {e}"),`
			`}`
			```

			`## Common Pitfalls`

			1. Python ChunkingConfig fields: Use `max_chars` and `max_overlap`, NOT `max_characters` or `overlap`.
			2. Rust extract_file signature: Third argument is `&ExtractionConfig` (a reference), not `Option`. Use `&ExtractionConfig::default()` for defaults.
			3. Rust feature gates: `extract_file_sync`, `batch_extract_file`, and `batch_extract_file_sync` all require `features = ["tokio-runtime"]` in Cargo.toml.
			4. Rust async context: `extract_file` is async. Use `#[tokio::main]` or call from an async context.
			5. CLI --format vs --output-format: `--format` controls CLI output (text/json). `--output-format` controls content format (plain/markdown/djot/html).
			6. Node.js extractFile signature: `extractFile(path, mimeType?, config?)` — mimeType is the second arg (pass `null` to skip).
			7. Python detect_mime_type: The function for detecting from bytes is `detect_mime_type(data)`. For paths use `detect_mime_type_from_path(path)`.
			8. Config file field names: Use snake_case in TOML/YAML/JSON config files (e.g., `max_chars`, `max_overlap`, `pdf_options`).

			`## Supported Formats (Summary)`

			`\| Category \| Extensions \|`
			`\| ----------------- \| ----------------------------------------------------------------------------------------------------------------------------------------------------------- \|`
			\| PDF \| `.pdf` \|
			\| Word \| `.docx`, `.odt` \|
			\| Spreadsheets \| `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` \|
			\| Presentations \| `.pptx`, `.ppt`, `.ppsx` \|
			\| eBooks \| `.epub`, `.fb2` \|
			\| Images \| `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`, `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`, `.svg` \|
			\| Markup \| `.html`, `.htm`, `.xhtml`, `.xml` \|
			\| Data \| `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` \|
			\| Text \| `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` \|
			\| Email \| `.eml`, `.msg` \|
			\| Archives \| `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` \|
			\| Academic \| `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`, `.tex`, `.latex`, `.typ`, `.jats`, `.ipynb`, `.docbook`, `.opml`, `.pod`, `.mdoc`, `.troff` \|

			`See [references/supported-formats.md](references/supported-formats.md) for the complete format reference with MIME types.`

			`## Additional Resources`

			`Detailed reference files for specific topics:`

			`- [Python API Reference](references/python-api.md) — All functions, config classes, plugin protocols, exact signatures`
			`- [Node.js API Reference](references/nodejs-api.md) — All functions, TypeScript interfaces, worker pool APIs`
			`- [Rust API Reference](references/rust-api.md) — All functions with feature gates, structs, Cargo.toml examples`
			`- [CLI Reference](references/cli-reference.md) — All commands, flags, config precedence, exit codes`
			`- [Configuration Reference](references/configuration.md) — TOML/YAML/JSON formats, auto-discovery, env vars, full schema`
			`- [Supported Formats](references/supported-formats.md) — All 85+ formats with file extensions and MIME types`
			`- [Advanced Features](references/advanced-features.md) — Plugins, embeddings, MCP server, API server, security limits`
			`- [Other Language Bindings](references/other-bindings.md) — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker`

			`Full documentation: <https://docs.kreuzberg.dev>`
			`GitHub: <https://github.com/kreuzberg-dev/kreuzberg>`