This commit is contained in:
417
skills/kreuzberg/SKILL.md
Normal file
417
skills/kreuzberg/SKILL.md
Normal file
@@ -0,0 +1,417 @@
|
||||
---
|
||||
name: kreuzberg
|
||||
description: >-
|
||||
Extract text, tables, metadata, and images from 91+ document formats
|
||||
(PDF, Office, images, HTML, email, archives, academic) using Kreuzberg.
|
||||
Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript,
|
||||
Rust, or CLI. Covers installation, extraction (sync/async), configuration
|
||||
(OCR, chunking, output format), batch processing, error handling, and plugins.
|
||||
license: Elastic-2.0
|
||||
metadata:
|
||||
author: kreuzberg-dev
|
||||
version: "1.0"
|
||||
repository: https://github.com/kreuzberg-dev/kreuzberg
|
||||
---
|
||||
|
||||
# Kreuzberg Document Extraction
|
||||
|
||||
Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.
|
||||
|
||||
Use this skill when writing code that:
|
||||
|
||||
- Extracts text or metadata from documents
|
||||
- Performs OCR on scanned documents or images
|
||||
- Batch-processes multiple files
|
||||
- Configures extraction options (output format, chunking, OCR, language detection)
|
||||
- Implements custom plugins (post-processors, validators, OCR backends)
|
||||
|
||||
## Installation
|
||||
|
||||
### Python
|
||||
|
||||
```bash
|
||||
pip install kreuzberg
|
||||
# Optional OCR backends:
|
||||
pip install kreuzberg[easyocr] # EasyOCR
|
||||
```
|
||||
|
||||
### Node.js
|
||||
|
||||
```bash
|
||||
npm install @kreuzberg/node
|
||||
```
|
||||
|
||||
### Rust
|
||||
|
||||
```toml
|
||||
# Cargo.toml
|
||||
[dependencies]
|
||||
kreuzberg = { version = "4", features = ["tokio-runtime"] }
|
||||
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
|
||||
# embeddings, language-detection, keywords-yake, keywords-rake
|
||||
```
|
||||
|
||||
### CLI
|
||||
|
||||
```bash
|
||||
# Download from GitHub releases, or:
|
||||
cargo install kreuzberg-cli
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Python (Async)
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file
|
||||
|
||||
result = await extract_file("document.pdf")
|
||||
print(result.content) # extracted text
|
||||
print(result.metadata) # document metadata
|
||||
print(result.tables) # extracted tables
|
||||
```
|
||||
|
||||
### Python (Sync)
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_file_sync
|
||||
|
||||
result = extract_file_sync("document.pdf")
|
||||
print(result.content)
|
||||
```
|
||||
|
||||
### Node.js
|
||||
|
||||
```typescript
|
||||
import { extractFile } from "@kreuzberg/node";
|
||||
|
||||
const result = await extractFile("document.pdf");
|
||||
console.log(result.content);
|
||||
console.log(result.metadata);
|
||||
console.log(result.tables);
|
||||
```
|
||||
|
||||
### Node.js (Sync)
|
||||
|
||||
```typescript
|
||||
import { extractFileSync } from "@kreuzberg/node";
|
||||
|
||||
const result = extractFileSync("document.pdf");
|
||||
```
|
||||
|
||||
### Rust (Async)
|
||||
|
||||
```rust
|
||||
use kreuzberg::{extract_file, ExtractionConfig};
|
||||
|
||||
#[tokio::main]
|
||||
async fn main() -> kreuzberg::Result<()> {
|
||||
let config = ExtractionConfig::default();
|
||||
let result = extract_file("document.pdf", None, &config).await?;
|
||||
println!("{}", result.content);
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
### Rust (Sync) — requires `tokio-runtime` feature
|
||||
|
||||
```rust
|
||||
use kreuzberg::{extract_file_sync, ExtractionConfig};
|
||||
|
||||
fn main() -> kreuzberg::Result<()> {
|
||||
let config = ExtractionConfig::default();
|
||||
let result = extract_file_sync("document.pdf", None, &config)?;
|
||||
println!("{}", result.content);
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
### CLI
|
||||
|
||||
```bash
|
||||
kreuzberg extract document.pdf
|
||||
kreuzberg extract document.pdf --format json
|
||||
kreuzberg extract document.pdf --output-format markdown
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
All languages use the same configuration structure with language-appropriate naming conventions.
|
||||
|
||||
### Python (snake_case)
|
||||
|
||||
```python
|
||||
from kreuzberg import (
|
||||
ExtractionConfig, OcrConfig, TesseractConfig,
|
||||
PdfConfig, ChunkingConfig,
|
||||
)
|
||||
|
||||
config = ExtractionConfig(
|
||||
ocr=OcrConfig(
|
||||
backend="tesseract",
|
||||
language="eng",
|
||||
tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
|
||||
),
|
||||
pdf_options=PdfConfig(passwords=["secret123"]),
|
||||
chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
|
||||
output_format="markdown",
|
||||
)
|
||||
|
||||
result = await extract_file("document.pdf", config=config)
|
||||
```
|
||||
|
||||
### Node.js (camelCase)
|
||||
|
||||
```typescript
|
||||
import { extractFile, type ExtractionConfig } from "@kreuzberg/node";
|
||||
|
||||
const config: ExtractionConfig = {
|
||||
ocr: { backend: "tesseract", language: "eng" },
|
||||
pdfOptions: { passwords: ["secret123"] },
|
||||
chunking: { maxChars: 1000, maxOverlap: 200 },
|
||||
outputFormat: "markdown",
|
||||
};
|
||||
|
||||
const result = await extractFile("document.pdf", null, config);
|
||||
```
|
||||
|
||||
### Rust (snake_case)
|
||||
|
||||
```rust
|
||||
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};
|
||||
|
||||
let config = ExtractionConfig {
|
||||
ocr: Some(OcrConfig {
|
||||
backend: "tesseract".into(),
|
||||
language: "eng".into(),
|
||||
..Default::default()
|
||||
}),
|
||||
chunking: Some(ChunkingConfig {
|
||||
max_characters: 1000,
|
||||
overlap: 200,
|
||||
..Default::default()
|
||||
}),
|
||||
output_format: OutputFormat::Markdown,
|
||||
..Default::default()
|
||||
};
|
||||
|
||||
let result = extract_file("document.pdf", None, &config).await?;
|
||||
```
|
||||
|
||||
### Config File (TOML)
|
||||
|
||||
```toml
|
||||
output_format = "markdown"
|
||||
|
||||
[ocr]
|
||||
backend = "tesseract"
|
||||
language = "eng"
|
||||
|
||||
[chunking]
|
||||
max_chars = 1000
|
||||
max_overlap = 200
|
||||
|
||||
[pdf_options]
|
||||
passwords = ["secret123"]
|
||||
```
|
||||
|
||||
```bash
|
||||
# CLI: auto-discovers kreuzberg.toml in current/parent directories
|
||||
kreuzberg extract doc.pdf
|
||||
# or explicit:
|
||||
kreuzberg extract doc.pdf --config kreuzberg.toml
|
||||
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
|
||||
```
|
||||
|
||||
## Batch Processing
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
from kreuzberg import batch_extract_files, batch_extract_files_sync
|
||||
|
||||
# Async
|
||||
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])
|
||||
|
||||
# Sync
|
||||
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])
|
||||
|
||||
for result in results:
|
||||
print(f"{len(result.content)} chars extracted")
|
||||
```
|
||||
|
||||
### Node.js
|
||||
|
||||
```typescript
|
||||
import { batchExtractFiles } from "@kreuzberg/node";
|
||||
|
||||
const results = await batchExtractFiles(["doc1.pdf", "doc2.docx"]);
|
||||
```
|
||||
|
||||
### Rust — requires `tokio-runtime` feature
|
||||
|
||||
```rust
|
||||
use kreuzberg::{batch_extract_file, ExtractionConfig};
|
||||
|
||||
let config = ExtractionConfig::default();
|
||||
let paths = vec!["doc1.pdf", "doc2.docx"];
|
||||
let results = batch_extract_file(paths, &config).await?;
|
||||
```
|
||||
|
||||
### CLI
|
||||
|
||||
```bash
|
||||
kreuzberg batch *.pdf --format json
|
||||
kreuzberg batch docs/*.docx --output-format markdown
|
||||
```
|
||||
|
||||
## OCR
|
||||
|
||||
OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).
|
||||
|
||||
### Backends
|
||||
|
||||
- **Tesseract** (default): Built-in native binding. All Tesseract languages supported.
|
||||
- **EasyOCR** (Python only): `pip install kreuzberg[easyocr]`. Pass `easyocr_kwargs={"gpu": True}`.
|
||||
- **PaddleOCR** (Python only): Bundled since 4.8.5, no extra install needed. Pass `paddleocr_kwargs={"use_angle_cls": True}`.
|
||||
- **Guten** (Node.js only): Built-in OCR backend via `GutenOcrBackend`.
|
||||
|
||||
### Language Codes
|
||||
|
||||
```python
|
||||
config = ExtractionConfig(ocr=OcrConfig(language="eng")) # English
|
||||
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # Multiple
|
||||
config = ExtractionConfig(ocr=OcrConfig(language="all")) # All installed
|
||||
```
|
||||
|
||||
### Force OCR
|
||||
|
||||
```python
|
||||
config = ExtractionConfig(force_ocr=True) # OCR even if text is extractable
|
||||
```
|
||||
|
||||
## ExtractionResult Fields
|
||||
|
||||
| Field | Python | Node.js | Rust | Description |
|
||||
| ------------ | --------------------------- | -------------------------- | --------------------------- | --------------------------------------------- |
|
||||
| Text content | `result.content` | `result.content` | `result.content` | Extracted text (str/String) |
|
||||
| MIME type | `result.mime_type` | `result.mimeType` | `result.mime_type` | Input document MIME type |
|
||||
| Metadata | `result.metadata` | `result.metadata` | `result.metadata` | Document metadata (dict/object/HashMap) |
|
||||
| Tables | `result.tables` | `result.tables` | `result.tables` | Extracted tables with cells + markdown |
|
||||
| Languages | `result.detected_languages` | `result.detectedLanguages` | `result.detected_languages` | Detected languages (if enabled) |
|
||||
| Chunks | `result.chunks` | `result.chunks` | `result.chunks` | Text chunks (if chunking enabled) |
|
||||
| Images | `result.images` | `result.images` | `result.images` | Extracted images (if enabled) |
|
||||
| Elements | `result.elements` | `result.elements` | `result.elements` | Semantic elements (if element_based format) |
|
||||
| Pages | `result.pages` | `result.pages` | `result.pages` | Per-page content (if page extraction enabled) |
|
||||
| Keywords | `result.keywords` | `result.keywords` | `result.keywords` | Extracted keywords (if enabled) |
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
from kreuzberg import (
|
||||
extract_file_sync, KreuzbergError, ParsingError,
|
||||
OCRError, ValidationError, MissingDependencyError,
|
||||
)
|
||||
|
||||
try:
|
||||
result = extract_file_sync("file.pdf")
|
||||
except ParsingError as e:
|
||||
print(f"Failed to parse: {e}")
|
||||
except OCRError as e:
|
||||
print(f"OCR failed: {e}")
|
||||
except ValidationError as e:
|
||||
print(f"Invalid input: {e}")
|
||||
except MissingDependencyError as e:
|
||||
print(f"Missing dependency: {e}")
|
||||
except KreuzbergError as e:
|
||||
print(f"Extraction failed: {e}")
|
||||
```
|
||||
|
||||
### Node.js
|
||||
|
||||
```typescript
|
||||
import {
|
||||
extractFile,
|
||||
KreuzbergError,
|
||||
ParsingError,
|
||||
OcrError,
|
||||
ValidationError,
|
||||
MissingDependencyError,
|
||||
} from "@kreuzberg/node";
|
||||
|
||||
try {
|
||||
const result = await extractFile("file.pdf");
|
||||
} catch (e) {
|
||||
if (e instanceof ParsingError) {
|
||||
/* ... */
|
||||
} else if (e instanceof OcrError) {
|
||||
/* ... */
|
||||
} else if (e instanceof ValidationError) {
|
||||
/* ... */
|
||||
} else if (e instanceof KreuzbergError) {
|
||||
/* ... */
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Rust
|
||||
|
||||
```rust
|
||||
use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};
|
||||
|
||||
let config = ExtractionConfig::default();
|
||||
match extract_file("file.pdf", None, &config).await {
|
||||
Ok(result) => println!("{}", result.content),
|
||||
Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
|
||||
Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
|
||||
Err(e) => eprintln!("Error: {e}"),
|
||||
}
|
||||
```
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Python ChunkingConfig fields**: Use `max_chars` and `max_overlap`, NOT `max_characters` or `overlap`.
|
||||
2. **Rust extract_file signature**: Third argument is `&ExtractionConfig` (a reference), not `Option`. Use `&ExtractionConfig::default()` for defaults.
|
||||
3. **Rust feature gates**: `extract_file_sync`, `batch_extract_file`, and `batch_extract_file_sync` all require `features = ["tokio-runtime"]` in Cargo.toml.
|
||||
4. **Rust async context**: `extract_file` is async. Use `#[tokio::main]` or call from an async context.
|
||||
5. **CLI --format vs --output-format**: `--format` controls CLI output (text/json). `--output-format` controls content format (plain/markdown/djot/html).
|
||||
6. **Node.js extractFile signature**: `extractFile(path, mimeType?, config?)` — mimeType is the second arg (pass `null` to skip).
|
||||
7. **Python detect_mime_type**: The function for detecting from bytes is `detect_mime_type(data)`. For paths use `detect_mime_type_from_path(path)`.
|
||||
8. **Config file field names**: Use snake_case in TOML/YAML/JSON config files (e.g., `max_chars`, `max_overlap`, `pdf_options`).
|
||||
|
||||
## Supported Formats (Summary)
|
||||
|
||||
| Category | Extensions |
|
||||
| ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| **PDF** | `.pdf` |
|
||||
| **Word** | `.docx`, `.odt` |
|
||||
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` |
|
||||
| **Presentations** | `.pptx`, `.ppt`, `.ppsx` |
|
||||
| **eBooks** | `.epub`, `.fb2` |
|
||||
| **Images** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`, `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`, `.svg` |
|
||||
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml` |
|
||||
| **Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` |
|
||||
| **Text** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` |
|
||||
| **Email** | `.eml`, `.msg` |
|
||||
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` |
|
||||
| **Academic** | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`, `.tex`, `.latex`, `.typ`, `.jats`, `.ipynb`, `.docbook`, `.opml`, `.pod`, `.mdoc`, `.troff` |
|
||||
|
||||
See [references/supported-formats.md](references/supported-formats.md) for the complete format reference with MIME types.
|
||||
|
||||
## Additional Resources
|
||||
|
||||
Detailed reference files for specific topics:
|
||||
|
||||
- **[Python API Reference](references/python-api.md)** — All functions, config classes, plugin protocols, exact signatures
|
||||
- **[Node.js API Reference](references/nodejs-api.md)** — All functions, TypeScript interfaces, worker pool APIs
|
||||
- **[Rust API Reference](references/rust-api.md)** — All functions with feature gates, structs, Cargo.toml examples
|
||||
- **[CLI Reference](references/cli-reference.md)** — All commands, flags, config precedence, exit codes
|
||||
- **[Configuration Reference](references/configuration.md)** — TOML/YAML/JSON formats, auto-discovery, env vars, full schema
|
||||
- **[Supported Formats](references/supported-formats.md)** — All 85+ formats with file extensions and MIME types
|
||||
- **[Advanced Features](references/advanced-features.md)** — Plugins, embeddings, MCP server, API server, security limits
|
||||
- **[Other Language Bindings](references/other-bindings.md)** — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker
|
||||
|
||||
Full documentation: <https://docs.kreuzberg.dev>
|
||||
GitHub: <https://github.com/kreuzberg-dev/kreuzberg>
|
||||
Reference in New Issue
Block a user