12 KiB
Quick Start
This guide walks you through Kreuzberg's core API — extracting text, handling errors, running OCR, and working with metadata. Install your binding first if you haven't: Installation.
TypeScript users: @kreuzberg/node for Node.js, @kreuzberg/wasm for browsers and edge runtimes — see Language Support.
Your First Extraction
Pass a file path to get its text content. Kreuzberg detects the format automatically:
=== "C"
--8<-- "snippets/c/api/extract_file_sync.md"
=== "C#"
--8<-- "snippets/csharp/extract_file_sync.md"
=== "Dart"
--8<-- "snippets/dart/api/extract_file_sync.md"
=== "Go"
--8<-- "snippets/go/api/extract_file_sync.md"
=== "Java"
--8<-- "snippets/java/api/extract_file_sync.md"
=== "Kotlin"
--8<-- "snippets/kotlin/api/extract_file_sync.md"
=== "Python"
--8<-- "snippets/python/api/extract_file_sync.md"
=== "Ruby"
--8<-- "snippets/ruby/api/extract_file_sync.md"
=== "R"
--8<-- "snippets/r/api/extract_file_sync.md"
=== "Rust"
--8<-- "snippets/rust/api/extract_file_sync.md"
=== "Swift"
--8<-- "snippets/swift/api/extract_file_sync.md"
=== "Elixir"
--8<-- "snippets/elixir/core/extract_file_sync.exs"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/extract_file_sync.md"
=== "Wasm"
--8<-- "snippets/wasm/getting-started/extract_file_sync.md"
=== "Zig"
--8<-- "snippets/zig/api/extract_file_sync.md"
=== "CLI"
--8<-- "snippets/cli/extract_basic.md"
Handle Errors
Wrap extractions in error handling before going further. Kreuzberg raises specific exceptions for missing files, parse failures, and OCR problems:
=== "C"
--8<-- "snippets/c/api/error_handling.md"
=== "C#"
--8<-- "snippets/csharp/error_handling.md"
=== "Dart"
--8<-- "snippets/dart/api/error_handling.md"
=== "Go"
--8<-- "snippets/go/api/error_handling.md"
=== "Java"
--8<-- "snippets/java/api/error_handling.md"
=== "Kotlin"
--8<-- "snippets/kotlin/api/error_handling.md"
=== "Python"
--8<-- "snippets/python/utils/error_handling.md"
=== "Ruby"
--8<-- "snippets/ruby/api/error_handling.md"
=== "R"
--8<-- "snippets/r/api/error_handling.md"
=== "Rust"
--8<-- "snippets/rust/api/error_handling.md"
=== "Swift"
--8<-- "snippets/swift/api/error_handling.md"
=== "Elixir"
--8<-- "snippets/elixir/core/error_handling.exs"
=== "TypeScript"
--8<-- "snippets/typescript/api/error_handling.md"
=== "Wasm"
--8<-- "snippets/wasm/api/error_handling.md"
=== "Zig"
--8<-- "snippets/zig/api/error_handling.md"
OCR for Scanned Documents
Kreuzberg runs OCR automatically when it detects an image or scanned PDF. You can also force OCR on any document:
=== "C"
--8<-- "snippets/c/ocr/ocr_extraction.md"
=== "C#"
--8<-- "snippets/csharp/ocr_extraction.md"
=== "Dart"
--8<-- "snippets/dart/ocr/ocr_extraction.md"
=== "Go"
--8<-- "snippets/go/ocr/ocr_extraction.md"
=== "Java"
--8<-- "snippets/java/ocr/ocr_extraction.md"
=== "Kotlin"
--8<-- "snippets/kotlin/ocr/ocr_extraction.md"
=== "Python"
--8<-- "snippets/python/ocr/ocr_extraction.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/ocr_extraction.md"
=== "R"
--8<-- "snippets/r/ocr/ocr_extraction.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_extraction.md"
=== "Swift"
--8<-- "snippets/swift/ocr/ocr_extraction.md"
=== "Elixir"
--8<-- "snippets/elixir/ocr/tesseract_basic.exs"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_extraction.md"
=== "Wasm"
--8<-- "snippets/wasm/ocr/ocr_extraction.md"
=== "Zig"
--8<-- "snippets/zig/ocr/ocr_extraction.md"
=== "CLI"
--8<-- "snippets/cli/ocr_basic.md"
Process Multiple Files
Pass a list of paths to extract them in parallel:
=== "C"
--8<-- "snippets/c/api/batch_extract_files_sync.md"
=== "C#"
--8<-- "snippets/csharp/batch_extract_files_sync.md"
=== "Dart"
--8<-- "snippets/dart/api/batch_extract_files_sync.md"
=== "Go"
--8<-- "snippets/go/api/batch_extract_files_sync.md"
=== "Java"
--8<-- "snippets/java/api/batch_extract_files_sync.md"
=== "Kotlin"
--8<-- "snippets/kotlin/api/batch_extract_files_sync.md"
=== "Python"
--8<-- "snippets/python/api/batch_extract_files_sync.md"
=== "Ruby"
--8<-- "snippets/ruby/api/batch_extract_files_sync.md"
=== "R"
--8<-- "snippets/r/api/batch_extract_files_sync.md"
=== "Rust"
--8<-- "snippets/rust/api/batch_extract_files_sync.md"
=== "Swift"
--8<-- "snippets/swift/api/batch_extract_files_sync.md"
=== "Elixir"
--8<-- "snippets/elixir/core/batch_extract_files_sync.exs"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/batch_extract_files_sync.md"
=== "Wasm"
--8<-- "snippets/wasm/getting-started/batch_extract_files_sync.md"
=== "Zig"
--8<-- "snippets/zig/api/batch_extract_files_sync.md"
=== "CLI"
--8<-- "snippets/cli/batch_basic.md"
Read Document Metadata
Every extraction result includes format-specific metadata — page count for PDFs, sheet names for Excel, dimensions for images:
=== "C"
--8<-- "snippets/c/metadata/metadata.md"
=== "C#"
--8<-- "snippets/csharp/metadata.md"
=== "Dart"
--8<-- "snippets/dart/metadata/metadata.md"
=== "Go"
--8<-- "snippets/go/metadata/metadata.md"
=== "Java"
--8<-- "snippets/java/metadata/metadata.md"
=== "Kotlin"
--8<-- "snippets/kotlin/metadata/metadata.md"
=== "Python"
--8<-- "snippets/python/metadata/metadata.md"
=== "Ruby"
--8<-- "snippets/ruby/metadata/metadata.md"
=== "R"
--8<-- "snippets/r/metadata/metadata.md"
=== "Rust"
--8<-- "snippets/rust/metadata/metadata.md"
=== "Swift"
--8<-- "snippets/swift/metadata/metadata.md"
=== "Elixir"
--8<-- "snippets/elixir/advanced/metadata_extraction.exs"
=== "TypeScript"
--8<-- "snippets/typescript/metadata/metadata.md"
=== "Wasm"
--8<-- "snippets/wasm/metadata/metadata.md"
=== "Zig"
--8<-- "snippets/zig/metadata/metadata.md"
=== "CLI"
Extract and parse metadata using JSON output:
```bash title="Terminal"
# Extract with metadata (JSON format includes metadata automatically)
kreuzberg extract document.pdf --format json
# Save to file and parse metadata
kreuzberg extract document.pdf --format json > result.json
# Print all metadata fields
cat result.json | jq '.metadata'
# Extract HTML metadata
kreuzberg extract page.html --format json | jq '.metadata'
# Get specific fields
kreuzberg extract document.pdf --format json | \
jq '.metadata | {page_count, authors, title}'
# Process multiple files
kreuzberg batch documents/*.pdf --format json > all_metadata.json
```
**JSON Output Structure:**
```json title="JSON"
{
"content": "Extracted text...",
"mime_type": "application/pdf",
"metadata": {
"title": "Document Title",
"authors": ["John Doe"],
"created_by": "LaTeX with hyperref package",
"format_type": "pdf",
"page_count": 10
},
"tables": []
}
```
Kreuzberg extracts format-specific metadata for:
- PDF: page count, title, authors (list), creation date, modification date
- HTML: SEO tags, Open Graph, Twitter Card, structured data, headers, links, images
- Excel: sheet count, sheet names
- Email: from, to, CC, BCC, message ID, attachments
- PowerPoint: title, author, description, fonts
- Images: dimensions, format, EXIF data
- Archives: format, file count, file list, sizes
- XML: element count, unique elements
- Text/Markdown: word count, line count, headers, links
See Types Reference for complete metadata reference.
Extract Tables
Tables come back as both structured cells and Markdown. Kreuzberg extracts them from PDFs, spreadsheets, and HTML:
=== "C"
--8<-- "snippets/c/metadata/tables.md"
=== "C#"
--8<-- "snippets/csharp/tables.md"
=== "Dart"
--8<-- "snippets/dart/metadata/tables.md"
=== "Go"
--8<-- "snippets/go/metadata/tables.md"
=== "Java"
--8<-- "snippets/java/metadata/tables.md"
=== "Kotlin"
--8<-- "snippets/kotlin/metadata/tables.md"
=== "Python"
--8<-- "snippets/python/utils/tables.md"
=== "Ruby"
--8<-- "snippets/ruby/metadata/tables.md"
=== "R"
--8<-- "snippets/r/metadata/tables.md"
=== "Rust"
--8<-- "snippets/rust/metadata/tables.md"
=== "Swift"
--8<-- "snippets/swift/metadata/tables.md"
=== "Elixir"
--8<-- "snippets/elixir/advanced/table_extraction.exs"
=== "TypeScript"
--8<-- "snippets/typescript/api/tables.md"
=== "Wasm"
--8<-- "snippets/wasm/api/tables.md"
=== "Zig"
--8<-- "snippets/zig/metadata/tables.md"
=== "CLI"
Extract and process tables from documents:
```bash title="Terminal"
# Extract with JSON format (includes tables when detected)
kreuzberg extract document.pdf --format json
# Save tables to JSON
kreuzberg extract spreadsheet.xlsx --format json > tables.json
# Extract and parse table markdown
kreuzberg extract document.pdf --format json | \
jq '.tables[]? | .markdown'
# Get table cells
kreuzberg extract document.pdf --format json | \
jq '.tables[]? | .cells'
# Batch extract tables from multiple files
kreuzberg batch documents/**/*.pdf --format json > all_tables.json
```
**JSON Table Structure:**
```json title="JSON"
{
"content": "...",
"tables": [
{
"cells": [
["Name", "Age", "City"],
["Alice", "30", "New York"],
["Bob", "25", "Los Angeles"]
],
"markdown": "| Name | Age | City |\n|------|-----|--------|\n| Alice | 30 | New York |\n| Bob | 25 | Los Angeles |"
}
]
}
```
Going Async
Use async extraction in web servers, background workers, or anywhere you need non-blocking I/O:
=== "C"
--8<-- "snippets/c/api/extract_file_async.md"
=== "C#"
--8<-- "snippets/csharp/extract_file_async.md"
=== "Dart"
--8<-- "snippets/dart/api/extract_file_async.md"
=== "Go"
--8<-- "snippets/go/api/extract_file_async.md"
=== "Java"
--8<-- "snippets/java/api/extract_file_async.md"
=== "Kotlin"
--8<-- "snippets/kotlin/api/extract_file_async.md"
=== "Python"
--8<-- "snippets/python/api/extract_file_async.md"
=== "Ruby"
--8<-- "snippets/ruby/api/extract_file_async.md"
=== "R"
--8<-- "snippets/r/api/extract_file_async.md"
=== "Rust"
--8<-- "snippets/rust/api/extract_file_async.md"
=== "Swift"
--8<-- "snippets/swift/api/extract_file_async.md"
=== "Elixir"
--8<-- "snippets/elixir/core/extract_file_async.exs"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/extract_file_async.md"
=== "Wasm"
--8<-- "snippets/wasm/getting-started/extract_file_async.md"
=== "Zig"
--8<-- "snippets/zig/api/extract_file_async.md"
=== "CLI"
!!! note "Not Applicable"
Async extraction is an API-level feature. The CLI operates synchronously.
Use language-specific bindings (Python, TypeScript, Rust, WASM) for async operations.
Next Steps
You've covered the core API. Go deeper:
- Configuration Guide — OCR backends, chunking, language detection, config files
- Extract from Bytes — Process in-memory data without writing to disk
- OCR Setup — Tesseract, PaddleOCR, EasyOCR backends
- Types Reference — Full metadata fields for every format
- Docker Deployment — Run Kreuzberg in containers
- API Reference — Complete API documentation