587 lines
12 KiB
Markdown
587 lines
12 KiB
Markdown
# Quick Start
|
|
|
|
This guide walks you through Kreuzberg's core API — extracting text, handling errors,
|
|
running OCR, and working with metadata. Install your binding first if you haven't:
|
|
[Installation](installation.md).
|
|
|
|
TypeScript users: `@kreuzberg/node` for Node.js, `@kreuzberg/wasm` for browsers and edge runtimes — see [Language Support](../index.md#language-support).
|
|
|
|
## Your First Extraction
|
|
|
|
Pass a file path to get its text content. Kreuzberg detects the format automatically:
|
|
|
|
=== "C"
|
|
|
|
--8<-- "snippets/c/api/extract_file_sync.md"
|
|
|
|
=== "C#"
|
|
|
|
--8<-- "snippets/csharp/extract_file_sync.md"
|
|
|
|
=== "Dart"
|
|
|
|
--8<-- "snippets/dart/api/extract_file_sync.md"
|
|
|
|
=== "Go"
|
|
|
|
--8<-- "snippets/go/api/extract_file_sync.md"
|
|
|
|
=== "Java"
|
|
|
|
--8<-- "snippets/java/api/extract_file_sync.md"
|
|
|
|
=== "Kotlin"
|
|
|
|
--8<-- "snippets/kotlin/api/extract_file_sync.md"
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/api/extract_file_sync.md"
|
|
|
|
=== "Ruby"
|
|
|
|
--8<-- "snippets/ruby/api/extract_file_sync.md"
|
|
|
|
=== "R"
|
|
|
|
--8<-- "snippets/r/api/extract_file_sync.md"
|
|
|
|
=== "Rust"
|
|
|
|
--8<-- "snippets/rust/api/extract_file_sync.md"
|
|
|
|
=== "Swift"
|
|
|
|
--8<-- "snippets/swift/api/extract_file_sync.md"
|
|
|
|
=== "Elixir"
|
|
|
|
--8<-- "snippets/elixir/core/extract_file_sync.exs"
|
|
|
|
=== "TypeScript"
|
|
|
|
--8<-- "snippets/typescript/getting-started/extract_file_sync.md"
|
|
|
|
=== "Wasm"
|
|
|
|
--8<-- "snippets/wasm/getting-started/extract_file_sync.md"
|
|
|
|
=== "Zig"
|
|
|
|
--8<-- "snippets/zig/api/extract_file_sync.md"
|
|
|
|
=== "CLI"
|
|
|
|
--8<-- "snippets/cli/extract_basic.md"
|
|
|
|
## Handle Errors
|
|
|
|
Wrap extractions in error handling before going further. Kreuzberg raises specific
|
|
exceptions for missing files, parse failures, and OCR problems:
|
|
|
|
=== "C"
|
|
|
|
--8<-- "snippets/c/api/error_handling.md"
|
|
|
|
=== "C#"
|
|
|
|
--8<-- "snippets/csharp/error_handling.md"
|
|
|
|
=== "Dart"
|
|
|
|
--8<-- "snippets/dart/api/error_handling.md"
|
|
|
|
=== "Go"
|
|
|
|
--8<-- "snippets/go/api/error_handling.md"
|
|
|
|
=== "Java"
|
|
|
|
--8<-- "snippets/java/api/error_handling.md"
|
|
|
|
=== "Kotlin"
|
|
|
|
--8<-- "snippets/kotlin/api/error_handling.md"
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/utils/error_handling.md"
|
|
|
|
=== "Ruby"
|
|
|
|
--8<-- "snippets/ruby/api/error_handling.md"
|
|
|
|
=== "R"
|
|
|
|
--8<-- "snippets/r/api/error_handling.md"
|
|
|
|
=== "Rust"
|
|
|
|
--8<-- "snippets/rust/api/error_handling.md"
|
|
|
|
=== "Swift"
|
|
|
|
--8<-- "snippets/swift/api/error_handling.md"
|
|
|
|
=== "Elixir"
|
|
|
|
--8<-- "snippets/elixir/core/error_handling.exs"
|
|
|
|
=== "TypeScript"
|
|
|
|
--8<-- "snippets/typescript/api/error_handling.md"
|
|
|
|
=== "Wasm"
|
|
|
|
--8<-- "snippets/wasm/api/error_handling.md"
|
|
|
|
=== "Zig"
|
|
|
|
--8<-- "snippets/zig/api/error_handling.md"
|
|
|
|
## OCR for Scanned Documents
|
|
|
|
Kreuzberg runs OCR automatically when it detects an image or scanned PDF.
|
|
You can also force OCR on any document:
|
|
|
|
=== "C"
|
|
|
|
--8<-- "snippets/c/ocr/ocr_extraction.md"
|
|
|
|
=== "C#"
|
|
|
|
--8<-- "snippets/csharp/ocr_extraction.md"
|
|
|
|
=== "Dart"
|
|
|
|
--8<-- "snippets/dart/ocr/ocr_extraction.md"
|
|
|
|
=== "Go"
|
|
|
|
--8<-- "snippets/go/ocr/ocr_extraction.md"
|
|
|
|
=== "Java"
|
|
|
|
--8<-- "snippets/java/ocr/ocr_extraction.md"
|
|
|
|
=== "Kotlin"
|
|
|
|
--8<-- "snippets/kotlin/ocr/ocr_extraction.md"
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/ocr/ocr_extraction.md"
|
|
|
|
=== "Ruby"
|
|
|
|
--8<-- "snippets/ruby/ocr/ocr_extraction.md"
|
|
|
|
=== "R"
|
|
|
|
--8<-- "snippets/r/ocr/ocr_extraction.md"
|
|
|
|
=== "Rust"
|
|
|
|
--8<-- "snippets/rust/ocr/ocr_extraction.md"
|
|
|
|
=== "Swift"
|
|
|
|
--8<-- "snippets/swift/ocr/ocr_extraction.md"
|
|
|
|
=== "Elixir"
|
|
|
|
--8<-- "snippets/elixir/ocr/tesseract_basic.exs"
|
|
|
|
=== "TypeScript"
|
|
|
|
--8<-- "snippets/typescript/ocr/ocr_extraction.md"
|
|
|
|
=== "Wasm"
|
|
|
|
--8<-- "snippets/wasm/ocr/ocr_extraction.md"
|
|
|
|
=== "Zig"
|
|
|
|
--8<-- "snippets/zig/ocr/ocr_extraction.md"
|
|
|
|
=== "CLI"
|
|
|
|
--8<-- "snippets/cli/ocr_basic.md"
|
|
|
|
## Process Multiple Files
|
|
|
|
Pass a list of paths to extract them in parallel:
|
|
|
|
=== "C"
|
|
|
|
--8<-- "snippets/c/api/batch_extract_files_sync.md"
|
|
|
|
=== "C#"
|
|
|
|
--8<-- "snippets/csharp/batch_extract_files_sync.md"
|
|
|
|
=== "Dart"
|
|
|
|
--8<-- "snippets/dart/api/batch_extract_files_sync.md"
|
|
|
|
=== "Go"
|
|
|
|
--8<-- "snippets/go/api/batch_extract_files_sync.md"
|
|
|
|
=== "Java"
|
|
|
|
--8<-- "snippets/java/api/batch_extract_files_sync.md"
|
|
|
|
=== "Kotlin"
|
|
|
|
--8<-- "snippets/kotlin/api/batch_extract_files_sync.md"
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/api/batch_extract_files_sync.md"
|
|
|
|
=== "Ruby"
|
|
|
|
--8<-- "snippets/ruby/api/batch_extract_files_sync.md"
|
|
|
|
=== "R"
|
|
|
|
--8<-- "snippets/r/api/batch_extract_files_sync.md"
|
|
|
|
=== "Rust"
|
|
|
|
--8<-- "snippets/rust/api/batch_extract_files_sync.md"
|
|
|
|
=== "Swift"
|
|
|
|
--8<-- "snippets/swift/api/batch_extract_files_sync.md"
|
|
|
|
=== "Elixir"
|
|
|
|
--8<-- "snippets/elixir/core/batch_extract_files_sync.exs"
|
|
|
|
=== "TypeScript"
|
|
|
|
--8<-- "snippets/typescript/getting-started/batch_extract_files_sync.md"
|
|
|
|
=== "Wasm"
|
|
|
|
--8<-- "snippets/wasm/getting-started/batch_extract_files_sync.md"
|
|
|
|
=== "Zig"
|
|
|
|
--8<-- "snippets/zig/api/batch_extract_files_sync.md"
|
|
|
|
=== "CLI"
|
|
|
|
--8<-- "snippets/cli/batch_basic.md"
|
|
|
|
## Read Document Metadata
|
|
|
|
Every extraction result includes format-specific metadata — page count for PDFs,
|
|
sheet names for Excel, dimensions for images:
|
|
|
|
=== "C"
|
|
|
|
--8<-- "snippets/c/metadata/metadata.md"
|
|
|
|
=== "C#"
|
|
|
|
--8<-- "snippets/csharp/metadata.md"
|
|
|
|
=== "Dart"
|
|
|
|
--8<-- "snippets/dart/metadata/metadata.md"
|
|
|
|
=== "Go"
|
|
|
|
--8<-- "snippets/go/metadata/metadata.md"
|
|
|
|
=== "Java"
|
|
|
|
--8<-- "snippets/java/metadata/metadata.md"
|
|
|
|
=== "Kotlin"
|
|
|
|
--8<-- "snippets/kotlin/metadata/metadata.md"
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/metadata/metadata.md"
|
|
|
|
=== "Ruby"
|
|
|
|
--8<-- "snippets/ruby/metadata/metadata.md"
|
|
|
|
=== "R"
|
|
|
|
--8<-- "snippets/r/metadata/metadata.md"
|
|
|
|
=== "Rust"
|
|
|
|
--8<-- "snippets/rust/metadata/metadata.md"
|
|
|
|
=== "Swift"
|
|
|
|
--8<-- "snippets/swift/metadata/metadata.md"
|
|
|
|
=== "Elixir"
|
|
|
|
--8<-- "snippets/elixir/advanced/metadata_extraction.exs"
|
|
|
|
=== "TypeScript"
|
|
|
|
--8<-- "snippets/typescript/metadata/metadata.md"
|
|
|
|
=== "Wasm"
|
|
|
|
--8<-- "snippets/wasm/metadata/metadata.md"
|
|
|
|
=== "Zig"
|
|
|
|
--8<-- "snippets/zig/metadata/metadata.md"
|
|
|
|
=== "CLI"
|
|
|
|
Extract and parse metadata using JSON output:
|
|
|
|
```bash title="Terminal"
|
|
# Extract with metadata (JSON format includes metadata automatically)
|
|
kreuzberg extract document.pdf --format json
|
|
|
|
# Save to file and parse metadata
|
|
kreuzberg extract document.pdf --format json > result.json
|
|
|
|
# Print all metadata fields
|
|
cat result.json | jq '.metadata'
|
|
|
|
# Extract HTML metadata
|
|
kreuzberg extract page.html --format json | jq '.metadata'
|
|
|
|
# Get specific fields
|
|
kreuzberg extract document.pdf --format json | \
|
|
jq '.metadata | {page_count, authors, title}'
|
|
|
|
# Process multiple files
|
|
kreuzberg batch documents/*.pdf --format json > all_metadata.json
|
|
```
|
|
|
|
**JSON Output Structure:**
|
|
|
|
```json title="JSON"
|
|
{
|
|
"content": "Extracted text...",
|
|
"mime_type": "application/pdf",
|
|
"metadata": {
|
|
"title": "Document Title",
|
|
"authors": ["John Doe"],
|
|
"created_by": "LaTeX with hyperref package",
|
|
"format_type": "pdf",
|
|
"page_count": 10
|
|
},
|
|
"tables": []
|
|
}
|
|
```
|
|
|
|
Kreuzberg extracts format-specific metadata for:
|
|
|
|
- **PDF**: page count, title, authors (list), creation date, modification date
|
|
- **HTML**: SEO tags, Open Graph, Twitter Card, structured data, headers, links, images
|
|
- **Excel**: sheet count, sheet names
|
|
- **Email**: from, to, CC, BCC, message ID, attachments
|
|
- **PowerPoint**: title, author, description, fonts
|
|
- **Images**: dimensions, format, EXIF data
|
|
- **Archives**: format, file count, file list, sizes
|
|
- **XML**: element count, unique elements
|
|
- **Text/Markdown**: word count, line count, headers, links
|
|
|
|
See [Types Reference](../reference/types.md) for complete metadata reference.
|
|
|
|
## Extract Tables
|
|
|
|
Tables come back as both structured cells and Markdown. Kreuzberg extracts them
|
|
from PDFs, spreadsheets, and HTML:
|
|
|
|
=== "C"
|
|
|
|
--8<-- "snippets/c/metadata/tables.md"
|
|
|
|
=== "C#"
|
|
|
|
--8<-- "snippets/csharp/tables.md"
|
|
|
|
=== "Dart"
|
|
|
|
--8<-- "snippets/dart/metadata/tables.md"
|
|
|
|
=== "Go"
|
|
|
|
--8<-- "snippets/go/metadata/tables.md"
|
|
|
|
=== "Java"
|
|
|
|
--8<-- "snippets/java/metadata/tables.md"
|
|
|
|
=== "Kotlin"
|
|
|
|
--8<-- "snippets/kotlin/metadata/tables.md"
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/utils/tables.md"
|
|
|
|
=== "Ruby"
|
|
|
|
--8<-- "snippets/ruby/metadata/tables.md"
|
|
|
|
=== "R"
|
|
|
|
--8<-- "snippets/r/metadata/tables.md"
|
|
|
|
=== "Rust"
|
|
|
|
--8<-- "snippets/rust/metadata/tables.md"
|
|
|
|
=== "Swift"
|
|
|
|
--8<-- "snippets/swift/metadata/tables.md"
|
|
|
|
=== "Elixir"
|
|
|
|
--8<-- "snippets/elixir/advanced/table_extraction.exs"
|
|
|
|
=== "TypeScript"
|
|
|
|
--8<-- "snippets/typescript/api/tables.md"
|
|
|
|
=== "Wasm"
|
|
|
|
--8<-- "snippets/wasm/api/tables.md"
|
|
|
|
=== "Zig"
|
|
|
|
--8<-- "snippets/zig/metadata/tables.md"
|
|
|
|
=== "CLI"
|
|
|
|
Extract and process tables from documents:
|
|
|
|
```bash title="Terminal"
|
|
# Extract with JSON format (includes tables when detected)
|
|
kreuzberg extract document.pdf --format json
|
|
|
|
# Save tables to JSON
|
|
kreuzberg extract spreadsheet.xlsx --format json > tables.json
|
|
|
|
# Extract and parse table markdown
|
|
kreuzberg extract document.pdf --format json | \
|
|
jq '.tables[]? | .markdown'
|
|
|
|
# Get table cells
|
|
kreuzberg extract document.pdf --format json | \
|
|
jq '.tables[]? | .cells'
|
|
|
|
# Batch extract tables from multiple files
|
|
kreuzberg batch documents/**/*.pdf --format json > all_tables.json
|
|
```
|
|
|
|
**JSON Table Structure:**
|
|
|
|
```json title="JSON"
|
|
{
|
|
"content": "...",
|
|
"tables": [
|
|
{
|
|
"cells": [
|
|
["Name", "Age", "City"],
|
|
["Alice", "30", "New York"],
|
|
["Bob", "25", "Los Angeles"]
|
|
],
|
|
"markdown": "| Name | Age | City |\n|------|-----|--------|\n| Alice | 30 | New York |\n| Bob | 25 | Los Angeles |"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Going Async
|
|
|
|
Use async extraction in web servers, background workers, or anywhere you need
|
|
non-blocking I/O:
|
|
|
|
=== "C"
|
|
|
|
--8<-- "snippets/c/api/extract_file_async.md"
|
|
|
|
=== "C#"
|
|
|
|
--8<-- "snippets/csharp/extract_file_async.md"
|
|
|
|
=== "Dart"
|
|
|
|
--8<-- "snippets/dart/api/extract_file_async.md"
|
|
|
|
=== "Go"
|
|
|
|
--8<-- "snippets/go/api/extract_file_async.md"
|
|
|
|
=== "Java"
|
|
|
|
--8<-- "snippets/java/api/extract_file_async.md"
|
|
|
|
=== "Kotlin"
|
|
|
|
--8<-- "snippets/kotlin/api/extract_file_async.md"
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/api/extract_file_async.md"
|
|
|
|
=== "Ruby"
|
|
|
|
--8<-- "snippets/ruby/api/extract_file_async.md"
|
|
|
|
=== "R"
|
|
|
|
--8<-- "snippets/r/api/extract_file_async.md"
|
|
|
|
=== "Rust"
|
|
|
|
--8<-- "snippets/rust/api/extract_file_async.md"
|
|
|
|
=== "Swift"
|
|
|
|
--8<-- "snippets/swift/api/extract_file_async.md"
|
|
|
|
=== "Elixir"
|
|
|
|
--8<-- "snippets/elixir/core/extract_file_async.exs"
|
|
|
|
=== "TypeScript"
|
|
|
|
--8<-- "snippets/typescript/getting-started/extract_file_async.md"
|
|
|
|
=== "Wasm"
|
|
|
|
--8<-- "snippets/wasm/getting-started/extract_file_async.md"
|
|
|
|
=== "Zig"
|
|
|
|
--8<-- "snippets/zig/api/extract_file_async.md"
|
|
|
|
=== "CLI"
|
|
|
|
!!! note "Not Applicable"
|
|
Async extraction is an API-level feature. The CLI operates synchronously.
|
|
Use language-specific bindings (Python, TypeScript, Rust, WASM) for async operations.
|
|
|
|
## Next Steps
|
|
|
|
You've covered the core API. Go deeper:
|
|
|
|
- **[Configuration Guide](../guides/configuration.md)** — OCR backends, chunking, language detection, config files
|
|
- **[Extract from Bytes](../reference/api-python.md#extract_bytes_sync)** — Process in-memory data without writing to disk
|
|
- **[OCR Setup](../guides/ocr.md)** — Tesseract, PaddleOCR, EasyOCR backends
|
|
- **[Types Reference](../reference/types.md)** — Full metadata fields for every format
|
|
- **[Docker Deployment](../guides/docker.md)** — Run Kreuzberg in containers
|
|
- **[API Reference](../reference/api-python.md)** — Complete API documentation
|