This commit is contained in:
586
docs/getting-started/quickstart.md
Normal file
586
docs/getting-started/quickstart.md
Normal file
@@ -0,0 +1,586 @@
|
||||
# Quick Start
|
||||
|
||||
This guide walks you through Kreuzberg's core API — extracting text, handling errors,
|
||||
running OCR, and working with metadata. Install your binding first if you haven't:
|
||||
[Installation](installation.md).
|
||||
|
||||
TypeScript users: `@kreuzberg/node` for Node.js, `@kreuzberg/wasm` for browsers and edge runtimes — see [Language Support](../index.md#language-support).
|
||||
|
||||
## Your First Extraction
|
||||
|
||||
Pass a file path to get its text content. Kreuzberg detects the format automatically:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/extract_file_sync.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/extract_file_sync.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/api/extract_file_sync.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/extract_file_sync.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/extract_file_sync.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/api/extract_file_sync.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/extract_file_sync.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/extract_file_sync.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/extract_file_sync.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/extract_file_sync.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/api/extract_file_sync.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/core/extract_file_sync.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/extract_file_sync.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/getting-started/extract_file_sync.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/api/extract_file_sync.md"
|
||||
|
||||
=== "CLI"
|
||||
|
||||
--8<-- "snippets/cli/extract_basic.md"
|
||||
|
||||
## Handle Errors
|
||||
|
||||
Wrap extractions in error handling before going further. Kreuzberg raises specific
|
||||
exceptions for missing files, parse failures, and OCR problems:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/error_handling.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/error_handling.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/api/error_handling.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/error_handling.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/error_handling.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/api/error_handling.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/error_handling.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/error_handling.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/error_handling.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/error_handling.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/api/error_handling.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/core/error_handling.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/api/error_handling.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/api/error_handling.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/api/error_handling.md"
|
||||
|
||||
## OCR for Scanned Documents
|
||||
|
||||
Kreuzberg runs OCR automatically when it detects an image or scanned PDF.
|
||||
You can also force OCR on any document:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/ocr/ocr_extraction.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/ocr_extraction.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/ocr/ocr_extraction.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/ocr/tesseract_basic.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/ocr/ocr_extraction.md"
|
||||
|
||||
=== "CLI"
|
||||
|
||||
--8<-- "snippets/cli/ocr_basic.md"
|
||||
|
||||
## Process Multiple Files
|
||||
|
||||
Pass a list of paths to extract them in parallel:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/batch_extract_files_sync.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/core/batch_extract_files_sync.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/batch_extract_files_sync.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/getting-started/batch_extract_files_sync.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "CLI"
|
||||
|
||||
--8<-- "snippets/cli/batch_basic.md"
|
||||
|
||||
## Read Document Metadata
|
||||
|
||||
Every extraction result includes format-specific metadata — page count for PDFs,
|
||||
sheet names for Excel, dimensions for images:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/metadata/metadata.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/metadata.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/metadata/metadata.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/metadata/metadata.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/metadata/metadata.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/metadata/metadata.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/metadata/metadata.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/metadata/metadata.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/metadata/metadata.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/metadata/metadata.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/metadata/metadata.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/advanced/metadata_extraction.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/metadata/metadata.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/metadata/metadata.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/metadata/metadata.md"
|
||||
|
||||
=== "CLI"
|
||||
|
||||
Extract and parse metadata using JSON output:
|
||||
|
||||
```bash title="Terminal"
|
||||
# Extract with metadata (JSON format includes metadata automatically)
|
||||
kreuzberg extract document.pdf --format json
|
||||
|
||||
# Save to file and parse metadata
|
||||
kreuzberg extract document.pdf --format json > result.json
|
||||
|
||||
# Print all metadata fields
|
||||
cat result.json | jq '.metadata'
|
||||
|
||||
# Extract HTML metadata
|
||||
kreuzberg extract page.html --format json | jq '.metadata'
|
||||
|
||||
# Get specific fields
|
||||
kreuzberg extract document.pdf --format json | \
|
||||
jq '.metadata | {page_count, authors, title}'
|
||||
|
||||
# Process multiple files
|
||||
kreuzberg batch documents/*.pdf --format json > all_metadata.json
|
||||
```
|
||||
|
||||
**JSON Output Structure:**
|
||||
|
||||
```json title="JSON"
|
||||
{
|
||||
"content": "Extracted text...",
|
||||
"mime_type": "application/pdf",
|
||||
"metadata": {
|
||||
"title": "Document Title",
|
||||
"authors": ["John Doe"],
|
||||
"created_by": "LaTeX with hyperref package",
|
||||
"format_type": "pdf",
|
||||
"page_count": 10
|
||||
},
|
||||
"tables": []
|
||||
}
|
||||
```
|
||||
|
||||
Kreuzberg extracts format-specific metadata for:
|
||||
|
||||
- **PDF**: page count, title, authors (list), creation date, modification date
|
||||
- **HTML**: SEO tags, Open Graph, Twitter Card, structured data, headers, links, images
|
||||
- **Excel**: sheet count, sheet names
|
||||
- **Email**: from, to, CC, BCC, message ID, attachments
|
||||
- **PowerPoint**: title, author, description, fonts
|
||||
- **Images**: dimensions, format, EXIF data
|
||||
- **Archives**: format, file count, file list, sizes
|
||||
- **XML**: element count, unique elements
|
||||
- **Text/Markdown**: word count, line count, headers, links
|
||||
|
||||
See [Types Reference](../reference/types.md) for complete metadata reference.
|
||||
|
||||
## Extract Tables
|
||||
|
||||
Tables come back as both structured cells and Markdown. Kreuzberg extracts them
|
||||
from PDFs, spreadsheets, and HTML:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/metadata/tables.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/tables.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/metadata/tables.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/metadata/tables.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/metadata/tables.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/metadata/tables.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/tables.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/metadata/tables.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/metadata/tables.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/metadata/tables.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/metadata/tables.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/advanced/table_extraction.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/api/tables.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/api/tables.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/metadata/tables.md"
|
||||
|
||||
=== "CLI"
|
||||
|
||||
Extract and process tables from documents:
|
||||
|
||||
```bash title="Terminal"
|
||||
# Extract with JSON format (includes tables when detected)
|
||||
kreuzberg extract document.pdf --format json
|
||||
|
||||
# Save tables to JSON
|
||||
kreuzberg extract spreadsheet.xlsx --format json > tables.json
|
||||
|
||||
# Extract and parse table markdown
|
||||
kreuzberg extract document.pdf --format json | \
|
||||
jq '.tables[]? | .markdown'
|
||||
|
||||
# Get table cells
|
||||
kreuzberg extract document.pdf --format json | \
|
||||
jq '.tables[]? | .cells'
|
||||
|
||||
# Batch extract tables from multiple files
|
||||
kreuzberg batch documents/**/*.pdf --format json > all_tables.json
|
||||
```
|
||||
|
||||
**JSON Table Structure:**
|
||||
|
||||
```json title="JSON"
|
||||
{
|
||||
"content": "...",
|
||||
"tables": [
|
||||
{
|
||||
"cells": [
|
||||
["Name", "Age", "City"],
|
||||
["Alice", "30", "New York"],
|
||||
["Bob", "25", "Los Angeles"]
|
||||
],
|
||||
"markdown": "| Name | Age | City |\n|------|-----|--------|\n| Alice | 30 | New York |\n| Bob | 25 | Los Angeles |"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Going Async
|
||||
|
||||
Use async extraction in web servers, background workers, or anywhere you need
|
||||
non-blocking I/O:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/extract_file_async.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/extract_file_async.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/api/extract_file_async.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/extract_file_async.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/extract_file_async.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/api/extract_file_async.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/extract_file_async.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/extract_file_async.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/extract_file_async.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/extract_file_async.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/api/extract_file_async.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/core/extract_file_async.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/extract_file_async.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/getting-started/extract_file_async.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/api/extract_file_async.md"
|
||||
|
||||
=== "CLI"
|
||||
|
||||
!!! note "Not Applicable"
|
||||
Async extraction is an API-level feature. The CLI operates synchronously.
|
||||
Use language-specific bindings (Python, TypeScript, Rust, WASM) for async operations.
|
||||
|
||||
## Next Steps
|
||||
|
||||
You've covered the core API. Go deeper:
|
||||
|
||||
- **[Configuration Guide](../guides/configuration.md)** — OCR backends, chunking, language detection, config files
|
||||
- **[Extract from Bytes](../reference/api-python.md#extract_bytes_sync)** — Process in-memory data without writing to disk
|
||||
- **[OCR Setup](../guides/ocr.md)** — Tesseract, PaddleOCR, EasyOCR backends
|
||||
- **[Types Reference](../reference/types.md)** — Full metadata fields for every format
|
||||
- **[Docker Deployment](../guides/docker.md)** — Run Kreuzberg in containers
|
||||
- **[API Reference](../reference/api-python.md)** — Complete API documentation
|
||||
Reference in New Issue
Block a user