Files
fil/docs/getting-started/quickstart.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

12 KiB

Quick Start

This guide walks you through Kreuzberg's core API — extracting text, handling errors, running OCR, and working with metadata. Install your binding first if you haven't: Installation.

TypeScript users: @kreuzberg/node for Node.js, @kreuzberg/wasm for browsers and edge runtimes — see Language Support.

Your First Extraction

Pass a file path to get its text content. Kreuzberg detects the format automatically:

=== "C"

--8<-- "snippets/c/api/extract_file_sync.md"

=== "C#"

--8<-- "snippets/csharp/extract_file_sync.md"

=== "Dart"

--8<-- "snippets/dart/api/extract_file_sync.md"

=== "Go"

--8<-- "snippets/go/api/extract_file_sync.md"

=== "Java"

--8<-- "snippets/java/api/extract_file_sync.md"

=== "Kotlin"

--8<-- "snippets/kotlin/api/extract_file_sync.md"

=== "Python"

--8<-- "snippets/python/api/extract_file_sync.md"

=== "Ruby"

--8<-- "snippets/ruby/api/extract_file_sync.md"

=== "R"

--8<-- "snippets/r/api/extract_file_sync.md"

=== "Rust"

--8<-- "snippets/rust/api/extract_file_sync.md"

=== "Swift"

--8<-- "snippets/swift/api/extract_file_sync.md"

=== "Elixir"

--8<-- "snippets/elixir/core/extract_file_sync.exs"

=== "TypeScript"

--8<-- "snippets/typescript/getting-started/extract_file_sync.md"

=== "Wasm"

--8<-- "snippets/wasm/getting-started/extract_file_sync.md"

=== "Zig"

--8<-- "snippets/zig/api/extract_file_sync.md"

=== "CLI"

--8<-- "snippets/cli/extract_basic.md"

Handle Errors

Wrap extractions in error handling before going further. Kreuzberg raises specific exceptions for missing files, parse failures, and OCR problems:

=== "C"

--8<-- "snippets/c/api/error_handling.md"

=== "C#"

--8<-- "snippets/csharp/error_handling.md"

=== "Dart"

--8<-- "snippets/dart/api/error_handling.md"

=== "Go"

--8<-- "snippets/go/api/error_handling.md"

=== "Java"

--8<-- "snippets/java/api/error_handling.md"

=== "Kotlin"

--8<-- "snippets/kotlin/api/error_handling.md"

=== "Python"

--8<-- "snippets/python/utils/error_handling.md"

=== "Ruby"

--8<-- "snippets/ruby/api/error_handling.md"

=== "R"

--8<-- "snippets/r/api/error_handling.md"

=== "Rust"

--8<-- "snippets/rust/api/error_handling.md"

=== "Swift"

--8<-- "snippets/swift/api/error_handling.md"

=== "Elixir"

--8<-- "snippets/elixir/core/error_handling.exs"

=== "TypeScript"

--8<-- "snippets/typescript/api/error_handling.md"

=== "Wasm"

--8<-- "snippets/wasm/api/error_handling.md"

=== "Zig"

--8<-- "snippets/zig/api/error_handling.md"

OCR for Scanned Documents

Kreuzberg runs OCR automatically when it detects an image or scanned PDF. You can also force OCR on any document:

=== "C"

--8<-- "snippets/c/ocr/ocr_extraction.md"

=== "C#"

--8<-- "snippets/csharp/ocr_extraction.md"

=== "Dart"

--8<-- "snippets/dart/ocr/ocr_extraction.md"

=== "Go"

--8<-- "snippets/go/ocr/ocr_extraction.md"

=== "Java"

--8<-- "snippets/java/ocr/ocr_extraction.md"

=== "Kotlin"

--8<-- "snippets/kotlin/ocr/ocr_extraction.md"

=== "Python"

--8<-- "snippets/python/ocr/ocr_extraction.md"

=== "Ruby"

--8<-- "snippets/ruby/ocr/ocr_extraction.md"

=== "R"

--8<-- "snippets/r/ocr/ocr_extraction.md"

=== "Rust"

--8<-- "snippets/rust/ocr/ocr_extraction.md"

=== "Swift"

--8<-- "snippets/swift/ocr/ocr_extraction.md"

=== "Elixir"

--8<-- "snippets/elixir/ocr/tesseract_basic.exs"

=== "TypeScript"

--8<-- "snippets/typescript/ocr/ocr_extraction.md"

=== "Wasm"

--8<-- "snippets/wasm/ocr/ocr_extraction.md"

=== "Zig"

--8<-- "snippets/zig/ocr/ocr_extraction.md"

=== "CLI"

--8<-- "snippets/cli/ocr_basic.md"

Process Multiple Files

Pass a list of paths to extract them in parallel:

=== "C"

--8<-- "snippets/c/api/batch_extract_files_sync.md"

=== "C#"

--8<-- "snippets/csharp/batch_extract_files_sync.md"

=== "Dart"

--8<-- "snippets/dart/api/batch_extract_files_sync.md"

=== "Go"

--8<-- "snippets/go/api/batch_extract_files_sync.md"

=== "Java"

--8<-- "snippets/java/api/batch_extract_files_sync.md"

=== "Kotlin"

--8<-- "snippets/kotlin/api/batch_extract_files_sync.md"

=== "Python"

--8<-- "snippets/python/api/batch_extract_files_sync.md"

=== "Ruby"

--8<-- "snippets/ruby/api/batch_extract_files_sync.md"

=== "R"

--8<-- "snippets/r/api/batch_extract_files_sync.md"

=== "Rust"

--8<-- "snippets/rust/api/batch_extract_files_sync.md"

=== "Swift"

--8<-- "snippets/swift/api/batch_extract_files_sync.md"

=== "Elixir"

--8<-- "snippets/elixir/core/batch_extract_files_sync.exs"

=== "TypeScript"

--8<-- "snippets/typescript/getting-started/batch_extract_files_sync.md"

=== "Wasm"

--8<-- "snippets/wasm/getting-started/batch_extract_files_sync.md"

=== "Zig"

--8<-- "snippets/zig/api/batch_extract_files_sync.md"

=== "CLI"

--8<-- "snippets/cli/batch_basic.md"

Read Document Metadata

Every extraction result includes format-specific metadata — page count for PDFs, sheet names for Excel, dimensions for images:

=== "C"

--8<-- "snippets/c/metadata/metadata.md"

=== "C#"

--8<-- "snippets/csharp/metadata.md"

=== "Dart"

--8<-- "snippets/dart/metadata/metadata.md"

=== "Go"

--8<-- "snippets/go/metadata/metadata.md"

=== "Java"

--8<-- "snippets/java/metadata/metadata.md"

=== "Kotlin"

--8<-- "snippets/kotlin/metadata/metadata.md"

=== "Python"

--8<-- "snippets/python/metadata/metadata.md"

=== "Ruby"

--8<-- "snippets/ruby/metadata/metadata.md"

=== "R"

--8<-- "snippets/r/metadata/metadata.md"

=== "Rust"

--8<-- "snippets/rust/metadata/metadata.md"

=== "Swift"

--8<-- "snippets/swift/metadata/metadata.md"

=== "Elixir"

--8<-- "snippets/elixir/advanced/metadata_extraction.exs"

=== "TypeScript"

--8<-- "snippets/typescript/metadata/metadata.md"

=== "Wasm"

--8<-- "snippets/wasm/metadata/metadata.md"

=== "Zig"

--8<-- "snippets/zig/metadata/metadata.md"

=== "CLI"

Extract and parse metadata using JSON output:

```bash title="Terminal"
# Extract with metadata (JSON format includes metadata automatically)
kreuzberg extract document.pdf --format json

# Save to file and parse metadata
kreuzberg extract document.pdf --format json > result.json

# Print all metadata fields
cat result.json | jq '.metadata'

# Extract HTML metadata
kreuzberg extract page.html --format json | jq '.metadata'

# Get specific fields
kreuzberg extract document.pdf --format json | \
  jq '.metadata | {page_count, authors, title}'

# Process multiple files
kreuzberg batch documents/*.pdf --format json > all_metadata.json
```

**JSON Output Structure:**

```json title="JSON"
{
  "content": "Extracted text...",
  "mime_type": "application/pdf",
  "metadata": {
    "title": "Document Title",
    "authors": ["John Doe"],
    "created_by": "LaTeX with hyperref package",
    "format_type": "pdf",
    "page_count": 10
  },
  "tables": []
}
```

Kreuzberg extracts format-specific metadata for:

  • PDF: page count, title, authors (list), creation date, modification date
  • HTML: SEO tags, Open Graph, Twitter Card, structured data, headers, links, images
  • Excel: sheet count, sheet names
  • Email: from, to, CC, BCC, message ID, attachments
  • PowerPoint: title, author, description, fonts
  • Images: dimensions, format, EXIF data
  • Archives: format, file count, file list, sizes
  • XML: element count, unique elements
  • Text/Markdown: word count, line count, headers, links

See Types Reference for complete metadata reference.

Extract Tables

Tables come back as both structured cells and Markdown. Kreuzberg extracts them from PDFs, spreadsheets, and HTML:

=== "C"

--8<-- "snippets/c/metadata/tables.md"

=== "C#"

--8<-- "snippets/csharp/tables.md"

=== "Dart"

--8<-- "snippets/dart/metadata/tables.md"

=== "Go"

--8<-- "snippets/go/metadata/tables.md"

=== "Java"

--8<-- "snippets/java/metadata/tables.md"

=== "Kotlin"

--8<-- "snippets/kotlin/metadata/tables.md"

=== "Python"

--8<-- "snippets/python/utils/tables.md"

=== "Ruby"

--8<-- "snippets/ruby/metadata/tables.md"

=== "R"

--8<-- "snippets/r/metadata/tables.md"

=== "Rust"

--8<-- "snippets/rust/metadata/tables.md"

=== "Swift"

--8<-- "snippets/swift/metadata/tables.md"

=== "Elixir"

--8<-- "snippets/elixir/advanced/table_extraction.exs"

=== "TypeScript"

--8<-- "snippets/typescript/api/tables.md"

=== "Wasm"

--8<-- "snippets/wasm/api/tables.md"

=== "Zig"

--8<-- "snippets/zig/metadata/tables.md"

=== "CLI"

Extract and process tables from documents:

```bash title="Terminal"
# Extract with JSON format (includes tables when detected)
kreuzberg extract document.pdf --format json

# Save tables to JSON
kreuzberg extract spreadsheet.xlsx --format json > tables.json

# Extract and parse table markdown
kreuzberg extract document.pdf --format json | \
  jq '.tables[]? | .markdown'

# Get table cells
kreuzberg extract document.pdf --format json | \
  jq '.tables[]? | .cells'

# Batch extract tables from multiple files
kreuzberg batch documents/**/*.pdf --format json > all_tables.json
```

**JSON Table Structure:**

```json title="JSON"
{
  "content": "...",
  "tables": [
    {
      "cells": [
        ["Name", "Age", "City"],
        ["Alice", "30", "New York"],
        ["Bob", "25", "Los Angeles"]
      ],
      "markdown": "| Name | Age | City |\n|------|-----|--------|\n| Alice | 30 | New York |\n| Bob | 25 | Los Angeles |"
    }
  ]
}
```

Going Async

Use async extraction in web servers, background workers, or anywhere you need non-blocking I/O:

=== "C"

--8<-- "snippets/c/api/extract_file_async.md"

=== "C#"

--8<-- "snippets/csharp/extract_file_async.md"

=== "Dart"

--8<-- "snippets/dart/api/extract_file_async.md"

=== "Go"

--8<-- "snippets/go/api/extract_file_async.md"

=== "Java"

--8<-- "snippets/java/api/extract_file_async.md"

=== "Kotlin"

--8<-- "snippets/kotlin/api/extract_file_async.md"

=== "Python"

--8<-- "snippets/python/api/extract_file_async.md"

=== "Ruby"

--8<-- "snippets/ruby/api/extract_file_async.md"

=== "R"

--8<-- "snippets/r/api/extract_file_async.md"

=== "Rust"

--8<-- "snippets/rust/api/extract_file_async.md"

=== "Swift"

--8<-- "snippets/swift/api/extract_file_async.md"

=== "Elixir"

--8<-- "snippets/elixir/core/extract_file_async.exs"

=== "TypeScript"

--8<-- "snippets/typescript/getting-started/extract_file_async.md"

=== "Wasm"

--8<-- "snippets/wasm/getting-started/extract_file_async.md"

=== "Zig"

--8<-- "snippets/zig/api/extract_file_async.md"

=== "CLI"

!!! note "Not Applicable"
    Async extraction is an API-level feature. The CLI operates synchronously.
    Use language-specific bindings (Python, TypeScript, Rust, WASM) for async operations.

Next Steps

You've covered the core API. Go deeper: