fil/docs/getting-started/quickstart.md

# Quick Start

This guide walks you through Kreuzberg's core API — extracting text, handling errors,
running OCR, and working with metadata. Install your binding first if you haven't:
[Installation](installation.md).

TypeScript users: `@kreuzberg/node` for Node.js, `@kreuzberg/wasm` for browsers and edge runtimes — see [Language Support](../index.md#language-support).

## Your First Extraction

Pass a file path to get its text content. Kreuzberg detects the format automatically:

=== "C"

    --8<-- "snippets/c/api/extract_file_sync.md"

=== "C#"

    --8<-- "snippets/csharp/extract_file_sync.md"

=== "Dart"

    --8<-- "snippets/dart/api/extract_file_sync.md"

=== "Go"

    --8<-- "snippets/go/api/extract_file_sync.md"

=== "Java"

    --8<-- "snippets/java/api/extract_file_sync.md"

=== "Kotlin"

    --8<-- "snippets/kotlin/api/extract_file_sync.md"

=== "Python"

    --8<-- "snippets/python/api/extract_file_sync.md"

=== "Ruby"

    --8<-- "snippets/ruby/api/extract_file_sync.md"

=== "R"

    --8<-- "snippets/r/api/extract_file_sync.md"

=== "Rust"

    --8<-- "snippets/rust/api/extract_file_sync.md"

=== "Swift"

    --8<-- "snippets/swift/api/extract_file_sync.md"

=== "Elixir"

    --8<-- "snippets/elixir/core/extract_file_sync.exs"

=== "TypeScript"

    --8<-- "snippets/typescript/getting-started/extract_file_sync.md"

=== "Wasm"

    --8<-- "snippets/wasm/getting-started/extract_file_sync.md"

=== "Zig"

    --8<-- "snippets/zig/api/extract_file_sync.md"

=== "CLI"

    --8<-- "snippets/cli/extract_basic.md"

## Handle Errors

Wrap extractions in error handling before going further. Kreuzberg raises specific
exceptions for missing files, parse failures, and OCR problems:

=== "C"

    --8<-- "snippets/c/api/error_handling.md"

=== "C#"

    --8<-- "snippets/csharp/error_handling.md"

=== "Dart"

    --8<-- "snippets/dart/api/error_handling.md"

=== "Go"

    --8<-- "snippets/go/api/error_handling.md"

=== "Java"

    --8<-- "snippets/java/api/error_handling.md"

=== "Kotlin"

    --8<-- "snippets/kotlin/api/error_handling.md"

=== "Python"

    --8<-- "snippets/python/utils/error_handling.md"

=== "Ruby"

    --8<-- "snippets/ruby/api/error_handling.md"

=== "R"

    --8<-- "snippets/r/api/error_handling.md"

=== "Rust"

    --8<-- "snippets/rust/api/error_handling.md"

=== "Swift"

    --8<-- "snippets/swift/api/error_handling.md"

=== "Elixir"

    --8<-- "snippets/elixir/core/error_handling.exs"

=== "TypeScript"

    --8<-- "snippets/typescript/api/error_handling.md"

=== "Wasm"

    --8<-- "snippets/wasm/api/error_handling.md"

=== "Zig"

    --8<-- "snippets/zig/api/error_handling.md"

## OCR for Scanned Documents

Kreuzberg runs OCR automatically when it detects an image or scanned PDF.
You can also force OCR on any document:

=== "C"

    --8<-- "snippets/c/ocr/ocr_extraction.md"

=== "C#"

    --8<-- "snippets/csharp/ocr_extraction.md"

=== "Dart"

    --8<-- "snippets/dart/ocr/ocr_extraction.md"

=== "Go"

    --8<-- "snippets/go/ocr/ocr_extraction.md"

=== "Java"

    --8<-- "snippets/java/ocr/ocr_extraction.md"

=== "Kotlin"

    --8<-- "snippets/kotlin/ocr/ocr_extraction.md"

=== "Python"

    --8<-- "snippets/python/ocr/ocr_extraction.md"

=== "Ruby"

    --8<-- "snippets/ruby/ocr/ocr_extraction.md"

=== "R"

    --8<-- "snippets/r/ocr/ocr_extraction.md"

=== "Rust"

    --8<-- "snippets/rust/ocr/ocr_extraction.md"

=== "Swift"

    --8<-- "snippets/swift/ocr/ocr_extraction.md"

=== "Elixir"

    --8<-- "snippets/elixir/ocr/tesseract_basic.exs"

=== "TypeScript"

    --8<-- "snippets/typescript/ocr/ocr_extraction.md"

=== "Wasm"

    --8<-- "snippets/wasm/ocr/ocr_extraction.md"

=== "Zig"

    --8<-- "snippets/zig/ocr/ocr_extraction.md"

=== "CLI"

    --8<-- "snippets/cli/ocr_basic.md"

## Process Multiple Files

Pass a list of paths to extract them in parallel:

=== "C"

    --8<-- "snippets/c/api/batch_extract_files_sync.md"

=== "C#"

    --8<-- "snippets/csharp/batch_extract_files_sync.md"

=== "Dart"

    --8<-- "snippets/dart/api/batch_extract_files_sync.md"

=== "Go"

    --8<-- "snippets/go/api/batch_extract_files_sync.md"

=== "Java"

    --8<-- "snippets/java/api/batch_extract_files_sync.md"

=== "Kotlin"

    --8<-- "snippets/kotlin/api/batch_extract_files_sync.md"

=== "Python"

    --8<-- "snippets/python/api/batch_extract_files_sync.md"

=== "Ruby"

    --8<-- "snippets/ruby/api/batch_extract_files_sync.md"

=== "R"

    --8<-- "snippets/r/api/batch_extract_files_sync.md"

=== "Rust"

    --8<-- "snippets/rust/api/batch_extract_files_sync.md"

=== "Swift"

    --8<-- "snippets/swift/api/batch_extract_files_sync.md"

=== "Elixir"

    --8<-- "snippets/elixir/core/batch_extract_files_sync.exs"

=== "TypeScript"

    --8<-- "snippets/typescript/getting-started/batch_extract_files_sync.md"

=== "Wasm"

    --8<-- "snippets/wasm/getting-started/batch_extract_files_sync.md"

=== "Zig"

    --8<-- "snippets/zig/api/batch_extract_files_sync.md"

=== "CLI"

    --8<-- "snippets/cli/batch_basic.md"

## Read Document Metadata

Every extraction result includes format-specific metadata — page count for PDFs,
sheet names for Excel, dimensions for images:

=== "C"

    --8<-- "snippets/c/metadata/metadata.md"

=== "C#"

    --8<-- "snippets/csharp/metadata.md"

=== "Dart"

    --8<-- "snippets/dart/metadata/metadata.md"

=== "Go"

    --8<-- "snippets/go/metadata/metadata.md"

=== "Java"

    --8<-- "snippets/java/metadata/metadata.md"

=== "Kotlin"

    --8<-- "snippets/kotlin/metadata/metadata.md"

=== "Python"

    --8<-- "snippets/python/metadata/metadata.md"

=== "Ruby"

    --8<-- "snippets/ruby/metadata/metadata.md"

=== "R"

    --8<-- "snippets/r/metadata/metadata.md"

=== "Rust"

    --8<-- "snippets/rust/metadata/metadata.md"

=== "Swift"

    --8<-- "snippets/swift/metadata/metadata.md"

=== "Elixir"

    --8<-- "snippets/elixir/advanced/metadata_extraction.exs"

=== "TypeScript"

    --8<-- "snippets/typescript/metadata/metadata.md"

=== "Wasm"

    --8<-- "snippets/wasm/metadata/metadata.md"

=== "Zig"

    --8<-- "snippets/zig/metadata/metadata.md"

=== "CLI"

    Extract and parse metadata using JSON output:

    ```bash title="Terminal"
    # Extract with metadata (JSON format includes metadata automatically)
    kreuzberg extract document.pdf --format json

    # Save to file and parse metadata
    kreuzberg extract document.pdf --format json > result.json

    # Print all metadata fields
    cat result.json | jq '.metadata'

    # Extract HTML metadata
    kreuzberg extract page.html --format json | jq '.metadata'

    # Get specific fields
    kreuzberg extract document.pdf --format json | \
      jq '.metadata | {page_count, authors, title}'

    # Process multiple files
    kreuzberg batch documents/*.pdf --format json > all_metadata.json
    ```

    **JSON Output Structure:**

    ```json title="JSON"
    {
      "content": "Extracted text...",
      "mime_type": "application/pdf",
      "metadata": {
        "title": "Document Title",
        "authors": ["John Doe"],
        "created_by": "LaTeX with hyperref package",
        "format_type": "pdf",
        "page_count": 10
      },
      "tables": []
    }
    ```

Kreuzberg extracts format-specific metadata for:

- **PDF**: page count, title, authors (list), creation date, modification date
- **HTML**: SEO tags, Open Graph, Twitter Card, structured data, headers, links, images
- **Excel**: sheet count, sheet names
- **Email**: from, to, CC, BCC, message ID, attachments
- **PowerPoint**: title, author, description, fonts
- **Images**: dimensions, format, EXIF data
- **Archives**: format, file count, file list, sizes
- **XML**: element count, unique elements
- **Text/Markdown**: word count, line count, headers, links

See [Types Reference](../reference/types.md) for complete metadata reference.

## Extract Tables

Tables come back as both structured cells and Markdown. Kreuzberg extracts them
from PDFs, spreadsheets, and HTML:

=== "C"

    --8<-- "snippets/c/metadata/tables.md"

=== "C#"

    --8<-- "snippets/csharp/tables.md"

=== "Dart"

    --8<-- "snippets/dart/metadata/tables.md"

=== "Go"

    --8<-- "snippets/go/metadata/tables.md"

=== "Java"

    --8<-- "snippets/java/metadata/tables.md"

=== "Kotlin"

    --8<-- "snippets/kotlin/metadata/tables.md"

=== "Python"

    --8<-- "snippets/python/utils/tables.md"

=== "Ruby"

    --8<-- "snippets/ruby/metadata/tables.md"

=== "R"

    --8<-- "snippets/r/metadata/tables.md"

=== "Rust"

    --8<-- "snippets/rust/metadata/tables.md"

=== "Swift"

    --8<-- "snippets/swift/metadata/tables.md"

=== "Elixir"

    --8<-- "snippets/elixir/advanced/table_extraction.exs"

=== "TypeScript"

    --8<-- "snippets/typescript/api/tables.md"

=== "Wasm"

    --8<-- "snippets/wasm/api/tables.md"

=== "Zig"

    --8<-- "snippets/zig/metadata/tables.md"

=== "CLI"

    Extract and process tables from documents:

    ```bash title="Terminal"
    # Extract with JSON format (includes tables when detected)
    kreuzberg extract document.pdf --format json

    # Save tables to JSON
    kreuzberg extract spreadsheet.xlsx --format json > tables.json

    # Extract and parse table markdown
    kreuzberg extract document.pdf --format json | \
      jq '.tables[]? | .markdown'

    # Get table cells
    kreuzberg extract document.pdf --format json | \
      jq '.tables[]? | .cells'

    # Batch extract tables from multiple files
    kreuzberg batch documents/**/*.pdf --format json > all_tables.json
    ```

    **JSON Table Structure:**

    ```json title="JSON"
    {
      "content": "...",
      "tables": [
        {
          "cells": [
            ["Name", "Age", "City"],
            ["Alice", "30", "New York"],
            ["Bob", "25", "Los Angeles"]
          ],
          "markdown": "| Name | Age | City |\n|------|-----|--------|\n| Alice | 30 | New York |\n| Bob | 25 | Los Angeles |"
        }
      ]
    }
    ```

## Going Async

Use async extraction in web servers, background workers, or anywhere you need
non-blocking I/O:

=== "C"

    --8<-- "snippets/c/api/extract_file_async.md"

=== "C#"

    --8<-- "snippets/csharp/extract_file_async.md"

=== "Dart"

    --8<-- "snippets/dart/api/extract_file_async.md"

=== "Go"

    --8<-- "snippets/go/api/extract_file_async.md"

=== "Java"

    --8<-- "snippets/java/api/extract_file_async.md"

=== "Kotlin"

    --8<-- "snippets/kotlin/api/extract_file_async.md"

=== "Python"

    --8<-- "snippets/python/api/extract_file_async.md"

=== "Ruby"

    --8<-- "snippets/ruby/api/extract_file_async.md"

=== "R"

    --8<-- "snippets/r/api/extract_file_async.md"

=== "Rust"

    --8<-- "snippets/rust/api/extract_file_async.md"

=== "Swift"

    --8<-- "snippets/swift/api/extract_file_async.md"

=== "Elixir"

    --8<-- "snippets/elixir/core/extract_file_async.exs"

=== "TypeScript"

    --8<-- "snippets/typescript/getting-started/extract_file_async.md"

=== "Wasm"

    --8<-- "snippets/wasm/getting-started/extract_file_async.md"

=== "Zig"

    --8<-- "snippets/zig/api/extract_file_async.md"

=== "CLI"

    !!! note "Not Applicable"
        Async extraction is an API-level feature. The CLI operates synchronously.
        Use language-specific bindings (Python, TypeScript, Rust, WASM) for async operations.

## Next Steps

You've covered the core API. Go deeper:

- **[Configuration Guide](../guides/configuration.md)** — OCR backends, chunking, language detection, config files
- **[Extract from Bytes](../reference/api-python.md#extract_bytes_sync)** — Process in-memory data without writing to disk
- **[OCR Setup](../guides/ocr.md)** — Tesseract, PaddleOCR, EasyOCR backends
- **[Types Reference](../reference/types.md)** — Full metadata fields for every format
- **[Docker Deployment](../guides/docker.md)** — Run Kreuzberg in containers
- **[API Reference](../reference/api-python.md)** — Complete API documentation