Nomad changes
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s

This commit is contained in:
Henrik Jess Nielsen
2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions

View File

@@ -0,0 +1,480 @@
---
description: "Install Kreuzberg — pick Python, TypeScript, Rust, Go, CLI/Docker, or any of 12 supported languages."
---
# Installation
Native bindings for 17 languages plus a standalone CLI. Every package ships **prebuilt binaries** for Linux (x86_64/aarch64), macOS (Apple Silicon), and Windows — no compile step needed.
<div class="cli-hero" markdown>
## :material-console: CLI / Docker { #cli--docker }
No SDK, no code — just your terminal.
=== "Install script"
```bash
curl -fsSL https://raw.githubusercontent.com/kreuzberg-dev/kreuzberg/main/scripts/install.sh | bash
```
=== "Homebrew"
```bash
brew install kreuzberg-dev/tap/kreuzberg
```
=== "Cargo"
```bash
cargo install kreuzberg-cli
```
=== "Docker (CLI image)"
```bash
docker pull ghcr.io/kreuzberg-dev/kreuzberg-cli:latest
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest extract /data/document.pdf
```
=== "Docker (full image)"
```bash
docker pull ghcr.io/kreuzberg-dev/kreuzberg:latest
```
[CLI Usage](../cli/usage.md){ .install-btn .install-btn--ghost }
[API Server Guide](../guides/api-server.md){ .install-btn .install-btn--solid }
</div>
!!! Warning "x86_64 CPU — AVX/AVX2 instruction set required"
The bundled ONNX Runtime binaries require **AVX/AVX2** CPU instructions. CPUs without AVX support (e.g. Intel Atom, Celeron N5105/Jasper Lake, older pre-2011 processors) will crash with an `invalid opcode` trap when using ONNX-dependent features. The affected features are **PaddleOCR**, **layout detection**, and **embeddings**. All other Kreuzberg functionality (text extraction, Tesseract OCR, chunking, metadata, etc.) works normally on any x86_64 CPU. ARM platforms (aarch64) are unaffected.
!!! Warning "Windows — ONNX Runtime required for Go, Elixir, and C/C++"
Go, Elixir, and C/C++ bindings on Windows link against ONNX Runtime dynamically. You must have `onnxruntime.dll` on your `PATH` at runtime. Download it from the [ONNX Runtime releases](https://github.com/microsoft/onnxruntime/releases) (for example `onnxruntime-win-x64-1.24.1.zip`). Python, TypeScript, Java, C#, Ruby, PHP, and Wasm are unaffected.
## Choose your language
<div class="grid cards install-cards" markdown>
- :fontawesome-brands-python:{ .lg .middle } **Python**
***
```bash
pip install kreuzberg
```
[API Reference](../reference/api-python.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](quickstart.md){ .install-btn .install-btn--solid .install-btn--sm }
- :fontawesome-brands-node-js:{ .lg .middle } **TypeScript (Node.js / Bun)**
***
```bash
npm install @kreuzberg/node
```
[API Reference](../reference/api-typescript.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](#typescript){ .install-btn .install-btn--solid .install-btn--sm }
- :fontawesome-brands-js:{ .lg .middle } **TypeScript (Browser / Edge)**
***
```bash
npm install @kreuzberg/wasm
```
[API Reference](../reference/api-wasm.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](#typescript){ .install-btn .install-btn--solid .install-btn--sm }
- :fontawesome-brands-rust:{ .lg .middle } **Rust**
***
```bash
cargo add kreuzberg
```
[API Reference](../reference/api-rust.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](quickstart.md){ .install-btn .install-btn--solid .install-btn--sm }
- :fontawesome-brands-golang:{ .lg .middle } **Go**
***
```bash
go get github.com/kreuzberg-dev/kreuzberg/v5@latest
```
[API Reference](../reference/api-go.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](quickstart.md){ .install-btn .install-btn--solid .install-btn--sm }
- :fontawesome-brands-java:{ .lg .middle } **Java**
***
```gradle
implementation 'dev.kreuzberg:kreuzberg:5.0.0-rc.1'
```
[API Reference](../reference/api-java.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](#java){ .install-btn .install-btn--solid .install-btn--sm }
- :fontawesome-brands-kotlin:{ .lg .middle } **Kotlin**
***
```kotlin
implementation("dev.kreuzberg:kreuzberg-kotlin:5.0.0-rc.1")
```
[API Reference](../reference/api-kotlin.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](#kotlin){ .install-btn .install-btn--solid .install-btn--sm }
- :material-language-ruby:{ .lg .middle } **Ruby**
***
```bash
gem install kreuzberg
```
[API Reference](../reference/api-ruby.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](quickstart.md){ .install-btn .install-btn--solid .install-btn--sm }
- :fontawesome-brands-swift:{ .lg .middle } **Swift**
***
```swift
.package(url: "https://github.com/kreuzberg-dev/kreuzberg.git", from: "5.0.0-rc.1")
```
[API Reference](../reference/api-swift.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](#swift){ .install-btn .install-btn--solid .install-btn--sm }
- :material-language-csharp:{ .lg .middle } **C# / .NET**
***
```bash
dotnet add package Kreuzberg
```
[API Reference](../reference/api-csharp.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](../reference/api-csharp.md){ .install-btn .install-btn--solid .install-btn--sm }
- :fontawesome-brands-php:{ .lg .middle } **PHP**
***
```bash
composer require kreuzberg/kreuzberg
```
[API Reference](../reference/api-php.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](quickstart.md){ .install-btn .install-btn--solid .install-btn--sm }
- :simple-elixir:{ .lg .middle } **Elixir**
***
```elixir
{:kreuzberg, "~> 5.0.0-rc.1"}
```
[API Reference](../reference/api-elixir.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](#elixir){ .install-btn .install-btn--solid .install-btn--sm }
- :simple-r:{ .lg .middle } **R**
***
```r
install.packages("kreuzberg",
repos = "https://kreuzberg-dev.r-universe.dev")
```
[API Reference](../reference/api-r.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](quickstart.md){ .install-btn .install-btn--solid .install-btn--sm }
- :simple-cplusplus:{ .lg .middle } **C / C++**
***
```bash
cargo build -p kreuzberg-ffi
```
[API Reference](../reference/api-c.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](#c-c){ .install-btn .install-btn--solid .install-btn--sm }
- :material-language-dart:{ .lg .middle } **Dart / Flutter**
***
```bash
dart pub add kreuzberg
```
[API Reference](../reference/api-dart.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](#dart){ .install-btn .install-btn--solid .install-btn--sm }
- :material-language-zig:{ .lg .middle } **Zig**
***
```bash
zig fetch --save https://github.com/kreuzberg-dev/kreuzberg/archive/refs/tags/v5.0.0-rc.1.tar.gz
```
[API Reference](../reference/api-zig.md){ .install-api-link }
[:material-lightning-bolt: Quick Start](#zig){ .install-btn .install-btn--solid .install-btn--sm }
</div>
---
## System requirements
Only relevant if building from source or enabling OCR:
| Dependency | When you need it |
| ------------------------- | -------------------------------------------------------------------------------------- |
| AVX/AVX2 CPU instructions | Required for ONNX Runtime features (PaddleOCR, layout detection, embeddings) on x86_64 |
| Rust toolchain (`rustup`) | Building any native binding from source |
| C/C++ compiler | Building native bindings (Xcode command-line tools / `build-essential` / MSVC) |
| Tesseract OCR | Optional — `brew install tesseract` / `apt install tesseract-ocr` |
| PDFium | Auto-fetched during builds |
The Wasm package (`@kreuzberg/wasm`) has **zero** system dependencies.
### GPU Acceleration
Kreuzberg bundles a CPU-only ONNX Runtime — ML features (PaddleOCR, layout detection, embeddings) work out of the box on CPU.
For GPU acceleration, install a GPU-enabled ONNX Runtime and set `ORT_DYLIB_PATH`:
| Platform | Install | Set ORT_DYLIB_PATH |
| --------------- | ---------------------------------------------------------------------------------------- | -------------------------------------------------- |
| Linux (CUDA) | Download from [ONNX Runtime releases](https://github.com/microsoft/onnxruntime/releases) | `export ORT_DYLIB_PATH=/path/to/libonnxruntime.so` |
| Python (any OS) | `pip install onnxruntime-gpu` | Point at the pip package's `capi/` directory |
| macOS (CoreML) | Works with bundled ORT — no extra setup needed | — |
See [AccelerationConfig](../reference/configuration.md#accelerationconfig) and [ORT_DYLIB_PATH](../reference/environment-variables.md#ort_dylib_path) for details.
---
## Language-specific notes
Edge cases and alternative install methods where they come up.
### TypeScript
Two npm packages target different runtimes:
| Package | Best for | Performance |
| ----------------- | ---------------------------------- | -------------- |
| `@kreuzberg/node` | Node.js, Bun — server-side apps | Native (100%) |
| `@kreuzberg/wasm` | Browsers, Deno, Cloudflare Workers | Wasm (~60-80%) |
Both work with **pnpm** (`pnpm add`) and **Yarn** (`yarn add`) as well.
!!! Note "pnpm workspaces"
In monorepos, add this to your root `.npmrc` so platform-specific optional deps resolve correctly:
```ini
auto-install-peers=true
```
??? Note "Wasm — Browser usage"
```html
<script type="module">
import { initWasm, extractFromFile } from "@kreuzberg/wasm";
await initWasm();
const input = document.getElementById("file");
input.addEventListener("change", async (e) => {
const result = await extractFromFile(e.target.files[0]);
console.log(result.content);
});
</script>
<input type="file" id="file" />
```
??? Note "Wasm — Deno"
```typescript
import { initWasm, extractFile } from "npm:@kreuzberg/wasm";
await initWasm();
const result = await extractFile("./document.pdf");
console.log(result.content);
```
??? Note "Wasm — Cloudflare Workers"
```typescript
import { initWasm, extractBytes } from "@kreuzberg/wasm";
export default {
async fetch(request: Request): Promise<Response> {
await initWasm();
const bytes = new Uint8Array(await request.arrayBuffer());
const result = await extractBytes(bytes, "application/pdf");
return Response.json({ content: result.content });
},
};
```
**Supported runtimes:** Chrome 74+, Firefox 79+, Safari 14+, Edge 79+, Node.js 22+, Deno 1.35+, Cloudflare Workers.
!!! Warning "Wasm Platform Limitations"
The Wasm binding does not support:
- **Layout detection** (RT-DETR model inference requires ONNX Runtime unavailable in WebAssembly)
- **Hardware acceleration config** (single-threaded WASM, no GPU access)
- **Concurrency config** (single-threaded environment, `maxThreads` is ignored)
- **Email codepage config** (EmailConfig not available)
All other features (text extraction, OCR via Tesseract WASM, chunking, embeddings, metadata, tables, language detection, image extraction) work fully in WASM. See the [WASM API Reference](../reference/api-wasm.md) for details.
### Java
=== "Maven"
```xml
<dependency>
<groupId>dev.kreuzberg</groupId>
<artifactId>kreuzberg</artifactId>
<version>5.0.0-rc.1</version>
</dependency>
```
=== "Gradle"
```gradle
implementation 'dev.kreuzberg:kreuzberg:5.0.0-rc.1'
```
Requires Java 25+ (FFM/Panama API). Native libraries are bundled in the JAR.
### Elixir
Add to `mix.exs`:
```elixir
def deps do
[
{:kreuzberg, "~> 5.0.0-rc.1"}
]
end
```
```bash
mix deps.get
```
Ships prebuilt NIF binaries via RustlerPrecompiled. Falls back to compiling from source if no prebuilt matches your platform (requires Rust).
!!! Warning "Windows"
The Windows NIF links against ONNX Runtime dynamically. `onnxruntime.dll` must be on your `PATH` at runtime — see the note at the top of this page.
### Go
```bash
go get github.com/kreuzberg-dev/kreuzberg/v5@latest
```
!!! Warning "Windows"
The Go binding links against ONNX Runtime dynamically on Windows. `onnxruntime.dll` must be on your `PATH` at runtime — see the note at the top of this page.
!!! Note "Windows feature limitations"
The Go and C/C++ bindings on Windows (MinGW/GNU target) do not include **PaddleOCR**, **layout detection**, or **auto-rotate**. Tesseract OCR and all other features work normally. These limitations apply only to Windows; Linux and macOS builds include the full feature set.
### Rust
Enable features selectively in `Cargo.toml`:
```toml title="Cargo.toml"
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# Optional features: pdf, ocr, chunking
```
### C / C++
Build the FFI library from source:
```bash
cargo build --release -p kreuzberg-ffi
```
This produces `libkreuzberg_ffi.a` and a header at `crates/kreuzberg-ffi/kreuzberg.h`. Link into your project:
```makefile
HEADER_DIR = path/to/crates/kreuzberg-ffi
LIBDIR = path/to/target/release
CFLAGS = -Wall -Wextra -I$(HEADER_DIR)
LDFLAGS = -L$(LIBDIR) -lkreuzberg_ffi -lpthread -ldl -lm
my_app: my_app.c
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
```
!!! Tip "Platform-specific linker flags"
**macOS:** add `-framework CoreFoundation -framework Security`
**Windows:** add `-lws2_32 -luserenv -lbcrypt`
!!! Warning "Windows"
The Windows FFI library links against ONNX Runtime dynamically. `onnxruntime.dll` must be on your `PATH` at runtime — see the note at the top of this page.
[API Reference →](../reference/api-c.md)
### Dart / Flutter { #dart }
Pure-Dart and Flutter consumers share the same package. Dart SDK 3.0 or higher is required. Flutter is supported on macOS, iOS, Android, Linux, and Windows; Flutter Web is not supported because the runtime is a native dynamic library delivered via flutter_rust_bridge. For Flutter projects use `flutter pub add kreuzberg` instead of `dart pub add kreuzberg`.
### Kotlin { #kotlin }
The Kotlin module sits on top of the Java facade and reuses its Foreign Function & Memory native loader, so the same bundled binaries serve both bindings. Requires JDK 25 or higher. Use the Kotlin DSL block above for `build.gradle.kts` consumers; Maven and Groovy DSL are also supported — see the README at packages/kotlin/ for both.
### Swift { #swift }
Swift Package Manager from `swift-tools-version: 6.0` upward. Targets macOS 13+ and iOS 16+; Linux is not currently declared in `Package.swift`. Once the package ships its `binaryTarget`, no manual cargo build is needed; in the interim, building the library locally requires `cargo build -p kreuzberg-swift` against the workspace.
### Zig { #zig }
Requires Zig 0.16.0 or higher (declared via `minimum_zig_version` in `build.zig.zon`). The Zig binding consumes the C FFI surface from `kreuzberg-ffi` via `linkSystemLibrary`; the build expects the consumer to provide a search path to the prebuilt `libkreuzberg_ffi` and the C header `kreuzberg.h`. The `zig fetch` command above pins the source archive in `build.zig.zon`; wire it into `build.zig` via `b.dependency("kreuzberg", ...)`.
---
## Development setup
For working on the Kreuzberg repository itself:
```bash
task setup # installs all language toolchains
task lint # linters across all languages
task dev:test # full test suite
```
See [Contributing](../contributing.md) for conventions and expectations.

View File

@@ -0,0 +1,586 @@
# Quick Start
This guide walks you through Kreuzberg's core API — extracting text, handling errors,
running OCR, and working with metadata. Install your binding first if you haven't:
[Installation](installation.md).
TypeScript users: `@kreuzberg/node` for Node.js, `@kreuzberg/wasm` for browsers and edge runtimes — see [Language Support](../index.md#language-support).
## Your First Extraction
Pass a file path to get its text content. Kreuzberg detects the format automatically:
=== "C"
--8<-- "snippets/c/api/extract_file_sync.md"
=== "C#"
--8<-- "snippets/csharp/extract_file_sync.md"
=== "Dart"
--8<-- "snippets/dart/api/extract_file_sync.md"
=== "Go"
--8<-- "snippets/go/api/extract_file_sync.md"
=== "Java"
--8<-- "snippets/java/api/extract_file_sync.md"
=== "Kotlin"
--8<-- "snippets/kotlin/api/extract_file_sync.md"
=== "Python"
--8<-- "snippets/python/api/extract_file_sync.md"
=== "Ruby"
--8<-- "snippets/ruby/api/extract_file_sync.md"
=== "R"
--8<-- "snippets/r/api/extract_file_sync.md"
=== "Rust"
--8<-- "snippets/rust/api/extract_file_sync.md"
=== "Swift"
--8<-- "snippets/swift/api/extract_file_sync.md"
=== "Elixir"
--8<-- "snippets/elixir/core/extract_file_sync.exs"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/extract_file_sync.md"
=== "Wasm"
--8<-- "snippets/wasm/getting-started/extract_file_sync.md"
=== "Zig"
--8<-- "snippets/zig/api/extract_file_sync.md"
=== "CLI"
--8<-- "snippets/cli/extract_basic.md"
## Handle Errors
Wrap extractions in error handling before going further. Kreuzberg raises specific
exceptions for missing files, parse failures, and OCR problems:
=== "C"
--8<-- "snippets/c/api/error_handling.md"
=== "C#"
--8<-- "snippets/csharp/error_handling.md"
=== "Dart"
--8<-- "snippets/dart/api/error_handling.md"
=== "Go"
--8<-- "snippets/go/api/error_handling.md"
=== "Java"
--8<-- "snippets/java/api/error_handling.md"
=== "Kotlin"
--8<-- "snippets/kotlin/api/error_handling.md"
=== "Python"
--8<-- "snippets/python/utils/error_handling.md"
=== "Ruby"
--8<-- "snippets/ruby/api/error_handling.md"
=== "R"
--8<-- "snippets/r/api/error_handling.md"
=== "Rust"
--8<-- "snippets/rust/api/error_handling.md"
=== "Swift"
--8<-- "snippets/swift/api/error_handling.md"
=== "Elixir"
--8<-- "snippets/elixir/core/error_handling.exs"
=== "TypeScript"
--8<-- "snippets/typescript/api/error_handling.md"
=== "Wasm"
--8<-- "snippets/wasm/api/error_handling.md"
=== "Zig"
--8<-- "snippets/zig/api/error_handling.md"
## OCR for Scanned Documents
Kreuzberg runs OCR automatically when it detects an image or scanned PDF.
You can also force OCR on any document:
=== "C"
--8<-- "snippets/c/ocr/ocr_extraction.md"
=== "C#"
--8<-- "snippets/csharp/ocr_extraction.md"
=== "Dart"
--8<-- "snippets/dart/ocr/ocr_extraction.md"
=== "Go"
--8<-- "snippets/go/ocr/ocr_extraction.md"
=== "Java"
--8<-- "snippets/java/ocr/ocr_extraction.md"
=== "Kotlin"
--8<-- "snippets/kotlin/ocr/ocr_extraction.md"
=== "Python"
--8<-- "snippets/python/ocr/ocr_extraction.md"
=== "Ruby"
--8<-- "snippets/ruby/ocr/ocr_extraction.md"
=== "R"
--8<-- "snippets/r/ocr/ocr_extraction.md"
=== "Rust"
--8<-- "snippets/rust/ocr/ocr_extraction.md"
=== "Swift"
--8<-- "snippets/swift/ocr/ocr_extraction.md"
=== "Elixir"
--8<-- "snippets/elixir/ocr/tesseract_basic.exs"
=== "TypeScript"
--8<-- "snippets/typescript/ocr/ocr_extraction.md"
=== "Wasm"
--8<-- "snippets/wasm/ocr/ocr_extraction.md"
=== "Zig"
--8<-- "snippets/zig/ocr/ocr_extraction.md"
=== "CLI"
--8<-- "snippets/cli/ocr_basic.md"
## Process Multiple Files
Pass a list of paths to extract them in parallel:
=== "C"
--8<-- "snippets/c/api/batch_extract_files_sync.md"
=== "C#"
--8<-- "snippets/csharp/batch_extract_files_sync.md"
=== "Dart"
--8<-- "snippets/dart/api/batch_extract_files_sync.md"
=== "Go"
--8<-- "snippets/go/api/batch_extract_files_sync.md"
=== "Java"
--8<-- "snippets/java/api/batch_extract_files_sync.md"
=== "Kotlin"
--8<-- "snippets/kotlin/api/batch_extract_files_sync.md"
=== "Python"
--8<-- "snippets/python/api/batch_extract_files_sync.md"
=== "Ruby"
--8<-- "snippets/ruby/api/batch_extract_files_sync.md"
=== "R"
--8<-- "snippets/r/api/batch_extract_files_sync.md"
=== "Rust"
--8<-- "snippets/rust/api/batch_extract_files_sync.md"
=== "Swift"
--8<-- "snippets/swift/api/batch_extract_files_sync.md"
=== "Elixir"
--8<-- "snippets/elixir/core/batch_extract_files_sync.exs"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/batch_extract_files_sync.md"
=== "Wasm"
--8<-- "snippets/wasm/getting-started/batch_extract_files_sync.md"
=== "Zig"
--8<-- "snippets/zig/api/batch_extract_files_sync.md"
=== "CLI"
--8<-- "snippets/cli/batch_basic.md"
## Read Document Metadata
Every extraction result includes format-specific metadata — page count for PDFs,
sheet names for Excel, dimensions for images:
=== "C"
--8<-- "snippets/c/metadata/metadata.md"
=== "C#"
--8<-- "snippets/csharp/metadata.md"
=== "Dart"
--8<-- "snippets/dart/metadata/metadata.md"
=== "Go"
--8<-- "snippets/go/metadata/metadata.md"
=== "Java"
--8<-- "snippets/java/metadata/metadata.md"
=== "Kotlin"
--8<-- "snippets/kotlin/metadata/metadata.md"
=== "Python"
--8<-- "snippets/python/metadata/metadata.md"
=== "Ruby"
--8<-- "snippets/ruby/metadata/metadata.md"
=== "R"
--8<-- "snippets/r/metadata/metadata.md"
=== "Rust"
--8<-- "snippets/rust/metadata/metadata.md"
=== "Swift"
--8<-- "snippets/swift/metadata/metadata.md"
=== "Elixir"
--8<-- "snippets/elixir/advanced/metadata_extraction.exs"
=== "TypeScript"
--8<-- "snippets/typescript/metadata/metadata.md"
=== "Wasm"
--8<-- "snippets/wasm/metadata/metadata.md"
=== "Zig"
--8<-- "snippets/zig/metadata/metadata.md"
=== "CLI"
Extract and parse metadata using JSON output:
```bash title="Terminal"
# Extract with metadata (JSON format includes metadata automatically)
kreuzberg extract document.pdf --format json
# Save to file and parse metadata
kreuzberg extract document.pdf --format json > result.json
# Print all metadata fields
cat result.json | jq '.metadata'
# Extract HTML metadata
kreuzberg extract page.html --format json | jq '.metadata'
# Get specific fields
kreuzberg extract document.pdf --format json | \
jq '.metadata | {page_count, authors, title}'
# Process multiple files
kreuzberg batch documents/*.pdf --format json > all_metadata.json
```
**JSON Output Structure:**
```json title="JSON"
{
"content": "Extracted text...",
"mime_type": "application/pdf",
"metadata": {
"title": "Document Title",
"authors": ["John Doe"],
"created_by": "LaTeX with hyperref package",
"format_type": "pdf",
"page_count": 10
},
"tables": []
}
```
Kreuzberg extracts format-specific metadata for:
- **PDF**: page count, title, authors (list), creation date, modification date
- **HTML**: SEO tags, Open Graph, Twitter Card, structured data, headers, links, images
- **Excel**: sheet count, sheet names
- **Email**: from, to, CC, BCC, message ID, attachments
- **PowerPoint**: title, author, description, fonts
- **Images**: dimensions, format, EXIF data
- **Archives**: format, file count, file list, sizes
- **XML**: element count, unique elements
- **Text/Markdown**: word count, line count, headers, links
See [Types Reference](../reference/types.md) for complete metadata reference.
## Extract Tables
Tables come back as both structured cells and Markdown. Kreuzberg extracts them
from PDFs, spreadsheets, and HTML:
=== "C"
--8<-- "snippets/c/metadata/tables.md"
=== "C#"
--8<-- "snippets/csharp/tables.md"
=== "Dart"
--8<-- "snippets/dart/metadata/tables.md"
=== "Go"
--8<-- "snippets/go/metadata/tables.md"
=== "Java"
--8<-- "snippets/java/metadata/tables.md"
=== "Kotlin"
--8<-- "snippets/kotlin/metadata/tables.md"
=== "Python"
--8<-- "snippets/python/utils/tables.md"
=== "Ruby"
--8<-- "snippets/ruby/metadata/tables.md"
=== "R"
--8<-- "snippets/r/metadata/tables.md"
=== "Rust"
--8<-- "snippets/rust/metadata/tables.md"
=== "Swift"
--8<-- "snippets/swift/metadata/tables.md"
=== "Elixir"
--8<-- "snippets/elixir/advanced/table_extraction.exs"
=== "TypeScript"
--8<-- "snippets/typescript/api/tables.md"
=== "Wasm"
--8<-- "snippets/wasm/api/tables.md"
=== "Zig"
--8<-- "snippets/zig/metadata/tables.md"
=== "CLI"
Extract and process tables from documents:
```bash title="Terminal"
# Extract with JSON format (includes tables when detected)
kreuzberg extract document.pdf --format json
# Save tables to JSON
kreuzberg extract spreadsheet.xlsx --format json > tables.json
# Extract and parse table markdown
kreuzberg extract document.pdf --format json | \
jq '.tables[]? | .markdown'
# Get table cells
kreuzberg extract document.pdf --format json | \
jq '.tables[]? | .cells'
# Batch extract tables from multiple files
kreuzberg batch documents/**/*.pdf --format json > all_tables.json
```
**JSON Table Structure:**
```json title="JSON"
{
"content": "...",
"tables": [
{
"cells": [
["Name", "Age", "City"],
["Alice", "30", "New York"],
["Bob", "25", "Los Angeles"]
],
"markdown": "| Name | Age | City |\n|------|-----|--------|\n| Alice | 30 | New York |\n| Bob | 25 | Los Angeles |"
}
]
}
```
## Going Async
Use async extraction in web servers, background workers, or anywhere you need
non-blocking I/O:
=== "C"
--8<-- "snippets/c/api/extract_file_async.md"
=== "C#"
--8<-- "snippets/csharp/extract_file_async.md"
=== "Dart"
--8<-- "snippets/dart/api/extract_file_async.md"
=== "Go"
--8<-- "snippets/go/api/extract_file_async.md"
=== "Java"
--8<-- "snippets/java/api/extract_file_async.md"
=== "Kotlin"
--8<-- "snippets/kotlin/api/extract_file_async.md"
=== "Python"
--8<-- "snippets/python/api/extract_file_async.md"
=== "Ruby"
--8<-- "snippets/ruby/api/extract_file_async.md"
=== "R"
--8<-- "snippets/r/api/extract_file_async.md"
=== "Rust"
--8<-- "snippets/rust/api/extract_file_async.md"
=== "Swift"
--8<-- "snippets/swift/api/extract_file_async.md"
=== "Elixir"
--8<-- "snippets/elixir/core/extract_file_async.exs"
=== "TypeScript"
--8<-- "snippets/typescript/getting-started/extract_file_async.md"
=== "Wasm"
--8<-- "snippets/wasm/getting-started/extract_file_async.md"
=== "Zig"
--8<-- "snippets/zig/api/extract_file_async.md"
=== "CLI"
!!! note "Not Applicable"
Async extraction is an API-level feature. The CLI operates synchronously.
Use language-specific bindings (Python, TypeScript, Rust, WASM) for async operations.
## Next Steps
You've covered the core API. Go deeper:
- **[Configuration Guide](../guides/configuration.md)** — OCR backends, chunking, language detection, config files
- **[Extract from Bytes](../reference/api-python.md#extract_bytes_sync)** — Process in-memory data without writing to disk
- **[OCR Setup](../guides/ocr.md)** — Tesseract, PaddleOCR, EasyOCR backends
- **[Types Reference](../reference/types.md)** — Full metadata fields for every format
- **[Docker Deployment](../guides/docker.md)** — Run Kreuzberg in containers
- **[API Reference](../reference/api-python.md)** — Complete API documentation