This commit is contained in:
480
docs/getting-started/installation.md
Normal file
480
docs/getting-started/installation.md
Normal file
@@ -0,0 +1,480 @@
|
||||
---
|
||||
description: "Install Kreuzberg — pick Python, TypeScript, Rust, Go, CLI/Docker, or any of 12 supported languages."
|
||||
---
|
||||
|
||||
# Installation
|
||||
|
||||
Native bindings for 17 languages plus a standalone CLI. Every package ships **prebuilt binaries** for Linux (x86_64/aarch64), macOS (Apple Silicon), and Windows — no compile step needed.
|
||||
|
||||
<div class="cli-hero" markdown>
|
||||
|
||||
## :material-console: CLI / Docker { #cli--docker }
|
||||
|
||||
No SDK, no code — just your terminal.
|
||||
|
||||
=== "Install script"
|
||||
|
||||
```bash
|
||||
curl -fsSL https://raw.githubusercontent.com/kreuzberg-dev/kreuzberg/main/scripts/install.sh | bash
|
||||
```
|
||||
|
||||
=== "Homebrew"
|
||||
|
||||
```bash
|
||||
brew install kreuzberg-dev/tap/kreuzberg
|
||||
```
|
||||
|
||||
=== "Cargo"
|
||||
|
||||
```bash
|
||||
cargo install kreuzberg-cli
|
||||
```
|
||||
|
||||
=== "Docker (CLI image)"
|
||||
|
||||
```bash
|
||||
docker pull ghcr.io/kreuzberg-dev/kreuzberg-cli:latest
|
||||
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest extract /data/document.pdf
|
||||
```
|
||||
|
||||
=== "Docker (full image)"
|
||||
|
||||
```bash
|
||||
docker pull ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
```
|
||||
|
||||
[CLI Usage](../cli/usage.md){ .install-btn .install-btn--ghost }
|
||||
[API Server Guide](../guides/api-server.md){ .install-btn .install-btn--solid }
|
||||
|
||||
</div>
|
||||
|
||||
!!! Warning "x86_64 CPU — AVX/AVX2 instruction set required"
|
||||
|
||||
The bundled ONNX Runtime binaries require **AVX/AVX2** CPU instructions. CPUs without AVX support (e.g. Intel Atom, Celeron N5105/Jasper Lake, older pre-2011 processors) will crash with an `invalid opcode` trap when using ONNX-dependent features. The affected features are **PaddleOCR**, **layout detection**, and **embeddings**. All other Kreuzberg functionality (text extraction, Tesseract OCR, chunking, metadata, etc.) works normally on any x86_64 CPU. ARM platforms (aarch64) are unaffected.
|
||||
|
||||
!!! Warning "Windows — ONNX Runtime required for Go, Elixir, and C/C++"
|
||||
|
||||
Go, Elixir, and C/C++ bindings on Windows link against ONNX Runtime dynamically. You must have `onnxruntime.dll` on your `PATH` at runtime. Download it from the [ONNX Runtime releases](https://github.com/microsoft/onnxruntime/releases) (for example `onnxruntime-win-x64-1.24.1.zip`). Python, TypeScript, Java, C#, Ruby, PHP, and Wasm are unaffected.
|
||||
|
||||
## Choose your language
|
||||
|
||||
<div class="grid cards install-cards" markdown>
|
||||
|
||||
- :fontawesome-brands-python:{ .lg .middle } **Python**
|
||||
|
||||
***
|
||||
|
||||
```bash
|
||||
pip install kreuzberg
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-python.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](quickstart.md){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :fontawesome-brands-node-js:{ .lg .middle } **TypeScript (Node.js / Bun)**
|
||||
|
||||
***
|
||||
|
||||
```bash
|
||||
npm install @kreuzberg/node
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-typescript.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](#typescript){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :fontawesome-brands-js:{ .lg .middle } **TypeScript (Browser / Edge)**
|
||||
|
||||
***
|
||||
|
||||
```bash
|
||||
npm install @kreuzberg/wasm
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-wasm.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](#typescript){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :fontawesome-brands-rust:{ .lg .middle } **Rust**
|
||||
|
||||
***
|
||||
|
||||
```bash
|
||||
cargo add kreuzberg
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-rust.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](quickstart.md){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :fontawesome-brands-golang:{ .lg .middle } **Go**
|
||||
|
||||
***
|
||||
|
||||
```bash
|
||||
go get github.com/kreuzberg-dev/kreuzberg/v5@latest
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-go.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](quickstart.md){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :fontawesome-brands-java:{ .lg .middle } **Java**
|
||||
|
||||
***
|
||||
|
||||
```gradle
|
||||
implementation 'dev.kreuzberg:kreuzberg:5.0.0-rc.1'
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-java.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](#java){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :fontawesome-brands-kotlin:{ .lg .middle } **Kotlin**
|
||||
|
||||
***
|
||||
|
||||
```kotlin
|
||||
implementation("dev.kreuzberg:kreuzberg-kotlin:5.0.0-rc.1")
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-kotlin.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](#kotlin){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :material-language-ruby:{ .lg .middle } **Ruby**
|
||||
|
||||
***
|
||||
|
||||
```bash
|
||||
gem install kreuzberg
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-ruby.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](quickstart.md){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :fontawesome-brands-swift:{ .lg .middle } **Swift**
|
||||
|
||||
***
|
||||
|
||||
```swift
|
||||
.package(url: "https://github.com/kreuzberg-dev/kreuzberg.git", from: "5.0.0-rc.1")
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-swift.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](#swift){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :material-language-csharp:{ .lg .middle } **C# / .NET**
|
||||
|
||||
***
|
||||
|
||||
```bash
|
||||
dotnet add package Kreuzberg
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-csharp.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](../reference/api-csharp.md){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :fontawesome-brands-php:{ .lg .middle } **PHP**
|
||||
|
||||
***
|
||||
|
||||
```bash
|
||||
composer require kreuzberg/kreuzberg
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-php.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](quickstart.md){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :simple-elixir:{ .lg .middle } **Elixir**
|
||||
|
||||
***
|
||||
|
||||
```elixir
|
||||
{:kreuzberg, "~> 5.0.0-rc.1"}
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-elixir.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](#elixir){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :simple-r:{ .lg .middle } **R**
|
||||
|
||||
***
|
||||
|
||||
```r
|
||||
install.packages("kreuzberg",
|
||||
repos = "https://kreuzberg-dev.r-universe.dev")
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-r.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](quickstart.md){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :simple-cplusplus:{ .lg .middle } **C / C++**
|
||||
|
||||
***
|
||||
|
||||
```bash
|
||||
cargo build -p kreuzberg-ffi
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-c.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](#c-c){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :material-language-dart:{ .lg .middle } **Dart / Flutter**
|
||||
|
||||
***
|
||||
|
||||
```bash
|
||||
dart pub add kreuzberg
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-dart.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](#dart){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
- :material-language-zig:{ .lg .middle } **Zig**
|
||||
|
||||
***
|
||||
|
||||
```bash
|
||||
zig fetch --save https://github.com/kreuzberg-dev/kreuzberg/archive/refs/tags/v5.0.0-rc.1.tar.gz
|
||||
```
|
||||
|
||||
[API Reference](../reference/api-zig.md){ .install-api-link }
|
||||
[:material-lightning-bolt: Quick Start](#zig){ .install-btn .install-btn--solid .install-btn--sm }
|
||||
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
## System requirements
|
||||
|
||||
Only relevant if building from source or enabling OCR:
|
||||
|
||||
| Dependency | When you need it |
|
||||
| ------------------------- | -------------------------------------------------------------------------------------- |
|
||||
| AVX/AVX2 CPU instructions | Required for ONNX Runtime features (PaddleOCR, layout detection, embeddings) on x86_64 |
|
||||
| Rust toolchain (`rustup`) | Building any native binding from source |
|
||||
| C/C++ compiler | Building native bindings (Xcode command-line tools / `build-essential` / MSVC) |
|
||||
| Tesseract OCR | Optional — `brew install tesseract` / `apt install tesseract-ocr` |
|
||||
| PDFium | Auto-fetched during builds |
|
||||
|
||||
The Wasm package (`@kreuzberg/wasm`) has **zero** system dependencies.
|
||||
|
||||
### GPU Acceleration
|
||||
|
||||
Kreuzberg bundles a CPU-only ONNX Runtime — ML features (PaddleOCR, layout detection, embeddings) work out of the box on CPU.
|
||||
|
||||
For GPU acceleration, install a GPU-enabled ONNX Runtime and set `ORT_DYLIB_PATH`:
|
||||
|
||||
| Platform | Install | Set ORT_DYLIB_PATH |
|
||||
| --------------- | ---------------------------------------------------------------------------------------- | -------------------------------------------------- |
|
||||
| Linux (CUDA) | Download from [ONNX Runtime releases](https://github.com/microsoft/onnxruntime/releases) | `export ORT_DYLIB_PATH=/path/to/libonnxruntime.so` |
|
||||
| Python (any OS) | `pip install onnxruntime-gpu` | Point at the pip package's `capi/` directory |
|
||||
| macOS (CoreML) | Works with bundled ORT — no extra setup needed | — |
|
||||
|
||||
See [AccelerationConfig](../reference/configuration.md#accelerationconfig) and [ORT_DYLIB_PATH](../reference/environment-variables.md#ort_dylib_path) for details.
|
||||
|
||||
---
|
||||
|
||||
## Language-specific notes
|
||||
|
||||
Edge cases and alternative install methods where they come up.
|
||||
|
||||
### TypeScript
|
||||
|
||||
Two npm packages target different runtimes:
|
||||
|
||||
| Package | Best for | Performance |
|
||||
| ----------------- | ---------------------------------- | -------------- |
|
||||
| `@kreuzberg/node` | Node.js, Bun — server-side apps | Native (100%) |
|
||||
| `@kreuzberg/wasm` | Browsers, Deno, Cloudflare Workers | Wasm (~60-80%) |
|
||||
|
||||
Both work with **pnpm** (`pnpm add`) and **Yarn** (`yarn add`) as well.
|
||||
|
||||
!!! Note "pnpm workspaces"
|
||||
|
||||
In monorepos, add this to your root `.npmrc` so platform-specific optional deps resolve correctly:
|
||||
|
||||
```ini
|
||||
auto-install-peers=true
|
||||
```
|
||||
|
||||
??? Note "Wasm — Browser usage"
|
||||
|
||||
```html
|
||||
<script type="module">
|
||||
import { initWasm, extractFromFile } from "@kreuzberg/wasm";
|
||||
|
||||
await initWasm();
|
||||
|
||||
const input = document.getElementById("file");
|
||||
input.addEventListener("change", async (e) => {
|
||||
const result = await extractFromFile(e.target.files[0]);
|
||||
console.log(result.content);
|
||||
});
|
||||
</script>
|
||||
|
||||
<input type="file" id="file" />
|
||||
```
|
||||
|
||||
??? Note "Wasm — Deno"
|
||||
|
||||
```typescript
|
||||
import { initWasm, extractFile } from "npm:@kreuzberg/wasm";
|
||||
|
||||
await initWasm();
|
||||
const result = await extractFile("./document.pdf");
|
||||
console.log(result.content);
|
||||
```
|
||||
|
||||
??? Note "Wasm — Cloudflare Workers"
|
||||
|
||||
```typescript
|
||||
import { initWasm, extractBytes } from "@kreuzberg/wasm";
|
||||
|
||||
export default {
|
||||
async fetch(request: Request): Promise<Response> {
|
||||
await initWasm();
|
||||
const bytes = new Uint8Array(await request.arrayBuffer());
|
||||
const result = await extractBytes(bytes, "application/pdf");
|
||||
return Response.json({ content: result.content });
|
||||
},
|
||||
};
|
||||
```
|
||||
|
||||
**Supported runtimes:** Chrome 74+, Firefox 79+, Safari 14+, Edge 79+, Node.js 22+, Deno 1.35+, Cloudflare Workers.
|
||||
|
||||
!!! Warning "Wasm Platform Limitations"
|
||||
|
||||
The Wasm binding does not support:
|
||||
|
||||
- **Layout detection** (RT-DETR model inference requires ONNX Runtime unavailable in WebAssembly)
|
||||
- **Hardware acceleration config** (single-threaded WASM, no GPU access)
|
||||
- **Concurrency config** (single-threaded environment, `maxThreads` is ignored)
|
||||
- **Email codepage config** (EmailConfig not available)
|
||||
|
||||
All other features (text extraction, OCR via Tesseract WASM, chunking, embeddings, metadata, tables, language detection, image extraction) work fully in WASM. See the [WASM API Reference](../reference/api-wasm.md) for details.
|
||||
|
||||
### Java
|
||||
|
||||
=== "Maven"
|
||||
|
||||
```xml
|
||||
<dependency>
|
||||
<groupId>dev.kreuzberg</groupId>
|
||||
<artifactId>kreuzberg</artifactId>
|
||||
<version>5.0.0-rc.1</version>
|
||||
</dependency>
|
||||
```
|
||||
|
||||
=== "Gradle"
|
||||
|
||||
```gradle
|
||||
implementation 'dev.kreuzberg:kreuzberg:5.0.0-rc.1'
|
||||
```
|
||||
|
||||
Requires Java 25+ (FFM/Panama API). Native libraries are bundled in the JAR.
|
||||
|
||||
### Elixir
|
||||
|
||||
Add to `mix.exs`:
|
||||
|
||||
```elixir
|
||||
def deps do
|
||||
[
|
||||
{:kreuzberg, "~> 5.0.0-rc.1"}
|
||||
]
|
||||
end
|
||||
```
|
||||
|
||||
```bash
|
||||
mix deps.get
|
||||
```
|
||||
|
||||
Ships prebuilt NIF binaries via RustlerPrecompiled. Falls back to compiling from source if no prebuilt matches your platform (requires Rust).
|
||||
|
||||
!!! Warning "Windows"
|
||||
|
||||
The Windows NIF links against ONNX Runtime dynamically. `onnxruntime.dll` must be on your `PATH` at runtime — see the note at the top of this page.
|
||||
|
||||
### Go
|
||||
|
||||
```bash
|
||||
go get github.com/kreuzberg-dev/kreuzberg/v5@latest
|
||||
```
|
||||
|
||||
!!! Warning "Windows"
|
||||
|
||||
The Go binding links against ONNX Runtime dynamically on Windows. `onnxruntime.dll` must be on your `PATH` at runtime — see the note at the top of this page.
|
||||
|
||||
!!! Note "Windows feature limitations"
|
||||
|
||||
The Go and C/C++ bindings on Windows (MinGW/GNU target) do not include **PaddleOCR**, **layout detection**, or **auto-rotate**. Tesseract OCR and all other features work normally. These limitations apply only to Windows; Linux and macOS builds include the full feature set.
|
||||
|
||||
### Rust
|
||||
|
||||
Enable features selectively in `Cargo.toml`:
|
||||
|
||||
```toml title="Cargo.toml"
|
||||
[dependencies]
|
||||
kreuzberg = { version = "4", features = ["tokio-runtime"] }
|
||||
# Optional features: pdf, ocr, chunking
|
||||
```
|
||||
|
||||
### C / C++
|
||||
|
||||
Build the FFI library from source:
|
||||
|
||||
```bash
|
||||
cargo build --release -p kreuzberg-ffi
|
||||
```
|
||||
|
||||
This produces `libkreuzberg_ffi.a` and a header at `crates/kreuzberg-ffi/kreuzberg.h`. Link into your project:
|
||||
|
||||
```makefile
|
||||
HEADER_DIR = path/to/crates/kreuzberg-ffi
|
||||
LIBDIR = path/to/target/release
|
||||
|
||||
CFLAGS = -Wall -Wextra -I$(HEADER_DIR)
|
||||
LDFLAGS = -L$(LIBDIR) -lkreuzberg_ffi -lpthread -ldl -lm
|
||||
|
||||
my_app: my_app.c
|
||||
$(CC) $(CFLAGS) -o $@ $< $(LDFLAGS)
|
||||
```
|
||||
|
||||
!!! Tip "Platform-specific linker flags"
|
||||
|
||||
**macOS:** add `-framework CoreFoundation -framework Security`
|
||||
|
||||
**Windows:** add `-lws2_32 -luserenv -lbcrypt`
|
||||
|
||||
!!! Warning "Windows"
|
||||
|
||||
The Windows FFI library links against ONNX Runtime dynamically. `onnxruntime.dll` must be on your `PATH` at runtime — see the note at the top of this page.
|
||||
|
||||
[API Reference →](../reference/api-c.md)
|
||||
|
||||
### Dart / Flutter { #dart }
|
||||
|
||||
Pure-Dart and Flutter consumers share the same package. Dart SDK 3.0 or higher is required. Flutter is supported on macOS, iOS, Android, Linux, and Windows; Flutter Web is not supported because the runtime is a native dynamic library delivered via flutter_rust_bridge. For Flutter projects use `flutter pub add kreuzberg` instead of `dart pub add kreuzberg`.
|
||||
|
||||
### Kotlin { #kotlin }
|
||||
|
||||
The Kotlin module sits on top of the Java facade and reuses its Foreign Function & Memory native loader, so the same bundled binaries serve both bindings. Requires JDK 25 or higher. Use the Kotlin DSL block above for `build.gradle.kts` consumers; Maven and Groovy DSL are also supported — see the README at packages/kotlin/ for both.
|
||||
|
||||
### Swift { #swift }
|
||||
|
||||
Swift Package Manager from `swift-tools-version: 6.0` upward. Targets macOS 13+ and iOS 16+; Linux is not currently declared in `Package.swift`. Once the package ships its `binaryTarget`, no manual cargo build is needed; in the interim, building the library locally requires `cargo build -p kreuzberg-swift` against the workspace.
|
||||
|
||||
### Zig { #zig }
|
||||
|
||||
Requires Zig 0.16.0 or higher (declared via `minimum_zig_version` in `build.zig.zon`). The Zig binding consumes the C FFI surface from `kreuzberg-ffi` via `linkSystemLibrary`; the build expects the consumer to provide a search path to the prebuilt `libkreuzberg_ffi` and the C header `kreuzberg.h`. The `zig fetch` command above pins the source archive in `build.zig.zon`; wire it into `build.zig` via `b.dependency("kreuzberg", ...)`.
|
||||
|
||||
---
|
||||
|
||||
## Development setup
|
||||
|
||||
For working on the Kreuzberg repository itself:
|
||||
|
||||
```bash
|
||||
task setup # installs all language toolchains
|
||||
task lint # linters across all languages
|
||||
task dev:test # full test suite
|
||||
```
|
||||
|
||||
See [Contributing](../contributing.md) for conventions and expectations.
|
||||
586
docs/getting-started/quickstart.md
Normal file
586
docs/getting-started/quickstart.md
Normal file
@@ -0,0 +1,586 @@
|
||||
# Quick Start
|
||||
|
||||
This guide walks you through Kreuzberg's core API — extracting text, handling errors,
|
||||
running OCR, and working with metadata. Install your binding first if you haven't:
|
||||
[Installation](installation.md).
|
||||
|
||||
TypeScript users: `@kreuzberg/node` for Node.js, `@kreuzberg/wasm` for browsers and edge runtimes — see [Language Support](../index.md#language-support).
|
||||
|
||||
## Your First Extraction
|
||||
|
||||
Pass a file path to get its text content. Kreuzberg detects the format automatically:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/extract_file_sync.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/extract_file_sync.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/api/extract_file_sync.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/extract_file_sync.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/extract_file_sync.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/api/extract_file_sync.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/extract_file_sync.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/extract_file_sync.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/extract_file_sync.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/extract_file_sync.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/api/extract_file_sync.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/core/extract_file_sync.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/extract_file_sync.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/getting-started/extract_file_sync.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/api/extract_file_sync.md"
|
||||
|
||||
=== "CLI"
|
||||
|
||||
--8<-- "snippets/cli/extract_basic.md"
|
||||
|
||||
## Handle Errors
|
||||
|
||||
Wrap extractions in error handling before going further. Kreuzberg raises specific
|
||||
exceptions for missing files, parse failures, and OCR problems:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/error_handling.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/error_handling.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/api/error_handling.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/error_handling.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/error_handling.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/api/error_handling.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/error_handling.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/error_handling.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/error_handling.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/error_handling.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/api/error_handling.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/core/error_handling.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/api/error_handling.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/api/error_handling.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/api/error_handling.md"
|
||||
|
||||
## OCR for Scanned Documents
|
||||
|
||||
Kreuzberg runs OCR automatically when it detects an image or scanned PDF.
|
||||
You can also force OCR on any document:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/ocr/ocr_extraction.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/ocr_extraction.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/ocr/ocr_extraction.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/ocr/tesseract_basic.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/ocr/ocr_extraction.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/ocr/ocr_extraction.md"
|
||||
|
||||
=== "CLI"
|
||||
|
||||
--8<-- "snippets/cli/ocr_basic.md"
|
||||
|
||||
## Process Multiple Files
|
||||
|
||||
Pass a list of paths to extract them in parallel:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/batch_extract_files_sync.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/core/batch_extract_files_sync.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/batch_extract_files_sync.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/getting-started/batch_extract_files_sync.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/api/batch_extract_files_sync.md"
|
||||
|
||||
=== "CLI"
|
||||
|
||||
--8<-- "snippets/cli/batch_basic.md"
|
||||
|
||||
## Read Document Metadata
|
||||
|
||||
Every extraction result includes format-specific metadata — page count for PDFs,
|
||||
sheet names for Excel, dimensions for images:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/metadata/metadata.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/metadata.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/metadata/metadata.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/metadata/metadata.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/metadata/metadata.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/metadata/metadata.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/metadata/metadata.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/metadata/metadata.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/metadata/metadata.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/metadata/metadata.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/metadata/metadata.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/advanced/metadata_extraction.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/metadata/metadata.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/metadata/metadata.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/metadata/metadata.md"
|
||||
|
||||
=== "CLI"
|
||||
|
||||
Extract and parse metadata using JSON output:
|
||||
|
||||
```bash title="Terminal"
|
||||
# Extract with metadata (JSON format includes metadata automatically)
|
||||
kreuzberg extract document.pdf --format json
|
||||
|
||||
# Save to file and parse metadata
|
||||
kreuzberg extract document.pdf --format json > result.json
|
||||
|
||||
# Print all metadata fields
|
||||
cat result.json | jq '.metadata'
|
||||
|
||||
# Extract HTML metadata
|
||||
kreuzberg extract page.html --format json | jq '.metadata'
|
||||
|
||||
# Get specific fields
|
||||
kreuzberg extract document.pdf --format json | \
|
||||
jq '.metadata | {page_count, authors, title}'
|
||||
|
||||
# Process multiple files
|
||||
kreuzberg batch documents/*.pdf --format json > all_metadata.json
|
||||
```
|
||||
|
||||
**JSON Output Structure:**
|
||||
|
||||
```json title="JSON"
|
||||
{
|
||||
"content": "Extracted text...",
|
||||
"mime_type": "application/pdf",
|
||||
"metadata": {
|
||||
"title": "Document Title",
|
||||
"authors": ["John Doe"],
|
||||
"created_by": "LaTeX with hyperref package",
|
||||
"format_type": "pdf",
|
||||
"page_count": 10
|
||||
},
|
||||
"tables": []
|
||||
}
|
||||
```
|
||||
|
||||
Kreuzberg extracts format-specific metadata for:
|
||||
|
||||
- **PDF**: page count, title, authors (list), creation date, modification date
|
||||
- **HTML**: SEO tags, Open Graph, Twitter Card, structured data, headers, links, images
|
||||
- **Excel**: sheet count, sheet names
|
||||
- **Email**: from, to, CC, BCC, message ID, attachments
|
||||
- **PowerPoint**: title, author, description, fonts
|
||||
- **Images**: dimensions, format, EXIF data
|
||||
- **Archives**: format, file count, file list, sizes
|
||||
- **XML**: element count, unique elements
|
||||
- **Text/Markdown**: word count, line count, headers, links
|
||||
|
||||
See [Types Reference](../reference/types.md) for complete metadata reference.
|
||||
|
||||
## Extract Tables
|
||||
|
||||
Tables come back as both structured cells and Markdown. Kreuzberg extracts them
|
||||
from PDFs, spreadsheets, and HTML:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/metadata/tables.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/tables.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/metadata/tables.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/metadata/tables.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/metadata/tables.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/metadata/tables.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/utils/tables.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/metadata/tables.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/metadata/tables.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/metadata/tables.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/metadata/tables.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/advanced/table_extraction.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/api/tables.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/api/tables.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/metadata/tables.md"
|
||||
|
||||
=== "CLI"
|
||||
|
||||
Extract and process tables from documents:
|
||||
|
||||
```bash title="Terminal"
|
||||
# Extract with JSON format (includes tables when detected)
|
||||
kreuzberg extract document.pdf --format json
|
||||
|
||||
# Save tables to JSON
|
||||
kreuzberg extract spreadsheet.xlsx --format json > tables.json
|
||||
|
||||
# Extract and parse table markdown
|
||||
kreuzberg extract document.pdf --format json | \
|
||||
jq '.tables[]? | .markdown'
|
||||
|
||||
# Get table cells
|
||||
kreuzberg extract document.pdf --format json | \
|
||||
jq '.tables[]? | .cells'
|
||||
|
||||
# Batch extract tables from multiple files
|
||||
kreuzberg batch documents/**/*.pdf --format json > all_tables.json
|
||||
```
|
||||
|
||||
**JSON Table Structure:**
|
||||
|
||||
```json title="JSON"
|
||||
{
|
||||
"content": "...",
|
||||
"tables": [
|
||||
{
|
||||
"cells": [
|
||||
["Name", "Age", "City"],
|
||||
["Alice", "30", "New York"],
|
||||
["Bob", "25", "Los Angeles"]
|
||||
],
|
||||
"markdown": "| Name | Age | City |\n|------|-----|--------|\n| Alice | 30 | New York |\n| Bob | 25 | Los Angeles |"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Going Async
|
||||
|
||||
Use async extraction in web servers, background workers, or anywhere you need
|
||||
non-blocking I/O:
|
||||
|
||||
=== "C"
|
||||
|
||||
--8<-- "snippets/c/api/extract_file_async.md"
|
||||
|
||||
=== "C#"
|
||||
|
||||
--8<-- "snippets/csharp/extract_file_async.md"
|
||||
|
||||
=== "Dart"
|
||||
|
||||
--8<-- "snippets/dart/api/extract_file_async.md"
|
||||
|
||||
=== "Go"
|
||||
|
||||
--8<-- "snippets/go/api/extract_file_async.md"
|
||||
|
||||
=== "Java"
|
||||
|
||||
--8<-- "snippets/java/api/extract_file_async.md"
|
||||
|
||||
=== "Kotlin"
|
||||
|
||||
--8<-- "snippets/kotlin/api/extract_file_async.md"
|
||||
|
||||
=== "Python"
|
||||
|
||||
--8<-- "snippets/python/api/extract_file_async.md"
|
||||
|
||||
=== "Ruby"
|
||||
|
||||
--8<-- "snippets/ruby/api/extract_file_async.md"
|
||||
|
||||
=== "R"
|
||||
|
||||
--8<-- "snippets/r/api/extract_file_async.md"
|
||||
|
||||
=== "Rust"
|
||||
|
||||
--8<-- "snippets/rust/api/extract_file_async.md"
|
||||
|
||||
=== "Swift"
|
||||
|
||||
--8<-- "snippets/swift/api/extract_file_async.md"
|
||||
|
||||
=== "Elixir"
|
||||
|
||||
--8<-- "snippets/elixir/core/extract_file_async.exs"
|
||||
|
||||
=== "TypeScript"
|
||||
|
||||
--8<-- "snippets/typescript/getting-started/extract_file_async.md"
|
||||
|
||||
=== "Wasm"
|
||||
|
||||
--8<-- "snippets/wasm/getting-started/extract_file_async.md"
|
||||
|
||||
=== "Zig"
|
||||
|
||||
--8<-- "snippets/zig/api/extract_file_async.md"
|
||||
|
||||
=== "CLI"
|
||||
|
||||
!!! note "Not Applicable"
|
||||
Async extraction is an API-level feature. The CLI operates synchronously.
|
||||
Use language-specific bindings (Python, TypeScript, Rust, WASM) for async operations.
|
||||
|
||||
## Next Steps
|
||||
|
||||
You've covered the core API. Go deeper:
|
||||
|
||||
- **[Configuration Guide](../guides/configuration.md)** — OCR backends, chunking, language detection, config files
|
||||
- **[Extract from Bytes](../reference/api-python.md#extract_bytes_sync)** — Process in-memory data without writing to disk
|
||||
- **[OCR Setup](../guides/ocr.md)** — Tesseract, PaddleOCR, EasyOCR backends
|
||||
- **[Types Reference](../reference/types.md)** — Full metadata fields for every format
|
||||
- **[Docker Deployment](../guides/docker.md)** — Run Kreuzberg in containers
|
||||
- **[API Reference](../reference/api-python.md)** — Complete API documentation
|
||||
Reference in New Issue
Block a user