fil/docs/cli/usage.md

# CLI Usage

Command-line access to all Kreuzberg extraction features.

## Installation

=== "Install Script (Linux/macOS)"

    --8<-- "snippets/cli/install_script.md"

=== "Homebrew (macOS/Linux)"

    --8<-- "snippets/cli/install_homebrew.md"

=== "Cargo (Cross-platform)"

    --8<-- "snippets/cli/install_cargo.md"

=== "Docker"

    --8<-- "snippets/cli/install_docker.md"

=== "Go (SDK)"

    --8<-- "snippets/cli/install_go_sdk.md"

!!! Info "Feature Availability"
**Homebrew Installation:**

    - ✅ Text extraction (PDF, Office, images, 90+ formats)
    - ✅ OCR with Tesseract
    - ✅ HTTP API server (`serve` command)
    - ✅ MCP protocol server (`mcp` command)
    - ✅ Chunking, quality scoring, language detection
    - ❌ **Embeddings** - Not available via CLI flags. Use config file or Docker image.

    **Docker Images:**

    - All features enabled including embeddings (ONNX Runtime included)

## Global Flags <span class="version-badge">v4.5.2</span>

### Log Level <span class="version-badge">v4.5.2</span>

`--log-level` controls log verbosity and overrides `RUST_LOG`.

```bash title="Terminal"
# Set log level to debug for troubleshooting
kreuzberg --log-level debug extract document.pdf

# Suppress all but error messages
kreuzberg --log-level error batch documents/*.pdf

# Trace-level logging for maximum detail
kreuzberg --log-level trace extract document.pdf
```

Valid levels: `trace`, `debug`, `info` (default), `warn`, `error`.

### Colored Output

Output is colored by default. Disable with `NO_COLOR`:

```bash title="Terminal"
# Disable colored output
NO_COLOR=1 kreuzberg extract document.pdf
```

## Basic Usage

### Extract from Single File

```bash title="Terminal"
# Extract text content to stdout
kreuzberg extract document.pdf

# Specify MIME type (auto-detected if not provided)
kreuzberg extract document.pdf --mime-type application/pdf
```

### Batch Extract Multiple Files

```bash title="Terminal"
# Extract from multiple files
kreuzberg batch doc1.pdf doc2.docx doc3.txt

# Batch extract all PDFs in directory
kreuzberg batch documents/*.pdf

# Batch extract recursively
kreuzberg batch documents/**/*.pdf
```

### Extract Structured Data <span class="version-badge">v4.8.0</span>

`extract-structured` pulls typed JSON out of a document via an LLM, using a JSON schema as constraint.

```bash title="Terminal"
# Extract invoice fields into JSON matching invoice_schema.json
kreuzberg extract-structured invoice.pdf \
  --schema invoice_schema.json \
  --model openai/gpt-4o \
  --strict
```

| Flag                    | Description                                                                                          |
| ----------------------- | ---------------------------------------------------------------------------------------------------- |
| `<PATH>` (positional)   | Document file path. Required.                                                                        |
| `--schema <PATH>`       | Path to a JSON schema file describing the desired output. Required.                                  |
| `--model <MODEL>`       | LLM model identifier, for example `openai/gpt-4o` or `anthropic/claude-sonnet-4-20250514`. Required. |
| `--api-key <KEY>`       | LLM provider API key. Falls back to `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, and so on.                |
| `--prompt <TEMPLATE>`   | Custom Jinja2 prompt template overriding the built-in one.                                           |
| `--schema-name <NAME>`  | Schema identifier passed to the LLM. Default: `extraction`.                                          |
| `--strict`              | Enable OpenAI strict mode for exact schema matching.                                                 |
| `-c, --config <PATH>`   | Path to a TOML/YAML/JSON extraction config file applied to the document extraction step.             |
| `-f, --format <FORMAT>` | Wire format for the printed output: `json` (default), `text`, or `toon`.                             |

Only the structured output is printed; the underlying document text is not. Set `RUST_LOG=kreuzberg=debug` to inspect the prompt sent.

### Output Formats

```bash title="Terminal"
# Output as plain text (default for extract)
kreuzberg extract document.pdf --format text

# Output as JSON (default for batch)
kreuzberg batch documents/*.pdf --format json

# Extract single file as JSON
kreuzberg extract document.pdf --format json

# Output as TOON wire format (token-efficient alternative to JSON)
kreuzberg extract document.pdf --format toon
```

### Content Output Format

`--content-format` (alias: `--output-format`) sets the format of extracted text content:

```bash title="Terminal"
# Extract as plain text (default)
kreuzberg extract document.pdf --content-format plain

# Extract as Markdown
kreuzberg extract document.pdf --content-format markdown

# Extract as Djot markup
kreuzberg extract document.pdf --content-format djot

# Extract as HTML
kreuzberg extract document.pdf --content-format html

# Combine content format with wire format
kreuzberg extract document.pdf --content-format markdown --format toon
```

`--content-format` formats `result.content`; `--format` controls the wire format of the entire response (`text`, `json`, or `toon`).

## OCR Extraction

### Enable OCR

```bash title="Terminal"
# Enable OCR (overrides config file setting)
kreuzberg extract scanned.pdf --ocr true

# Disable OCR
kreuzberg extract document.pdf --ocr false
```

### Force OCR

Force OCR even for PDFs with text layer:

```bash title="Terminal"
# Force OCR to run regardless of existing text
kreuzberg extract document.pdf --force-ocr true
```

### OCR Language Selection

`--ocr-language` is backend-agnostic and overrides config-file or default settings.

| Backend   | Code format                  | Examples                                           |
| --------- | ---------------------------- | -------------------------------------------------- |
| Tesseract | ISO 639-3 (three-letter)     | `eng`, `fra`, `deu`, `spa`, `jpn`                  |
| PaddleOCR | short codes / language names | `en`, `ch`, `french`, `korean`, `thai`, `cyrillic` |
| EasyOCR   | similar to PaddleOCR         | —                                                  |

```bash title="Terminal"
# French OCR with Tesseract (default backend)
kreuzberg extract --ocr true --ocr-language fra document.pdf

# Chinese OCR with PaddleOCR
kreuzberg extract --ocr true --ocr-backend paddle-ocr --ocr-language ch document.pdf

# Thai OCR with PaddleOCR
kreuzberg extract --ocr true --ocr-backend paddle-ocr --ocr-language thai document.pdf

# German OCR with Tesseract
kreuzberg extract --ocr true --ocr-language deu document.pdf

# Override config file language with Spanish
kreuzberg extract document.pdf --config kreuzberg.toml --ocr-language spa
```

### OCR Configuration

OCR options live in the config file; CLI flags override:

```bash title="Terminal"
kreuzberg extract scanned.pdf --config kreuzberg.toml --ocr true
```

See [Configuration Files](#configuration-files) for backend, language, and Tesseract options.

## Configuration Files

### Using Config Files

Kreuzberg auto-discovers `kreuzberg.toml` by walking up from the current directory. For YAML or JSON, pass `--config` explicitly.

```bash title="Terminal"
kreuzberg extract document.pdf  # auto-discovers kreuzberg.toml
```

### Specify Config File

Load TOML, YAML (`.yaml`/`.yml`), or JSON via `--config`:

```bash title="Terminal"
kreuzberg extract document.pdf --config my-config.toml
kreuzberg extract document.pdf --config kreuzberg.yaml
kreuzberg extract document.pdf --config my-config.json
```

### Inline JSON Config

Inline JSON is merged after config file, before individual flags:

```bash title="Terminal"
# Inline JSON (applied after config file)
kreuzberg extract document.pdf --config-json '{"ocr":{"backend":"tesseract"},"chunking":{"max_chars":1000}}'

# Base64-encoded JSON (useful in shells where quoting is awkward)
kreuzberg extract document.pdf --config-json-base64 eyJvY3IiOnsiYmFja2VuZCI6InRlc3NlcmFjdCJ9fQ==
```

Both `extract` and `batch` support `--config-json` and `--config-json-base64`.

### Example Config Files

**kreuzberg.toml:**

```toml title="OCR configuration"
use_cache = true
enable_quality_processing = true

[ocr]
backend = "tesseract"
language = "eng"

[ocr.tesseract_config]
psm = 3

[chunking]
max_characters = 1000
overlap = 100
```

**kreuzberg.yaml:**

```yaml title="kreuzberg.yaml"
use_cache: true
enable_quality_processing: true

ocr:
  backend: tesseract
  language: eng
  tesseract_config:
    psm: 3

chunking:
  max_characters: 1000
  overlap: 100
```

**kreuzberg.json:**

```json title="kreuzberg.json"
{
  "use_cache": true,
  "enable_quality_processing": true,
  "ocr": {
    "backend": "tesseract",
    "language": "eng",
    "tesseract_config": {
      "psm": 3
    }
  },
  "chunking": {
    "max_characters": 1000,
    "overlap": 100
  }
}
```

## Batch Processing

Process multiple files with `batch`:

```bash title="Terminal"
# Extract all PDFs in directory
kreuzberg batch documents/*.pdf

# Extract PDFs recursively from subdirectories
kreuzberg batch documents/**/*.pdf

# Extract multiple file types
kreuzberg batch documents/**/*.{pdf,docx,txt}
```

### Batch with Output Formats

```bash title="Terminal"
# Output as JSON (default for batch command)
kreuzberg batch documents/*.pdf --format json

# Output as plain text
kreuzberg batch documents/*.pdf --format text
```

### Batch with OCR

```bash title="Terminal"
# Batch extract with OCR enabled
kreuzberg batch scanned/*.pdf --ocr true

# Batch extract with force OCR
kreuzberg batch documents/*.pdf --force-ocr true

# Batch extract with quality processing
kreuzberg batch documents/*.pdf --quality true
```

### Batch with Content Format

```bash title="Terminal"
# Batch extract with djot formatting
kreuzberg batch documents/*.pdf --output-format djot --format json

# Batch extract as Markdown
kreuzberg batch documents/*.pdf --output-format markdown --format json

# Batch extract as HTML
kreuzberg batch documents/*.pdf --output-format html --format json
```

## Advanced Features

### Language Detection

```bash title="Terminal"
# Extract with automatic language detection
kreuzberg extract document.pdf --detect-language true

# Disable language detection
kreuzberg extract document.pdf --detect-language false
```

### Content Chunking

```bash title="Terminal"
# Split content into chunks for LLM processing
kreuzberg extract document.pdf --chunk true

# Specify chunk size and overlap
kreuzberg extract document.pdf --chunk true --chunk-size 1000 --chunk-overlap 100

# Output chunked content as JSON
kreuzberg extract document.pdf --chunk true --format json
```

### Quality Processing

```bash title="Terminal"
# Apply quality processing for improved formatting
kreuzberg extract document.pdf --quality true

# Disable quality processing
kreuzberg extract document.pdf --quality false

# Batch extraction with quality processing
kreuzberg batch documents/*.pdf --quality true
```

### Caching

```bash title="Terminal"
# Extract with result caching enabled (default)
kreuzberg extract document.pdf

# Extract without caching results
kreuzberg extract document.pdf --no-cache true

# Clear all cached results
kreuzberg cache clear

# View cache statistics
kreuzberg cache stats
```

## Extraction Override Flags <span class="version-badge">v4.5.2</span>

`extract` and `batch` accept the flags below; they take precedence over config-file settings.

### OCR Flags

| Flag                              | Description                                                                                                                  |
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
| `--ocr <true\|false>`             | Enable or disable OCR. Defaults to tesseract backend when enabled.                                                           |
| `--ocr-backend <BACKEND>`         | OCR backend: `tesseract`, `paddle-ocr`, or `easyocr`.                                                                        |
| `--ocr-language <LANG>`           | OCR language code. Tesseract uses ISO 639-3 (`eng`, `fra`, `deu`). PaddleOCR/EasyOCR use short codes (`en`, `ch`, `korean`). |
| `--force-ocr <true\|false>`       | Force OCR even if the document has an existing text layer.                                                                   |
| `--ocr-auto-rotate <true\|false>` | Automatically rotate images before OCR based on detected orientation.                                                        |
| `--disable-ocr <true\|false>`     | Disable OCR entirely, even for images.                                                                                       |

```bash title="Terminal"
kreuzberg extract scanned.pdf --ocr true --ocr-backend paddle-ocr --ocr-language ch
kreuzberg extract document.pdf --force-ocr true --ocr-auto-rotate true
```

### Chunking Flags

| Flag                           | Description                                                                                                                                          |
| ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--chunk <true\|false>`        | Enable or disable text chunking.                                                                                                                     |
| `--chunk-size <N>`             | Maximum chunk size in characters (default: 1000).                                                                                                    |
| `--chunk-overlap <N>`          | Overlap between consecutive chunks in characters (default: 200).                                                                                     |
| `--chunking-tokenizer <MODEL>` | Tokenizer model for token-based chunk sizing (for example `Xenova/gpt-4o`). Implicitly enables chunking. Requires the `chunking-tokenizers` feature. |

```bash title="Terminal"
kreuzberg extract document.pdf --chunk true --chunk-size 512 --chunk-overlap 50
kreuzberg extract document.pdf --chunking-tokenizer "Xenova/gpt-4o"
```

### Output Flags

| Flag                                | Description                                                                                                                                    |
| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `--content-format <FORMAT>`         | Content output format: `plain`, `markdown`, `djot`, or `html`. Controls how extracted text is formatted. (Deprecated alias: `--output-format`) |
| `--include-structure <true\|false>` | Include hierarchical document structure in results.                                                                                            |

```bash title="Terminal"
kreuzberg extract document.pdf --content-format markdown --include-structure true
```

### Layout Detection Flags

| Flag                           | Description                                                                                                                                      |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| `--layout`                     | Enable layout detection with default settings (RT-DETR v2). Use `--layout false` to explicitly disable. Requires the `layout-detection` feature. |
| `--layout-confidence <FLOAT>`  | Layout detection confidence threshold (0.0 - 1.0).                                                                                               |
| `--layout-table-model <MODEL>` | Table structure model: `tatr` (default), `slanet_wired`, `slanet_wireless`, `slanet_plus`, `slanet_auto`, `disabled`.                            |

```bash title="Terminal"
kreuzberg extract document.pdf --layout --layout-confidence 0.7
```

### Acceleration Flags

| Flag                        | Description                                                                                          |
| --------------------------- | ---------------------------------------------------------------------------------------------------- |
| `--acceleration <PROVIDER>` | ONNX Runtime execution provider for model inference: `auto`, `cpu`, `coreml`, `cuda`, or `tensorrt`. |

```bash title="Terminal"
# Use CoreML on macOS for GPU acceleration
kreuzberg extract document.pdf --acceleration coreml

# Use CUDA on Linux with NVIDIA GPU
kreuzberg extract document.pdf --acceleration cuda
```

### Page Flags

| Flag                            | Description                                               |
| ------------------------------- | --------------------------------------------------------- |
| `--extract-pages <true\|false>` | Extract pages as a separate array in results.             |
| `--page-markers <true\|false>`  | Insert page marker comments into the main content string. |

```bash title="Terminal"
kreuzberg extract document.pdf --extract-pages true --page-markers true --format json
```

### Image Flags

| Flag                             | Description                                     |
| -------------------------------- | ----------------------------------------------- |
| `--extract-images <true\|false>` | Enable image extraction from documents.         |
| `--target-dpi <N>`               | Target DPI for image normalisation (36 - 2400). |

```bash title="Terminal"
kreuzberg extract document.pdf --extract-images true --target-dpi 300
```

### PDF Flags

| Flag                                   | Description                                                                          |
| -------------------------------------- | ------------------------------------------------------------------------------------ |
| `--pdf-password <PASSWORD>`            | Password for encrypted PDFs. Can be specified multiple times for multiple passwords. |
| `--pdf-extract-images <true\|false>`   | Extract images embedded in PDF pages.                                                |
| `--pdf-extract-metadata <true\|false>` | Extract PDF metadata (title, author, etc.).                                          |

```bash title="Terminal"
kreuzberg extract encrypted.pdf --pdf-password "secret"
kreuzberg extract document.pdf --pdf-extract-images true --pdf-extract-metadata true
```

### Token Reduction Flags

| Flag                        | Description                                                                                                                 |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| `--token-reduction <LEVEL>` | Token reduction intensity: `off`, `light`, `moderate`, `aggressive`, or `maximum`. Reduces token count for LLM consumption. |

```bash title="Terminal"
# Aggressive token reduction for cheaper LLM processing
kreuzberg extract document.pdf --token-reduction aggressive

# Maximum compression (lossy)
kreuzberg extract document.pdf --token-reduction maximum
```

### Quality and Detection Flags

| Flag                              | Description                                             |
| --------------------------------- | ------------------------------------------------------- |
| `--quality <true\|false>`         | Enable quality post-processing for improved formatting. |
| `--detect-language <true\|false>` | Enable automatic language detection on extracted text.  |

### Cache Flags

| Flag                            | Description                                        |
| ------------------------------- | -------------------------------------------------- |
| `--no-cache <true\|false>`      | Disable extraction result caching.                 |
| `--cache-namespace <NAMESPACE>` | Cache namespace for tenant isolation.              |
| `--cache-ttl-secs <SECONDS>`    | Per-request cache TTL in seconds (0 = skip cache). |

### Concurrency Flags

| Flag                   | Description                                                                                                 |
| ---------------------- | ----------------------------------------------------------------------------------------------------------- |
| `--max-concurrent <N>` | Limit parallel extractions in batch mode.                                                                   |
| `--max-threads <N>`    | Cap all internal thread pools (Rayon, ONNX intra-op, batch semaphore). Useful for constrained environments. |

```bash title="Terminal"
kreuzberg batch documents/*.pdf --max-concurrent 4 --max-threads 8
```

### Email Flags

| Flag                 | Description                                                                                                                                 |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| `--msg-codepage <N>` | Windows codepage fallback for MSG files without codepage metadata. Common values: 1250 (Central European), 1251 (Cyrillic), 1252 (Western). |

```bash title="Terminal"
kreuzberg extract message.msg --msg-codepage 1251
```

## Output Options

### Standard Output (Text Format)

```bash title="Terminal"
# Extract and print content to stdout
kreuzberg extract document.pdf

# Extract and redirect output to file
kreuzberg extract document.pdf > output.txt

# Batch extract as text
kreuzberg batch documents/*.pdf --format text
```

### JSON Output

```bash title="Terminal"
# Output as JSON
kreuzberg extract document.pdf --format json

# Batch extract as JSON (default format)
kreuzberg batch documents/*.pdf --format json
```

**JSON Output Structure:**

```json title="JSON Response"
{
  "content": "Extracted text content...",
  "metadata": {
    "mime_type": "application/pdf"
  }
}
```

## Error Handling

The CLI returns non-zero exit codes on error. Use shell idioms:

```bash title="Terminal"
# Check for extraction errors
kreuzberg extract document.pdf || echo "Extraction failed"

# Continue processing even if one file fails (bash)
for file in documents/*.pdf; do
  kreuzberg batch "$file" || continue
done
```

## Examples

### Extract Single PDF

```bash title="Extract text from PDF"
kreuzberg extract document.pdf
```

### Batch Extract All PDFs in Directory

```bash title="Extract all PDFs from directory as JSON"
kreuzberg batch documents/*.pdf --format json
```

### OCR Scanned Documents

```bash title="OCR extraction from scanned documents"
kreuzberg batch scans/*.pdf --ocr true --format json
```

### Extract with Quality Processing

```bash title="Extract with quality processing enabled"
kreuzberg extract document.pdf --quality true --format json
```

### Extract with Chunking

```bash title="Extract with chunking for LLM processing"
kreuzberg extract document.pdf --config kreuzberg.toml --chunk true --chunk-size 1000 --chunk-overlap 100 --format json
```

### Batch Extract Multiple File Types

```bash title="Extract multiple file types in batch"
kreuzberg batch documents/**/*.{pdf,docx,txt} --format json
```

### Extract with Config File

```bash title="Extract using configuration file"
kreuzberg extract document.pdf --config /path/to/kreuzberg.toml
```

### Detect MIME Type

```bash title="Detect file MIME type"
kreuzberg detect document.pdf
```

## Docker Usage

Use `ghcr.io/kreuzberg-dev/kreuzberg-cli:latest` for the CLI image, or `ghcr.io/kreuzberg-dev/kreuzberg:latest` for the full image (also includes the CLI).

### Basic Docker

```bash title="Terminal"
# Extract document using Docker with mounted directory
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest \
  extract /data/document.pdf

# Extract and save output to host directory using shell redirection
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest \
  extract /data/document.pdf > output.txt
```

### Docker with OCR

```bash title="Terminal"
# Extract with OCR using Docker
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest \
  extract /data/scanned.pdf --ocr true
```

### Docker Compose

**docker-compose.yaml:**

```yaml title="docker-compose.yaml"
version: "3.8"

services:
  kreuzberg:
    image: ghcr.io/kreuzberg-dev/kreuzberg-cli:latest
    volumes:
      - ./documents:/input
    command: extract /input/document.pdf --ocr true
```

Run:

```bash title="Terminal"
docker-compose up
```

## Performance Tips

### Optimize Extraction Speed

```bash title="Terminal"
# Extract without quality processing for faster speed
kreuzberg extract large.pdf --quality false

# Use batch for processing multiple files
kreuzberg batch large_files/*.pdf --format json
```

### Manage Memory Usage

```bash title="Terminal"
# Disable caching to reduce memory footprint
kreuzberg extract large_file.pdf --no-cache true

# Compress output to save disk space
kreuzberg extract document.pdf | gzip > output.txt.gz
```

## Troubleshooting

### Check Installation

```bash title="Terminal"
# Display installed version
kreuzberg --version

# Display help for commands
kreuzberg --help
```

### Common Issues

**Issue: "Tesseract not found"**

When using OCR, Tesseract must be installed:

```bash title="Terminal"
# Install Tesseract OCR engine on macOS
brew install tesseract

# Install Tesseract OCR engine on Ubuntu
sudo apt-get install tesseract-ocr
```

**Issue: "File not found"**

Ensure the file path is correct and accessible:

```bash title="Terminal"
# Check if file exists and is readable
ls -la document.pdf

# Extract with absolute path
kreuzberg extract /absolute/path/to/document.pdf
```

## Server Commands

### Start API Server

`serve` starts the HTTP REST API:

```bash title="Terminal"
# Start server on default host (127.0.0.1) and port (8000)
kreuzberg serve

# Start server on specific host and port (-H / -p are short forms)
kreuzberg serve --host 0.0.0.0 --port 8000
kreuzberg serve -H 0.0.0.0 -p 8000

# Start server with custom configuration file
kreuzberg serve --config kreuzberg.toml --host 0.0.0.0 --port 8000
```

### Server Endpoints

The server provides the following endpoints:

- `POST /extract` - Extract text from uploaded files
- `POST /batch` - Batch extract from multiple files
- `GET /detect` - Detect MIME type of file
- `GET /health` - Health check
- `GET /info` - Server information
- `GET /cache/stats` - Cache statistics
- `POST /cache/clear` - Clear cache

See [API Server Guide](../guides/api-server.md) for full API details.

### Start MCP Server

`mcp` starts a Model Context Protocol server for AI agents:

```bash title="Terminal"
# Start MCP server with stdio transport (default for Claude Desktop)
kreuzberg mcp

# Start MCP server with HTTP transport
kreuzberg mcp --transport http

# Start MCP server on specific HTTP host and port
kreuzberg mcp --transport http --host 0.0.0.0 --port 8001

# Start MCP server with custom configuration file
kreuzberg mcp --config kreuzberg.toml --transport stdio
```

The MCP server provides tools for AI agents:

- `extract_file` - Extract text from a file path
- `extract_bytes` - Extract text from base64-encoded bytes
- `batch_extract` - Extract from multiple files

See [API Server Guide](../guides/api-server.md) for MCP integration details.

## Embeddings <span class="version-badge">v4.5.2</span>

Generate vector embeddings using pre-trained models. Input via `--text` or stdin.

```bash title="Terminal"
# Generate embeddings for a single text
kreuzberg embed --text "hello world" --preset balanced

# Generate embeddings with a specific preset
kreuzberg embed --text "document content" --preset fast

# Batch embed multiple texts
kreuzberg embed --text "first document" --text "second document" --preset quality

# Read from stdin
echo "hello world" | kreuzberg embed --preset balanced

# Output as text instead of JSON
kreuzberg embed --text "hello" --preset balanced --format text
```

Available presets: `fast`, `balanced` (default), `quality`, `multilingual`.

!!! Info "Feature Availability" The `embed` command requires the `embeddings` feature. It is available in Docker images but not in Homebrew installations.

## Chunking Command <span class="version-badge">v4.5.2</span>

Split text with configurable size and overlap. Input via `--text` or stdin.

```bash title="Terminal"
# Chunk text with default settings
kreuzberg chunk --text "long text content to be split into chunks..."

# Specify chunk size and overlap
kreuzberg chunk --text "long text..." --chunk-size 512 --chunk-overlap 50

# Use markdown-aware chunking
kreuzberg chunk --text "# Heading\n\nParagraph..." --chunker-type markdown

# Use a tokenizer model for token-based sizing
kreuzberg chunk --text "long text..." --chunking-tokenizer "Xenova/gpt-4o"

# Read from stdin
cat document.txt | kreuzberg chunk --chunk-size 1000

# Output as text instead of JSON
kreuzberg chunk --text "long text..." --format text

# Use a config file for chunking settings
kreuzberg chunk --text "long text..." --config kreuzberg.toml
```

## Shell Completions <span class="version-badge">v4.5.2</span>

Tab-completion scripts for bash, zsh, and fish:

```bash title="Terminal"
# Generate bash completions
kreuzberg completions bash

# Generate zsh completions
kreuzberg completions zsh

# Generate fish completions
kreuzberg completions fish

# Install bash completions
eval "$(kreuzberg completions bash)"

# Install zsh completions (add to .zshrc)
eval "$(kreuzberg completions zsh)"
```

## API Utilities <span class="version-badge">v4.5.2</span>

### Dump OpenAPI Schema <span class="version-badge">v4.5.2</span>

Output the OpenAPI 3.1 specification — useful for code generation and API client tooling.

```bash title="Terminal"
# Print OpenAPI schema as JSON
kreuzberg api schema

# Save to file
kreuzberg api schema > openapi.json
```

!!! Info "Feature Availability" The `api` subcommand requires the `api` feature.

## List Supported Formats <span class="version-badge">v4.5.2</span>

List supported formats with extensions and MIME types:

```bash title="Terminal"
# List formats as a table
kreuzberg formats

# List formats as JSON
kreuzberg formats --format json
```

## Cache Management

### View Cache Statistics

```bash title="Terminal"
# Display cache usage statistics
kreuzberg cache stats

# Display statistics for specific cache directory
kreuzberg cache stats --cache-dir /path/to/cache

# Output cache statistics as JSON
kreuzberg cache stats --format json
```

### Clear Cache

```bash title="Terminal"
# Remove all cached extraction results
kreuzberg cache clear

# Clear specific cache directory
kreuzberg cache clear --cache-dir /path/to/cache

# Clear cache and display removal details
kreuzberg cache clear --format json
```

### Warm Model Cache <span class="version-badge">v4.5.0</span>

Pre-download ML models (PaddleOCR, layout detection) for offline use — useful for containerized deployments.

Default cache directories:

- **Linux**: `~/.cache/kreuzberg/{module}` (or `$XDG_CACHE_HOME/kreuzberg/{module}`)
- **macOS**: `~/Library/Caches/kreuzberg/{module}`
- **Windows**: `%LOCALAPPDATA%/kreuzberg/{module}`

Override with `KREUZBERG_CACHE_DIR` or `--cache-dir`.

```bash title="Terminal"
# Download all OCR and layout models eagerly
kreuzberg cache warm

# Download to a specific cache directory
kreuzberg cache warm --cache-dir /path/to/cache

# Also download all 4 embedding model presets (fast, balanced, quality, multilingual)
kreuzberg cache warm --all-embeddings

# Download a specific embedding model preset
kreuzberg cache warm --embedding-model balanced

# Output download results as JSON
kreuzberg cache warm --format json
```

### Model Manifest <span class="version-badge">v4.5.0</span>

Manifest of expected model files with SHA256 checksums and sizes — for cache integrity checks or scripted pre-population.

```bash title="Terminal"
# Output manifest as JSON (default)
kreuzberg cache manifest

# Output manifest as human-readable text
kreuzberg cache manifest --format text
```

## Getting Help

### CLI Help

```bash title="Terminal"
# Display general CLI help
kreuzberg --help

# Display command-specific help
kreuzberg extract --help
kreuzberg batch --help
kreuzberg detect --help
kreuzberg formats --help
kreuzberg version --help
kreuzberg embed --help
kreuzberg chunk --help
kreuzberg completions --help
kreuzberg serve --help
kreuzberg mcp --help
kreuzberg cache --help
kreuzberg cache stats --help
kreuzberg cache clear --help
kreuzberg cache warm --help
kreuzberg cache manifest --help
kreuzberg api schema --help
```

### Version Information

```bash title="Terminal"
# Display version number
kreuzberg --version

# Show version with JSON output
kreuzberg version --format json
```

## Next Steps

- [API Server Guide](../guides/api-server.md) - API and MCP server setup
- [Advanced Features](../guides/advanced.md) - Advanced Kreuzberg features
- [Plugin Development](../guides/plugins.md) - Extend Kreuzberg functionality
- [API Reference](../reference/api-python.md) - Programmatic access