kreuzberg-cli
Command-line interface for the Kreuzberg document intelligence library.
Overview
This crate provides a production-ready CLI tool for document extraction, MIME type detection, batch processing, embeddings, chunking, and cache management. It exposes the core extraction capabilities of the Kreuzberg Rust library through an easy-to-use command-line interface.
The CLI supports 90+ file formats including PDF, DOCX, PPTX, XLSX, images, HTML, and more, with optional OCR support for scanned documents.
Architecture
Binary Structure
Kreuzberg Core Library (crates/kreuzberg)
|
Kreuzberg CLI (crates/kreuzberg-cli) <- This crate
|
Command-line interface with configuration and caching
Commands
| Command | Description |
|---|---|
extract |
Extract text from a document |
batch |
Batch extract from multiple documents |
detect |
Detect MIME type of a file |
formats |
List all supported document formats |
version |
Show version information |
cache |
Cache management (stats, clear, warm, manifest) |
serve |
Start the API server (requires api feature) |
mcp |
Start the MCP server (requires mcp feature) |
api |
API utilities (schema) (requires api feature) |
embed |
Generate embeddings for text (requires embeddings feature) v4.5.2 |
chunk |
Chunk text for processing v4.5.2 |
completions |
Generate shell completions v4.5.2 |
Platform Support
The CLI is tested and officially supported on:
- Linux x86_64 (glibc and musl static)
- Linux aarch64 / ARM64 (glibc and musl static)
- macOS aarch64 (Apple Silicon)
- Windows x86_64
All platforms receive precompiled binaries through GitHub releases. Linux musl binaries are fully statically linked with zero runtime dependencies.
Installation
Install Script (Linux / macOS)
curl -fsSL https://raw.githubusercontent.com/kreuzberg-dev/kreuzberg/main/scripts/install.sh | bash
Homebrew
brew install kreuzberg-dev/tap/kreuzberg
Cargo
cargo install kreuzberg-cli
Docker
docker pull ghcr.io/kreuzberg-dev/kreuzberg-cli:latest
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest extract /data/document.pdf
From Source
cargo install --path crates/kreuzberg-cli
Or via the workspace:
cargo build --release -p kreuzberg-cli
Platform-Specific Requirements
ONNX Runtime (for embeddings)
If using embeddings functionality, ONNX Runtime must be installed:
# macOS
brew install onnxruntime
# Ubuntu/Debian
sudo apt install libonnxruntime libonnxruntime-dev
# Windows (MSVC)
scoop install onnxruntime
# OR download from https://github.com/microsoft/onnxruntime/releases
Without ONNX Runtime, embeddings will raise MissingDependencyError with installation instructions.
OCR Support (Optional)
To enable optical character recognition for scanned documents:
- macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt-get install tesseract-ocr - Windows: Download from tesseract-ocr/tesseract
Quick Start
The CLI is available for Linux (x86_64/aarch64), macOS (Apple Silicon), and Windows with consistent behavior across all platforms.
Basic Text Extraction
# Extract text from a PDF
kreuzberg extract document.pdf
# Extract with JSON output
kreuzberg extract document.pdf --format json
Extract with OCR
# Enable OCR for scanned documents
kreuzberg extract scanned.pdf --ocr true
# Force OCR even if text extraction succeeds
kreuzberg extract mixed.pdf --force-ocr true
Batch Processing
# Process multiple documents in parallel
kreuzberg batch *.pdf --format json
# Process with custom configuration
kreuzberg batch documents/*.docx --config config.toml --format json
MIME Type Detection
# Detect file type
kreuzberg detect unknown-file
# JSON output
kreuzberg detect unknown-file --format json
Generate Embeddings (with embeddings feature)
# Embed a single text
kreuzberg embed --text "hello world" --preset balanced
# Embed multiple texts
kreuzberg embed --text "first" --text "second" --format json
# Read from stdin
echo "some text" | kreuzberg embed --preset fast
Chunk Text
# Chunk text from flag
kreuzberg chunk --text "Long document content..." --chunk-size 512
# Chunk from stdin with overlap
echo "Long document content..." | kreuzberg chunk --chunk-size 512 --chunk-overlap 64
# Use markdown-aware chunking
kreuzberg chunk --text "# Heading\nContent..." --chunker-type markdown
Cache Management
# View cache statistics
kreuzberg cache stats
# Clear the cache
kreuzberg cache clear --cache-dir /path/to/cache
# Pre-download all models
kreuzberg cache warm
# Pre-download models including all embedding presets
kreuzberg cache warm --all-embeddings
# Show model manifest (paths, checksums, sizes)
kreuzberg cache manifest
Shell Completions
# Bash
eval "$(kreuzberg completions bash)"
# Zsh
kreuzberg completions zsh > ~/.zfunc/_kreuzberg
# Fish
kreuzberg completions fish | source
API Server (with api feature)
# Start API server on localhost:8000
kreuzberg serve
# Custom host and port
kreuzberg serve --host 0.0.0.0 --port 3000
# With configuration file
kreuzberg serve --config kreuzberg.toml --host 127.0.0.1 --port 8080
API Utilities (with api feature)
# Dump the OpenAPI 3.1 schema
kreuzberg api schema
MCP Server (with mcp feature)
# Start Model Context Protocol server (stdio transport)
kreuzberg mcp
# With HTTP transport
kreuzberg mcp --transport http --host 127.0.0.1 --port 8001
# With configuration file
kreuzberg mcp --config kreuzberg.toml
Global Flags
| Flag | Description |
|---|---|
--log-level <LEVEL> |
Set log level (trace, debug, info, warn, error). Overrides RUST_LOG env var. v4.5.2 |
Configuration
The CLI supports configuration files in TOML, YAML, or JSON formats. Configuration can be:
- Explicit: Passed via
--config /path/to/config.{toml,yaml,json} - Auto-discovered: Searches for
kreuzberg.{toml,yaml,json}in current and parent directories - Inline JSON: Passed via
--config-json '{"ocr":{"backend":"tesseract"}}' - Base64 JSON: Passed via
--config-json-base64 <BASE64>(useful when shell quoting is tricky) - Default: Uses built-in defaults if no config found
Configuration precedence (highest to lowest):
- Individual CLI flags (
--ocr,--chunk-size, etc.) - Inline JSON config (
--config-jsonor--config-json-base64) - Config file (
--config path.toml) - Default values
Example Configuration (TOML)
# Basic extraction settings
use_cache = true
enable_quality_processing = true
force_ocr = false
# OCR configuration
[ocr]
backend = "tesseract"
language = "eng"
[ocr.tesseract_config]
enable_table_detection = true
psm = 6
min_confidence = 50.0
# Text chunking (useful for LLM processing)
[chunking]
max_chars = 1000
max_overlap = 200
# PDF-specific options
[pdf_options]
extract_images = true
extract_metadata = true
passwords = []
# Language detection
[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false
# Image extraction
[images]
extract_images = true
target_dpi = 300
max_image_dimension = 4096
auto_adjust_dpi = true
Configuration Overrides
Command-line flags override configuration file settings:
# Override OCR setting from config
kreuzberg extract document.pdf --config config.toml --ocr false
# Override chunking settings
kreuzberg extract long.pdf --chunk true --chunk-size 2000 --chunk-overlap 400
# Disable cache despite config file
kreuzberg extract document.pdf --no-cache true
# Enable language detection
kreuzberg extract multilingual.pdf --detect-language true
Command Reference
extract
Extract text, tables, and metadata from a document.
kreuzberg extract <PATH> [OPTIONS]
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--config-json <JSON>: Inline JSON configuration--config-json-base64 <BASE64>: Base64-encoded JSON configuration--mime-type <TYPE>: MIME type hint (auto-detected if not provided)--format <FORMAT>: Output format (textorjson), default:text
Extraction override flags v4.5.2 (also available on batch):
| Flag | Description |
|---|---|
--ocr <true|false> |
Enable/disable OCR |
--ocr-backend <BACKEND> |
OCR backend: tesseract, paddle-ocr, easyocr |
--ocr-language <LANG> |
OCR language code (e.g. eng, fra, ch) |
--ocr-auto-rotate <true|false> |
Auto-rotate images before OCR |
--force-ocr <true|false> |
Force OCR even if text extraction succeeds |
--no-cache <true|false> |
Disable result caching |
--chunk <true|false> |
Enable text chunking |
--chunk-size <SIZE> |
Maximum chunk size in characters (default: 1000) |
--chunk-overlap <SIZE> |
Overlap between chunks in characters (default: 200) |
--chunking-tokenizer <MODEL> |
Tokenizer model for token-based sizing (e.g. Xenova/gpt-4o) |
--content-format <FORMAT> |
Content format: plain, markdown, djot, html |
--include-structure <true|false> |
Include hierarchical document structure |
--quality <true|false> |
Enable quality post-processing |
--detect-language <true|false> |
Enable language detection |
--layout |
Enable layout detection (RT-DETR v2) (enables with defaults, use --layout false to disable) |
--layout-confidence <FLOAT> |
Layout confidence threshold (0.0 - 1.0) |
--acceleration <PROVIDER> |
ONNX execution provider: auto, cpu, coreml, cuda, tensorrt |
--max-concurrent <N> |
Max parallel extractions in batch mode |
--max-threads <N> |
Cap all internal thread pools |
--extract-pages <true|false> |
Extract pages as separate array |
--page-markers <true|false> |
Insert page marker comments |
--extract-images <true|false> |
Enable image extraction |
--target-dpi <DPI> |
Target DPI for images (36 - 2400) |
--pdf-password <PASS> |
Password for encrypted PDFs (repeatable) |
--pdf-extract-images <true|false> |
Extract images from PDF pages |
--pdf-extract-metadata <true|false> |
Extract PDF metadata |
--token-reduction <LEVEL> |
Token reduction: off, light, moderate, aggressive, maximum |
--layout-table-model <MODEL> |
Table structure model: tatr, slanet_wired, slanet_wireless, slanet_plus, slanet_auto, disabled |
--disable-ocr <true|false> |
Disable OCR entirely (even for images) |
--cache-namespace <NAMESPACE> |
Cache namespace for tenant isolation |
--cache-ttl-secs <SECONDS> |
Per-request cache TTL in seconds (0 = skip cache) |
--msg-codepage <CODE> |
Windows codepage fallback for MSG files |
Examples:
# Simple extraction
kreuzberg extract invoice.pdf
# With configuration and JSON output
kreuzberg extract document.pdf --config config.toml --format json
# With chunking for LLM processing
kreuzberg extract report.pdf --chunk true --chunk-size 2000
# With OCR for scanned document
kreuzberg extract scanned.pdf --ocr true --format json
# Markdown output with page markers
kreuzberg extract report.pdf --content-format markdown --page-markers true
# Layout-aware extraction with GPU acceleration
kreuzberg extract document.pdf --layout --content-format markdown --acceleration coreml
# GPU-accelerated extraction
kreuzberg extract scanned.pdf --ocr true --acceleration coreml
batch
Process multiple documents in parallel.
kreuzberg batch <PATHS>... [OPTIONS]
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--config-json <JSON>: Inline JSON configuration--config-json-base64 <BASE64>: Base64-encoded JSON configuration--format <FORMAT>: Output format (textorjson), default:json--file-configs <PATH>: JSON file mapping per-file extraction overrides- All extraction override flags (see
extractabove)
Examples:
# Batch process multiple files
kreuzberg batch doc1.pdf doc2.docx doc3.xlsx
# With glob patterns
kreuzberg batch *.pdf *.docx
# With custom configuration
kreuzberg batch documents/* --config batch-config.toml --format json
# With OCR and concurrency limit
kreuzberg batch scanned/*.pdf --ocr true --max-concurrent 4 --format json
# Per-file overrides
kreuzberg batch doc1.pdf doc2.pdf --file-configs overrides.json
detect
Identify the MIME type of a file.
kreuzberg detect <PATH> [OPTIONS]
Options:
--format <FORMAT>: Output format (textorjson), default:text
Examples:
# Simple detection
kreuzberg detect unknown-file
# JSON output
kreuzberg detect mystery.bin --format json
formats
List all supported document formats.
kreuzberg formats [OPTIONS]
Options:
--format <FORMAT>: Output format (textorjson), default:text
Examples:
# List formats as table
kreuzberg formats
# JSON output for tooling
kreuzberg formats --format json
cache
Manage extraction result cache and model downloads.
kreuzberg cache <COMMAND> [OPTIONS]
Subcommands:
stats
Show cache statistics.
kreuzberg cache stats [--cache-dir <DIR>] [--format <FORMAT>]
Options:
--cache-dir <DIR>: Cache directory (default:.kreuzbergin current directory)--format <FORMAT>: Output format (textorjson), default:text
clear
Clear the cache.
kreuzberg cache clear [--cache-dir <DIR>] [--format <FORMAT>]
Options:
--cache-dir <DIR>: Cache directory (default:.kreuzbergin current directory)--format <FORMAT>: Output format (textorjson), default:text
warm
Pre-download all models (OCR, layout detection, and optionally embeddings).
kreuzberg cache warm [--cache-dir <DIR>] [--format <FORMAT>] [--all-embeddings] [--embedding-model <PRESET>]
Options:
--cache-dir <DIR>: Cache directory (default:.kreuzbergorKREUZBERG_CACHE_DIR)--format <FORMAT>: Output format (textorjson), default:text--all-embeddings: Download all 4 embedding model presets (fast, balanced, quality, multilingual)--embedding-model <PRESET>: Download a specific embedding model preset
manifest
Output model manifest (expected model files, checksums, sizes).
kreuzberg cache manifest [--format <FORMAT>]
Options:
--format <FORMAT>: Output format (textorjson), default:json
Examples:
# View cache statistics
kreuzberg cache stats
# Clear cache with custom directory
kreuzberg cache clear --cache-dir ~/.kreuzberg-cache
# Pre-download all models for offline/container use
kreuzberg cache warm
# Also download embedding models
kreuzberg cache warm --all-embeddings
# Download only the fast embedding model
kreuzberg cache warm --embedding-model fast
# Get model manifest as JSON
kreuzberg cache manifest
serve (requires api feature)
Start the REST API server.
kreuzberg serve [OPTIONS]
Options:
-H, --host <HOST>: Host to bind to (default:127.0.0.1)--port <PORT>: Port to bind to (default:8000)--config <PATH>: Configuration file (TOML, YAML, or JSON)
Examples:
# Default: localhost:8000
kreuzberg serve
# Public access on port 3000
kreuzberg serve --host 0.0.0.0 --port 3000
# With custom configuration
kreuzberg serve --config server-config.toml --port 8080
mcp (requires mcp feature)
Start the Model Context Protocol server.
kreuzberg mcp [OPTIONS]
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--transport <MODE>: Transport mode:stdio(default) orhttp--host <HOST>: HTTP host (only for--transport http, default:127.0.0.1)--port <PORT>: HTTP port (only for--transport http, default:8001)
Examples:
# Start MCP server (stdio, for editor integration)
kreuzberg mcp
# HTTP transport for remote access
kreuzberg mcp --transport http --host 0.0.0.0 --port 8001
# With custom configuration
kreuzberg mcp --config mcp-config.toml
api (requires api feature)
API utility commands.
schema v4.5.2
Output the full OpenAPI 3.1 specification as JSON.
kreuzberg api schema
Examples:
# Dump OpenAPI spec to file
kreuzberg api schema > openapi.json
# Pipe to jq for inspection
kreuzberg api schema | jq '.paths | keys'
embed v4.5.2 (requires embeddings feature)
Generate vector embeddings for text.
kreuzberg embed [OPTIONS]
Options:
--text <TEXT>: Text to embed (repeatable for batch embedding; reads from stdin if omitted)--preset <PRESET>: Embedding preset (fast,balanced,quality,multilingual), default:balanced--format <FORMAT>: Output format (textorjson), default:json
Examples:
# Embed a single text
kreuzberg embed --text "hello world"
# Batch embed multiple texts
kreuzberg embed --text "first" --text "second"
# Use a specific preset
kreuzberg embed --text "bonjour" --preset multilingual
# Read from stdin
cat document.txt | kreuzberg embed --preset fast
chunk v4.5.2
Chunk text for processing (useful for LLM context windows).
kreuzberg chunk [OPTIONS]
Options:
--text <TEXT>: Text to chunk (reads from stdin if omitted)--config <PATH>: Configuration file (TOML, YAML, or JSON)--chunk-size <SIZE>: Chunk size in characters--chunk-overlap <SIZE>: Chunk overlap in characters--chunker-type <TYPE>: Chunker type:text(default),markdown,yaml, orsemantic--chunking-tokenizer <MODEL>: Tokenizer model for token-based sizing (e.g.Xenova/gpt-4o)--format <FORMAT>: Output format (textorjson), default:json
Examples:
# Chunk text with defaults
kreuzberg chunk --text "Long document content here..."
# Custom chunk size and overlap
kreuzberg chunk --text "..." --chunk-size 512 --chunk-overlap 64
# Markdown-aware chunking
kreuzberg chunk --text "# Title\nContent..." --chunker-type markdown
# Token-based chunking with specific tokenizer
kreuzberg chunk --text "..." --chunking-tokenizer "Xenova/gpt-4o"
# Read from stdin
cat long-document.txt | kreuzberg chunk --chunk-size 1000
completions v4.5.2
Generate shell completion scripts.
kreuzberg completions <SHELL>
Supported shells: bash, zsh, fish, elvish, powershell
Examples:
# Bash (add to .bashrc)
eval "$(kreuzberg completions bash)"
# Zsh (add to fpath)
kreuzberg completions zsh > ~/.zfunc/_kreuzberg
# Fish
kreuzberg completions fish | source
# PowerShell
kreuzberg completions powershell | Out-String | Invoke-Expression
version
Show version information.
kreuzberg version [--format <FORMAT>]
Options:
--format <FORMAT>: Output format (textorjson), default:text
Examples:
# Display version
kreuzberg version
# JSON output
kreuzberg version --format json
Output Formats
Text Format
The default human-readable format:
kreuzberg extract document.pdf
# Output:
# Document content here...
JSON Format
For programmatic integration:
kreuzberg extract document.pdf --format json
# Output:
# {
# "content": "Document content...",
# "mime_type": "application/pdf",
# "metadata": { "title": "...", "author": "..." },
# "tables": [{ "markdown": "...", "cells": [...], "page_number": 0 }]
# }
Supported File Formats
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
| Images | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
| Web | HTML, XHTML, XML |
| Text | TXT, MD, CSV, TSV, JSON, YAML, TOML |
| EML, MSG | |
| Archives | ZIP, TAR, 7Z |
| Other | 30+ additional formats |
Exit Codes
0: Successful executionNon-zero: Error occurred (check stderr for details)
Logging
Control logging verbosity with the --log-level flag or RUST_LOG environment variable:
# Using --log-level flag (overrides RUST_LOG)
kreuzberg --log-level debug extract document.pdf
kreuzberg --log-level warn batch *.pdf
# Using RUST_LOG environment variable
RUST_LOG=info kreuzberg extract document.pdf
RUST_LOG=debug kreuzberg extract document.pdf
RUST_LOG=warn kreuzberg extract document.pdf
# Show logs from specific modules
RUST_LOG=kreuzberg=debug kreuzberg extract document.pdf
Performance Tips
-
Use batch processing for multiple files instead of sequential extraction:
kreuzberg batch *.pdf # Parallel processing -
Enable caching to avoid reprocessing the same documents:
# Cache is enabled by default kreuzberg extract document.pdf -
Use appropriate chunk sizes for LLM processing:
kreuzberg extract long.pdf --chunk true --chunk-size 2000 -
Tune OCR settings for better performance:
kreuzberg extract scanned.pdf --ocr true # Adjust tesseract_config in configuration file for optimization -
Monitor cache size and clear when needed:
kreuzberg cache stats kreuzberg cache clear -
Pre-warm models for containerized deployments:
kreuzberg cache warm --all-embeddings -
Use hardware acceleration when available:
kreuzberg extract scanned.pdf --ocr true --acceleration coreml # macOS kreuzberg extract scanned.pdf --ocr true --acceleration cuda # NVIDIA GPU
Features
Default Features
None by default. The binary includes core extraction.
Optional Features
api: Enable the REST API server (kreuzberg servecommand) and API utilities (kreuzberg api schema)mcp: Enable Model Context Protocol server (kreuzberg mcpcommand)embeddings: Enable embedding generation (kreuzberg embedcommand)layout-detection: Enable layout detection flags (--layout,--layout-confidence)chunking-tokenizers: Enable token-based chunking (--chunking-tokenizer)all: Enable all features (api+mcp)
Building with Features
# Build with all features
cargo build --release -p kreuzberg-cli --features all
# Build with specific features
cargo build --release -p kreuzberg-cli --features api,mcp,embeddings
Troubleshooting
File Not Found Error
Ensure the file path is correct and the file is readable:
# Check if file exists
ls -l /path/to/document.pdf
# Try with absolute path
kreuzberg extract /absolute/path/to/document.pdf
OCR Not Working
Verify Tesseract is installed:
tesseract --version
# If not found:
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr
# Windows: Download from https://github.com/tesseract-ocr/tesseract
Configuration File Not Found
Check that the configuration file has the correct format and location:
# Use explicit path
kreuzberg extract document.pdf --config /absolute/path/to/config.toml
# Or place kreuzberg.toml in current directory
ls -l kreuzberg.toml
Out of Memory with Large Files
Use chunking to reduce memory usage:
kreuzberg extract large-document.pdf --chunk true --chunk-size 1000
Cache Directory Permissions
Ensure write access to the cache directory:
# Check permissions
ls -ld .kreuzberg
# Or use a custom directory with appropriate permissions
kreuzberg extract document.pdf --config config.toml
# In config.toml: cache_dir = "/tmp/kreuzberg-cache"
Key Files
src/main.rs: CLI implementation with command definitions and argument parsingsrc/commands/overrides.rs: Extraction override flags (OCR, chunking, layout, acceleration, etc.)Cargo.toml: Package metadata and dependencies
Building
Development Build
cargo build -p kreuzberg-cli
Release Build
cargo build --release -p kreuzberg-cli
With All Features
cargo build --release -p kreuzberg-cli --features all
Testing
# Run CLI tests
cargo test -p kreuzberg-cli
# With logging
RUST_LOG=debug cargo test -p kreuzberg-cli -- --nocapture
Performance Characteristics
- Single file extraction: Typically 10-100ms depending on file size and format
- Batch processing: Near-linear scaling with 8 concurrent extractions by default
- OCR processing: 100-500ms per page depending on image quality and language
- Caching: Sub-millisecond retrieval for cached results
References
- Kreuzberg Core:
../kreuzberg/ - Main Documentation: https://docs.kreuzberg.dev
- GitHub Repository: https://github.com/kreuzberg-dev/kreuzberg
- Configuration Guide: See example configuration sections above
Contributing
We welcome contributions! Please see the main Kreuzberg repository for contribution guidelines.
License
Elastic License 2.0 (ELv2)