# kreuzberg-cli [![Bindings](https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6)](https://github.com/kreuzberg-dev/alef) Command-line interface for the Kreuzberg document intelligence library. ## Overview This crate provides a production-ready CLI tool for document extraction, MIME type detection, batch processing, embeddings, chunking, and cache management. It exposes the core extraction capabilities of the Kreuzberg Rust library through an easy-to-use command-line interface. The CLI supports 90+ file formats including PDF, DOCX, PPTX, XLSX, images, HTML, and more, with optional OCR support for scanned documents. ## Architecture ### Binary Structure ```text Kreuzberg Core Library (crates/kreuzberg) | Kreuzberg CLI (crates/kreuzberg-cli) <- This crate | Command-line interface with configuration and caching ``` ### Commands | Command | Description | | ------------- | ------------------------------------------------------------------------------------------------------ | | `extract` | Extract text from a document | | `batch` | Batch extract from multiple documents | | `detect` | Detect MIME type of a file | | `formats` | List all supported document formats | | `version` | Show version information | | `cache` | Cache management (stats, clear, warm, manifest) | | `serve` | Start the API server (requires `api` feature) | | `mcp` | Start the MCP server (requires `mcp` feature) | | `api` | API utilities (schema) (requires `api` feature) | | `embed` | Generate embeddings for text (requires `embeddings` feature) v4.5.2 | | `chunk` | Chunk text for processing v4.5.2 | | `completions` | Generate shell completions v4.5.2 | ### Platform Support The CLI is tested and officially supported on: - Linux x86_64 (glibc and musl static) - Linux aarch64 / ARM64 (glibc and musl static) - macOS aarch64 (Apple Silicon) - Windows x86_64 All platforms receive precompiled binaries through GitHub releases. Linux musl binaries are fully statically linked with zero runtime dependencies. ## Installation ### Install Script (Linux / macOS) ```bash curl -fsSL https://raw.githubusercontent.com/kreuzberg-dev/kreuzberg/main/scripts/install.sh | bash ``` ### Homebrew ```bash brew install kreuzberg-dev/tap/kreuzberg ``` ### Cargo ```bash cargo install kreuzberg-cli ``` ### Docker ```bash docker pull ghcr.io/kreuzberg-dev/kreuzberg-cli:latest docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest extract /data/document.pdf ``` ### From Source ```bash cargo install --path crates/kreuzberg-cli ``` Or via the workspace: ```bash cargo build --release -p kreuzberg-cli ``` ### Platform-Specific Requirements #### ONNX Runtime (for embeddings) If using embeddings functionality, ONNX Runtime must be installed: ```bash # macOS brew install onnxruntime # Ubuntu/Debian sudo apt install libonnxruntime libonnxruntime-dev # Windows (MSVC) scoop install onnxruntime # OR download from https://github.com/microsoft/onnxruntime/releases ``` Without ONNX Runtime, embeddings will raise `MissingDependencyError` with installation instructions. #### OCR Support (Optional) To enable optical character recognition for scanned documents: - **macOS**: `brew install tesseract` - **Ubuntu/Debian**: `sudo apt-get install tesseract-ocr` - **Windows**: Download from [tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract) ## Quick Start > The CLI is available for Linux (x86_64/aarch64), macOS (Apple Silicon), and Windows with consistent behavior across all platforms. ### Basic Text Extraction ```bash # Extract text from a PDF kreuzberg extract document.pdf # Extract with JSON output kreuzberg extract document.pdf --format json ``` ### Extract with OCR ```bash # Enable OCR for scanned documents kreuzberg extract scanned.pdf --ocr true # Force OCR even if text extraction succeeds kreuzberg extract mixed.pdf --force-ocr true ``` ### Batch Processing ```bash # Process multiple documents in parallel kreuzberg batch *.pdf --format json # Process with custom configuration kreuzberg batch documents/*.docx --config config.toml --format json ``` ### MIME Type Detection ```bash # Detect file type kreuzberg detect unknown-file # JSON output kreuzberg detect unknown-file --format json ``` ### Generate Embeddings (with `embeddings` feature) ```bash # Embed a single text kreuzberg embed --text "hello world" --preset balanced # Embed multiple texts kreuzberg embed --text "first" --text "second" --format json # Read from stdin echo "some text" | kreuzberg embed --preset fast ``` ### Chunk Text ```bash # Chunk text from flag kreuzberg chunk --text "Long document content..." --chunk-size 512 # Chunk from stdin with overlap echo "Long document content..." | kreuzberg chunk --chunk-size 512 --chunk-overlap 64 # Use markdown-aware chunking kreuzberg chunk --text "# Heading\nContent..." --chunker-type markdown ``` ### Cache Management ```bash # View cache statistics kreuzberg cache stats # Clear the cache kreuzberg cache clear --cache-dir /path/to/cache # Pre-download all models kreuzberg cache warm # Pre-download models including all embedding presets kreuzberg cache warm --all-embeddings # Show model manifest (paths, checksums, sizes) kreuzberg cache manifest ``` ### Shell Completions ```bash # Bash eval "$(kreuzberg completions bash)" # Zsh kreuzberg completions zsh > ~/.zfunc/_kreuzberg # Fish kreuzberg completions fish | source ``` ### API Server (with `api` feature) ```bash # Start API server on localhost:8000 kreuzberg serve # Custom host and port kreuzberg serve --host 0.0.0.0 --port 3000 # With configuration file kreuzberg serve --config kreuzberg.toml --host 127.0.0.1 --port 8080 ``` ### API Utilities (with `api` feature) ```bash # Dump the OpenAPI 3.1 schema kreuzberg api schema ``` ### MCP Server (with `mcp` feature) ```bash # Start Model Context Protocol server (stdio transport) kreuzberg mcp # With HTTP transport kreuzberg mcp --transport http --host 127.0.0.1 --port 8001 # With configuration file kreuzberg mcp --config kreuzberg.toml ``` ## Global Flags | Flag | Description | | --------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | | `--log-level ` | Set log level (`trace`, `debug`, `info`, `warn`, `error`). Overrides `RUST_LOG` env var. v4.5.2 | ## Configuration The CLI supports configuration files in TOML, YAML, or JSON formats. Configuration can be: 1. **Explicit**: Passed via `--config /path/to/config.{toml,yaml,json}` 2. **Auto-discovered**: Searches for `kreuzberg.{toml,yaml,json}` in current and parent directories 3. **Inline JSON**: Passed via `--config-json '{"ocr":{"backend":"tesseract"}}'` 4. **Base64 JSON**: Passed via `--config-json-base64 ` (useful when shell quoting is tricky) 5. **Default**: Uses built-in defaults if no config found Configuration precedence (highest to lowest): 1. Individual CLI flags (`--ocr`, `--chunk-size`, etc.) 2. Inline JSON config (`--config-json` or `--config-json-base64`) 3. Config file (`--config path.toml`) 4. Default values ### Example Configuration (TOML) ```toml # Basic extraction settings use_cache = true enable_quality_processing = true force_ocr = false # OCR configuration [ocr] backend = "tesseract" language = "eng" [ocr.tesseract_config] enable_table_detection = true psm = 6 min_confidence = 50.0 # Text chunking (useful for LLM processing) [chunking] max_chars = 1000 max_overlap = 200 # PDF-specific options [pdf_options] extract_images = true extract_metadata = true passwords = [] # Language detection [language_detection] enabled = true min_confidence = 0.8 detect_multiple = false # Image extraction [images] extract_images = true target_dpi = 300 max_image_dimension = 4096 auto_adjust_dpi = true ``` ### Configuration Overrides Command-line flags override configuration file settings: ```bash # Override OCR setting from config kreuzberg extract document.pdf --config config.toml --ocr false # Override chunking settings kreuzberg extract long.pdf --chunk true --chunk-size 2000 --chunk-overlap 400 # Disable cache despite config file kreuzberg extract document.pdf --no-cache true # Enable language detection kreuzberg extract multilingual.pdf --detect-language true ``` ## Command Reference ### extract Extract text, tables, and metadata from a document. ```bash kreuzberg extract [OPTIONS] ``` **Options:** - `--config `: Configuration file (TOML, YAML, or JSON) - `--config-json `: Inline JSON configuration - `--config-json-base64 `: Base64-encoded JSON configuration - `--mime-type `: MIME type hint (auto-detected if not provided) - `--format `: Output format (`text` or `json`), default: `text` **Extraction override flags** v4.5.2 (also available on `batch`): | Flag | Description | | -------------------------------------- | ---------------------------------------------------------------------------------------------------------- | | `--ocr ` | Enable/disable OCR | | `--ocr-backend ` | OCR backend: `tesseract`, `paddle-ocr`, `easyocr` | | `--ocr-language ` | OCR language code (e.g. `eng`, `fra`, `ch`) | | `--ocr-auto-rotate ` | Auto-rotate images before OCR | | `--force-ocr ` | Force OCR even if text extraction succeeds | | `--no-cache ` | Disable result caching | | `--chunk ` | Enable text chunking | | `--chunk-size ` | Maximum chunk size in characters (default: 1000) | | `--chunk-overlap ` | Overlap between chunks in characters (default: 200) | | `--chunking-tokenizer ` | Tokenizer model for token-based sizing (e.g. `Xenova/gpt-4o`) | | `--content-format ` | Content format: `plain`, `markdown`, `djot`, `html` | | `--include-structure ` | Include hierarchical document structure | | `--quality ` | Enable quality post-processing | | `--detect-language ` | Enable language detection | | `--layout` | Enable layout detection (RT-DETR v2) (enables with defaults, use `--layout false` to disable) | | `--layout-confidence ` | Layout confidence threshold (0.0 - 1.0) | | `--acceleration ` | ONNX execution provider: `auto`, `cpu`, `coreml`, `cuda`, `tensorrt` | | `--max-concurrent ` | Max parallel extractions in batch mode | | `--max-threads ` | Cap all internal thread pools | | `--extract-pages ` | Extract pages as separate array | | `--page-markers ` | Insert page marker comments | | `--extract-images ` | Enable image extraction | | `--target-dpi ` | Target DPI for images (36 - 2400) | | `--pdf-password ` | Password for encrypted PDFs (repeatable) | | `--pdf-extract-images ` | Extract images from PDF pages | | `--pdf-extract-metadata ` | Extract PDF metadata | | `--token-reduction ` | Token reduction: `off`, `light`, `moderate`, `aggressive`, `maximum` | | `--layout-table-model ` | Table structure model: `tatr`, `slanet_wired`, `slanet_wireless`, `slanet_plus`, `slanet_auto`, `disabled` | | `--disable-ocr ` | Disable OCR entirely (even for images) | | `--cache-namespace ` | Cache namespace for tenant isolation | | `--cache-ttl-secs ` | Per-request cache TTL in seconds (0 = skip cache) | | `--msg-codepage ` | Windows codepage fallback for MSG files | **Examples:** ```bash # Simple extraction kreuzberg extract invoice.pdf # With configuration and JSON output kreuzberg extract document.pdf --config config.toml --format json # With chunking for LLM processing kreuzberg extract report.pdf --chunk true --chunk-size 2000 # With OCR for scanned document kreuzberg extract scanned.pdf --ocr true --format json # Markdown output with page markers kreuzberg extract report.pdf --content-format markdown --page-markers true # Layout-aware extraction with GPU acceleration kreuzberg extract document.pdf --layout --content-format markdown --acceleration coreml # GPU-accelerated extraction kreuzberg extract scanned.pdf --ocr true --acceleration coreml ``` ### batch Process multiple documents in parallel. ```bash kreuzberg batch ... [OPTIONS] ``` **Options:** - `--config `: Configuration file (TOML, YAML, or JSON) - `--config-json `: Inline JSON configuration - `--config-json-base64 `: Base64-encoded JSON configuration - `--format `: Output format (`text` or `json`), default: `json` - `--file-configs `: JSON file mapping per-file extraction overrides - All extraction override flags (see `extract` above) **Examples:** ```bash # Batch process multiple files kreuzberg batch doc1.pdf doc2.docx doc3.xlsx # With glob patterns kreuzberg batch *.pdf *.docx # With custom configuration kreuzberg batch documents/* --config batch-config.toml --format json # With OCR and concurrency limit kreuzberg batch scanned/*.pdf --ocr true --max-concurrent 4 --format json # Per-file overrides kreuzberg batch doc1.pdf doc2.pdf --file-configs overrides.json ``` ### detect Identify the MIME type of a file. ```bash kreuzberg detect [OPTIONS] ``` **Options:** - `--format `: Output format (`text` or `json`), default: `text` **Examples:** ```bash # Simple detection kreuzberg detect unknown-file # JSON output kreuzberg detect mystery.bin --format json ``` ### formats List all supported document formats. ```bash kreuzberg formats [OPTIONS] ``` **Options:** - `--format `: Output format (`text` or `json`), default: `text` **Examples:** ```bash # List formats as table kreuzberg formats # JSON output for tooling kreuzberg formats --format json ``` ### cache Manage extraction result cache and model downloads. ```bash kreuzberg cache [OPTIONS] ``` **Subcommands:** #### stats Show cache statistics. ```bash kreuzberg cache stats [--cache-dir ] [--format ] ``` **Options:** - `--cache-dir `: Cache directory (default: `.kreuzberg` in current directory) - `--format `: Output format (`text` or `json`), default: `text` #### clear Clear the cache. ```bash kreuzberg cache clear [--cache-dir ] [--format ] ``` **Options:** - `--cache-dir `: Cache directory (default: `.kreuzberg` in current directory) - `--format `: Output format (`text` or `json`), default: `text` #### warm Pre-download all models (OCR, layout detection, and optionally embeddings). ```bash kreuzberg cache warm [--cache-dir ] [--format ] [--all-embeddings] [--embedding-model ] ``` **Options:** - `--cache-dir `: Cache directory (default: `.kreuzberg` or `KREUZBERG_CACHE_DIR`) - `--format `: Output format (`text` or `json`), default: `text` - `--all-embeddings`: Download all 4 embedding model presets (fast, balanced, quality, multilingual) - `--embedding-model `: Download a specific embedding model preset #### manifest Output model manifest (expected model files, checksums, sizes). ```bash kreuzberg cache manifest [--format ] ``` **Options:** - `--format `: Output format (`text` or `json`), default: `json` **Examples:** ```bash # View cache statistics kreuzberg cache stats # Clear cache with custom directory kreuzberg cache clear --cache-dir ~/.kreuzberg-cache # Pre-download all models for offline/container use kreuzberg cache warm # Also download embedding models kreuzberg cache warm --all-embeddings # Download only the fast embedding model kreuzberg cache warm --embedding-model fast # Get model manifest as JSON kreuzberg cache manifest ``` ### serve (requires `api` feature) Start the REST API server. ```bash kreuzberg serve [OPTIONS] ``` **Options:** - `-H, --host `: Host to bind to (default: `127.0.0.1`) - `--port `: Port to bind to (default: `8000`) - `--config `: Configuration file (TOML, YAML, or JSON) **Examples:** ```bash # Default: localhost:8000 kreuzberg serve # Public access on port 3000 kreuzberg serve --host 0.0.0.0 --port 3000 # With custom configuration kreuzberg serve --config server-config.toml --port 8080 ``` ### mcp (requires `mcp` feature) Start the Model Context Protocol server. ```bash kreuzberg mcp [OPTIONS] ``` **Options:** - `--config `: Configuration file (TOML, YAML, or JSON) - `--transport `: Transport mode: `stdio` (default) or `http` - `--host `: HTTP host (only for `--transport http`, default: `127.0.0.1`) - `--port `: HTTP port (only for `--transport http`, default: `8001`) **Examples:** ```bash # Start MCP server (stdio, for editor integration) kreuzberg mcp # HTTP transport for remote access kreuzberg mcp --transport http --host 0.0.0.0 --port 8001 # With custom configuration kreuzberg mcp --config mcp-config.toml ``` ### api (requires `api` feature) API utility commands. #### schema v4.5.2 Output the full OpenAPI 3.1 specification as JSON. ```bash kreuzberg api schema ``` **Examples:** ```bash # Dump OpenAPI spec to file kreuzberg api schema > openapi.json # Pipe to jq for inspection kreuzberg api schema | jq '.paths | keys' ``` ### embed v4.5.2 (requires `embeddings` feature) Generate vector embeddings for text. ```bash kreuzberg embed [OPTIONS] ``` **Options:** - `--text `: Text to embed (repeatable for batch embedding; reads from stdin if omitted) - `--preset `: Embedding preset (`fast`, `balanced`, `quality`, `multilingual`), default: `balanced` - `--format `: Output format (`text` or `json`), default: `json` **Examples:** ```bash # Embed a single text kreuzberg embed --text "hello world" # Batch embed multiple texts kreuzberg embed --text "first" --text "second" # Use a specific preset kreuzberg embed --text "bonjour" --preset multilingual # Read from stdin cat document.txt | kreuzberg embed --preset fast ``` ### chunk v4.5.2 Chunk text for processing (useful for LLM context windows). ```bash kreuzberg chunk [OPTIONS] ``` **Options:** - `--text `: Text to chunk (reads from stdin if omitted) - `--config `: Configuration file (TOML, YAML, or JSON) - `--chunk-size `: Chunk size in characters - `--chunk-overlap `: Chunk overlap in characters - `--chunker-type `: Chunker type: `text` (default), `markdown`, `yaml`, or `semantic` - `--chunking-tokenizer `: Tokenizer model for token-based sizing (e.g. `Xenova/gpt-4o`) - `--format `: Output format (`text` or `json`), default: `json` **Examples:** ```bash # Chunk text with defaults kreuzberg chunk --text "Long document content here..." # Custom chunk size and overlap kreuzberg chunk --text "..." --chunk-size 512 --chunk-overlap 64 # Markdown-aware chunking kreuzberg chunk --text "# Title\nContent..." --chunker-type markdown # Token-based chunking with specific tokenizer kreuzberg chunk --text "..." --chunking-tokenizer "Xenova/gpt-4o" # Read from stdin cat long-document.txt | kreuzberg chunk --chunk-size 1000 ``` ### completions v4.5.2 Generate shell completion scripts. ```bash kreuzberg completions ``` **Supported shells:** `bash`, `zsh`, `fish`, `elvish`, `powershell` **Examples:** ```bash # Bash (add to .bashrc) eval "$(kreuzberg completions bash)" # Zsh (add to fpath) kreuzberg completions zsh > ~/.zfunc/_kreuzberg # Fish kreuzberg completions fish | source # PowerShell kreuzberg completions powershell | Out-String | Invoke-Expression ``` ### version Show version information. ```bash kreuzberg version [--format ] ``` **Options:** - `--format `: Output format (`text` or `json`), default: `text` **Examples:** ```bash # Display version kreuzberg version # JSON output kreuzberg version --format json ``` ## Output Formats ### Text Format The default human-readable format: ```bash kreuzberg extract document.pdf # Output: # Document content here... ``` ### JSON Format For programmatic integration: ```bash kreuzberg extract document.pdf --format json # Output: # { # "content": "Document content...", # "mime_type": "application/pdf", # "metadata": { "title": "...", "author": "..." }, # "tables": [{ "markdown": "...", "cells": [...], "page_number": 0 }] # } ``` ## Supported File Formats | Category | Formats | | ------------- | -------------------------------------------------------- | | **Documents** | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF | | **Images** | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF | | **Web** | HTML, XHTML, XML | | **Text** | TXT, MD, CSV, TSV, JSON, YAML, TOML | | **Email** | EML, MSG | | **Archives** | ZIP, TAR, 7Z | | **Other** | 30+ additional formats | ## Exit Codes - `0`: Successful execution - `Non-zero`: Error occurred (check stderr for details) ## Logging Control logging verbosity with the `--log-level` flag or `RUST_LOG` environment variable: ```bash # Using --log-level flag (overrides RUST_LOG) kreuzberg --log-level debug extract document.pdf kreuzberg --log-level warn batch *.pdf # Using RUST_LOG environment variable RUST_LOG=info kreuzberg extract document.pdf RUST_LOG=debug kreuzberg extract document.pdf RUST_LOG=warn kreuzberg extract document.pdf # Show logs from specific modules RUST_LOG=kreuzberg=debug kreuzberg extract document.pdf ``` ## Performance Tips 1. **Use batch processing** for multiple files instead of sequential extraction: ```bash kreuzberg batch *.pdf # Parallel processing ``` 2. **Enable caching** to avoid reprocessing the same documents: ```bash # Cache is enabled by default kreuzberg extract document.pdf ``` 3. **Use appropriate chunk sizes** for LLM processing: ```bash kreuzberg extract long.pdf --chunk true --chunk-size 2000 ``` 4. **Tune OCR settings** for better performance: ```bash kreuzberg extract scanned.pdf --ocr true # Adjust tesseract_config in configuration file for optimization ``` 5. **Monitor cache size** and clear when needed: ```bash kreuzberg cache stats kreuzberg cache clear ``` 6. **Pre-warm models** for containerized deployments: ```bash kreuzberg cache warm --all-embeddings ``` 7. **Use hardware acceleration** when available: ```bash kreuzberg extract scanned.pdf --ocr true --acceleration coreml # macOS kreuzberg extract scanned.pdf --ocr true --acceleration cuda # NVIDIA GPU ``` ## Features ### Default Features None by default. The binary includes core extraction. ### Optional Features - **`api`**: Enable the REST API server (`kreuzberg serve` command) and API utilities (`kreuzberg api schema`) - **`mcp`**: Enable Model Context Protocol server (`kreuzberg mcp` command) - **`embeddings`**: Enable embedding generation (`kreuzberg embed` command) - **`layout-detection`**: Enable layout detection flags (`--layout`, `--layout-confidence`) - **`chunking-tokenizers`**: Enable token-based chunking (`--chunking-tokenizer`) - **`all`**: Enable all features (`api` + `mcp`) ### Building with Features ```bash # Build with all features cargo build --release -p kreuzberg-cli --features all # Build with specific features cargo build --release -p kreuzberg-cli --features api,mcp,embeddings ``` ## Troubleshooting ### File Not Found Error Ensure the file path is correct and the file is readable: ```bash # Check if file exists ls -l /path/to/document.pdf # Try with absolute path kreuzberg extract /absolute/path/to/document.pdf ``` ### OCR Not Working Verify Tesseract is installed: ```bash tesseract --version # If not found: # macOS: brew install tesseract # Ubuntu: sudo apt-get install tesseract-ocr # Windows: Download from https://github.com/tesseract-ocr/tesseract ``` ### Configuration File Not Found Check that the configuration file has the correct format and location: ```bash # Use explicit path kreuzberg extract document.pdf --config /absolute/path/to/config.toml # Or place kreuzberg.toml in current directory ls -l kreuzberg.toml ``` ### Out of Memory with Large Files Use chunking to reduce memory usage: ```bash kreuzberg extract large-document.pdf --chunk true --chunk-size 1000 ``` ### Cache Directory Permissions Ensure write access to the cache directory: ```bash # Check permissions ls -ld .kreuzberg # Or use a custom directory with appropriate permissions kreuzberg extract document.pdf --config config.toml # In config.toml: cache_dir = "/tmp/kreuzberg-cache" ``` ## Key Files - `src/main.rs`: CLI implementation with command definitions and argument parsing - `src/commands/overrides.rs`: Extraction override flags (OCR, chunking, layout, acceleration, etc.) - `Cargo.toml`: Package metadata and dependencies ## Building ### Development Build ```bash cargo build -p kreuzberg-cli ``` ### Release Build ```bash cargo build --release -p kreuzberg-cli ``` ### With All Features ```bash cargo build --release -p kreuzberg-cli --features all ``` ## Testing ```bash # Run CLI tests cargo test -p kreuzberg-cli # With logging RUST_LOG=debug cargo test -p kreuzberg-cli -- --nocapture ``` ## Performance Characteristics - **Single file extraction**: Typically 10-100ms depending on file size and format - **Batch processing**: Near-linear scaling with 8 concurrent extractions by default - **OCR processing**: 100-500ms per page depending on image quality and language - **Caching**: Sub-millisecond retrieval for cached results ## References - **Kreuzberg Core**: `../kreuzberg/` - **Main Documentation**: - **GitHub Repository**: - **Configuration Guide**: See example configuration sections above ## Contributing We welcome contributions! Please see the main Kreuzberg repository for contribution guidelines. ## License Elastic License 2.0 (ELv2)