12 KiB
Kreuzberg CLI Reference
Comprehensive command-line interface for the Kreuzberg document intelligence library.
Installation
Install from crates.io:
cargo install kreuzberg-cli
Or download pre-built binaries from GitHub Releases.
Commands
extract
Extract text and structure from a single document.
kreuzberg extract <path> [FLAGS]
Positional Arguments
<path>— Path to the document file
Flags
-c, --config <path>— Path to config file (TOML, YAML, or JSON). Auto-discoverskreuzberg.{toml,yaml,json}in current and parent directories if omitted.--config-json <json>— Inline JSON configuration (merged after config file, before CLI flags).--config-json-base64 <base64>— Base64-encoded JSON configuration.-m, --mime-type <type>— MIME type hint (auto-detected if not provided).-f, --format <text|json>— CLI output format (default:text). Controls how results display, not extraction content format.--content-format <plain|markdown|djot|html>— Extraction content format (default:plain). Controls format of extracted content. (Note:--output-formatis a deprecated alias.)--ocr <bool>— Enable OCR processing.--ocr-backend <BACKEND>— OCR backend:tesseract,paddle-ocr,easyocr.--ocr-language <LANG>— OCR language code.--ocr-auto-rotate <bool>— Auto-rotate images before OCR.--force-ocr <bool>— Force OCR even if text extraction succeeds.--disable-ocr <bool>— Disable OCR entirely (even for images).--no-cache <bool>— Disable caching.--chunk <bool>— Enable text chunking.--chunk-size <n>— Chunk size in characters.--chunk-overlap <n>— Chunk overlap in characters.--chunking-tokenizer <model>— Tokenizer model for token-based sizing.--include-structure <bool>— Include hierarchical document structure.--quality <bool>— Enable quality processing.--detect-language <bool>— Enable language detection.--layout— Enable layout detection (RT-DETR v2). Use--layout falseto disable.--layout-confidence <float>— Layout confidence threshold (0.0-1.0).--layout-table-model <model>— Table structure model:tatr,slanet_wired,slanet_wireless,slanet_plus,slanet_auto,disabled.--acceleration <provider>— ONNX execution provider:auto,cpu,coreml,cuda,tensorrt.--extract-pages <bool>— Extract pages as separate array.--page-markers <bool>— Insert page marker comments.--extract-images <bool>— Enable image extraction.--target-dpi <n>— Target DPI for images (36-2400).--pdf-password <pass>— Password for encrypted PDFs (repeatable).--pdf-extract-images <bool>— Extract images from PDF pages.--pdf-extract-metadata <bool>— Extract PDF metadata.--token-reduction <level>— Token reduction:off,light,moderate,aggressive,maximum.--msg-codepage <n>— Windows codepage fallback for MSG files.--max-concurrent <n>— Max parallel extractions in batch mode.--max-threads <n>— Cap all internal thread pools.--cache-namespace <name>— Cache namespace for tenant isolation.--cache-ttl-secs <n>— Per-request cache TTL in seconds.
Examples
# Extract with default settings
kreuzberg extract document.pdf
# Extract with OCR enabled
kreuzberg extract scanned.pdf --ocr true
# Extract with specific output format
kreuzberg extract doc.docx --output-format markdown
# Extract with inline JSON config
kreuzberg extract file.pdf --config-json '{"ocr":{"backend":"tesseract"}}'
# Extract with base64-encoded config
kreuzberg extract file.pdf --config-json-base64 eyJvY3IiOnsiYmFja2VuZCI6InRlc3NlcmFjdCJ9fQ==
# Extract and output as JSON
kreuzberg extract doc.pdf --format json
# Extract with chunking
kreuzberg extract large-doc.pdf --chunk true --chunk-size 2000 --chunk-overlap 200
# Layout-aware markdown extraction
kreuzberg extract document.pdf --layout --content-format markdown
# With custom confidence threshold
kreuzberg extract document.pdf --layout-confidence 0.7 --content-format markdown
batch
Batch extract from multiple documents in parallel.
kreuzberg batch <paths...> [FLAGS]
Positional Arguments
<paths...>— One or more document file paths
Flags
-c, --config <path>— Path to config file (TOML, YAML, or JSON). Auto-discoverskreuzberg.{toml,yaml,json}in current and parent directories if omitted.--config-json <json>— Inline JSON configuration (merged after config file, before CLI flags).--config-json-base64 <base64>— Base64-encoded JSON configuration.-f, --format <text|json>— CLI output format (default:json). Controls how results display, not extraction content format.- All extraction override flags from
extractare also supported (e.g.,--content-format,--ocr,--layout,--force-ocr,--no-cache,--quality,--acceleration, etc.). See theextractcommand flags for the full list.
Notes
- Batch command defaults to JSON output format (unlike
extractwhich defaults to text). - Does not support
--mime-typeor--detect-languageflags.
Examples
# Batch extract multiple PDFs
kreuzberg batch document1.pdf document2.pdf document3.pdf
# Batch extract with glob patterns (shell expansion)
kreuzberg batch *.pdf
# Batch extract with custom output format
kreuzberg batch doc1.pdf doc2.pdf --output-format markdown
# Batch extract with OCR
kreuzberg batch scanned*.pdf --ocr true
# Batch extract with text output format
kreuzberg batch files*.docx --format text
detect
Identify MIME type of a file.
kreuzberg detect <path> [FLAGS]
Positional Arguments
<path>— Path to the file
Flags
-f, --format <text|json>— Output format (default:text)
Examples
# Detect MIME type (text output)
kreuzberg detect unknown-file.bin
# Detect MIME type (JSON output)
kreuzberg detect file.xyz --format json
version
Display version information.
kreuzberg version [FLAGS]
Flags
-f, --format <text|json>— Output format (default:text)
Examples
# Show version as text
kreuzberg version
# Show version as JSON
kreuzberg version --format json
cache
Manage extraction cache.
cache stats
Display cache statistics.
kreuzberg cache stats [FLAGS]
Flags
--cache-dir <path>— Cache directory (default:.kreuzbergin current directory)-f, --format <text|json>— Output format (default:text)
Examples
# Show cache stats
kreuzberg cache stats
# Show cache stats as JSON
kreuzberg cache stats --format json
# Show stats for specific cache directory
kreuzberg cache stats --cache-dir /tmp/my-cache
cache clear
Clear all cached extractions.
kreuzberg cache clear [FLAGS]
Flags
--cache-dir <path>— Cache directory (default:.kreuzbergin current directory)-f, --format <text|json>— Output format (default:text)
Examples
# Clear cache
kreuzberg cache clear
# Clear specific cache directory
kreuzberg cache clear --cache-dir /tmp/my-cache
serve
Start the API server (requires api feature).
kreuzberg serve [FLAGS]
Flags
-H, --host <host>— Host to bind to (e.g.,127.0.0.1or0.0.0.0). CLI arg overrides config file and environment variables.-p, --port <port>— Port to bind to. CLI arg overrides config file and environment variables.-c, --config <path>— Path to config file (TOML, YAML, or JSON). Auto-discoverskreuzberg.{toml,yaml,json}in current and parent directories if omitted.
Configuration Precedence
- CLI arguments (
--host,--port) - Environment variables (
KREUZBERG_HOST,KREUZBERG_PORT) - Config file (
[server]section) - Built-in defaults (
127.0.0.1:8000)
Examples
# Start server with defaults
kreuzberg serve
# Start server on specific host and port
kreuzberg serve --host 0.0.0.0 --port 3000
# Start server with config file
kreuzberg serve --config kreuzberg.toml
# Start server (environment variables override defaults)
KREUZBERG_HOST=192.168.1.100 KREUZBERG_PORT=8080 kreuzberg serve
mcp
Start the Model Context Protocol (MCP) server (requires mcp feature).
kreuzberg mcp [FLAGS]
Flags
-c, --config <path>— Path to config file (TOML, YAML, or JSON). Auto-discoverskreuzberg.{toml,yaml,json}in current and parent directories if omitted.--transport <stdio|http>— Transport mode (default:stdio)--host <host>— HTTP host for http transport (default:127.0.0.1)--port <port>— HTTP port for http transport (default:8001)
Examples
# Start MCP server with stdio transport
kreuzberg mcp
# Start MCP server with HTTP transport
kreuzberg mcp --transport http
# Start MCP server on custom HTTP host/port
kreuzberg mcp --transport http --host 0.0.0.0 --port 9000
# Start MCP server with config file
kreuzberg mcp --config kreuzberg.toml
Configuration
File Format
Configuration files support three formats with automatic detection:
- TOML —
.tomlextension (recommended) - YAML —
.yamlor.ymlextension - JSON —
.jsonextension
Configuration Precedence
Settings are applied in order from highest to lowest priority:
- Individual CLI flags (e.g.,
--ocr true,--output-format markdown) - Inline JSON config (
--config-jsonor--config-json-base64) - Config file (explicit
--config path.tomlor auto-discovered) - Default values (built-in library defaults)
Auto-Discovery
When no config file is specified, Kreuzberg searches for configuration in this order:
kreuzberg.tomlin current directorykreuzberg.yamlin current directorykreuzberg.jsonin current directory- Parent directories (same search pattern, up to filesystem root)
Example Configuration
# Top-level extraction options
use_cache = true
enable_quality_processing = true
force_ocr = false
output_format = "markdown"
# OCR settings
[ocr]
backend = "tesseract"
language = "eng"
# Chunking settings
[chunking]
max_chars = 2000
max_overlap = 200
# Language detection
[language_detection]
enabled = true
# Server configuration (for serve command)
[server]
host = "127.0.0.1"
port = 8000
Exit Codes
0— Success- Non-zero — Error (see stderr for details)
Error Handling
The CLI validates input and provides clear error messages:
- File not found — Verify path exists and is readable
- Invalid MIME type — Ensure file is accessible and format is supported
- Invalid JSON — Check
--config-jsonsyntax - Invalid config file — Verify TOML/YAML/JSON format
- Invalid chunk parameters — Ensure chunk-size > 0 and overlap < chunk-size
Environment Variables
RUST_LOG— Set logging level (e.g.,RUST_LOG=debug)KREUZBERG_HOST— Server bind host (used byservecommand)KREUZBERG_PORT— Server bind port (used byservecommand)
Common Patterns
Extract with Custom Configuration
kreuzberg extract document.pdf \
--content-format markdown \
--ocr true \
--quality true
Batch Process with Config File
kreuzberg batch *.pdf --config extraction-config.toml
CI/CD Integration
# Extract to JSON for downstream processing
kreuzberg extract file.pdf --format json | jq '.content'
# Batch process with error handling
kreuzberg batch docs/*.pdf --format json || exit 1
Performance Tuning
# Disable cache for temporary processing
kreuzberg extract file.pdf --no-cache true
# Enable chunking for large documents
kreuzberg extract large-file.pdf \
--chunk true \
--chunk-size 5000 \
--chunk-overlap 500
Debugging
Enable detailed logging:
RUST_LOG=debug kreuzberg extract document.pdf
Check cache statistics:
kreuzberg cache stats --format json
Detect file MIME type:
kreuzberg detect unknown-file --format json