192 lines
5.1 KiB
Markdown
192 lines
5.1 KiB
Markdown
|
|
# MCP Integration <span class="version-badge">v5.0.0</span>
|
||
|
|
|
||
|
|
Kreuzberg speaks [Model Context Protocol](https://modelcontextprotocol.io/). That means any AI agent — Claude, Cursor, a custom LangChain pipeline — can extract documents, generate embeddings, and manage caches through a standard tool interface without writing extraction code.
|
||
|
|
|
||
|
|
Two commands to get started:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
pip install "kreuzberg[all]"
|
||
|
|
kreuzberg mcp
|
||
|
|
```
|
||
|
|
|
||
|
|
That's it. You now have an MCP server running over stdio, ready for any compatible client.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## How It Works
|
||
|
|
|
||
|
|
The MCP server wraps Kreuzberg's extraction engine behind standard tools, running as a child process over stdin/stdout with JSON-RPC messages — no HTTP ports or configuration needed.
|
||
|
|
|
||
|
|
```mermaid
|
||
|
|
flowchart LR
|
||
|
|
A["AI Agent\n(Claude, Cursor, etc.)"] -->|"JSON-RPC\nover stdio"| B["kreuzberg mcp"]
|
||
|
|
B --> C["Extraction Engine"]
|
||
|
|
B --> D["Embedding Engine"]
|
||
|
|
B --> E["Cache Layer"]
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Server Modes
|
||
|
|
|
||
|
|
### Stdio (Default)
|
||
|
|
|
||
|
|
The standard mode for local AI tools. The agent spawns `kreuzberg mcp` as a subprocess and communicates over pipes.
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
kreuzberg mcp
|
||
|
|
kreuzberg mcp --config kreuzberg.toml
|
||
|
|
```
|
||
|
|
|
||
|
|
This is what Claude Desktop, Cursor, and most MCP clients expect.
|
||
|
|
|
||
|
|
### HTTP Transport
|
||
|
|
|
||
|
|
!!! Info "Feature flag: `mcp-http`" HTTP transport requires the `mcp-http` feature flag at build time.
|
||
|
|
|
||
|
|
For remote deployments or multi-client setups where stdio doesn't work — shared servers, team environments, cloud-hosted agents — HTTP transport exposes the same tool interface over the network.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Tools
|
||
|
|
|
||
|
|
Kreuzberg exposes 13 tools via MCP. All extraction tools accept an optional `config` object to override defaults:
|
||
|
|
|
||
|
|
**Extraction:** `extract_file`, `extract_bytes`, `batch_extract_files`, `detect_mime_type`, `extract_structured`
|
||
|
|
**Embeddings:** `embed_text`
|
||
|
|
**Chunking:** `chunk_text`
|
||
|
|
**Cache:** `cache_stats`, `cache_clear`, `cache_manifest`, `cache_warm`
|
||
|
|
**Metadata:** `list_formats`, `get_version`
|
||
|
|
|
||
|
|
`extract_structured` requires the server to be built with the `liter-llm` feature. Full parameter schemas are discoverable at runtime via the MCP client's `list_tools` call.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Connecting AI Tools
|
||
|
|
|
||
|
|
### Claude Desktop
|
||
|
|
|
||
|
|
Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
|
||
|
|
|
||
|
|
```json title="claude_desktop_config.json"
|
||
|
|
{
|
||
|
|
"mcpServers": {
|
||
|
|
"kreuzberg": {
|
||
|
|
"command": "kreuzberg",
|
||
|
|
"args": ["mcp"]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Restart Claude. Kreuzberg's tools appear automatically — ask Claude to "extract text from invoice.pdf" and it will call `extract_file` behind the scenes.
|
||
|
|
|
||
|
|
### Cursor
|
||
|
|
|
||
|
|
Add to `.cursor/mcp.json` in your project root:
|
||
|
|
|
||
|
|
```json title=".cursor/mcp.json"
|
||
|
|
{
|
||
|
|
"mcpServers": {
|
||
|
|
"kreuzberg": {
|
||
|
|
"command": "kreuzberg",
|
||
|
|
"args": ["mcp"]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Python MCP Client
|
||
|
|
|
||
|
|
For building custom agent pipelines, use the official `mcp` Python SDK:
|
||
|
|
|
||
|
|
```python title="mcp_client.py"
|
||
|
|
import asyncio
|
||
|
|
from mcp import ClientSession, StdioServerParameters
|
||
|
|
from mcp.client.stdio import stdio_client
|
||
|
|
|
||
|
|
async def main() -> None:
|
||
|
|
server_params = StdioServerParameters(
|
||
|
|
command="kreuzberg", args=["mcp"]
|
||
|
|
)
|
||
|
|
|
||
|
|
async with stdio_client(server_params) as (read, write):
|
||
|
|
async with ClientSession(read, write) as session:
|
||
|
|
await session.initialize()
|
||
|
|
|
||
|
|
tools = await session.list_tools()
|
||
|
|
print(f"Available: {[t.name for t in tools.tools]}")
|
||
|
|
|
||
|
|
result = await session.call_tool(
|
||
|
|
"extract_file",
|
||
|
|
arguments={"path": "document.pdf"},
|
||
|
|
)
|
||
|
|
print(result)
|
||
|
|
|
||
|
|
asyncio.run(main())
|
||
|
|
```
|
||
|
|
|
||
|
|
### Spawning from Python
|
||
|
|
|
||
|
|
If your application manages the server lifecycle directly:
|
||
|
|
|
||
|
|
```python title="spawn_server.py"
|
||
|
|
import subprocess
|
||
|
|
|
||
|
|
process = subprocess.Popen(
|
||
|
|
["python", "-m", "kreuzberg", "mcp"],
|
||
|
|
stdout=subprocess.PIPE,
|
||
|
|
stderr=subprocess.PIPE,
|
||
|
|
)
|
||
|
|
print(f"MCP server running (PID {process.pid})")
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
Pass a TOML config file to set extraction defaults for all tools:
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
kreuzberg mcp --config kreuzberg.toml
|
||
|
|
```
|
||
|
|
|
||
|
|
Individual tool calls override file defaults via a `config` parameter. See [ExtractionConfig Reference](../reference/configuration.md) for all available fields.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Running in Docker
|
||
|
|
|
||
|
|
```bash title="Terminal"
|
||
|
|
docker run ghcr.io/kreuzberg-dev/kreuzberg:latest mcp
|
||
|
|
|
||
|
|
docker run \
|
||
|
|
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
|
||
|
|
ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
||
|
|
mcp --config /config/kreuzberg.toml
|
||
|
|
```
|
||
|
|
|
||
|
|
For production, use Compose with a persistent cache volume so embedding models don't re-download on restart:
|
||
|
|
|
||
|
|
```yaml title="docker-compose.yaml"
|
||
|
|
services:
|
||
|
|
kreuzberg-mcp:
|
||
|
|
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||
|
|
command: mcp --config /config/kreuzberg.toml
|
||
|
|
volumes:
|
||
|
|
- ./kreuzberg.toml:/config/kreuzberg.toml:ro
|
||
|
|
- cache-data:/app/.kreuzberg
|
||
|
|
restart: unless-stopped
|
||
|
|
|
||
|
|
volumes:
|
||
|
|
cache-data:
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## What to Read Next
|
||
|
|
|
||
|
|
- [API Server Guide](api-server.md) — the HTTP REST API and detailed MCP tool reference
|
||
|
|
- [Docker Deployment](docker.md) — container setup for all server modes
|
||
|
|
- [Configuration Reference](../reference/configuration.md) — every config option explained
|