This commit is contained in:
191
docs/guides/mcp-integration.md
Normal file
191
docs/guides/mcp-integration.md
Normal file
@@ -0,0 +1,191 @@
|
||||
# MCP Integration <span class="version-badge">v5.0.0</span>
|
||||
|
||||
Kreuzberg speaks [Model Context Protocol](https://modelcontextprotocol.io/). That means any AI agent — Claude, Cursor, a custom LangChain pipeline — can extract documents, generate embeddings, and manage caches through a standard tool interface without writing extraction code.
|
||||
|
||||
Two commands to get started:
|
||||
|
||||
```bash title="Terminal"
|
||||
pip install "kreuzberg[all]"
|
||||
kreuzberg mcp
|
||||
```
|
||||
|
||||
That's it. You now have an MCP server running over stdio, ready for any compatible client.
|
||||
|
||||
---
|
||||
|
||||
## How It Works
|
||||
|
||||
The MCP server wraps Kreuzberg's extraction engine behind standard tools, running as a child process over stdin/stdout with JSON-RPC messages — no HTTP ports or configuration needed.
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
A["AI Agent\n(Claude, Cursor, etc.)"] -->|"JSON-RPC\nover stdio"| B["kreuzberg mcp"]
|
||||
B --> C["Extraction Engine"]
|
||||
B --> D["Embedding Engine"]
|
||||
B --> E["Cache Layer"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Server Modes
|
||||
|
||||
### Stdio (Default)
|
||||
|
||||
The standard mode for local AI tools. The agent spawns `kreuzberg mcp` as a subprocess and communicates over pipes.
|
||||
|
||||
```bash title="Terminal"
|
||||
kreuzberg mcp
|
||||
kreuzberg mcp --config kreuzberg.toml
|
||||
```
|
||||
|
||||
This is what Claude Desktop, Cursor, and most MCP clients expect.
|
||||
|
||||
### HTTP Transport
|
||||
|
||||
!!! Info "Feature flag: `mcp-http`" HTTP transport requires the `mcp-http` feature flag at build time.
|
||||
|
||||
For remote deployments or multi-client setups where stdio doesn't work — shared servers, team environments, cloud-hosted agents — HTTP transport exposes the same tool interface over the network.
|
||||
|
||||
---
|
||||
|
||||
## Tools
|
||||
|
||||
Kreuzberg exposes 13 tools via MCP. All extraction tools accept an optional `config` object to override defaults:
|
||||
|
||||
**Extraction:** `extract_file`, `extract_bytes`, `batch_extract_files`, `detect_mime_type`, `extract_structured`
|
||||
**Embeddings:** `embed_text`
|
||||
**Chunking:** `chunk_text`
|
||||
**Cache:** `cache_stats`, `cache_clear`, `cache_manifest`, `cache_warm`
|
||||
**Metadata:** `list_formats`, `get_version`
|
||||
|
||||
`extract_structured` requires the server to be built with the `liter-llm` feature. Full parameter schemas are discoverable at runtime via the MCP client's `list_tools` call.
|
||||
|
||||
---
|
||||
|
||||
## Connecting AI Tools
|
||||
|
||||
### Claude Desktop
|
||||
|
||||
Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
|
||||
|
||||
```json title="claude_desktop_config.json"
|
||||
{
|
||||
"mcpServers": {
|
||||
"kreuzberg": {
|
||||
"command": "kreuzberg",
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Restart Claude. Kreuzberg's tools appear automatically — ask Claude to "extract text from invoice.pdf" and it will call `extract_file` behind the scenes.
|
||||
|
||||
### Cursor
|
||||
|
||||
Add to `.cursor/mcp.json` in your project root:
|
||||
|
||||
```json title=".cursor/mcp.json"
|
||||
{
|
||||
"mcpServers": {
|
||||
"kreuzberg": {
|
||||
"command": "kreuzberg",
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Python MCP Client
|
||||
|
||||
For building custom agent pipelines, use the official `mcp` Python SDK:
|
||||
|
||||
```python title="mcp_client.py"
|
||||
import asyncio
|
||||
from mcp import ClientSession, StdioServerParameters
|
||||
from mcp.client.stdio import stdio_client
|
||||
|
||||
async def main() -> None:
|
||||
server_params = StdioServerParameters(
|
||||
command="kreuzberg", args=["mcp"]
|
||||
)
|
||||
|
||||
async with stdio_client(server_params) as (read, write):
|
||||
async with ClientSession(read, write) as session:
|
||||
await session.initialize()
|
||||
|
||||
tools = await session.list_tools()
|
||||
print(f"Available: {[t.name for t in tools.tools]}")
|
||||
|
||||
result = await session.call_tool(
|
||||
"extract_file",
|
||||
arguments={"path": "document.pdf"},
|
||||
)
|
||||
print(result)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Spawning from Python
|
||||
|
||||
If your application manages the server lifecycle directly:
|
||||
|
||||
```python title="spawn_server.py"
|
||||
import subprocess
|
||||
|
||||
process = subprocess.Popen(
|
||||
["python", "-m", "kreuzberg", "mcp"],
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
)
|
||||
print(f"MCP server running (PID {process.pid})")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
Pass a TOML config file to set extraction defaults for all tools:
|
||||
|
||||
```bash title="Terminal"
|
||||
kreuzberg mcp --config kreuzberg.toml
|
||||
```
|
||||
|
||||
Individual tool calls override file defaults via a `config` parameter. See [ExtractionConfig Reference](../reference/configuration.md) for all available fields.
|
||||
|
||||
---
|
||||
|
||||
## Running in Docker
|
||||
|
||||
```bash title="Terminal"
|
||||
docker run ghcr.io/kreuzberg-dev/kreuzberg:latest mcp
|
||||
|
||||
docker run \
|
||||
-v $(pwd)/kreuzberg.toml:/config/kreuzberg.toml \
|
||||
ghcr.io/kreuzberg-dev/kreuzberg:latest \
|
||||
mcp --config /config/kreuzberg.toml
|
||||
```
|
||||
|
||||
For production, use Compose with a persistent cache volume so embedding models don't re-download on restart:
|
||||
|
||||
```yaml title="docker-compose.yaml"
|
||||
services:
|
||||
kreuzberg-mcp:
|
||||
image: ghcr.io/kreuzberg-dev/kreuzberg:latest
|
||||
command: mcp --config /config/kreuzberg.toml
|
||||
volumes:
|
||||
- ./kreuzberg.toml:/config/kreuzberg.toml:ro
|
||||
- cache-data:/app/.kreuzberg
|
||||
restart: unless-stopped
|
||||
|
||||
volumes:
|
||||
cache-data:
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## What to Read Next
|
||||
|
||||
- [API Server Guide](api-server.md) — the HTTP REST API and detailed MCP tool reference
|
||||
- [Docker Deployment](docker.md) — container setup for all server modes
|
||||
- [Configuration Reference](../reference/configuration.md) — every config option explained
|
||||
Reference in New Issue
Block a user