412 lines
14 KiB
Markdown
412 lines
14 KiB
Markdown
# LLM Integration <span class="version-badge">v4.8.0</span>
|
|
|
|
Kreuzberg integrates with 143 LLM providers (including local inference engines) via [liter-llm](https://github.com/kreuzberg-dev/liter-llm) for three capabilities: VLM OCR, structured extraction, and provider-hosted embeddings.
|
|
|
|
!!! Note "Feature gate" Requires the `liter-llm` Cargo feature. Not included in the default feature set.
|
|
|
|
## VLM OCR
|
|
|
|
Use vision-language models as an OCR backend by rendering document pages as images and sending them to the VLM for text extraction.
|
|
|
|
### When to Use
|
|
|
|
- Low-quality scanned documents where traditional OCR struggles
|
|
- Handwritten text recognition
|
|
- Arabic, Farsi, and other scripts with poor Tesseract/PaddleOCR support
|
|
- Complex layouts where traditional OCR fails (mixed tables, forms, diagrams)
|
|
- When you need higher accuracy and can accept higher latency and API costs
|
|
|
|
### Configuration
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/llm/vlm_ocr.md"
|
|
|
|
=== "TypeScript"
|
|
|
|
--8<-- "snippets/typescript/llm/vlm_ocr.md"
|
|
|
|
=== "Rust"
|
|
|
|
```rust title="Rust"
|
|
use kreuzberg::{extract_file, ExtractionConfig, OcrConfig, LlmConfig};
|
|
|
|
let config = ExtractionConfig {
|
|
force_ocr: true,
|
|
ocr: Some(OcrConfig {
|
|
backend: "vlm".to_string(),
|
|
vlm_config: Some(LlmConfig {
|
|
model: "openai/gpt-4o-mini".to_string(),
|
|
..Default::default()
|
|
}),
|
|
..Default::default()
|
|
}),
|
|
..Default::default()
|
|
};
|
|
let result = extract_file("scan.pdf", None, &config).await?;
|
|
```
|
|
|
|
=== "CLI"
|
|
|
|
```bash title="Terminal"
|
|
kreuzberg extract scan.pdf --force-ocr true \
|
|
--vlm-model openai/gpt-4o-mini
|
|
```
|
|
|
|
=== "TOML"
|
|
|
|
```toml title="kreuzberg.toml"
|
|
force_ocr = true
|
|
|
|
[ocr]
|
|
backend = "vlm"
|
|
|
|
[ocr.vlm_config]
|
|
model = "openai/gpt-4o-mini"
|
|
```
|
|
|
|
=== "Environment Variables"
|
|
|
|
```bash title="Terminal"
|
|
export KREUZBERG_VLM_OCR_MODEL=openai/gpt-4o-mini
|
|
export OPENAI_API_KEY=sk-...
|
|
```
|
|
|
|
### Custom VLM Prompt
|
|
|
|
Override the default prompt template for VLM OCR:
|
|
|
|
```python title="Python"
|
|
from kreuzberg import ExtractionConfig, OcrConfig, LlmConfig
|
|
|
|
config = ExtractionConfig(
|
|
force_ocr=True,
|
|
ocr=OcrConfig(
|
|
backend="vlm",
|
|
vlm_config=LlmConfig(model="openai/gpt-4o-mini"),
|
|
vlm_prompt="Extract all text from this document image. Preserve formatting.",
|
|
),
|
|
)
|
|
```
|
|
|
|
### Supported Providers
|
|
|
|
Any liter-llm vision-capable provider works as a VLM OCR backend:
|
|
|
|
| Provider | Example Model |
|
|
| ----------------- | -------------------------------------- |
|
|
| OpenAI | `openai/gpt-4o`, `openai/gpt-4o-mini` |
|
|
| Anthropic | `anthropic/claude-3-5-sonnet-20241022` |
|
|
| Google | `google/gemini-2.0-flash` |
|
|
| Groq | `groq/llama-3.2-90b-vision-preview` |
|
|
| Ollama (local) | `ollama/llama3.2-vision` |
|
|
| LM Studio (local) | `lmstudio/llava-1.5` |
|
|
| vLLM (local) | `vllm/llava-next` |
|
|
|
|
## Structured Extraction
|
|
|
|
Extract structured JSON data from documents by providing a schema; the document text is sent to an LLM for conforming extraction.
|
|
|
|
### Basic Usage
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/llm/structured_extraction.md"
|
|
|
|
=== "TypeScript"
|
|
|
|
--8<-- "snippets/typescript/llm/structured_extraction.md"
|
|
|
|
=== "Rust"
|
|
|
|
--8<-- "snippets/rust/llm/structured_extraction.md"
|
|
|
|
=== "CLI"
|
|
|
|
```bash title="Terminal"
|
|
kreuzberg extract-structured paper.pdf \
|
|
--schema schema.json \
|
|
--model openai/gpt-4o-mini \
|
|
--strict
|
|
```
|
|
|
|
=== "TOML"
|
|
|
|
```toml title="kreuzberg.toml"
|
|
[structured_extraction]
|
|
schema_name = "paper_metadata"
|
|
strict = true
|
|
|
|
[structured_extraction.schema]
|
|
type = "object"
|
|
|
|
[structured_extraction.schema.properties.title]
|
|
type = "string"
|
|
|
|
[structured_extraction.schema.properties.date]
|
|
type = "string"
|
|
|
|
[structured_extraction.llm]
|
|
model = "openai/gpt-4o-mini"
|
|
```
|
|
|
|
### Custom Prompts (Jinja2)
|
|
|
|
Override the default extraction prompt with a Jinja2 template:
|
|
|
|
```python title="Python"
|
|
from kreuzberg import ExtractionConfig, StructuredExtractionConfig, LlmConfig
|
|
|
|
config = ExtractionConfig(
|
|
structured_extraction=StructuredExtractionConfig(
|
|
schema={"type": "object", "properties": {"title": {"type": "string"}}},
|
|
llm=LlmConfig(model="openai/gpt-4o-mini"),
|
|
prompt=(
|
|
"Analyze this document and extract key metadata.\n\n"
|
|
"Document:\n{{ content }}\n\n"
|
|
"Schema: {{ schema }}"
|
|
),
|
|
),
|
|
)
|
|
```
|
|
|
|
Available template variables:
|
|
|
|
| Variable | Description |
|
|
| -------------------------- | ----------------------------------------- |
|
|
| `{{ content }}` | The extracted document text |
|
|
| `{{ schema }}` | The JSON schema as a formatted string |
|
|
| `{{ schema_name }}` | The schema name (default: `"extraction"`) |
|
|
| `{{ schema_description }}` | The schema description (may be empty) |
|
|
|
|
### Cross-Provider Compatibility
|
|
|
|
Structured extraction handles provider differences automatically:
|
|
|
|
- **OpenAI**: Full strict mode with `additionalProperties` enforcement
|
|
- **Anthropic/Gemini**: `additionalProperties` automatically stripped (not supported by these providers)
|
|
- **All providers**: Markdown code fence wrapping in responses is automatically handled
|
|
|
|
### Strict Mode
|
|
|
|
When `strict=True`, the LLM is instructed to produce output that exactly matches the schema. This enables OpenAI's structured output mode and adds validation on the response.
|
|
|
|
## VLM Embeddings
|
|
|
|
Use provider-hosted embedding models when you need to match your vector database model or local ONNX models are unavailable.
|
|
|
|
### Configuration
|
|
|
|
=== "Python"
|
|
|
|
--8<-- "snippets/python/llm/vlm_embeddings.md"
|
|
|
|
=== "TypeScript"
|
|
|
|
```typescript title="TypeScript"
|
|
import { embedSync } from '@kreuzberg/node';
|
|
|
|
const embeddings = embedSync(['Hello world'], {
|
|
model: {
|
|
modelType: 'llm',
|
|
value: 'openai/text-embedding-3-small',
|
|
},
|
|
normalize: true,
|
|
});
|
|
console.log(embeddings[0].length); // 1536
|
|
```
|
|
|
|
=== "Rust"
|
|
|
|
```rust title="Rust"
|
|
use kreuzberg::{embed_texts, EmbeddingConfig, EmbeddingModelType, LlmConfig};
|
|
|
|
let config = EmbeddingConfig {
|
|
model: EmbeddingModelType::Llm {
|
|
llm: LlmConfig {
|
|
model: "openai/text-embedding-3-small".to_string(),
|
|
..Default::default()
|
|
},
|
|
},
|
|
normalize: true,
|
|
..Default::default()
|
|
};
|
|
let embeddings = embed_texts(&["Hello world"], &config)?;
|
|
```
|
|
|
|
=== "CLI"
|
|
|
|
```bash title="Terminal"
|
|
kreuzberg embed \
|
|
--provider llm \
|
|
--model openai/text-embedding-3-small \
|
|
--text "Hello world"
|
|
```
|
|
|
|
### Available Models
|
|
|
|
| Model | Dimensions | Provider |
|
|
| ---------------------------------------- | ---------- | -------- |
|
|
| `openai/text-embedding-3-small` | 1536 | OpenAI |
|
|
| `openai/text-embedding-3-large` | 3072 | OpenAI |
|
|
| `mistral/mistral-embed` | 1024 | Mistral |
|
|
| Any liter-llm embedding-capable provider | Varies | Various |
|
|
|
|
## Local LLM Support
|
|
|
|
<span class="version-badge">v4.8.0</span>
|
|
|
|
Run local LLM inference engines via [liter-llm](https://github.com/kreuzberg-dev/liter-llm)'s provider routing; point to your local server without needing an API key.
|
|
|
|
### Supported Local Engines
|
|
|
|
| Engine | Prefix | Default URL | Install |
|
|
| ------------------------------------------------------ | ------------ | --------------------------- | --------------------- |
|
|
| [Ollama](https://ollama.com) | `ollama/` | `http://localhost:11434/v1` | `brew install ollama` |
|
|
| [LM Studio](https://lmstudio.ai) | `lmstudio/` | `http://localhost:1234/v1` | Desktop app |
|
|
| [vLLM](https://vllm.ai) | `vllm/` | `http://localhost:8000/v1` | `pip install vllm` |
|
|
| [llama.cpp](https://github.com/ggerganov/llama.cpp) | `llamacpp/` | `http://localhost:8080/v1` | Build from source |
|
|
| [LocalAI](https://localai.io) | `localai/` | `http://localhost:8080/v1` | Docker |
|
|
| [llamafile](https://github.com/Mozilla-Ocho/llamafile) | `llamafile/` | `http://localhost:8080/v1` | Single binary |
|
|
|
|
### Example: Ollama
|
|
|
|
=== "CLI" ```Bash
|
|
|
|
# Start Ollama and pull a model
|
|
|
|
ollama pull llama3.2-vision
|
|
|
|
# Use it for VLM OCR (no API key needed)
|
|
kreuzberg extract scan.pdf --force-ocr true \
|
|
--vlm-model ollama/llama3.2-vision
|
|
|
|
# Use it for structured extraction
|
|
kreuzberg extract-structured doc.pdf \
|
|
--schema schema.json \
|
|
--model ollama/llama3.2
|
|
|
|
# Use it for embeddings
|
|
kreuzberg embed --provider llm \
|
|
--model ollama/all-minilm \
|
|
--text "Hello world"
|
|
```
|
|
|
|
=== "Python" ```python from Kreuzberg import extract_file, ExtractionConfig, StructuredExtractionConfig, LlmConfig
|
|
|
|
config = ExtractionConfig(
|
|
structured_extraction=StructuredExtractionConfig(
|
|
schema={"type": "object", "properties": {"title": {"type": "string"}}},
|
|
llm=LlmConfig(model="ollama/llama3.2"), # No api_key needed
|
|
),
|
|
)
|
|
result = await extract_file("doc.pdf", config=config)
|
|
```
|
|
|
|
=== "TOML Config" ```toml [structured_extraction.llm] model = "ollama/llama3.2"
|
|
|
|
# No api_key needed for local providers
|
|
```
|
|
|
|
!!! Tip "Custom Base URL" If your local server runs on a non-default port, use `base_url`:
|
|
`python
|
|
LlmConfig(model="ollama/llama3.2", base_url="http://localhost:11435/v1")`
|
|
|
|
## LLM Usage Tracking
|
|
|
|
Every LLM call made during extraction is tracked in the `llm_usage` field of `ExtractionResult`. Each entry records the model used, token counts, estimated cost, and why the model stopped generating.
|
|
|
|
=== "Python"
|
|
|
|
```python
|
|
result = await extract_file("document.pdf", config)
|
|
if result.get("llm_usage"):
|
|
for usage in result["llm_usage"]:
|
|
print(f"{usage['source']}: {usage['input_tokens']} in, {usage['output_tokens']} out, ${usage['estimated_cost']:.4f}")
|
|
```
|
|
|
|
=== "TypeScript"
|
|
|
|
```typescript
|
|
const result = await extractFile("document.pdf", config);
|
|
for (const usage of result.llmUsage ?? []) {
|
|
console.log(`${usage.source}: ${usage.inputTokens} in, ${usage.outputTokens} out, $${usage.estimatedCost?.toFixed(4)}`);
|
|
}
|
|
```
|
|
|
|
=== "Rust"
|
|
|
|
```rust
|
|
let result = extract_file("document.pdf", &config).await?;
|
|
if let Some(usages) = &result.llm_usage {
|
|
for usage in usages {
|
|
println!("{}: {} in, {} out", usage.source, usage.input_tokens.unwrap_or(0), usage.output_tokens.unwrap_or(0));
|
|
}
|
|
}
|
|
```
|
|
|
|
The `source` field indicates which pipeline stage triggered the call: `"vlm_ocr"`, `"structured_extraction"`, or `"embeddings"`.
|
|
|
|
## API Key Configuration
|
|
|
|
API keys can be set via (in order of precedence):
|
|
|
|
1. `api_key` field in `LlmConfig` — highest priority, per-request
|
|
2. Provider standard env vars (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, etc.)
|
|
3. Kreuzberg-specific env var (`KREUZBERG_LLM_API_KEY`) — used as fallback for any provider
|
|
|
|
!!! Note "Local providers skip API key lookup" Local inference engines (Ollama, LM Studio, vLLM, llama.cpp, LocalAI, llamafile) do not require an API key. If you use a local provider prefix (for example, `ollama/`), the API key fields are ignored.
|
|
|
|
```python title="Python"
|
|
from kreuzberg import LlmConfig
|
|
|
|
# Explicit API key
|
|
config = LlmConfig(model="openai/gpt-4o", api_key="sk-...")
|
|
|
|
# Custom base URL (e.g., Azure OpenAI, local proxy)
|
|
config = LlmConfig(
|
|
model="openai/gpt-4o",
|
|
base_url="https://my-proxy.example.com/v1",
|
|
)
|
|
```
|
|
|
|
## LlmConfig Reference
|
|
|
|
| Field | Type | Default | Description |
|
|
| -------------- | --------------- | ---------- | ------------------------------------------------------------------- |
|
|
| `model` | `str` | _required_ | Provider/model in liter-llm format (for example, `"openai/gpt-4o"`) |
|
|
| `api_key` | `str \| None` | `None` | API key (falls back to env vars) |
|
|
| `base_url` | `str \| None` | `None` | Custom endpoint URL |
|
|
| `timeout_secs` | `int \| None` | `60` | Request timeout in seconds |
|
|
| `max_retries` | `int \| None` | `3` | Maximum retry attempts |
|
|
| `temperature` | `float \| None` | `None` | Sampling temperature |
|
|
| `max_tokens` | `int \| None` | `None` | Maximum tokens to generate |
|
|
|
|
## REST API
|
|
|
|
### Structured Extraction
|
|
|
|
`POST /extract-structured` — multipart form with file + schema + model configuration.
|
|
|
|
```bash title="Terminal"
|
|
curl -X POST http://localhost:4000/extract-structured \
|
|
-F "file=@invoice.pdf" \
|
|
-F 'schema={"type":"object","properties":{"vendor":{"type":"string"},"total":{"type":"number"}}}' \
|
|
-F "model=openai/gpt-4o-mini" \
|
|
-F "strict=true"
|
|
```
|
|
|
|
## MCP Tools
|
|
|
|
When running Kreuzberg as an MCP server, LLM features are available as tools:
|
|
|
|
- `extract_structured` — extract structured data from a document using a JSON schema
|
|
- `embed_text` — extended with `model` parameter for LLM-hosted embeddings
|
|
|
|
## Related
|
|
|
|
- [OCR](ocr.md) — OCR backends including VLM OCR
|
|
- [Configuration Reference](configuration.md) — full field reference for all config types
|
|
- [Advanced Features](advanced.md) — chunking, language detection, local embeddings
|
|
- [API Server](api-server.md) — REST API endpoints
|