This commit is contained in:
20
docs/integrations/index.md
Normal file
20
docs/integrations/index.md
Normal file
@@ -0,0 +1,20 @@
|
||||
# Integrations
|
||||
|
||||
Kreuzberg integrates with AI frameworks, databases, and search engines — bringing document extraction into your existing stack. Each integration is a standalone package published on PyPI.
|
||||
|
||||
---
|
||||
|
||||
## Available integrations
|
||||
|
||||
| Integration | Framework | Package | Docs |
|
||||
| ----------- | ------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
|
||||
| Open WebUI | [Open WebUI](https://openwebui.com/) | Built-in (v4.7.0+) | [Open WebUI](openwebui.md) |
|
||||
| LangChain | [LangChain](https://www.langchain.com/) | [`langchain-kreuzberg`](https://pypi.org/project/langchain-kreuzberg/) | [GitHub](https://github.com/kreuzberg-dev/langchain-kreuzberg) |
|
||||
| LlamaIndex | [LlamaIndex](https://www.llamaindex.ai/) | [`llama-index-readers-kreuzberg`](https://pypi.org/project/llama-index-readers-kreuzberg/) | [GitHub](https://github.com/kreuzberg-dev/llama-index-kreuzberg) |
|
||||
| Haystack | [Haystack](https://haystack.deepset.ai/) | [`kreuzberg-haystack`](https://pypi.org/project/kreuzberg-haystack/) | [GitHub](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/kreuzberg) |
|
||||
| CrewAI | [CrewAI](https://www.crewai.com/) | [`kreuzberg-crewai`](https://pypi.org/project/kreuzberg-crewai/) | [GitHub](https://github.com/kreuzberg-dev/kreuzberg-crewai) |
|
||||
| txtAI | [txtAI](https://neuml.github.io/txtai/) | [`kreuzberg-txtai`](https://pypi.org/project/kreuzberg-txtai/) | [GitHub](https://github.com/kreuzberg-dev/kreuzberg-txtai) |
|
||||
| SurrealDB | [SurrealDB](https://surrealdb.com/) | [`kreuzberg-surrealdb`](https://pypi.org/project/kreuzberg-surrealdb/) | [SurrealDB](surrealdb.md) |
|
||||
| Spring AI | [Spring AI](https://spring.io/projects/spring-ai) | [`kreuzberg-spring-ai-document-reader`](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg-spring-ai-document-reader) | [GitHub](https://github.com/kreuzberg-dev/kreuzberg-spring-ai) |
|
||||
|
||||
!!! Tip "Building a new integration?" Explore existing integrations on [GitHub](https://github.com/kreuzberg-dev) for reference.
|
||||
168
docs/integrations/openwebui.md
Normal file
168
docs/integrations/openwebui.md
Normal file
@@ -0,0 +1,168 @@
|
||||
# Open WebUI
|
||||
|
||||

|
||||
|
||||
Open WebUI supports pluggable content extraction backends. Kreuzberg implements two of those backend APIs — the **docling-serve** endpoint and the **external document loader** endpoint, so it works as a drop-in replacement without patching Open WebUI.
|
||||
|
||||
## How it works
|
||||
|
||||
1. A user uploads a document (PDF, DOCX, image, etc.) in Open WebUI.
|
||||
2. Open WebUI sends the file to Kreuzberg's API endpoint.
|
||||
3. Kreuzberg extracts the content — running OCR where needed and returns Markdown.
|
||||
4. Open WebUI stores the Markdown in its vector database for retrieval-augmented generation.
|
||||
|
||||
Kreuzberg supports [90+ file formats](../reference/formats.md) and requires no GPU.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Docker and Docker Compose (v2)
|
||||
- Open WebUI running or ready to deploy
|
||||
- No GPU required — Kreuzberg runs entirely on CPU
|
||||
|
||||
## Setup with Docker Compose
|
||||
|
||||
This is the fastest way to get both services running together.
|
||||
|
||||
```yaml title="docker-compose.yaml"
|
||||
services:
|
||||
kreuzberg:
|
||||
image: ghcr.io/kreuzberg-dev/kreuzberg:latest-core
|
||||
ports:
|
||||
- "8000:8000"
|
||||
command: ["serve", "--host", "0.0.0.0", "--port", "8000"]
|
||||
volumes:
|
||||
- kreuzberg-cache:/app/.kreuzberg
|
||||
healthcheck:
|
||||
test: ["CMD", "kreuzberg", "version"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
|
||||
open-webui:
|
||||
image: ghcr.io/open-webui/open-webui:main
|
||||
ports:
|
||||
- "3000:8080"
|
||||
environment:
|
||||
CONTENT_EXTRACTION_ENGINE: "docling"
|
||||
DOCLING_SERVER_URL: "http://kreuzberg:8000"
|
||||
depends_on:
|
||||
kreuzberg:
|
||||
condition: service_healthy
|
||||
|
||||
volumes:
|
||||
kreuzberg-cache:
|
||||
```
|
||||
|
||||
Start both services in detached mode:
|
||||
|
||||
```bash
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
Open `http://localhost:3000`, create an account, and upload a document. The extracted text will appear in the chat context.
|
||||
|
||||
!!! Note "Cache volume" The `kreuzberg-cache` volume persists OCR models and embedding weights across restarts. Without it, models re-download on every container restart (~90 MB–1.2 GB depending on configuration).
|
||||
|
||||
!!! Info "Already running Open WebUI?" Start Kreuzberg separately, then point Open WebUI to that Kreuzberg URL.
|
||||
|
||||
=== "Docker"
|
||||
|
||||
```bash
|
||||
docker run -d \
|
||||
--name kreuzberg \
|
||||
-p 8000:8000 \
|
||||
-v kreuzberg-cache:/app/.kreuzberg \
|
||||
ghcr.io/kreuzberg-dev/kreuzberg:latest-core \
|
||||
serve --host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
=== "CLI (Homebrew / Cargo)"
|
||||
|
||||
```bash
|
||||
kreuzberg serve --host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
Then configure Open WebUI using one of the two engine modes below.
|
||||
|
||||
## Choosing an engine mode
|
||||
|
||||
Kreuzberg exposes two Open WebUI–compatible APIs. Both return the same extracted content. So pick whichever fits your setup.
|
||||
|
||||
| | **Docling** (recommended) | **External** |
|
||||
| ------------------ | ------------------------- | ------------------------------ |
|
||||
| **Endpoint** | `POST /v1/convert/file` | `PUT /process` |
|
||||
| **Engine setting** | `docling` | `external` |
|
||||
| **URL variable** | `DOCLING_SERVER_URL` | `EXTERNAL_DOCUMENT_LOADER_URL` |
|
||||
|
||||
=== "Docling (recommended)"
|
||||
|
||||
Set these environment variables on the Open WebUI container:
|
||||
|
||||
```yaml
|
||||
environment:
|
||||
CONTENT_EXTRACTION_ENGINE: "docling"
|
||||
DOCLING_SERVER_URL: "http://kreuzberg:8000"
|
||||
```
|
||||
|
||||
Or via the Admin UI: **Settings → Documents → Content Extraction Engine** → select **Docling** → set server URL to `http://kreuzberg:8000`.
|
||||
|
||||
=== "External"
|
||||
|
||||
Set these environment variables on the Open WebUI container:
|
||||
|
||||
```yaml
|
||||
environment:
|
||||
CONTENT_EXTRACTION_ENGINE: "external"
|
||||
EXTERNAL_DOCUMENT_LOADER_URL: "http://kreuzberg:8000"
|
||||
```
|
||||
|
||||
Or via the Admin UI: **Settings → Documents → Content Extraction Engine** → select **External** → set URL to `http://kreuzberg:8000`.
|
||||
|
||||
!!! Tip If Kreuzberg runs on a different host or port, replace `http://kreuzberg:8000` with the actual address. Inside Docker Compose, use the service name (`kreuzberg`). Outside Docker, use the host IP or `localhost`.
|
||||
|
||||
## Verify it works
|
||||
|
||||
Test the endpoints directly before debugging through Open WebUI.
|
||||
|
||||
=== "Docling endpoint"
|
||||
|
||||
```bash
|
||||
curl -s -F "files=@invoice.pdf" http://localhost:8000/v1/convert/file | jq .
|
||||
```
|
||||
|
||||
```json title="Expected response"
|
||||
{
|
||||
"document": {
|
||||
"md_content": "# Invoice\n\nDate: 2026-01-15\n..."
|
||||
},
|
||||
"status": "success"
|
||||
}
|
||||
```
|
||||
|
||||
=== "External endpoint"
|
||||
|
||||
```bash
|
||||
curl -s -X PUT \
|
||||
-H "Content-Type: application/pdf" \
|
||||
-H "X-Filename: invoice.pdf" \
|
||||
--data-binary @invoice.pdf \
|
||||
http://localhost:8000/process | jq .
|
||||
```
|
||||
|
||||
```json title="Expected response"
|
||||
{
|
||||
"page_content": "# Invoice\n\nDate: 2026-01-15\n...",
|
||||
"metadata": {
|
||||
"source": "invoice.pdf"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
If the endpoint returns extracted text, the integration is working. Upload a document through Open WebUI to confirm end-to-end.
|
||||
|
||||
## Next steps
|
||||
|
||||
- [Docker deployment guide](../guides/docker.md) — image variants, volumes, security hardening
|
||||
- [API server reference](../guides/api-server.md) — all endpoints and configuration options
|
||||
- [OCR guide](../guides/ocr.md) — language packs, engine selection, tuning
|
||||
- [Format support](../reference/formats.md) — full list of supported file types
|
||||
71
docs/integrations/surrealdb.md
Normal file
71
docs/integrations/surrealdb.md
Normal file
@@ -0,0 +1,71 @@
|
||||
# SurrealDB
|
||||
|
||||
The `kreuzberg-surrealdb` package connects Kreuzberg's document extraction pipeline to [SurrealDB](https://surrealdb.com/). It handles schema creation, content deduplication, optional chunking and embedding, and index configuration.
|
||||
|
||||
[](https://pypi.org/project/kreuzberg-surrealdb/)
|
||||
[](https://pypi.org/project/kreuzberg-surrealdb/)
|
||||
[](https://github.com/kreuzberg-dev/kreuzberg-surrealdb/blob/main/LICENSE)
|
||||
|
||||
## How it works
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
Input[Documents] --> Kreuzberg[Kreuzberg Extraction]
|
||||
Kreuzberg --> Connector[Integration Connector]
|
||||
Connector --> Schema[Auto Schema Setup]
|
||||
Connector --> Dedup[Content Deduplication]
|
||||
Connector --> Store[Storage & Indexing]
|
||||
Store --> Search[Search & Retrieval]
|
||||
|
||||
style Kreuzberg fill:#87CEEB
|
||||
style Connector fill:#FFD700
|
||||
style Search fill:#90EE90
|
||||
```
|
||||
|
||||
1. **Extract** — Kreuzberg parses the source documents and runs OCR where needed.
|
||||
2. **Connect** — The connector receives the extracted output and manages the SurrealDB connection.
|
||||
3. **Store** — Each document is hashed (SHA-256) for deduplication, optionally chunked and embedded, then written to SurrealDB under an auto-generated schema.
|
||||
4. **Search** — Full-text (BM25), vector (HNSW), and hybrid (RRF) search are available immediately after ingestion.
|
||||
|
||||
## Key capabilities
|
||||
|
||||
- **Schema management** — `setup_schema()` creates tables, indices, and analyzers. No manual DDL required.
|
||||
- **Deduplication** — Deterministic record IDs derived from content hashes prevent duplicate rows across ingestion runs.
|
||||
- **Flexible ingestion** — Single files, file lists, directories (with glob), or raw bytes.
|
||||
- **Extraction control** — Pass Kreuzberg's `ExtractionConfig` to set OCR behavior, output format, and quality processing.
|
||||
- **Batch tuning** — Adjust `insert_batch_size` to balance throughput against memory usage.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install kreuzberg-surrealdb
|
||||
```
|
||||
|
||||
Requires Python 3.10+. You also need a running SurrealDB instance:
|
||||
|
||||
```bash
|
||||
docker run --rm -p 8000:8000 surrealdb/surrealdb:latest start --allow-all --user root --pass root
|
||||
```
|
||||
|
||||
## Quick start
|
||||
|
||||
```python
|
||||
from kreuzberg_surrealdb import DocumentPipeline
|
||||
|
||||
pipeline = DocumentPipeline(db=db, embed=True, embedding_model="balanced")
|
||||
await pipeline.setup_schema()
|
||||
await pipeline.ingest_directory("./papers", glob="**/*.pdf")
|
||||
```
|
||||
|
||||
## Choosing a class
|
||||
|
||||
The package provides two entry points. Choose based on whether you need chunking and embeddings.
|
||||
|
||||
| | `DocumentConnector` | `DocumentPipeline` | `DocumentPipeline(embed=False)` |
|
||||
| ---------- | ----------------------------------- | ------------------------------------- | ------------------------------- |
|
||||
| Stores | Full documents | Documents + chunks | Documents + chunks |
|
||||
| Embeddings | No | Yes (configurable) | No |
|
||||
| Indices | BM25 on documents | BM25 + HNSW on chunks | BM25 on chunks |
|
||||
| Best for | Keyword search over whole documents | Semantic or hybrid search over chunks | Keyword search over chunks |
|
||||
|
||||
For the complete API reference, embedding model options, chunking configuration, and database schema details, see the [kreuzberg-surrealdb readme](https://github.com/kreuzberg-dev/kreuzberg-surrealdb). For general SurrealDB usage, see the [SurrealDB docs](https://surrealdb.com/docs).
|
||||
Reference in New Issue
Block a user