Nomad changes

2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions
--- a/docs/integrations/index.md
+++ b/docs/integrations/index.md
@@ -0,0 +1,20 @@
+# Integrations
+
+Kreuzberg integrates with AI frameworks, databases, and search engines — bringing document extraction into your existing stack. Each integration is a standalone package published on PyPI.
+
+---
+
+## Available integrations
+
+| Integration | Framework                                         | Package                                                                                                                          | Docs                                                                                                |
+| ----------- | ------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
+| Open WebUI  | [Open WebUI](https://openwebui.com/)              | Built-in (v4.7.0+)                                                                                                               | [Open WebUI](openwebui.md)                                                                          |
+| LangChain   | [LangChain](https://www.langchain.com/)           | [`langchain-kreuzberg`](https://pypi.org/project/langchain-kreuzberg/)                                                           | [GitHub](https://github.com/kreuzberg-dev/langchain-kreuzberg)                                      |
+| LlamaIndex  | [LlamaIndex](https://www.llamaindex.ai/)          | [`llama-index-readers-kreuzberg`](https://pypi.org/project/llama-index-readers-kreuzberg/)                                       | [GitHub](https://github.com/kreuzberg-dev/llama-index-kreuzberg)                                    |
+| Haystack    | [Haystack](https://haystack.deepset.ai/)          | [`kreuzberg-haystack`](https://pypi.org/project/kreuzberg-haystack/)                                                             | [GitHub](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/kreuzberg) |
+| CrewAI      | [CrewAI](https://www.crewai.com/)                 | [`kreuzberg-crewai`](https://pypi.org/project/kreuzberg-crewai/)                                                                 | [GitHub](https://github.com/kreuzberg-dev/kreuzberg-crewai)                                         |
+| txtAI       | [txtAI](https://neuml.github.io/txtai/)           | [`kreuzberg-txtai`](https://pypi.org/project/kreuzberg-txtai/)                                                                   | [GitHub](https://github.com/kreuzberg-dev/kreuzberg-txtai)                                          |
+| SurrealDB   | [SurrealDB](https://surrealdb.com/)               | [`kreuzberg-surrealdb`](https://pypi.org/project/kreuzberg-surrealdb/)                                                           | [SurrealDB](surrealdb.md)                                                                           |
+| Spring AI   | [Spring AI](https://spring.io/projects/spring-ai) | [`kreuzberg-spring-ai-document-reader`](https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg-spring-ai-document-reader) | [GitHub](https://github.com/kreuzberg-dev/kreuzberg-spring-ai)                                      |
+
+!!! Tip "Building a new integration?" Explore existing integrations on [GitHub](https://github.com/kreuzberg-dev) for reference.
--- a/docs/integrations/openwebui.md
+++ b/docs/integrations/openwebui.md
@@ -0,0 +1,168 @@
+# Open WebUI
+
+![Kreuzberg](https://img.shields.io/badge/kreuzberg-v4.7.0+-blue)
+
+Open WebUI supports pluggable content extraction backends. Kreuzberg implements two of those backend APIs — the **docling-serve** endpoint and the **external document loader** endpoint, so it works as a drop-in replacement without patching Open WebUI.
+
+## How it works
+
+1. A user uploads a document (PDF, DOCX, image, etc.) in Open WebUI.
+2. Open WebUI sends the file to Kreuzberg's API endpoint.
+3. Kreuzberg extracts the content — running OCR where needed and returns Markdown.
+4. Open WebUI stores the Markdown in its vector database for retrieval-augmented generation.
+
+Kreuzberg supports [90+ file formats](../reference/formats.md) and requires no GPU.
+
+## Prerequisites
+
+- Docker and Docker Compose (v2)
+- Open WebUI running or ready to deploy
+- No GPU required — Kreuzberg runs entirely on CPU
+
+## Setup with Docker Compose
+
+This is the fastest way to get both services running together.
+
+```yaml title="docker-compose.yaml"
+services:
+  kreuzberg:
+    image: ghcr.io/kreuzberg-dev/kreuzberg:latest-core
+    ports:
+      - "8000:8000"
+    command: ["serve", "--host", "0.0.0.0", "--port", "8000"]
+    volumes:
+      - kreuzberg-cache:/app/.kreuzberg
+    healthcheck:
+      test: ["CMD", "kreuzberg", "version"]
+      interval: 10s
+      timeout: 5s
+      retries: 5
+
+  open-webui:
+    image: ghcr.io/open-webui/open-webui:main
+    ports:
+      - "3000:8080"
+    environment:
+      CONTENT_EXTRACTION_ENGINE: "docling"
+      DOCLING_SERVER_URL: "http://kreuzberg:8000"
+    depends_on:
+      kreuzberg:
+        condition: service_healthy
+
+volumes:
+  kreuzberg-cache:
+```
+
+Start both services in detached mode:
+
+```bash
+docker compose up -d
+```
+
+Open `http://localhost:3000`, create an account, and upload a document. The extracted text will appear in the chat context.
+
+!!! Note "Cache volume" The `kreuzberg-cache` volume persists OCR models and embedding weights across restarts. Without it, models re-download on every container restart (~90 MB–1.2 GB depending on configuration).
+
+!!! Info "Already running Open WebUI?" Start Kreuzberg separately, then point Open WebUI to that Kreuzberg URL.
+
+=== "Docker"
+
+    ```bash
+    docker run -d \
+      --name kreuzberg \
+      -p 8000:8000 \
+      -v kreuzberg-cache:/app/.kreuzberg \
+      ghcr.io/kreuzberg-dev/kreuzberg:latest-core \
+      serve --host 0.0.0.0 --port 8000
+    ```
+
+=== "CLI (Homebrew / Cargo)"
+
+    ```bash
+    kreuzberg serve --host 0.0.0.0 --port 8000
+    ```
+
+Then configure Open WebUI using one of the two engine modes below.
+
+## Choosing an engine mode
+
+Kreuzberg exposes two Open WebUI–compatible APIs. Both return the same extracted content. So pick whichever fits your setup.
+
+|                    | **Docling** (recommended) | **External**                   |
+| ------------------ | ------------------------- | ------------------------------ |
+| **Endpoint**       | `POST /v1/convert/file`   | `PUT /process`                 |
+| **Engine setting** | `docling`                 | `external`                     |
+| **URL variable**   | `DOCLING_SERVER_URL`      | `EXTERNAL_DOCUMENT_LOADER_URL` |
+
+=== "Docling (recommended)"
+
+    Set these environment variables on the Open WebUI container:
+
+    ```yaml
+    environment:
+      CONTENT_EXTRACTION_ENGINE: "docling"
+      DOCLING_SERVER_URL: "http://kreuzberg:8000"
+    ```
+
+    Or via the Admin UI: **Settings → Documents → Content Extraction Engine** → select **Docling** → set server URL to `http://kreuzberg:8000`.
+
+=== "External"
+
+    Set these environment variables on the Open WebUI container:
+
+    ```yaml
+    environment:
+      CONTENT_EXTRACTION_ENGINE: "external"
+      EXTERNAL_DOCUMENT_LOADER_URL: "http://kreuzberg:8000"
+    ```
+
+    Or via the Admin UI: **Settings → Documents → Content Extraction Engine** → select **External** → set URL to `http://kreuzberg:8000`.
+
+!!! Tip If Kreuzberg runs on a different host or port, replace `http://kreuzberg:8000` with the actual address. Inside Docker Compose, use the service name (`kreuzberg`). Outside Docker, use the host IP or `localhost`.
+
+## Verify it works
+
+Test the endpoints directly before debugging through Open WebUI.
+
+=== "Docling endpoint"
+
+    ```bash
+    curl -s -F "files=@invoice.pdf" http://localhost:8000/v1/convert/file | jq .
+    ```
+
+    ```json title="Expected response"
+    {
+      "document": {
+        "md_content": "# Invoice\n\nDate: 2026-01-15\n..."
+      },
+      "status": "success"
+    }
+    ```
+
+=== "External endpoint"
+
+    ```bash
+    curl -s -X PUT \
+      -H "Content-Type: application/pdf" \
+      -H "X-Filename: invoice.pdf" \
+      --data-binary @invoice.pdf \
+      http://localhost:8000/process | jq .
+    ```
+
+    ```json title="Expected response"
+    {
+      "page_content": "# Invoice\n\nDate: 2026-01-15\n...",
+      "metadata": {
+        "source": "invoice.pdf"
+      }
+    }
+    ```
+
+If the endpoint returns extracted text, the integration is working. Upload a document through Open WebUI to confirm end-to-end.
+
+## Next steps
+
+- [Docker deployment guide](../guides/docker.md) — image variants, volumes, security hardening
+- [API server reference](../guides/api-server.md) — all endpoints and configuration options
+- [OCR guide](../guides/ocr.md) — language packs, engine selection, tuning
+- [Format support](../reference/formats.md) — full list of supported file types
--- a/docs/integrations/surrealdb.md
+++ b/docs/integrations/surrealdb.md
@@ -0,0 +1,71 @@
+# SurrealDB
+
+The `kreuzberg-surrealdb` package connects Kreuzberg's document extraction pipeline to [SurrealDB](https://surrealdb.com/). It handles schema creation, content deduplication, optional chunking and embedding, and index configuration.
+
+[![PyPI](https://img.shields.io/pypi/v/kreuzberg-surrealdb)](https://pypi.org/project/kreuzberg-surrealdb/)
+[![Python](https://img.shields.io/pypi/pyversions/kreuzberg-surrealdb)](https://pypi.org/project/kreuzberg-surrealdb/)
+[![License](https://img.shields.io/pypi/l/kreuzberg-surrealdb)](https://github.com/kreuzberg-dev/kreuzberg-surrealdb/blob/main/LICENSE)
+
+## How it works
+
+```mermaid
+flowchart LR
+    Input[Documents] --> Kreuzberg[Kreuzberg Extraction]
+    Kreuzberg --> Connector[Integration Connector]
+    Connector --> Schema[Auto Schema Setup]
+    Connector --> Dedup[Content Deduplication]
+    Connector --> Store[Storage & Indexing]
+    Store --> Search[Search & Retrieval]
+
+    style Kreuzberg fill:#87CEEB
+    style Connector fill:#FFD700
+    style Search fill:#90EE90
+```
+
+1. **Extract** — Kreuzberg parses the source documents and runs OCR where needed.
+2. **Connect** — The connector receives the extracted output and manages the SurrealDB connection.
+3. **Store** — Each document is hashed (SHA-256) for deduplication, optionally chunked and embedded, then written to SurrealDB under an auto-generated schema.
+4. **Search** — Full-text (BM25), vector (HNSW), and hybrid (RRF) search are available immediately after ingestion.
+
+## Key capabilities
+
+- **Schema management** — `setup_schema()` creates tables, indices, and analyzers. No manual DDL required.
+- **Deduplication** — Deterministic record IDs derived from content hashes prevent duplicate rows across ingestion runs.
+- **Flexible ingestion** — Single files, file lists, directories (with glob), or raw bytes.
+- **Extraction control** — Pass Kreuzberg's `ExtractionConfig` to set OCR behavior, output format, and quality processing.
+- **Batch tuning** — Adjust `insert_batch_size` to balance throughput against memory usage.
+
+## Installation
+
+```bash
+pip install kreuzberg-surrealdb
+```
+
+Requires Python 3.10+. You also need a running SurrealDB instance:
+
+```bash
+docker run --rm -p 8000:8000 surrealdb/surrealdb:latest start --allow-all --user root --pass root
+```
+
+## Quick start
+
+```python
+from kreuzberg_surrealdb import DocumentPipeline
+
+pipeline = DocumentPipeline(db=db, embed=True, embedding_model="balanced")
+await pipeline.setup_schema()
+await pipeline.ingest_directory("./papers", glob="**/*.pdf")
+```
+
+## Choosing a class
+
+The package provides two entry points. Choose based on whether you need chunking and embeddings.
+
+|            | `DocumentConnector`                 | `DocumentPipeline`                    | `DocumentPipeline(embed=False)` |
+| ---------- | ----------------------------------- | ------------------------------------- | ------------------------------- |
+| Stores     | Full documents                      | Documents + chunks                    | Documents + chunks              |
+| Embeddings | No                                  | Yes (configurable)                    | No                              |
+| Indices    | BM25 on documents                   | BM25 + HNSW on chunks                 | BM25 on chunks                  |
+| Best for   | Keyword search over whole documents | Semantic or hybrid search over chunks | Keyword search over chunks      |
+
+For the complete API reference, embedding model options, chunking configuration, and database schema details, see the [kreuzberg-surrealdb readme](https://github.com/kreuzberg-dev/kreuzberg-surrealdb). For general SurrealDB usage, see the [SurrealDB docs](https://surrealdb.com/docs).