Nomad changes

2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions
--- a/docs/migration/from-unstructured.md
+++ b/docs/migration/from-unstructured.md
@@ -0,0 +1,322 @@
+# Migrating from Unstructured to Kreuzberg
+
+This guide helps you migrate from Unstructured.io to Kreuzberg for document intelligence workloads.
+
+## Quick Start
+
+**Unstructured API**:
+
+```bash
+curl -X POST "https://api.unstructured.io/general/v0/general" \
+  -F 'files=@document.pdf'
+```
+
+**Kreuzberg API**:
+
+```bash
+curl -X POST "http://localhost:8080/extract" \
+  -F 'files=@document.pdf' \
+  -F 'output_format=element_based'
+```
+
+## Output Format Comparison
+
+### Unified Output (Default)
+
+Kreuzberg's default output provides richer metadata than Unstructured:
+
+**Kreuzberg Unified**:
+
+```json
+{
+  "content": "Full document text...",
+  "mime_type": "application/pdf",
+  "metadata": {
+    "title": "Document Title",
+    "authors": ["Author Name"],
+    "created_at": "2024-01-15T10:30:00Z",
+    "format": {
+      "format_type": "pdf",
+      "page_count": 10,
+      "version": "1.7"
+    }
+  },
+  "tables": [...],
+  "images": [...],
+  "pages": [...]
+}
+```
+
+### Element-Based Output
+
+**Kreuzberg** (when `output_format=element_based`):
+
+```json
+{
+  "elements": [
+    {
+      "element_id": "elem-a3f2b1c4",
+      "element_type": "title",
+      "text": "Introduction",
+      "metadata": {
+        "page_number": 1,
+        "filename": "Document Title",
+        "coordinates": {
+          "x0": 72.0,
+          "y0": 100.0,
+          "x1": 540.0,
+          "y1": 130.0
+        },
+        "element_index": 0,
+        "additional": {
+          "level": "h1",
+          "font_size": "24.0"
+        }
+      }
+    },
+    {
+      "element_type": "narrative_text",
+      "text": "This is a paragraph...",
+      "metadata": {
+        "page_number": 1
+      }
+    }
+  ]
+}
+```
+
+**Unstructured**:
+
+```json
+[
+  {
+    "type": "Title",
+    "text": "Introduction",
+    "metadata": {
+      "page_number": 1,
+      "filename": "document.pdf"
+    }
+  },
+  {
+    "type": "NarrativeText",
+    "text": "This is a paragraph...",
+    "metadata": {
+      "page_number": 1
+    }
+  }
+]
+```
+
+## API Endpoint Mapping
+
+| Unstructured               | Kreuzberg          | Notes                             |
+| -------------------------- | ------------------ | --------------------------------- |
+| `POST /general/v0/general` | `POST /extract`    | Single/batch extraction           |
+| N/A                        | `POST /embed`      | Built-in embeddings (ONNX models) |
+| N/A                        | `GET /health`      | Health check                      |
+| N/A                        | `GET /cache/stats` | Cache statistics                  |
+
+## Element Type Mapping
+
+| Unstructured    | Kreuzberg        | Notes                               |
+| --------------- | ---------------- | ----------------------------------- |
+| `Title`         | `title`          | PDF hierarchy (h1-h6) detection     |
+| `NarrativeText` | `narrative_text` | Paragraphs split on double newlines |
+| `ListItem`      | `list_item`      | Bullets, numbered, lettered         |
+| `Table`         | `table`          | Tab-separated text representation   |
+| `Image`         | `image`          | Format, dimensions in metadata      |
+| `PageBreak`     | `page_break`     | Between pages in multi-page docs    |
+| `Header`        | `header`         | Page header text                    |
+| `Footer`        | `footer`         | Page footer text                    |
+| N/A             | `heading`        | Section headings (beyond title)     |
+| N/A             | `code_block`     | Code snippets                       |
+| N/A             | `block_quote`    | Quoted text blocks                  |
+
+## Code Examples
+
+### Python
+
+**Unstructured**:
+
+```python
+from unstructured.partition.auto import partition
+
+elements = partition(filename="document.pdf")
+for element in elements:
+    print(f"{element.category}: {element.text}")
+```
+
+**Kreuzberg**:
+
+```python
+from kreuzberg import extract_bytes
+
+# Option 1: Element-based output
+config = {"output_format": "element_based"}
+result = extract_bytes(pdf_bytes, "application/pdf", config)
+
+for element in result.elements:
+    print(f"{element.element_type}: {element.text}")
+    if element.metadata.page_number:
+        print(f"  Page: {element.metadata.page_number}")
+
+# Option 2: Unified output (default, richer metadata)
+result = extract_bytes(pdf_bytes, "application/pdf")
+print(result.content)  # Full text
+print(result.metadata.title)  # Document metadata
+for page in result.pages:
+    print(f"Page {page.page_number}: {page.content[:100]}")
+```
+
+### TypeScript
+
+**Unstructured** (via API):
+
+```typescript
+const formData = new FormData();
+formData.append("files", fileBlob);
+
+const response = await fetch("https://api.unstructured.io/general/v0/general", {
+  method: "POST",
+  body: formData,
+});
+const elements = await response.json();
+```
+
+**Kreuzberg**:
+
+```typescript
+import { extractBytes } from "kreuzberg";
+
+// Option 1: Element-based output
+const result = await extractBytes(pdfBuffer, "application/pdf", {
+  output_format: "element_based",
+});
+
+for (const element of result.elements) {
+  console.log(`${element.element_type}: ${element.text}`);
+}
+
+// Option 2: Unified output with pages
+const result = await extractBytes(pdfBuffer, "application/pdf", {
+  pages: { extract_pages: true },
+});
+
+for (const page of result.pages) {
+  console.log(`Page ${page.page_number}:`, page.content);
+}
+```
+
+### CURL
+
+**Unstructured**:
+
+```bash
+curl -X POST "https://api.unstructured.io/general/v0/general" \
+  -H "unstructured-api-key: $API_KEY" \
+  -F 'files=@document.pdf' \
+  -F 'strategy=hi_res'
+```
+
+**Kreuzberg**:
+
+```bash
+# Element-based output
+curl -X POST "http://localhost:8080/extract" \
+  -F 'files=@document.pdf' \
+  -F 'output_format=element_based'
+
+# With configuration JSON
+curl -X POST "http://localhost:8080/extract" \
+  -F 'files=@document.pdf' \
+  -F 'config={"output_format":"element_based","pages":{"extract_pages":true}}'
+```
+
+## Feature Comparison
+
+### What Kreuzberg Adds
+
+1. **Richer Metadata**: Format-specific discriminated unions (PDF, Excel, Email, etc.)
+2. **Native Per-Page**: `PageContent` with byte offsets, hierarchy, tables, images per page
+3. **90+ Formats**: vs Unstructured's ~30 formats
+4. **Performance**: Rust-based native implementation (vs Python-based)
+5. **10 Language Bindings**: Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, Rust, WASM
+6. **Built-in Embeddings**: ONNX models via `/embed` endpoint (no external API)
+7. **Smart Hierarchy**: PDF font-size clustering for h1-h6 detection
+8. **Bounding Boxes**: Preserved from PDF source in element coordinates
+
+### What Unstructured Has
+
+1. **Layout Detection Models**: ML-based layout analysis (GPU-accelerated)
+2. **Cloud API**: Hosted service (Kreuzberg requires self-hosting)
+3. **More Element Types**: More granular element classification
+4. **Mature Ecosystem**: Larger community, more integrations
+
+## Configuration Mapping
+
+| Unstructured Parameter                | Kreuzberg Config                     | Notes                              |
+| ------------------------------------- | ------------------------------------ | ---------------------------------- |
+| `strategy=hi_res`                     | `pdf_options.hierarchy.enabled=true` | PDF hierarchy extraction           |
+| `coordinates=true`                    | Always included when available       | Bounding boxes in element metadata |
+| `languages=["eng"]`                   | `ocr.language="eng"`                 | OCR language                       |
+| `extract_image_block_types=["image"]` | `images.extract_images=true`         | Image extraction                   |
+| `chunking_strategy="by_title"`        | `chunking.max_chars=1000`            | Text chunking (basic)              |
+| `embedding_model="..."`               | `chunking.embedding.model="..."`     | Embedding generation               |
+
+## Migration Checklist
+
+- [ ] Update API endpoint URLs (Unstructured → Kreuzberg)
+- [ ] Add `output_format=element_based` if using element-based workflow
+- [ ] Update element type references (`Title` → `title`, camelCase → snake_case)
+- [ ] Update metadata field references (Kreuzberg has richer metadata structure)
+- [ ] Test with sample documents to verify output equivalence
+- [ ] Update error handling (Kreuzberg uses HTTP 422 for validation errors)
+- [ ] Configure caching if needed (Kreuzberg has built-in file-based cache)
+- [ ] Set up embeddings if using RAG pipeline (Kreuzberg has built-in ONNX support)
+
+## Advanced: Hybrid Approach
+
+You can use **both formats** simultaneously:
+
+```python
+from kreuzberg import extract_bytes
+
+result = extract_bytes(pdf_bytes, "application/pdf", {
+    "output_format": "element_based",  # Get elements
+    "pages": {"extract_pages": true}   # Also get per-page content
+})
+
+# Element-based processing
+for element in result.elements:
+    if element.element_type == "title":
+        index_heading(element.text)
+
+# Page-based processing
+for page in result.pages:
+    if page.hierarchy:
+        for block in page.hierarchy.blocks:
+            if block.level == "h1":
+                process_section(block.text)
+```
+
+## Performance Tips
+
+1. **Enable Caching**: `use_cache: true` (default) for repeated extractions
+2. **Disable OCR**: If documents are searchable PDFs, set `force_ocr: false`
+3. **Limit Page Extraction**: Only enable `pages` if you need per-page content
+4. **Batch Processing**: Send multiple files in single request (up to 10MB total)
+5. **Use Embeddings Wisely**: Enable only for chunked content destined for vector DB
+
+## Getting Help
+
+- **Documentation**: <https://github.com/kreuzberg-dev/Kreuzberg>
+- **Issues**: <https://github.com/kreuzberg-dev/kreuzberg/issues>
+- **API Reference**: See `docs/api/` for endpoint documentation
+
+## Next Steps
+
+After migration:
+
+1. Review the [Kreuzberg vs Unstructured Comparison](../comparisons/kreuzberg-vs-unstructured.md)
+2. Explore Kreuzberg-specific features (hierarchy, per-page metadata, embeddings)
+3. Optimize your pipeline with native Rust performance