# Migrating from Unstructured to Kreuzberg This guide helps you migrate from Unstructured.io to Kreuzberg for document intelligence workloads. ## Quick Start **Unstructured API**: ```bash curl -X POST "https://api.unstructured.io/general/v0/general" \ -F 'files=@document.pdf' ``` **Kreuzberg API**: ```bash curl -X POST "http://localhost:8080/extract" \ -F 'files=@document.pdf' \ -F 'output_format=element_based' ``` ## Output Format Comparison ### Unified Output (Default) Kreuzberg's default output provides richer metadata than Unstructured: **Kreuzberg Unified**: ```json { "content": "Full document text...", "mime_type": "application/pdf", "metadata": { "title": "Document Title", "authors": ["Author Name"], "created_at": "2024-01-15T10:30:00Z", "format": { "format_type": "pdf", "page_count": 10, "version": "1.7" } }, "tables": [...], "images": [...], "pages": [...] } ``` ### Element-Based Output **Kreuzberg** (when `output_format=element_based`): ```json { "elements": [ { "element_id": "elem-a3f2b1c4", "element_type": "title", "text": "Introduction", "metadata": { "page_number": 1, "filename": "Document Title", "coordinates": { "x0": 72.0, "y0": 100.0, "x1": 540.0, "y1": 130.0 }, "element_index": 0, "additional": { "level": "h1", "font_size": "24.0" } } }, { "element_type": "narrative_text", "text": "This is a paragraph...", "metadata": { "page_number": 1 } } ] } ``` **Unstructured**: ```json [ { "type": "Title", "text": "Introduction", "metadata": { "page_number": 1, "filename": "document.pdf" } }, { "type": "NarrativeText", "text": "This is a paragraph...", "metadata": { "page_number": 1 } } ] ``` ## API Endpoint Mapping | Unstructured | Kreuzberg | Notes | | -------------------------- | ------------------ | --------------------------------- | | `POST /general/v0/general` | `POST /extract` | Single/batch extraction | | N/A | `POST /embed` | Built-in embeddings (ONNX models) | | N/A | `GET /health` | Health check | | N/A | `GET /cache/stats` | Cache statistics | ## Element Type Mapping | Unstructured | Kreuzberg | Notes | | --------------- | ---------------- | ----------------------------------- | | `Title` | `title` | PDF hierarchy (h1-h6) detection | | `NarrativeText` | `narrative_text` | Paragraphs split on double newlines | | `ListItem` | `list_item` | Bullets, numbered, lettered | | `Table` | `table` | Tab-separated text representation | | `Image` | `image` | Format, dimensions in metadata | | `PageBreak` | `page_break` | Between pages in multi-page docs | | `Header` | `header` | Page header text | | `Footer` | `footer` | Page footer text | | N/A | `heading` | Section headings (beyond title) | | N/A | `code_block` | Code snippets | | N/A | `block_quote` | Quoted text blocks | ## Code Examples ### Python **Unstructured**: ```python from unstructured.partition.auto import partition elements = partition(filename="document.pdf") for element in elements: print(f"{element.category}: {element.text}") ``` **Kreuzberg**: ```python from kreuzberg import extract_bytes # Option 1: Element-based output config = {"output_format": "element_based"} result = extract_bytes(pdf_bytes, "application/pdf", config) for element in result.elements: print(f"{element.element_type}: {element.text}") if element.metadata.page_number: print(f" Page: {element.metadata.page_number}") # Option 2: Unified output (default, richer metadata) result = extract_bytes(pdf_bytes, "application/pdf") print(result.content) # Full text print(result.metadata.title) # Document metadata for page in result.pages: print(f"Page {page.page_number}: {page.content[:100]}") ``` ### TypeScript **Unstructured** (via API): ```typescript const formData = new FormData(); formData.append("files", fileBlob); const response = await fetch("https://api.unstructured.io/general/v0/general", { method: "POST", body: formData, }); const elements = await response.json(); ``` **Kreuzberg**: ```typescript import { extractBytes } from "kreuzberg"; // Option 1: Element-based output const result = await extractBytes(pdfBuffer, "application/pdf", { output_format: "element_based", }); for (const element of result.elements) { console.log(`${element.element_type}: ${element.text}`); } // Option 2: Unified output with pages const result = await extractBytes(pdfBuffer, "application/pdf", { pages: { extract_pages: true }, }); for (const page of result.pages) { console.log(`Page ${page.page_number}:`, page.content); } ``` ### CURL **Unstructured**: ```bash curl -X POST "https://api.unstructured.io/general/v0/general" \ -H "unstructured-api-key: $API_KEY" \ -F 'files=@document.pdf' \ -F 'strategy=hi_res' ``` **Kreuzberg**: ```bash # Element-based output curl -X POST "http://localhost:8080/extract" \ -F 'files=@document.pdf' \ -F 'output_format=element_based' # With configuration JSON curl -X POST "http://localhost:8080/extract" \ -F 'files=@document.pdf' \ -F 'config={"output_format":"element_based","pages":{"extract_pages":true}}' ``` ## Feature Comparison ### What Kreuzberg Adds 1. **Richer Metadata**: Format-specific discriminated unions (PDF, Excel, Email, etc.) 2. **Native Per-Page**: `PageContent` with byte offsets, hierarchy, tables, images per page 3. **90+ Formats**: vs Unstructured's ~30 formats 4. **Performance**: Rust-based native implementation (vs Python-based) 5. **10 Language Bindings**: Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, Rust, WASM 6. **Built-in Embeddings**: ONNX models via `/embed` endpoint (no external API) 7. **Smart Hierarchy**: PDF font-size clustering for h1-h6 detection 8. **Bounding Boxes**: Preserved from PDF source in element coordinates ### What Unstructured Has 1. **Layout Detection Models**: ML-based layout analysis (GPU-accelerated) 2. **Cloud API**: Hosted service (Kreuzberg requires self-hosting) 3. **More Element Types**: More granular element classification 4. **Mature Ecosystem**: Larger community, more integrations ## Configuration Mapping | Unstructured Parameter | Kreuzberg Config | Notes | | ------------------------------------- | ------------------------------------ | ---------------------------------- | | `strategy=hi_res` | `pdf_options.hierarchy.enabled=true` | PDF hierarchy extraction | | `coordinates=true` | Always included when available | Bounding boxes in element metadata | | `languages=["eng"]` | `ocr.language="eng"` | OCR language | | `extract_image_block_types=["image"]` | `images.extract_images=true` | Image extraction | | `chunking_strategy="by_title"` | `chunking.max_chars=1000` | Text chunking (basic) | | `embedding_model="..."` | `chunking.embedding.model="..."` | Embedding generation | ## Migration Checklist - [ ] Update API endpoint URLs (Unstructured → Kreuzberg) - [ ] Add `output_format=element_based` if using element-based workflow - [ ] Update element type references (`Title` → `title`, camelCase → snake_case) - [ ] Update metadata field references (Kreuzberg has richer metadata structure) - [ ] Test with sample documents to verify output equivalence - [ ] Update error handling (Kreuzberg uses HTTP 422 for validation errors) - [ ] Configure caching if needed (Kreuzberg has built-in file-based cache) - [ ] Set up embeddings if using RAG pipeline (Kreuzberg has built-in ONNX support) ## Advanced: Hybrid Approach You can use **both formats** simultaneously: ```python from kreuzberg import extract_bytes result = extract_bytes(pdf_bytes, "application/pdf", { "output_format": "element_based", # Get elements "pages": {"extract_pages": true} # Also get per-page content }) # Element-based processing for element in result.elements: if element.element_type == "title": index_heading(element.text) # Page-based processing for page in result.pages: if page.hierarchy: for block in page.hierarchy.blocks: if block.level == "h1": process_section(block.text) ``` ## Performance Tips 1. **Enable Caching**: `use_cache: true` (default) for repeated extractions 2. **Disable OCR**: If documents are searchable PDFs, set `force_ocr: false` 3. **Limit Page Extraction**: Only enable `pages` if you need per-page content 4. **Batch Processing**: Send multiple files in single request (up to 10MB total) 5. **Use Embeddings Wisely**: Enable only for chunked content destined for vector DB ## Getting Help - **Documentation**: - **Issues**: - **API Reference**: See `docs/api/` for endpoint documentation ## Next Steps After migration: 1. Review the [Kreuzberg vs Unstructured Comparison](../comparisons/kreuzberg-vs-unstructured.md) 2. Explore Kreuzberg-specific features (hierarchy, per-page metadata, embeddings) 3. Optimize your pipeline with native Rust performance