This commit is contained in:
322
docs/migration/from-unstructured.md
Normal file
322
docs/migration/from-unstructured.md
Normal file
@@ -0,0 +1,322 @@
|
||||
# Migrating from Unstructured to Kreuzberg
|
||||
|
||||
This guide helps you migrate from Unstructured.io to Kreuzberg for document intelligence workloads.
|
||||
|
||||
## Quick Start
|
||||
|
||||
**Unstructured API**:
|
||||
|
||||
```bash
|
||||
curl -X POST "https://api.unstructured.io/general/v0/general" \
|
||||
-F 'files=@document.pdf'
|
||||
```
|
||||
|
||||
**Kreuzberg API**:
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8080/extract" \
|
||||
-F 'files=@document.pdf' \
|
||||
-F 'output_format=element_based'
|
||||
```
|
||||
|
||||
## Output Format Comparison
|
||||
|
||||
### Unified Output (Default)
|
||||
|
||||
Kreuzberg's default output provides richer metadata than Unstructured:
|
||||
|
||||
**Kreuzberg Unified**:
|
||||
|
||||
```json
|
||||
{
|
||||
"content": "Full document text...",
|
||||
"mime_type": "application/pdf",
|
||||
"metadata": {
|
||||
"title": "Document Title",
|
||||
"authors": ["Author Name"],
|
||||
"created_at": "2024-01-15T10:30:00Z",
|
||||
"format": {
|
||||
"format_type": "pdf",
|
||||
"page_count": 10,
|
||||
"version": "1.7"
|
||||
}
|
||||
},
|
||||
"tables": [...],
|
||||
"images": [...],
|
||||
"pages": [...]
|
||||
}
|
||||
```
|
||||
|
||||
### Element-Based Output
|
||||
|
||||
**Kreuzberg** (when `output_format=element_based`):
|
||||
|
||||
```json
|
||||
{
|
||||
"elements": [
|
||||
{
|
||||
"element_id": "elem-a3f2b1c4",
|
||||
"element_type": "title",
|
||||
"text": "Introduction",
|
||||
"metadata": {
|
||||
"page_number": 1,
|
||||
"filename": "Document Title",
|
||||
"coordinates": {
|
||||
"x0": 72.0,
|
||||
"y0": 100.0,
|
||||
"x1": 540.0,
|
||||
"y1": 130.0
|
||||
},
|
||||
"element_index": 0,
|
||||
"additional": {
|
||||
"level": "h1",
|
||||
"font_size": "24.0"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"element_type": "narrative_text",
|
||||
"text": "This is a paragraph...",
|
||||
"metadata": {
|
||||
"page_number": 1
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Unstructured**:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"type": "Title",
|
||||
"text": "Introduction",
|
||||
"metadata": {
|
||||
"page_number": 1,
|
||||
"filename": "document.pdf"
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "NarrativeText",
|
||||
"text": "This is a paragraph...",
|
||||
"metadata": {
|
||||
"page_number": 1
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
## API Endpoint Mapping
|
||||
|
||||
| Unstructured | Kreuzberg | Notes |
|
||||
| -------------------------- | ------------------ | --------------------------------- |
|
||||
| `POST /general/v0/general` | `POST /extract` | Single/batch extraction |
|
||||
| N/A | `POST /embed` | Built-in embeddings (ONNX models) |
|
||||
| N/A | `GET /health` | Health check |
|
||||
| N/A | `GET /cache/stats` | Cache statistics |
|
||||
|
||||
## Element Type Mapping
|
||||
|
||||
| Unstructured | Kreuzberg | Notes |
|
||||
| --------------- | ---------------- | ----------------------------------- |
|
||||
| `Title` | `title` | PDF hierarchy (h1-h6) detection |
|
||||
| `NarrativeText` | `narrative_text` | Paragraphs split on double newlines |
|
||||
| `ListItem` | `list_item` | Bullets, numbered, lettered |
|
||||
| `Table` | `table` | Tab-separated text representation |
|
||||
| `Image` | `image` | Format, dimensions in metadata |
|
||||
| `PageBreak` | `page_break` | Between pages in multi-page docs |
|
||||
| `Header` | `header` | Page header text |
|
||||
| `Footer` | `footer` | Page footer text |
|
||||
| N/A | `heading` | Section headings (beyond title) |
|
||||
| N/A | `code_block` | Code snippets |
|
||||
| N/A | `block_quote` | Quoted text blocks |
|
||||
|
||||
## Code Examples
|
||||
|
||||
### Python
|
||||
|
||||
**Unstructured**:
|
||||
|
||||
```python
|
||||
from unstructured.partition.auto import partition
|
||||
|
||||
elements = partition(filename="document.pdf")
|
||||
for element in elements:
|
||||
print(f"{element.category}: {element.text}")
|
||||
```
|
||||
|
||||
**Kreuzberg**:
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_bytes
|
||||
|
||||
# Option 1: Element-based output
|
||||
config = {"output_format": "element_based"}
|
||||
result = extract_bytes(pdf_bytes, "application/pdf", config)
|
||||
|
||||
for element in result.elements:
|
||||
print(f"{element.element_type}: {element.text}")
|
||||
if element.metadata.page_number:
|
||||
print(f" Page: {element.metadata.page_number}")
|
||||
|
||||
# Option 2: Unified output (default, richer metadata)
|
||||
result = extract_bytes(pdf_bytes, "application/pdf")
|
||||
print(result.content) # Full text
|
||||
print(result.metadata.title) # Document metadata
|
||||
for page in result.pages:
|
||||
print(f"Page {page.page_number}: {page.content[:100]}")
|
||||
```
|
||||
|
||||
### TypeScript
|
||||
|
||||
**Unstructured** (via API):
|
||||
|
||||
```typescript
|
||||
const formData = new FormData();
|
||||
formData.append("files", fileBlob);
|
||||
|
||||
const response = await fetch("https://api.unstructured.io/general/v0/general", {
|
||||
method: "POST",
|
||||
body: formData,
|
||||
});
|
||||
const elements = await response.json();
|
||||
```
|
||||
|
||||
**Kreuzberg**:
|
||||
|
||||
```typescript
|
||||
import { extractBytes } from "kreuzberg";
|
||||
|
||||
// Option 1: Element-based output
|
||||
const result = await extractBytes(pdfBuffer, "application/pdf", {
|
||||
output_format: "element_based",
|
||||
});
|
||||
|
||||
for (const element of result.elements) {
|
||||
console.log(`${element.element_type}: ${element.text}`);
|
||||
}
|
||||
|
||||
// Option 2: Unified output with pages
|
||||
const result = await extractBytes(pdfBuffer, "application/pdf", {
|
||||
pages: { extract_pages: true },
|
||||
});
|
||||
|
||||
for (const page of result.pages) {
|
||||
console.log(`Page ${page.page_number}:`, page.content);
|
||||
}
|
||||
```
|
||||
|
||||
### CURL
|
||||
|
||||
**Unstructured**:
|
||||
|
||||
```bash
|
||||
curl -X POST "https://api.unstructured.io/general/v0/general" \
|
||||
-H "unstructured-api-key: $API_KEY" \
|
||||
-F 'files=@document.pdf' \
|
||||
-F 'strategy=hi_res'
|
||||
```
|
||||
|
||||
**Kreuzberg**:
|
||||
|
||||
```bash
|
||||
# Element-based output
|
||||
curl -X POST "http://localhost:8080/extract" \
|
||||
-F 'files=@document.pdf' \
|
||||
-F 'output_format=element_based'
|
||||
|
||||
# With configuration JSON
|
||||
curl -X POST "http://localhost:8080/extract" \
|
||||
-F 'files=@document.pdf' \
|
||||
-F 'config={"output_format":"element_based","pages":{"extract_pages":true}}'
|
||||
```
|
||||
|
||||
## Feature Comparison
|
||||
|
||||
### What Kreuzberg Adds
|
||||
|
||||
1. **Richer Metadata**: Format-specific discriminated unions (PDF, Excel, Email, etc.)
|
||||
2. **Native Per-Page**: `PageContent` with byte offsets, hierarchy, tables, images per page
|
||||
3. **90+ Formats**: vs Unstructured's ~30 formats
|
||||
4. **Performance**: Rust-based native implementation (vs Python-based)
|
||||
5. **10 Language Bindings**: Python, TypeScript, Ruby, PHP, Go, Java, C#, Elixir, Rust, WASM
|
||||
6. **Built-in Embeddings**: ONNX models via `/embed` endpoint (no external API)
|
||||
7. **Smart Hierarchy**: PDF font-size clustering for h1-h6 detection
|
||||
8. **Bounding Boxes**: Preserved from PDF source in element coordinates
|
||||
|
||||
### What Unstructured Has
|
||||
|
||||
1. **Layout Detection Models**: ML-based layout analysis (GPU-accelerated)
|
||||
2. **Cloud API**: Hosted service (Kreuzberg requires self-hosting)
|
||||
3. **More Element Types**: More granular element classification
|
||||
4. **Mature Ecosystem**: Larger community, more integrations
|
||||
|
||||
## Configuration Mapping
|
||||
|
||||
| Unstructured Parameter | Kreuzberg Config | Notes |
|
||||
| ------------------------------------- | ------------------------------------ | ---------------------------------- |
|
||||
| `strategy=hi_res` | `pdf_options.hierarchy.enabled=true` | PDF hierarchy extraction |
|
||||
| `coordinates=true` | Always included when available | Bounding boxes in element metadata |
|
||||
| `languages=["eng"]` | `ocr.language="eng"` | OCR language |
|
||||
| `extract_image_block_types=["image"]` | `images.extract_images=true` | Image extraction |
|
||||
| `chunking_strategy="by_title"` | `chunking.max_chars=1000` | Text chunking (basic) |
|
||||
| `embedding_model="..."` | `chunking.embedding.model="..."` | Embedding generation |
|
||||
|
||||
## Migration Checklist
|
||||
|
||||
- [ ] Update API endpoint URLs (Unstructured → Kreuzberg)
|
||||
- [ ] Add `output_format=element_based` if using element-based workflow
|
||||
- [ ] Update element type references (`Title` → `title`, camelCase → snake_case)
|
||||
- [ ] Update metadata field references (Kreuzberg has richer metadata structure)
|
||||
- [ ] Test with sample documents to verify output equivalence
|
||||
- [ ] Update error handling (Kreuzberg uses HTTP 422 for validation errors)
|
||||
- [ ] Configure caching if needed (Kreuzberg has built-in file-based cache)
|
||||
- [ ] Set up embeddings if using RAG pipeline (Kreuzberg has built-in ONNX support)
|
||||
|
||||
## Advanced: Hybrid Approach
|
||||
|
||||
You can use **both formats** simultaneously:
|
||||
|
||||
```python
|
||||
from kreuzberg import extract_bytes
|
||||
|
||||
result = extract_bytes(pdf_bytes, "application/pdf", {
|
||||
"output_format": "element_based", # Get elements
|
||||
"pages": {"extract_pages": true} # Also get per-page content
|
||||
})
|
||||
|
||||
# Element-based processing
|
||||
for element in result.elements:
|
||||
if element.element_type == "title":
|
||||
index_heading(element.text)
|
||||
|
||||
# Page-based processing
|
||||
for page in result.pages:
|
||||
if page.hierarchy:
|
||||
for block in page.hierarchy.blocks:
|
||||
if block.level == "h1":
|
||||
process_section(block.text)
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Enable Caching**: `use_cache: true` (default) for repeated extractions
|
||||
2. **Disable OCR**: If documents are searchable PDFs, set `force_ocr: false`
|
||||
3. **Limit Page Extraction**: Only enable `pages` if you need per-page content
|
||||
4. **Batch Processing**: Send multiple files in single request (up to 10MB total)
|
||||
5. **Use Embeddings Wisely**: Enable only for chunked content destined for vector DB
|
||||
|
||||
## Getting Help
|
||||
|
||||
- **Documentation**: <https://github.com/kreuzberg-dev/Kreuzberg>
|
||||
- **Issues**: <https://github.com/kreuzberg-dev/kreuzberg/issues>
|
||||
- **API Reference**: See `docs/api/` for endpoint documentation
|
||||
|
||||
## Next Steps
|
||||
|
||||
After migration:
|
||||
|
||||
1. Review the [Kreuzberg vs Unstructured Comparison](../comparisons/kreuzberg-vs-unstructured.md)
|
||||
2. Explore Kreuzberg-specific features (hierarchy, per-page metadata, embeddings)
|
||||
3. Optimize your pipeline with native Rust performance
|
||||
Reference in New Issue
Block a user