hjess/fil

Fork 0

Files

Henrik Jess Nielsen b4c07d3693

Deploy fil (kreuzberg) / deploy (push) Successful in 49s

Details

Nomad changes

2026-06-01 23:40:55 +02:00

7.4 KiB

Raw Blame History

Kreuzberg PHP Snippets

Comprehensive code examples for the Kreuzberg PHP bindings. These snippets demonstrate all major features and use cases.

Directory Structure

php/
├── installation/          # Getting started, setup, requirements
├── quickstart/           # Basic usage examples
├── configuration/        # Configuration classes and options
├── extraction/           # Document extraction examples
├── async/               # Async extraction with DeferredResult
├── ocr/                 # OCR and image preprocessing
├── chunking/            # Text chunking for RAG
├── embeddings/          # Vector embeddings and semantic search
├── advanced/            # Error handling, performance tuning
├── cache/               # Caching strategies
├── cli/                 # Command-line tools
└── benchmarking/        # Performance testing

Installation (3 snippets)

Composer_install.php

Installing Kreuzberg via Composer and verifying the extension is loaded.

Extension_setup.php

Setting up the native PHP extension (kreuzberg.so/.dll) and checking for optional dependencies (Tesseract, ONNX Runtime).

Requirements_check.php

Comprehensive system requirements verification script.

Quickstart (4 snippets)

Basic_extraction_oop.php

Simple document extraction using the object-oriented API.

Basic_extraction_procedural.php

Simple extraction using the procedural API for more concise code.

Extract_from_bytes.php

Extract content from file data in memory (useful for uploaded files).

Mime_type_detection.php

Automatic MIME type detection from file paths or content.

Configuration (5 snippets)

Extraction_config.php

Main ExtractionConfig class - controlling all aspects of extraction.

Pdf_config.php

PDF-specific settings including image quality and extraction methods.

Page_config.php

Per-page extraction and page markers for maintaining document structure.

Language_detection_config.php

Automatic language detection for multilingual documents.

Keyword_config.php

Automatic keyword extraction for document categorization.

Extraction (7 snippets)

Pdf_extraction.php

Extract text, tables, and images from PDF files with various configurations.

Docx_extraction.php

Extract content from Microsoft Word documents including metadata and tables.

Image_extraction.php

Extract embedded images from documents with optional OCR.

Batch_processing.php

Process multiple documents in parallel for maximum performance.

Table_extraction.php

Extract and process tables, export to CSV, JSON, and HTML formats.

Metadata_extraction.php

Extract document metadata (title, author, dates, keywords).

Multi_format.php

Handle various document formats with format-specific processing.

OCR (3 snippets)

Basic_ocr.php

Basic OCR with Tesseract for scanned documents and images.

Advanced_ocr.php

Advanced OCR configuration with Tesseract PSM modes and table detection.

Image_preprocessing.php

Image preprocessing for better OCR accuracy (denoising, deskewing, sharpening).

Chunking (1 snippet)

Basic_chunking.php

Split documents into chunks for RAG applications with various strategies.

Embeddings (2 snippets)

Basic_embeddings.php

Generate vector embeddings for semantic search and similarity matching.

Semantic_search.php

Build a semantic search system using document embeddings.

Advanced (2 snippets)

Error_handling.php

Robust error handling, retry strategies, and validation.

Performance_tuning.php

Performance optimization tips and techniques.

Cache (1 snippet)

Disk_cache.php

File-based caching to avoid re-processing documents.

CLI (2 snippets)

Basic_cli.php

Simple command-line interface for document extraction.

Cli_with_config.php

Advanced CLI with support for various extraction options.

Benchmarking (1 snippet)

Simple_benchmark.php

Benchmark extraction performance across different configurations.

Usage Patterns

Basic Extraction

use Kreuzberg\Kreuzberg;

$kreuzberg = new Kreuzberg();
$result = $kreuzberg->extractFile('document.pdf');
echo $result->content;

With Configuration

use Kreuzberg\Config\ExtractionConfig;
use Kreuzberg\Config\OcrConfig;

$config = new ExtractionConfig(
    ocr: new OcrConfig(backend: 'tesseract', language: 'eng'),
    extractTables: true
);

$kreuzberg = new Kreuzberg($config);
$result = $kreuzberg->extractFile('scanned.pdf');

Procedural API

use function Kreuzberg\extract_file;

$result = extract_file('document.pdf');
echo $result->content;

Batch Processing

use function Kreuzberg\batch_extract_files;

$files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
$results = batch_extract_files($files);

Async Extraction (4 snippets)

Async_extract_file.php

Async file extraction with DeferredResult polling and blocking patterns.

Async_batch.php

Async batch extraction with timeout-based waiting.

Async_amp_bridge.php

Integration with Amp v3+ framework using AmpBridge::toFuture().

Async_react_bridge.php

Integration with ReactPHP framework using ReactBridge::toPromise().

Key Features Demonstrated

90+ File Formats: PDF, DOCX, XLSX, PPTX, images, HTML, and more
Async Extraction: Non-blocking extraction with DeferredResult pattern
OCR Support: Tesseract integration with preprocessing
Table Extraction: Extract structured tables with multiple export formats
Metadata: Rich metadata extraction for all formats
Batch Processing: Parallel processing for high throughput
Text Chunking: Intelligent segmentation for RAG applications
Embeddings: Vector embeddings for semantic search
Type Safety: Full PHP 8.1+ type hints and readonly classes
Error Handling: Comprehensive error handling patterns
Performance: Optimization techniques and benchmarking

Requirements

PHP 8.1.0 or higher
Kreuzberg PHP extension (kreuzberg.so/.dll)
Composer package: kreuzberg/Kreuzberg
Optional: Tesseract OCR (for OCR functionality)
Optional: ONNX Runtime (for embeddings)

Testing Snippets

Each snippet is designed to be self-contained and runnable. To test:

Install dependencies:
```
composer require kreuzberg/kreuzberg
```
Ensure the extension is loaded:
```
php -m | grep kreuzberg
```

Run any snippet:

php docs/snippets/php/quickstart/basic_extraction_oop.php

Best Practices

Use batch processing for multiple files
Disable unnecessary features (OCR, embeddings) if not needed
Implement caching for often accessed documents
Handle errors gracefully with try-catch blocks
Monitor memory usage for large documents
Use type hints for better IDE support and safety

Contributing

These snippets follow these conventions:

All files use declare(strict_types=1)
Code is wrapped in ````php` markdown code blocks
Clear comments explain what each snippet demonstrates
Both OOP and procedural APIs are shown where applicable
Examples are realistic and production-ready

7.4 KiB Raw Blame History