301 lines
7.4 KiB
Markdown
301 lines
7.4 KiB
Markdown
# Kreuzberg PHP Snippets
|
|
|
|
Comprehensive code examples for the Kreuzberg PHP bindings. These snippets demonstrate all major features and use cases.
|
|
|
|
## Directory Structure
|
|
|
|
```text
|
|
php/
|
|
├── installation/ # Getting started, setup, requirements
|
|
├── quickstart/ # Basic usage examples
|
|
├── configuration/ # Configuration classes and options
|
|
├── extraction/ # Document extraction examples
|
|
├── async/ # Async extraction with DeferredResult
|
|
├── ocr/ # OCR and image preprocessing
|
|
├── chunking/ # Text chunking for RAG
|
|
├── embeddings/ # Vector embeddings and semantic search
|
|
├── advanced/ # Error handling, performance tuning
|
|
├── cache/ # Caching strategies
|
|
├── cli/ # Command-line tools
|
|
└── benchmarking/ # Performance testing
|
|
```
|
|
|
|
## Installation (3 snippets)
|
|
|
|
### Composer_install.php
|
|
|
|
Installing Kreuzberg via Composer and verifying the extension is loaded.
|
|
|
|
### Extension_setup.php
|
|
|
|
Setting up the native PHP extension (kreuzberg.so/.dll) and checking for optional dependencies (Tesseract, ONNX Runtime).
|
|
|
|
### Requirements_check.php
|
|
|
|
Comprehensive system requirements verification script.
|
|
|
|
## Quickstart (4 snippets)
|
|
|
|
### Basic_extraction_oop.php
|
|
|
|
Simple document extraction using the object-oriented API.
|
|
|
|
### Basic_extraction_procedural.php
|
|
|
|
Simple extraction using the procedural API for more concise code.
|
|
|
|
### Extract_from_bytes.php
|
|
|
|
Extract content from file data in memory (useful for uploaded files).
|
|
|
|
### Mime_type_detection.php
|
|
|
|
Automatic MIME type detection from file paths or content.
|
|
|
|
## Configuration (5 snippets)
|
|
|
|
### Extraction_config.php
|
|
|
|
Main ExtractionConfig class - controlling all aspects of extraction.
|
|
|
|
### Pdf_config.php
|
|
|
|
PDF-specific settings including image quality and extraction methods.
|
|
|
|
### Page_config.php
|
|
|
|
Per-page extraction and page markers for maintaining document structure.
|
|
|
|
### Language_detection_config.php
|
|
|
|
Automatic language detection for multilingual documents.
|
|
|
|
### Keyword_config.php
|
|
|
|
Automatic keyword extraction for document categorization.
|
|
|
|
## Extraction (7 snippets)
|
|
|
|
### Pdf_extraction.php
|
|
|
|
Extract text, tables, and images from PDF files with various configurations.
|
|
|
|
### Docx_extraction.php
|
|
|
|
Extract content from Microsoft Word documents including metadata and tables.
|
|
|
|
### Image_extraction.php
|
|
|
|
Extract embedded images from documents with optional OCR.
|
|
|
|
### Batch_processing.php
|
|
|
|
Process multiple documents in parallel for maximum performance.
|
|
|
|
### Table_extraction.php
|
|
|
|
Extract and process tables, export to CSV, JSON, and HTML formats.
|
|
|
|
### Metadata_extraction.php
|
|
|
|
Extract document metadata (title, author, dates, keywords).
|
|
|
|
### Multi_format.php
|
|
|
|
Handle various document formats with format-specific processing.
|
|
|
|
## OCR (3 snippets)
|
|
|
|
### Basic_ocr.php
|
|
|
|
Basic OCR with Tesseract for scanned documents and images.
|
|
|
|
### Advanced_ocr.php
|
|
|
|
Advanced OCR configuration with Tesseract PSM modes and table detection.
|
|
|
|
### Image_preprocessing.php
|
|
|
|
Image preprocessing for better OCR accuracy (denoising, deskewing, sharpening).
|
|
|
|
## Chunking (1 snippet)
|
|
|
|
### Basic_chunking.php
|
|
|
|
Split documents into chunks for RAG applications with various strategies.
|
|
|
|
## Embeddings (2 snippets)
|
|
|
|
### Basic_embeddings.php
|
|
|
|
Generate vector embeddings for semantic search and similarity matching.
|
|
|
|
### Semantic_search.php
|
|
|
|
Build a semantic search system using document embeddings.
|
|
|
|
## Advanced (2 snippets)
|
|
|
|
### Error_handling.php
|
|
|
|
Robust error handling, retry strategies, and validation.
|
|
|
|
### Performance_tuning.php
|
|
|
|
Performance optimization tips and techniques.
|
|
|
|
## Cache (1 snippet)
|
|
|
|
### Disk_cache.php
|
|
|
|
File-based caching to avoid re-processing documents.
|
|
|
|
## CLI (2 snippets)
|
|
|
|
### Basic_cli.php
|
|
|
|
Simple command-line interface for document extraction.
|
|
|
|
### Cli_with_config.php
|
|
|
|
Advanced CLI with support for various extraction options.
|
|
|
|
## Benchmarking (1 snippet)
|
|
|
|
### Simple_benchmark.php
|
|
|
|
Benchmark extraction performance across different configurations.
|
|
|
|
## Usage Patterns
|
|
|
|
### Basic Extraction
|
|
|
|
```php title="Basic Extraction"
|
|
use Kreuzberg\Kreuzberg;
|
|
|
|
$kreuzberg = new Kreuzberg();
|
|
$result = $kreuzberg->extractFile('document.pdf');
|
|
echo $result->content;
|
|
```
|
|
|
|
### With Configuration
|
|
|
|
```php title="With Configuration"
|
|
use Kreuzberg\Config\ExtractionConfig;
|
|
use Kreuzberg\Config\OcrConfig;
|
|
|
|
$config = new ExtractionConfig(
|
|
ocr: new OcrConfig(backend: 'tesseract', language: 'eng'),
|
|
extractTables: true
|
|
);
|
|
|
|
$kreuzberg = new Kreuzberg($config);
|
|
$result = $kreuzberg->extractFile('scanned.pdf');
|
|
```
|
|
|
|
### Procedural API
|
|
|
|
```php title="Procedural API"
|
|
use function Kreuzberg\extract_file;
|
|
|
|
$result = extract_file('document.pdf');
|
|
echo $result->content;
|
|
```
|
|
|
|
### Batch Processing
|
|
|
|
```php title="Batch Processing"
|
|
use function Kreuzberg\batch_extract_files;
|
|
|
|
$files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
|
|
$results = batch_extract_files($files);
|
|
```
|
|
|
|
## Async Extraction (4 snippets)
|
|
|
|
### Async_extract_file.php
|
|
|
|
Async file extraction with DeferredResult polling and blocking patterns.
|
|
|
|
### Async_batch.php
|
|
|
|
Async batch extraction with timeout-based waiting.
|
|
|
|
### Async_amp_bridge.php
|
|
|
|
Integration with Amp v3+ framework using AmpBridge::toFuture().
|
|
|
|
### Async_react_bridge.php
|
|
|
|
Integration with ReactPHP framework using ReactBridge::toPromise().
|
|
|
|
## Key Features Demonstrated
|
|
|
|
- **90+ File Formats**: PDF, DOCX, XLSX, PPTX, images, HTML, and more
|
|
- **Async Extraction**: Non-blocking extraction with DeferredResult pattern
|
|
- **OCR Support**: Tesseract integration with preprocessing
|
|
- **Table Extraction**: Extract structured tables with multiple export formats
|
|
- **Metadata**: Rich metadata extraction for all formats
|
|
- **Batch Processing**: Parallel processing for high throughput
|
|
- **Text Chunking**: Intelligent segmentation for RAG applications
|
|
- **Embeddings**: Vector embeddings for semantic search
|
|
- **Type Safety**: Full PHP 8.1+ type hints and readonly classes
|
|
- **Error Handling**: Comprehensive error handling patterns
|
|
- **Performance**: Optimization techniques and benchmarking
|
|
|
|
## Requirements
|
|
|
|
- PHP 8.1.0 or higher
|
|
- Kreuzberg PHP extension (kreuzberg.so/.dll)
|
|
- Composer package: kreuzberg/Kreuzberg
|
|
- Optional: Tesseract OCR (for OCR functionality)
|
|
- Optional: ONNX Runtime (for embeddings)
|
|
|
|
## Testing Snippets
|
|
|
|
Each snippet is designed to be self-contained and runnable. To test:
|
|
|
|
1. Install dependencies:
|
|
|
|
```bash
|
|
composer require kreuzberg/kreuzberg
|
|
```
|
|
|
|
2. Ensure the extension is loaded:
|
|
|
|
```bash
|
|
php -m | grep kreuzberg
|
|
```
|
|
|
|
3. Run any snippet:
|
|
|
|
```bash
|
|
php docs/snippets/php/quickstart/basic_extraction_oop.php
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Use batch processing** for multiple files
|
|
2. **Disable unnecessary features** (OCR, embeddings) if not needed
|
|
3. **Implement caching** for often accessed documents
|
|
4. **Handle errors gracefully** with try-catch blocks
|
|
5. **Monitor memory usage** for large documents
|
|
6. **Use type hints** for better IDE support and safety
|
|
|
|
## Contributing
|
|
|
|
These snippets follow these conventions:
|
|
|
|
- All files use `declare(strict_types=1)`
|
|
- Code is wrapped in ````php` markdown code blocks
|
|
- Clear comments explain what each snippet demonstrates
|
|
- Both OOP and procedural APIs are shown where applicable
|
|
- Examples are realistic and production-ready
|
|
|
|
## Links
|
|
|
|
- **Documentation**: <https://kreuzberg.dev>
|
|
- **GitHub**: <https://github.com/kreuzberg-dev/Kreuzberg>
|
|
- **Issues**: <https://github.com/kreuzberg-dev/kreuzberg/issues>
|
|
- **Package**: <https://packagist.org/packages/kreuzberg/Kreuzberg>
|