Nomad changes
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s

This commit is contained in:
Henrik Jess Nielsen
2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions

300
docs/snippets/php/README.md Normal file
View File

@@ -0,0 +1,300 @@
# Kreuzberg PHP Snippets
Comprehensive code examples for the Kreuzberg PHP bindings. These snippets demonstrate all major features and use cases.
## Directory Structure
```text
php/
├── installation/ # Getting started, setup, requirements
├── quickstart/ # Basic usage examples
├── configuration/ # Configuration classes and options
├── extraction/ # Document extraction examples
├── async/ # Async extraction with DeferredResult
├── ocr/ # OCR and image preprocessing
├── chunking/ # Text chunking for RAG
├── embeddings/ # Vector embeddings and semantic search
├── advanced/ # Error handling, performance tuning
├── cache/ # Caching strategies
├── cli/ # Command-line tools
└── benchmarking/ # Performance testing
```
## Installation (3 snippets)
### Composer_install.php
Installing Kreuzberg via Composer and verifying the extension is loaded.
### Extension_setup.php
Setting up the native PHP extension (kreuzberg.so/.dll) and checking for optional dependencies (Tesseract, ONNX Runtime).
### Requirements_check.php
Comprehensive system requirements verification script.
## Quickstart (4 snippets)
### Basic_extraction_oop.php
Simple document extraction using the object-oriented API.
### Basic_extraction_procedural.php
Simple extraction using the procedural API for more concise code.
### Extract_from_bytes.php
Extract content from file data in memory (useful for uploaded files).
### Mime_type_detection.php
Automatic MIME type detection from file paths or content.
## Configuration (5 snippets)
### Extraction_config.php
Main ExtractionConfig class - controlling all aspects of extraction.
### Pdf_config.php
PDF-specific settings including image quality and extraction methods.
### Page_config.php
Per-page extraction and page markers for maintaining document structure.
### Language_detection_config.php
Automatic language detection for multilingual documents.
### Keyword_config.php
Automatic keyword extraction for document categorization.
## Extraction (7 snippets)
### Pdf_extraction.php
Extract text, tables, and images from PDF files with various configurations.
### Docx_extraction.php
Extract content from Microsoft Word documents including metadata and tables.
### Image_extraction.php
Extract embedded images from documents with optional OCR.
### Batch_processing.php
Process multiple documents in parallel for maximum performance.
### Table_extraction.php
Extract and process tables, export to CSV, JSON, and HTML formats.
### Metadata_extraction.php
Extract document metadata (title, author, dates, keywords).
### Multi_format.php
Handle various document formats with format-specific processing.
## OCR (3 snippets)
### Basic_ocr.php
Basic OCR with Tesseract for scanned documents and images.
### Advanced_ocr.php
Advanced OCR configuration with Tesseract PSM modes and table detection.
### Image_preprocessing.php
Image preprocessing for better OCR accuracy (denoising, deskewing, sharpening).
## Chunking (1 snippet)
### Basic_chunking.php
Split documents into chunks for RAG applications with various strategies.
## Embeddings (2 snippets)
### Basic_embeddings.php
Generate vector embeddings for semantic search and similarity matching.
### Semantic_search.php
Build a semantic search system using document embeddings.
## Advanced (2 snippets)
### Error_handling.php
Robust error handling, retry strategies, and validation.
### Performance_tuning.php
Performance optimization tips and techniques.
## Cache (1 snippet)
### Disk_cache.php
File-based caching to avoid re-processing documents.
## CLI (2 snippets)
### Basic_cli.php
Simple command-line interface for document extraction.
### Cli_with_config.php
Advanced CLI with support for various extraction options.
## Benchmarking (1 snippet)
### Simple_benchmark.php
Benchmark extraction performance across different configurations.
## Usage Patterns
### Basic Extraction
```php title="Basic Extraction"
use Kreuzberg\Kreuzberg;
$kreuzberg = new Kreuzberg();
$result = $kreuzberg->extractFile('document.pdf');
echo $result->content;
```
### With Configuration
```php title="With Configuration"
use Kreuzberg\Config\ExtractionConfig;
use Kreuzberg\Config\OcrConfig;
$config = new ExtractionConfig(
ocr: new OcrConfig(backend: 'tesseract', language: 'eng'),
extractTables: true
);
$kreuzberg = new Kreuzberg($config);
$result = $kreuzberg->extractFile('scanned.pdf');
```
### Procedural API
```php title="Procedural API"
use function Kreuzberg\extract_file;
$result = extract_file('document.pdf');
echo $result->content;
```
### Batch Processing
```php title="Batch Processing"
use function Kreuzberg\batch_extract_files;
$files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
$results = batch_extract_files($files);
```
## Async Extraction (4 snippets)
### Async_extract_file.php
Async file extraction with DeferredResult polling and blocking patterns.
### Async_batch.php
Async batch extraction with timeout-based waiting.
### Async_amp_bridge.php
Integration with Amp v3+ framework using AmpBridge::toFuture().
### Async_react_bridge.php
Integration with ReactPHP framework using ReactBridge::toPromise().
## Key Features Demonstrated
- **90+ File Formats**: PDF, DOCX, XLSX, PPTX, images, HTML, and more
- **Async Extraction**: Non-blocking extraction with DeferredResult pattern
- **OCR Support**: Tesseract integration with preprocessing
- **Table Extraction**: Extract structured tables with multiple export formats
- **Metadata**: Rich metadata extraction for all formats
- **Batch Processing**: Parallel processing for high throughput
- **Text Chunking**: Intelligent segmentation for RAG applications
- **Embeddings**: Vector embeddings for semantic search
- **Type Safety**: Full PHP 8.1+ type hints and readonly classes
- **Error Handling**: Comprehensive error handling patterns
- **Performance**: Optimization techniques and benchmarking
## Requirements
- PHP 8.1.0 or higher
- Kreuzberg PHP extension (kreuzberg.so/.dll)
- Composer package: kreuzberg/Kreuzberg
- Optional: Tesseract OCR (for OCR functionality)
- Optional: ONNX Runtime (for embeddings)
## Testing Snippets
Each snippet is designed to be self-contained and runnable. To test:
1. Install dependencies:
```bash
composer require kreuzberg/kreuzberg
```
2. Ensure the extension is loaded:
```bash
php -m | grep kreuzberg
```
3. Run any snippet:
```bash
php docs/snippets/php/quickstart/basic_extraction_oop.php
```
## Best Practices
1. **Use batch processing** for multiple files
2. **Disable unnecessary features** (OCR, embeddings) if not needed
3. **Implement caching** for often accessed documents
4. **Handle errors gracefully** with try-catch blocks
5. **Monitor memory usage** for large documents
6. **Use type hints** for better IDE support and safety
## Contributing
These snippets follow these conventions:
- All files use `declare(strict_types=1)`
- Code is wrapped in ````php` markdown code blocks
- Clear comments explain what each snippet demonstrates
- Both OOP and procedural APIs are shown where applicable
- Examples are realistic and production-ready
## Links
- **Documentation**: <https://kreuzberg.dev>
- **GitHub**: <https://github.com/kreuzberg-dev/Kreuzberg>
- **Issues**: <https://github.com/kreuzberg-dev/kreuzberg/issues>
- **Package**: <https://packagist.org/packages/kreuzberg/Kreuzberg>