# Kreuzberg PHP Snippets

Comprehensive code examples for the Kreuzberg PHP bindings. These snippets demonstrate all major features and use cases.

## Directory Structure

```text
php/
├── installation/          # Getting started, setup, requirements
├── quickstart/           # Basic usage examples
├── configuration/        # Configuration classes and options
├── extraction/           # Document extraction examples
├── async/               # Async extraction with DeferredResult
├── ocr/                 # OCR and image preprocessing
├── chunking/            # Text chunking for RAG
├── embeddings/          # Vector embeddings and semantic search
├── advanced/            # Error handling, performance tuning
├── cache/               # Caching strategies
├── cli/                 # Command-line tools
└── benchmarking/        # Performance testing
```

## Installation (3 snippets)

### Composer_install.php

Installing Kreuzberg via Composer and verifying the extension is loaded.

### Extension_setup.php

Setting up the native PHP extension (kreuzberg.so/.dll) and checking for optional dependencies (Tesseract, ONNX Runtime).

### Requirements_check.php

Comprehensive system requirements verification script.

## Quickstart (4 snippets)

### Basic_extraction_oop.php

Simple document extraction using the object-oriented API.

### Basic_extraction_procedural.php

Simple extraction using the procedural API for more concise code.

### Extract_from_bytes.php

Extract content from file data in memory (useful for uploaded files).

### Mime_type_detection.php

Automatic MIME type detection from file paths or content.

## Configuration (5 snippets)

### Extraction_config.php

Main ExtractionConfig class - controlling all aspects of extraction.

### Pdf_config.php

PDF-specific settings including image quality and extraction methods.

### Page_config.php

Per-page extraction and page markers for maintaining document structure.

### Language_detection_config.php

Automatic language detection for multilingual documents.

### Keyword_config.php

Automatic keyword extraction for document categorization.

## Extraction (7 snippets)

### Pdf_extraction.php

Extract text, tables, and images from PDF files with various configurations.

### Docx_extraction.php

Extract content from Microsoft Word documents including metadata and tables.

### Image_extraction.php

Extract embedded images from documents with optional OCR.

### Batch_processing.php

Process multiple documents in parallel for maximum performance.

### Table_extraction.php

Extract and process tables, export to CSV, JSON, and HTML formats.

### Metadata_extraction.php

Extract document metadata (title, author, dates, keywords).

### Multi_format.php

Handle various document formats with format-specific processing.

## OCR (3 snippets)

### Basic_ocr.php

Basic OCR with Tesseract for scanned documents and images.

### Advanced_ocr.php

Advanced OCR configuration with Tesseract PSM modes and table detection.

### Image_preprocessing.php

Image preprocessing for better OCR accuracy (denoising, deskewing, sharpening).

## Chunking (1 snippet)

### Basic_chunking.php

Split documents into chunks for RAG applications with various strategies.

## Embeddings (2 snippets)

### Basic_embeddings.php

Generate vector embeddings for semantic search and similarity matching.

### Semantic_search.php

Build a semantic search system using document embeddings.

## Advanced (2 snippets)

### Error_handling.php

Robust error handling, retry strategies, and validation.

### Performance_tuning.php

Performance optimization tips and techniques.

## Cache (1 snippet)

### Disk_cache.php

File-based caching to avoid re-processing documents.

## CLI (2 snippets)

### Basic_cli.php

Simple command-line interface for document extraction.

### Cli_with_config.php

Advanced CLI with support for various extraction options.

## Benchmarking (1 snippet)

### Simple_benchmark.php

Benchmark extraction performance across different configurations.

## Usage Patterns

### Basic Extraction

```php title="Basic Extraction"
use Kreuzberg\Kreuzberg;

$kreuzberg = new Kreuzberg();
$result = $kreuzberg->extractFile('document.pdf');
echo $result->content;
```

### With Configuration

```php title="With Configuration"
use Kreuzberg\Config\ExtractionConfig;
use Kreuzberg\Config\OcrConfig;

$config = new ExtractionConfig(
    ocr: new OcrConfig(backend: 'tesseract', language: 'eng'),
    extractTables: true
);

$kreuzberg = new Kreuzberg($config);
$result = $kreuzberg->extractFile('scanned.pdf');
```

### Procedural API

```php title="Procedural API"
use function Kreuzberg\extract_file;

$result = extract_file('document.pdf');
echo $result->content;
```

### Batch Processing

```php title="Batch Processing"
use function Kreuzberg\batch_extract_files;

$files = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx'];
$results = batch_extract_files($files);
```

## Async Extraction (4 snippets)

### Async_extract_file.php

Async file extraction with DeferredResult polling and blocking patterns.

### Async_batch.php

Async batch extraction with timeout-based waiting.

### Async_amp_bridge.php

Integration with Amp v3+ framework using AmpBridge::toFuture().

### Async_react_bridge.php

Integration with ReactPHP framework using ReactBridge::toPromise().

## Key Features Demonstrated

- **90+ File Formats**: PDF, DOCX, XLSX, PPTX, images, HTML, and more
- **Async Extraction**: Non-blocking extraction with DeferredResult pattern
- **OCR Support**: Tesseract integration with preprocessing
- **Table Extraction**: Extract structured tables with multiple export formats
- **Metadata**: Rich metadata extraction for all formats
- **Batch Processing**: Parallel processing for high throughput
- **Text Chunking**: Intelligent segmentation for RAG applications
- **Embeddings**: Vector embeddings for semantic search
- **Type Safety**: Full PHP 8.1+ type hints and readonly classes
- **Error Handling**: Comprehensive error handling patterns
- **Performance**: Optimization techniques and benchmarking

## Requirements

- PHP 8.1.0 or higher
- Kreuzberg PHP extension (kreuzberg.so/.dll)
- Composer package: kreuzberg/Kreuzberg
- Optional: Tesseract OCR (for OCR functionality)
- Optional: ONNX Runtime (for embeddings)

## Testing Snippets

Each snippet is designed to be self-contained and runnable. To test:

1. Install dependencies:

   ```bash
   composer require kreuzberg/kreuzberg
   ```

2. Ensure the extension is loaded:

   ```bash
   php -m | grep kreuzberg
   ```

3. Run any snippet:

   ```bash
   php docs/snippets/php/quickstart/basic_extraction_oop.php
   ```

## Best Practices

1. **Use batch processing** for multiple files
2. **Disable unnecessary features** (OCR, embeddings) if not needed
3. **Implement caching** for often accessed documents
4. **Handle errors gracefully** with try-catch blocks
5. **Monitor memory usage** for large documents
6. **Use type hints** for better IDE support and safety

## Contributing

These snippets follow these conventions:

- All files use `declare(strict_types=1)`
- Code is wrapped in ````php` markdown code blocks
- Clear comments explain what each snippet demonstrates
- Both OOP and procedural APIs are shown where applicable
- Examples are realistic and production-ready

## Links

- **Documentation**: <https://kreuzberg.dev>
- **GitHub**: <https://github.com/kreuzberg-dev/Kreuzberg>
- **Issues**: <https://github.com/kreuzberg-dev/kreuzberg/issues>
- **Package**: <https://packagist.org/packages/kreuzberg/Kreuzberg>