Nomad changes
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s

This commit is contained in:
Henrik Jess Nielsen
2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions

93
packages/php/LICENSE Normal file
View File

@@ -0,0 +1,93 @@
Elastic License 2.0 (ELv2)
Copyright 2025-2026 Kreuzberg, Inc.
Acceptance
By using the software, you agree to all of the terms and conditions below.
Copyright License
The licensor grants you a non-exclusive, royalty-free, worldwide,
non-sublicensable, non-transferable license to use, copy, distribute, make
available, and prepare derivative works of the software, in each case subject to
the limitations and conditions below.
Limitations
You may not provide the software to third parties as a hosted or managed
service, where the service provides users with access to any substantial set of
the features or functionality of the software.
You may not move, change, disable, or circumvent the license key functionality
in the software, and you may not remove or obscure any functionality in the
software that is protected by the license key.
You may not alter, remove, or obscure any licensing, copyright, or other notices
of the licensor in the software. Any use of the licensor's trademarks is subject
to applicable law.
Patents
The licensor grants you a license, under any patent claims the licensor can
license, or becomes able to license, to make, have made, use, sell, offer for
sale, import and have imported the software, in each case subject to the
limitations and conditions in this license. This license does not cover any
patent claims that you cause to be infringed by modifications or additions to the
software. If you or your company make any written claim that the software
infringes or contributes to infringement of any patent, your patent license for
the software granted under these terms ends immediately. If your company makes
such a claim, your patent license ends immediately for work on behalf of your
company.
Notices
You must ensure that anyone who gets a copy of any part of the software from you
also gets a copy of these terms.
If you modify the software, you must include in any modified copies of the
software prominent notices stating that you have modified the software.
No Other Rights
These terms do not imply any licenses other than those expressly granted in
these terms.
Termination
If you use the software in violation of these terms, such use is not licensed,
and your licenses will automatically terminate. If the licensor provides you with
a notice of your violation, and you cease all violation of this license no later
than 30 days after you receive that notice, your licenses will be reinstated
retroactively. However, if you violate these terms after such reinstatement, any
additional violation of these terms will cause your licenses to terminate
automatically and permanently.
No Liability
As far as the law allows, the software comes as is, without any warranty or
condition, and the licensor will not be liable to you for any damages arising out
of these terms or the use or nature of the software, under any kind of legal
claim.
Definitions
The licensor is the entity offering these terms, and the software is the
software the licensor makes available under these terms, including any portion
of it.
you refers to the individual or entity agreeing to these terms.
your company is any legal entity, sole proprietorship, or other kind of
organization that you work for, plus all organizations that have control over,
are under the control of, or are under common control with that organization.
control means ownership of substantially all the assets of an entity, or the
power to direct its management and policies by vote, contract, or otherwise.
Control can be direct or indirect.
your licenses are all the licenses granted to you for the software under these
terms.
use means anything you do with the software requiring one of your licenses.
trademark means trademarks, service marks, and similar rights.

907
packages/php/README.md Normal file
View File

@@ -0,0 +1,907 @@
# PHP
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
<a href="https://github.com/kreuzberg-dev/alef">
<img src="https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6" alt="Bindings">
</a>
<!-- Language Bindings -->
<a href="https://crates.io/crates/kreuzberg">
<img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
</a>
<a href="https://pypi.org/project/kreuzberg/">
<img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
</a>
<a href="https://www.npmjs.com/package/@kreuzberg/node">
<img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
</a>
<a href="https://www.npmjs.com/package/@kreuzberg/wasm">
<img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
</a>
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go/v5">
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v5*" alt="Go">
</a>
<a href="https://www.nuget.org/packages/Kreuzberg/">
<img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
</a>
<a href="https://packagist.org/packages/kreuzberg/kreuzberg">
<img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
</a>
<a href="https://rubygems.org/gems/kreuzberg">
<img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
</a>
<a href="https://hex.pm/packages/kreuzberg">
<img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
</a>
<a href="https://kreuzberg-dev.r-universe.dev/kreuzberg">
<img src="https://img.shields.io/badge/R-kreuzberg-007ec6" alt="R">
</a>
<a href="https://pub.dev/packages/kreuzberg">
<img src="https://img.shields.io/pub/v/kreuzberg?label=Dart&color=007ec6" alt="Dart">
</a>
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg-android">
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg-android?label=Kotlin&color=007ec6" alt="Kotlin">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/swift">
<img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/zig">
<img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
<img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/kreuzberg">
<img src="https://img.shields.io/badge/Docker-ghcr.io-007ec6?logo=docker&logoColor=white" alt="Docker">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/charts%2Fkreuzberg">
<img src="https://img.shields.io/badge/Helm-ghcr.io-007ec6?logo=helm&logoColor=white" alt="Helm">
</a>
<!-- Project Info -->
<a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
<img src="https://img.shields.io/badge/License-Elastic--2.0-007ec6" alt="License">
</a>
<a href="https://docs.kreuzberg.dev">
<img src="https://img.shields.io/badge/Docs-kreuzberg-007ec6" alt="Documentation">
</a>
<a href="https://huggingface.co/Kreuzberg">
<img src="https://img.shields.io/badge/Hugging%20Face-Kreuzberg-007ec6" alt="Hugging Face">
</a>
</div>
<div align="center" style="margin: 24px 0 0;">
<a href="https://kreuzberg.dev">
<img alt="Kreuzberg" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
</a>
</div>
<div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
<a href="https://discord.gg/xt9WY3GnKR">
<img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
</a>
<a href="https://docs.kreuzberg.dev/demo.html">
<img height="22" src="https://img.shields.io/badge/Live%20Demo-Open-007ec6?logo=webassembly&logoColor=white" alt="Live Demo">
</a>
</div>
Extract text, tables, images, and metadata from 90+ file formats and 300+ programming languages including PDF, Office documents, and images. PHP bindings with modern PHP 8.2+ support and type-safe API.
## What This Package Provides
- **Document intelligence core** — extract text, tables, images, metadata, entities, keywords, and code intelligence from one API.
- **Format coverage** — PDF, Office, images, HTML/XML, email, archives, notebooks, citations, scientific formats, and plain text.
- **OCR choices** — Tesseract, PaddleOCR, EasyOCR where supported, VLM OCR through liter-llm, and plugin hooks for custom backends.
- **Same engine as every binding** — Rust, Python, Node.js, Go, Java, PHP, Ruby, .NET, Elixir, R, WASM, Kotlin Android, Swift, Dart, Zig, and C FFI share the same Rust implementation.
- **PHP package** — PHP 8.2+ API with generated types.
## Installation
### Package Installation
Install via Composer:
```bash
composer require kreuzberg/kreuzberg
```
### System Requirements
- **PHP 8.2+** required
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
## Quick Start
### Basic Extraction
Extract text, metadata, and structure from any supported document format:
```php
```php title="basic_extraction_oop.php"
<?php
declare(strict_types=1);
/**
* Basic Document Extraction (OOP API)
*
* This example demonstrates the simplest way to extract text from a document
* using the object-oriented API.
*/
require_once __DIR__ . '/vendor/autoload.php';
use Kreuzberg\Kreuzberg;
$kreuzberg = new Kreuzberg();
$result = $kreuzberg->extractFile('document.pdf');
echo "Extracted Content:\n";
echo "==================\n";
echo $result->content . "\n\n";
echo "Metadata:\n";
echo "=========\n";
echo "Title: " . ($result->metadata->title ?? 'N/A') . "\n";
echo "Authors: " . (isset($result->metadata->authors) ? implode(', ', $result->metadata->authors) : 'N/A') . "\n";
echo "Pages: " . ($result->metadata->pageCount ?? 'N/A') . "\n";
echo "Format: " . $result->mimeType . "\n\n";
if (count($result->tables) > 0) {
echo "Tables Found: " . count($result->tables) . "\n";
foreach ($result->tables as $index => $table) {
echo "\nTable " . ($index + 1) . " (Page {$table->pageNumber}):\n";
echo $table->markdown . "\n";
}
}
```
```
### Common Use Cases
#### Extract with Custom Configuration
Most use cases benefit from configuration to control extraction behavior:
**With OCR (for scanned documents):**
```php
```php title="basic_ocr.php"
<?php
declare(strict_types=1);
/**
* Basic OCR with Tesseract
*
* Extract text from scanned PDFs and images using Tesseract OCR.
*/
require_once __DIR__ . '/vendor/autoload.php';
use Kreuzberg\Kreuzberg;
use Kreuzberg\Config\ExtractionConfig;
use Kreuzberg\Config\OcrConfig;
$config = new ExtractionConfig(
ocr: new OcrConfig(
backend: 'tesseract',
language: 'eng'
)
);
$kreuzberg = new Kreuzberg($config);
$result = $kreuzberg->extractFile('scanned_document.pdf');
echo "OCR Extraction Results:\n";
echo str_repeat('=', 60) . "\n";
echo $result->content . "\n\n";
$multilingualConfig = new ExtractionConfig(
ocr: new OcrConfig(
backend: 'tesseract',
language: 'eng+fra+deu'
)
);
$kreuzberg = new Kreuzberg($multilingualConfig);
$result = $kreuzberg->extractFile('multilingual_scan.pdf');
echo "Multilingual OCR:\n";
echo str_repeat('=', 60) . "\n";
echo substr($result->content, 0, 500) . "...\n\n";
$imageConfig = new ExtractionConfig(
ocr: new OcrConfig(
backend: 'tesseract',
language: 'eng'
)
);
$kreuzberg = new Kreuzberg($imageConfig);
$imageFormats = ['png', 'jpg', 'tiff'];
foreach ($imageFormats as $format) {
$file = "scan.$format";
if (file_exists($file)) {
echo "Processing $file...\n";
$result = $kreuzberg->extractFile($file);
echo "Extracted " . strlen($result->content) . " characters\n";
echo "Preview: " . substr($result->content, 0, 100) . "...\n\n";
}
}
$languages = [
'spa' => 'Spanish document',
'fra' => 'French document',
'deu' => 'German document',
'ita' => 'Italian document',
'por' => 'Portuguese document',
'rus' => 'Russian document',
'jpn' => 'Japanese document',
'chi_sim' => 'Chinese (Simplified) document',
];
foreach ($languages as $lang => $description) {
$file = strtolower(str_replace(' ', '_', $description)) . '.pdf';
if (file_exists($file)) {
$config = new ExtractionConfig(
ocr: new OcrConfig(
backend: 'tesseract',
language: $lang
)
);
$kreuzberg = new Kreuzberg($config);
$result = $kreuzberg->extractFile($file);
echo "$description ($lang):\n";
echo " Characters extracted: " . mb_strlen($result->content) . "\n\n";
}
}
use function Kreuzberg\extract_file;
$config = new ExtractionConfig(
ocr: new OcrConfig(backend: 'tesseract', language: 'eng')
);
$result = extract_file('invoice_scan.pdf', config: $config);
echo "Invoice OCR:\n";
echo str_repeat('=', 60) . "\n";
echo $result->content . "\n";
$result = $kreuzberg->extractFile('scanned.pdf');
$contentLength = strlen($result->content);
$pageCount = $result->metadata->pageCount ?? 1;
$avgCharsPerPage = $contentLength / $pageCount;
echo "\nOCR Quality Assessment:\n";
echo "Total characters: $contentLength\n";
echo "Pages: $pageCount\n";
echo "Average chars/page: " . number_format($avgCharsPerPage) . "\n";
if ($avgCharsPerPage < 100) {
echo "Warning: Low character count may indicate poor scan quality\n";
echo "Consider using image preprocessing or higher DPI settings.\n";
} elseif ($avgCharsPerPage > 2000) {
echo "Pass: Good - Adequate text extracted\n";
} else {
echo "Pass: Moderate - Text extracted successfully\n";
}
```
```
#### Table Extraction
See [Configuration Guide](https://docs.kreuzberg.dev/guides/configuration/) for table extraction options.
#### Processing Multiple Files
```php
```php title="batch_processing.php"
<?php
declare(strict_types=1);
/**
* Batch Document Processing
*
* Process multiple documents in parallel for maximum performance.
* Kreuzberg's batch API uses multiple threads to extract documents concurrently.
*/
require_once __DIR__ . '/vendor/autoload.php';
use Kreuzberg\Kreuzberg;
use Kreuzberg\Config\ExtractionConfig;
use function Kreuzberg\batch_extract_files;
use function Kreuzberg\batch_extract_bytes;
$files = [
'document1.pdf',
'document2.docx',
'document3.xlsx',
'presentation.pptx',
];
$files = array_filter($files, 'file_exists');
if (!empty($files)) {
echo "Processing " . count($files) . " files in batch...\n\n";
$start = microtime(true);
$results = batch_extract_files($files);
$elapsed = microtime(true) - $start;
echo "Batch extraction completed in " . number_format($elapsed, 3) . " seconds\n";
echo "Average: " . number_format($elapsed / count($files), 3) . " seconds per file\n\n";
foreach ($results as $index => $result) {
$filename = basename($files[$index]);
echo "$filename:\n";
echo " Content: " . strlen($result->content) . " chars\n";
echo " Tables: " . count($result->tables) . "\n";
echo " MIME: " . $result->mimeType . "\n\n";
}
}
$config = new ExtractionConfig(
extractTables: true,
extractImages: false
);
$kreuzberg = new Kreuzberg($config);
$pdfFiles = glob('*.pdf');
if (!empty($pdfFiles)) {
echo "Processing " . count($pdfFiles) . " PDF files...\n";
$start = microtime(true);
$results = $kreuzberg->batchExtractFiles($pdfFiles, $config);
$elapsed = microtime(true) - $start;
echo "Completed in " . number_format($elapsed, 2) . " seconds\n";
echo "Throughput: " . number_format(count($pdfFiles) / $elapsed, 2) . " files/second\n\n";
$totalChars = 0;
$totalTables = 0;
foreach ($results as $result) {
$totalChars += strlen($result->content);
$totalTables += count($result->tables);
}
echo "Total content: " . number_format($totalChars) . " characters\n";
echo "Total tables: $totalTables\n";
}
$uploadedFiles = [
['data' => file_get_contents('file1.pdf'), 'mime' => 'application/pdf'],
['data' => file_get_contents('file2.docx'), 'mime' => 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'],
];
$dataList = array_column($uploadedFiles, 'data');
$mimeTypes = array_column($uploadedFiles, 'mime');
$results = batch_extract_bytes($dataList, $mimeTypes);
echo "\nProcessed " . count($results) . " files from memory\n";
function processDirectory(string $dir, Kreuzberg $kreuzberg): array
{
$results = [];
$iterator = new RecursiveIteratorIterator(
new RecursiveDirectoryIterator($dir)
);
$files = [];
foreach ($iterator as $file) {
if ($file->isFile()) {
$ext = strtolower($file->getExtension());
if (in_array($ext, ['pdf', 'docx', 'xlsx', 'pptx', 'txt'], true)) {
$files[] = $file->getPathname();
}
}
}
if (empty($files)) {
return $results;
}
$batches = array_chunk($files, 10);
foreach ($batches as $batchIndex => $batch) {
echo "Processing batch " . ($batchIndex + 1) . "/" . count($batches) . "...\n";
$batchResults = $kreuzberg->batchExtractFiles($batch);
$results = array_merge($results, $batchResults);
}
return $results;
}
$directory = './documents';
if (is_dir($directory)) {
echo "\nProcessing directory: $directory\n";
$results = processDirectory($directory, $kreuzberg);
echo "Processed " . count($results) . " files\n";
}
$mixedFiles = ['valid.pdf', 'nonexistent.pdf', 'another.docx'];
try {
$results = batch_extract_files($mixedFiles);
} catch (\Kreuzberg\Exceptions\KreuzbergException $e) {
echo "Batch processing error: " . $e->getMessage() . "\n";
}
$allFiles = glob('documents/*.{pdf,docx,xlsx}', GLOB_BRACE);
$batchSize = 5;
$batches = array_chunk($allFiles, $batchSize);
$totalProcessed = 0;
echo "\nProcessing " . count($allFiles) . " files in " . count($batches) . " batches...\n";
foreach ($batches as $index => $batch) {
$progress = (($index + 1) / count($batches)) * 100;
echo sprintf("\rProgress: %.1f%% [%d/%d batches]",
$progress, $index + 1, count($batches));
$results = $kreuzberg->batchExtractFiles($batch);
$totalProcessed += count($results);
}
echo "\n\nCompleted! Processed $totalProcessed files.\n";
```
```
### Next Steps
- **[Installation Guide](https://docs.kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
- **[API Documentation](https://docs.kreuzberg.dev/reference/api-python/)** - Complete API reference
- **[Examples & Guides](https://docs.kreuzberg.dev/)** - Full code examples and usage guides
- **[Configuration Guide](https://docs.kreuzberg.dev/guides/configuration/)** - Advanced configuration options
## Features
### Supported File Formats (90+)
90+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
#### Office Documents
| Category | Formats | Capabilities |
|----------|---------|--------------|
| **Word Processing** | `.docx`, `.docm`, `.dotx`, `.dotm`, `.dot`, `.odt` | Full text, tables, images, metadata, styles |
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.xltx`, `.xlt`, `.ods` | Sheet data, formulas, cell metadata, charts |
| **Presentations** | `.pptx`, `.pptm`, `.ppsx`, `.potx`, `.potm`, `.pot`, `.ppt` | Slides, speaker notes, images, metadata |
| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
| **Database** | `.dbf` | Table data extraction, field type support |
| **Hangul** | `.hwp`, `.hwpx` | Korean document format, text extraction |
#### Images (OCR-Enabled)
| Category | Formats | Features |
|----------|---------|----------|
| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata |
| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
#### Web & Data
| Category | Formats | Features |
|----------|---------|----------|
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, Djot, reStructuredText, Org Mode |
#### Email & Archives
| Category | Formats | Features |
|----------|---------|----------|
| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
#### Academic & Scientific
| Category | Formats | Features |
|----------|---------|----------|
| **Citations** | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl` | Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON |
| **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
| **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
#### Code Intelligence (300+ Languages)
| Feature | Description |
|---------|-------------|
| **Structure Extraction** | Functions, classes, methods, structs, interfaces, enums |
| **Import/Export Analysis** | Module dependencies, re-exports, wildcard imports |
| **Symbol Extraction** | Variables, constants, type aliases, properties |
| **Docstring Parsing** | Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats |
| **Diagnostics** | Parse errors with line/column positions |
| **Syntax-Aware Chunking** | Split code by semantic boundaries, not arbitrary byte offsets |
Powered by [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — [documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev).
**[Complete Format Reference](https://docs.kreuzberg.dev/reference/formats/)**
### Key Capabilities
- **Text Extraction** - Extract all text content with position and formatting information
- **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
- **Table Extraction** - Parse tables with structure and cell content preservation
- **Image Extraction** - Extract embedded images and render page previews
- **OCR Support** - Integrate multiple OCR backends for scanned documents
- **Plugin System** - Extensible post-processing for custom text transformation
- **Embeddings** - Generate vector embeddings using ONNX Runtime models
- **Batch Processing** - Efficiently process multiple documents in parallel
- **Memory Efficient** - Stream large files without loading entirely into memory
- **Language Detection** - Detect and support multiple languages in documents
- **Code Intelligence** - Extract structure, imports, exports, symbols, and docstrings from [300+ programming languages](https://docs.tree-sitter-language-pack.kreuzberg.dev) via tree-sitter
- **Configuration** - Fine-grained control over extraction behavior
### Performance Characteristics
| Format | Speed | Memory | Notes |
|--------|-------|--------|-------|
| **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
| **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
| **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
| **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
| **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
## OCR Support
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
- **Tesseract**
- **Paddleocr**
### OCR Configuration Example
```php
```php title="basic_ocr.php"
<?php
declare(strict_types=1);
/**
* Basic OCR with Tesseract
*
* Extract text from scanned PDFs and images using Tesseract OCR.
*/
require_once __DIR__ . '/vendor/autoload.php';
use Kreuzberg\Kreuzberg;
use Kreuzberg\Config\ExtractionConfig;
use Kreuzberg\Config\OcrConfig;
$config = new ExtractionConfig(
ocr: new OcrConfig(
backend: 'tesseract',
language: 'eng'
)
);
$kreuzberg = new Kreuzberg($config);
$result = $kreuzberg->extractFile('scanned_document.pdf');
echo "OCR Extraction Results:\n";
echo str_repeat('=', 60) . "\n";
echo $result->content . "\n\n";
$multilingualConfig = new ExtractionConfig(
ocr: new OcrConfig(
backend: 'tesseract',
language: 'eng+fra+deu'
)
);
$kreuzberg = new Kreuzberg($multilingualConfig);
$result = $kreuzberg->extractFile('multilingual_scan.pdf');
echo "Multilingual OCR:\n";
echo str_repeat('=', 60) . "\n";
echo substr($result->content, 0, 500) . "...\n\n";
$imageConfig = new ExtractionConfig(
ocr: new OcrConfig(
backend: 'tesseract',
language: 'eng'
)
);
$kreuzberg = new Kreuzberg($imageConfig);
$imageFormats = ['png', 'jpg', 'tiff'];
foreach ($imageFormats as $format) {
$file = "scan.$format";
if (file_exists($file)) {
echo "Processing $file...\n";
$result = $kreuzberg->extractFile($file);
echo "Extracted " . strlen($result->content) . " characters\n";
echo "Preview: " . substr($result->content, 0, 100) . "...\n\n";
}
}
$languages = [
'spa' => 'Spanish document',
'fra' => 'French document',
'deu' => 'German document',
'ita' => 'Italian document',
'por' => 'Portuguese document',
'rus' => 'Russian document',
'jpn' => 'Japanese document',
'chi_sim' => 'Chinese (Simplified) document',
];
foreach ($languages as $lang => $description) {
$file = strtolower(str_replace(' ', '_', $description)) . '.pdf';
if (file_exists($file)) {
$config = new ExtractionConfig(
ocr: new OcrConfig(
backend: 'tesseract',
language: $lang
)
);
$kreuzberg = new Kreuzberg($config);
$result = $kreuzberg->extractFile($file);
echo "$description ($lang):\n";
echo " Characters extracted: " . mb_strlen($result->content) . "\n\n";
}
}
use function Kreuzberg\extract_file;
$config = new ExtractionConfig(
ocr: new OcrConfig(backend: 'tesseract', language: 'eng')
);
$result = extract_file('invoice_scan.pdf', config: $config);
echo "Invoice OCR:\n";
echo str_repeat('=', 60) . "\n";
echo $result->content . "\n";
$result = $kreuzberg->extractFile('scanned.pdf');
$contentLength = strlen($result->content);
$pageCount = $result->metadata->pageCount ?? 1;
$avgCharsPerPage = $contentLength / $pageCount;
echo "\nOCR Quality Assessment:\n";
echo "Total characters: $contentLength\n";
echo "Pages: $pageCount\n";
echo "Average chars/page: " . number_format($avgCharsPerPage) . "\n";
if ($avgCharsPerPage < 100) {
echo "Warning: Low character count may indicate poor scan quality\n";
echo "Consider using image preprocessing or higher DPI settings.\n";
} elseif ($avgCharsPerPage > 2000) {
echo "Pass: Good - Adequate text extracted\n";
} else {
echo "Pass: Moderate - Text extracted successfully\n";
}
```
```
## Plugin System
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
For detailed plugin documentation, visit [Plugin System Guide](https://docs.kreuzberg.dev/guides/plugins/).
## Embeddings Support
Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
**[Embeddings Guide](https://docs.kreuzberg.dev/features/#embeddings)**
## Batch Processing
Process multiple documents efficiently:
```php
```php title="batch_processing.php"
<?php
declare(strict_types=1);
/**
* Batch Document Processing
*
* Process multiple documents in parallel for maximum performance.
* Kreuzberg's batch API uses multiple threads to extract documents concurrently.
*/
require_once __DIR__ . '/vendor/autoload.php';
use Kreuzberg\Kreuzberg;
use Kreuzberg\Config\ExtractionConfig;
use function Kreuzberg\batch_extract_files;
use function Kreuzberg\batch_extract_bytes;
$files = [
'document1.pdf',
'document2.docx',
'document3.xlsx',
'presentation.pptx',
];
$files = array_filter($files, 'file_exists');
if (!empty($files)) {
echo "Processing " . count($files) . " files in batch...\n\n";
$start = microtime(true);
$results = batch_extract_files($files);
$elapsed = microtime(true) - $start;
echo "Batch extraction completed in " . number_format($elapsed, 3) . " seconds\n";
echo "Average: " . number_format($elapsed / count($files), 3) . " seconds per file\n\n";
foreach ($results as $index => $result) {
$filename = basename($files[$index]);
echo "$filename:\n";
echo " Content: " . strlen($result->content) . " chars\n";
echo " Tables: " . count($result->tables) . "\n";
echo " MIME: " . $result->mimeType . "\n\n";
}
}
$config = new ExtractionConfig(
extractTables: true,
extractImages: false
);
$kreuzberg = new Kreuzberg($config);
$pdfFiles = glob('*.pdf');
if (!empty($pdfFiles)) {
echo "Processing " . count($pdfFiles) . " PDF files...\n";
$start = microtime(true);
$results = $kreuzberg->batchExtractFiles($pdfFiles, $config);
$elapsed = microtime(true) - $start;
echo "Completed in " . number_format($elapsed, 2) . " seconds\n";
echo "Throughput: " . number_format(count($pdfFiles) / $elapsed, 2) . " files/second\n\n";
$totalChars = 0;
$totalTables = 0;
foreach ($results as $result) {
$totalChars += strlen($result->content);
$totalTables += count($result->tables);
}
echo "Total content: " . number_format($totalChars) . " characters\n";
echo "Total tables: $totalTables\n";
}
$uploadedFiles = [
['data' => file_get_contents('file1.pdf'), 'mime' => 'application/pdf'],
['data' => file_get_contents('file2.docx'), 'mime' => 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'],
];
$dataList = array_column($uploadedFiles, 'data');
$mimeTypes = array_column($uploadedFiles, 'mime');
$results = batch_extract_bytes($dataList, $mimeTypes);
echo "\nProcessed " . count($results) . " files from memory\n";
function processDirectory(string $dir, Kreuzberg $kreuzberg): array
{
$results = [];
$iterator = new RecursiveIteratorIterator(
new RecursiveDirectoryIterator($dir)
);
$files = [];
foreach ($iterator as $file) {
if ($file->isFile()) {
$ext = strtolower($file->getExtension());
if (in_array($ext, ['pdf', 'docx', 'xlsx', 'pptx', 'txt'], true)) {
$files[] = $file->getPathname();
}
}
}
if (empty($files)) {
return $results;
}
$batches = array_chunk($files, 10);
foreach ($batches as $batchIndex => $batch) {
echo "Processing batch " . ($batchIndex + 1) . "/" . count($batches) . "...\n";
$batchResults = $kreuzberg->batchExtractFiles($batch);
$results = array_merge($results, $batchResults);
}
return $results;
}
$directory = './documents';
if (is_dir($directory)) {
echo "\nProcessing directory: $directory\n";
$results = processDirectory($directory, $kreuzberg);
echo "Processed " . count($results) . " files\n";
}
$mixedFiles = ['valid.pdf', 'nonexistent.pdf', 'another.docx'];
try {
$results = batch_extract_files($mixedFiles);
} catch (\Kreuzberg\Exceptions\KreuzbergException $e) {
echo "Batch processing error: " . $e->getMessage() . "\n";
}
$allFiles = glob('documents/*.{pdf,docx,xlsx}', GLOB_BRACE);
$batchSize = 5;
$batches = array_chunk($allFiles, $batchSize);
$totalProcessed = 0;
echo "\nProcessing " . count($allFiles) . " files in " . count($batches) . " batches...\n";
foreach ($batches as $index => $batch) {
$progress = (($index + 1) / count($batches)) * 100;
echo sprintf("\rProgress: %.1f%% [%d/%d batches]",
$progress, $index + 1, count($batches));
$results = $kreuzberg->batchExtractFiles($batch);
$totalProcessed += count($results);
}
echo "\n\nCompleted! Processed $totalProcessed files.\n";
```
```
## Configuration
For advanced configuration options including language detection, table extraction, OCR settings, and more:
**[Configuration Guide](https://docs.kreuzberg.dev/guides/configuration/)**
## Documentation
- **[Official Documentation](https://docs.kreuzberg.dev/)**
- **[API Reference](https://docs.kreuzberg.dev/reference/api-python/)**
- **[Examples & Guides](https://docs.kreuzberg.dev/)**
## Contributing
Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
## Part of Kreuzberg.dev
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
## License
Elastic-2.0 License — see [LICENSE](../../LICENSE) for details.
## Support
- **Discord Community**: [Join our Discord](https://discord.gg/xt9WY3GnKR)
- **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
- **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)

View File

@@ -0,0 +1,41 @@
{
"name": "kreuzberg-dev/kreuzberg",
"description": "High-performance document intelligence library",
"license": "Elastic-2.0",
"type": "php-ext",
"require": {
"php": ">=8.2"
},
"require-dev": {
"phpstan/phpstan": "^2.1",
"friendsofphp/php-cs-fixer": "^3.95",
"phpunit/phpunit": "^13.1"
},
"autoload": {
"psr-4": {
"Kreuzberg\\": "src/"
}
},
"scripts": {
"phpstan": "php -d detect_unicode=0 vendor/bin/phpstan --configuration=phpstan.neon --memory-limit=512M",
"format": "php vendor/bin/php-cs-fixer fix --quiet",
"format:check": "php vendor/bin/php-cs-fixer fix --dry-run --quiet",
"test": "php vendor/bin/phpunit",
"lint": "@phpstan",
"lint:fix": "php vendor/bin/php-cs-fixer fix --quiet && php -d detect_unicode=0 vendor/bin/phpstan --configuration=phpstan.neon --memory-limit=512M"
},
"php-ext": {
"extension-name": "kreuzberg",
"support-zts": true,
"support-nts": true,
"download-url-method": ["pre-packaged-binary", "composer-default"]
},
"keywords": ["document", "extraction", "pdf", "ocr", "text"],
"extra": {
"pie": {
"binary": {
"url-template": "https://github.com/kreuzberg-dev/kreuzberg/releases/download/v{Version}/php_kreuzberg-{Version}_php{PhpVersion}-{Arch}-{OS}-{Libc}-{TSMode}.tgz"
}
}
}
}

4696
packages/php/composer.lock generated Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,25 @@
<?php
declare(strict_types=1);
$finder = (new PhpCsFixer\Finder())
->in(array_filter([
__DIR__ . '/src',
is_dir(__DIR__ . '/tests') ? __DIR__ . '/tests' : null,
]));
return (new PhpCsFixer\Config())
->setRules([
'@PSR12' => true,
'@PHP82Migration' => true,
'array_syntax' => ['syntax' => 'short'],
'single_quote' => true,
'trailing_comma_in_multiline' => [
'elements' => ['arrays', 'arguments', 'parameters'],
],
'declare_strict_types' => true,
'ordered_imports' => ['sort_algorithm' => 'alpha'],
'no_unused_imports' => true,
])
->setFinder($finder)
->setRiskyAllowed(true);

View File

@@ -0,0 +1,2 @@
parameters:
ignoreErrors: []

12
packages/php/phpstan.neon Normal file
View File

@@ -0,0 +1,12 @@
includes:
- phpstan-baseline.neon
parameters:
level: max
paths:
- src
scanFiles:
- stubs/kreuzberg_extension.php
treatPhpDocTypesAsCertain: false
reportUnmatchedIgnoredErrors: false
tmpDir: var/cache/phpstan

View File

@@ -0,0 +1,64 @@
<?php
declare(strict_types=1);
namespace Kreuzberg;
/**
* Plugin interface for DocumentExtractor.
*
* Implement this interface and register an instance with the corresponding
* registration function to provide custom behavior for extraction.
*/
interface DocumentExtractor
{
/**
* Extract content from a byte array.
*
* @param mixed $content
* @param string $mime_type
* @param mixed $config
* @return mixed Return value from the plugin method
*/
public function extract_bytes(mixed $content, string $mime_type, mixed $config): mixed;
/**
* Extract content from a file.
*
* @param mixed $path
* @param string $mime_type
* @param mixed $config
* @return mixed Return value from the plugin method
*/
public function extract_file(mixed $path, string $mime_type, mixed $config): mixed;
/**
* Get the list of MIME types supported by this extractor.
*
* @return mixed Return value from the plugin method
*/
public function supported_mime_types(): mixed;
/**
* Get the priority of this extractor.
*
* @return mixed Return value from the plugin method
*/
public function priority(): mixed;
/**
* Optional: Check if this extractor can handle a specific file.
*
* @param mixed $_path
* @param string $_mime_type
* @return mixed Return value from the plugin method
*/
public function can_handle(mixed $_path, string $_mime_type): mixed;
}

View File

@@ -0,0 +1,33 @@
<?php
declare(strict_types=1);
namespace Kreuzberg;
/**
* Plugin interface for EmbeddingBackend.
*
* Implement this interface and register an instance with the corresponding
* registration function to provide custom behavior for extraction.
*/
interface EmbeddingBackend
{
/**
* Embedding vector dimension. Must be `> 0` and must match the length of
*
* @return mixed Return value from the plugin method
*/
public function dimensions(): mixed;
/**
* Embed a batch of texts, returning one vector per input in order.
*
* @param mixed $texts
* @return mixed Return value from the plugin method
*/
public function embed(mixed $texts): mixed;
}

View File

@@ -0,0 +1,587 @@
<?php
// This file is auto-generated by alef — DO NOT EDIT.
// alef:hash:4e15143f4af1ae8bafbdb1506ef057da924484c66a19483966333558ad437e75
// To regenerate: alef generate
// To verify freshness: alef verify --exit-code
// Issues & docs: https://github.com/kreuzberg-dev/alef
declare(strict_types=1);
namespace Kreuzberg;
final class Kreuzberg
{
/**
* Extract content from a byte array.
*
* This is the main entry point for in-memory extraction. It performs the following steps:
* 1. Validate MIME type
* 2. Handle legacy format conversion if needed
* 3. Select appropriate extractor from registry
* 4. Extract content
* 5. Run post-processing pipeline
*
* @param string $content
* @param string $mime_type
* @param ExtractionConfig $config
* @return ExtractionResult
* @throws \Kreuzberg\KreuzbergException
*/
public static function extractBytes(
string $content, string $mime_type, ?ExtractionConfig $config = null): ExtractionResult
{
return \Kreuzberg\KreuzbergApi::extractBytes($content, $mime_type, $config ?? new ExtractionConfig()); // delegate to native extension class
}
/**
* Extract content from a file.
*
* This is the main entry point for file-based extraction. It performs the following steps:
* 1. Check cache for existing result (if caching enabled)
* 2. Detect or validate MIME type
* 3. Select appropriate extractor from registry
* 4. Extract content
* 5. Run post-processing pipeline
* 6. Store result in cache (if caching enabled)
*
* @param string $path
* @param ?string $mime_type
* @param ExtractionConfig $config
* @return ExtractionResult
* @throws \Kreuzberg\KreuzbergException
*/
public static function extractFile(
string $path, ?string $mime_type = null, ?ExtractionConfig $config = null): ExtractionResult
{
return \Kreuzberg\KreuzbergApi::extractFile($path, $mime_type, $config ?? new ExtractionConfig()); // delegate to native extension class
}
/**
* Synchronous wrapper for `extract_file`.
*
* This is a convenience function that blocks the current thread until extraction completes.
* For async code, use `extract_file` directly.
*
* Uses the global Tokio runtime for 100x+ performance improvement over creating
* a new runtime per call. Always uses the global runtime to avoid nested runtime issues.
*
* This function is only available with the `tokio-runtime` feature. For WASM targets,
* use a truly synchronous extraction approach instead.
*
* @param string $path
* @param ?string $mime_type
* @param ExtractionConfig $config
* @return ExtractionResult
* @throws \Kreuzberg\KreuzbergException
*/
public static function extractFileSync(
string $path, ?string $mime_type = null, ?ExtractionConfig $config = null): ExtractionResult
{
return \Kreuzberg\KreuzbergApi::extractFileSync($path, $mime_type, $config ?? new ExtractionConfig()); // delegate to native extension class
}
/**
* Synchronous wrapper for `extract_bytes`.
*
* Uses the global Tokio runtime for 100x+ performance improvement over creating
* a new runtime per call.
*
* With the `tokio-runtime` feature, this blocks the current thread using the global
* Tokio runtime. Without it (WASM), this calls a truly synchronous implementation.
*
* @param string $content
* @param string $mime_type
* @param ExtractionConfig $config
* @return ExtractionResult
* @throws \Kreuzberg\KreuzbergException
*/
public static function extractBytesSync(
string $content, string $mime_type, ?ExtractionConfig $config = null): ExtractionResult
{
return \Kreuzberg\KreuzbergApi::extractBytesSync($content, $mime_type, $config ?? new ExtractionConfig()); // delegate to native extension class
}
/**
* Synchronous wrapper for `batch_extract_files`.
*
* Uses the global Tokio runtime for optimal performance.
* Only available with `tokio-runtime` (WASM has no filesystem).
*
* @param array<BatchFileItem> $items
* @param ExtractionConfig $config
* @return array<ExtractionResult>
* @throws \Kreuzberg\KreuzbergException
*/
public static function batchExtractFilesSync(
array $items, ?ExtractionConfig $config = null): array
{
return \Kreuzberg\KreuzbergApi::batchExtractFilesSync($items, $config ?? new ExtractionConfig()); // delegate to native extension class
}
/**
* Synchronous wrapper for `batch_extract_bytes`.
*
* Uses the global Tokio runtime for optimal performance.
* With the `tokio-runtime` feature, this blocks the current thread using the global
* Tokio runtime. Without it (WASM), this calls a truly synchronous implementation
* that iterates through items and calls `extract_bytes_sync()`.
*
* @param array<BatchBytesItem> $items
* @param ExtractionConfig $config
* @return array<ExtractionResult>
* @throws \Kreuzberg\KreuzbergException
*/
public static function batchExtractBytesSync(
array $items, ?ExtractionConfig $config = null): array
{
return \Kreuzberg\KreuzbergApi::batchExtractBytesSync($items, $config ?? new ExtractionConfig()); // delegate to native extension class
}
/**
* Extract content from multiple files concurrently.
*
* This function processes multiple files in parallel, automatically managing
* concurrency to prevent resource exhaustion. The concurrency limit can be
* configured via `ExtractionConfig::max_concurrent_extractions` or defaults
* to `(num_cpus * 1.5).ceil()`.
*
* Each file can optionally specify a [`FileExtractionConfig`] that overrides specific
* fields from the batch-level `config`. Pass `None` for a file to use the batch defaults.
* Batch-level settings like `max_concurrent_extractions` and `use_cache` are always
* taken from the batch-level `config`.
*
* @param array<BatchFileItem> $items
* @param ExtractionConfig $config
* @return array<ExtractionResult>
* @throws \Kreuzberg\KreuzbergException
*/
public static function batchExtractFiles(
array $items, ?ExtractionConfig $config = null): array
{
return \Kreuzberg\KreuzbergApi::batchExtractFiles($items, $config ?? new ExtractionConfig()); // delegate to native extension class
}
/**
* Extract content from multiple byte arrays concurrently.
*
* This function processes multiple byte arrays in parallel, automatically managing
* concurrency to prevent resource exhaustion. The concurrency limit can be
* configured via `ExtractionConfig::max_concurrent_extractions` or defaults
* to `(num_cpus * 1.5).ceil()`.
*
* Each item can optionally specify a [`FileExtractionConfig`] that overrides specific
* fields from the batch-level `config`. Pass `None` as the config to use
* the batch-level defaults for that item.
*
* @param array<BatchBytesItem> $items
* @param ExtractionConfig $config
* @return array<ExtractionResult>
* @throws \Kreuzberg\KreuzbergException
*/
public static function batchExtractBytes(
array $items, ?ExtractionConfig $config = null): array
{
return \Kreuzberg\KreuzbergApi::batchExtractBytes($items, $config ?? new ExtractionConfig()); // delegate to native extension class
}
/**
* Detect MIME type from raw file bytes.
*
* Uses magic byte signatures to detect file type from content.
* Falls back to `infer` crate for comprehensive detection.
*
* For ZIP-based files, inspects contents to distinguish Office Open XML
* formats (DOCX, XLSX, PPTX) from plain ZIP archives.
*
* @param string $content
* @return string
* @throws \Kreuzberg\KreuzbergException
*/
public static function detectMimeTypeFromBytes(
string $content): string
{
return \Kreuzberg\KreuzbergApi::detectMimeTypeFromBytes($content); // delegate to native extension class
}
/**
* Get file extensions for a given MIME type.
*
* Returns all known file extensions that map to the specified MIME type.
*
* @param string $mime_type
* @return array<string>
* @throws \Kreuzberg\KreuzbergException
*/
public static function getExtensionsForMime(
string $mime_type): array
{
return \Kreuzberg\KreuzbergApi::getExtensionsForMime($mime_type); // delegate to native extension class
}
/**
* List the names of all registered embedding backends.
*
* Used by `kreuzberg-cli`, the api/mcp endpoints, and generated language
* bindings.
*
* @return array<string>
* @throws \Kreuzberg\KreuzbergException
*/
public static function listEmbeddingBackends(
): array
{
return \Kreuzberg\KreuzbergApi::listEmbeddingBackends(); // delegate to native extension class
}
/**
* List names of all registered document extractors.
*
* @return array<string>
* @throws \Kreuzberg\KreuzbergException
*/
public static function listDocumentExtractors(
): array
{
return \Kreuzberg\KreuzbergApi::listDocumentExtractors(); // delegate to native extension class
}
/**
* List all registered OCR backends.
*
* Returns the names of all OCR backends currently registered in the global registry.
*
* @return array<string>
* @throws \Kreuzberg\KreuzbergException
*/
public static function listOcrBackends(
): array
{
return \Kreuzberg\KreuzbergApi::listOcrBackends(); // delegate to native extension class
}
/**
* List all registered post-processor names.
*
* Returns a vector of all post-processor names currently registered in the
* global registry.
*
* @return array<string>
* @throws \Kreuzberg\KreuzbergException
*/
public static function listPostProcessors(
): array
{
return \Kreuzberg\KreuzbergApi::listPostProcessors(); // delegate to native extension class
}
/**
* List names of all registered renderers.
*
* @return array<string>
* @throws \Kreuzberg\KreuzbergException
*/
public static function listRenderers(
): array
{
return \Kreuzberg\KreuzbergApi::listRenderers(); // delegate to native extension class
}
/**
* List names of all registered validators.
*
* @return array<string>
* @throws \Kreuzberg\KreuzbergException
*/
public static function listValidators(
): array
{
return \Kreuzberg\KreuzbergApi::listValidators(); // delegate to native extension class
}
/**
* Compare two extraction results and return a structured diff.
*
* The comparison is purely structural — no I/O, no side effects. All fields
* of [`ExtractionDiff`] are populated according to the provided [`DiffOptions`].
*
* @param ExtractionResult $a
* @param ExtractionResult $b
* @param DiffOptions $opts
* @return ExtractionDiff
*/
public static function compare(
ExtractionResult $a, ExtractionResult $b, DiffOptions $opts): ExtractionDiff
{
return \Kreuzberg\KreuzbergApi::compare($a, $b, $opts); // delegate to native extension class
}
/**
* Generate embeddings asynchronously for a list of text strings.
*
* This is the async counterpart to [`embed_texts`]. It offloads the blocking
* ONNX inference work to a dedicated blocking thread pool via Tokio's
* `spawn_blocking`, keeping the async executor free.
*
* Returns one embedding vector per input text in the same order.
*
* @param array<string> $texts
* @param EmbeddingConfig $config
* @return array<array<float>>
* @throws \Kreuzberg\KreuzbergException
*/
public static function embedTextsAsync(
array $texts, ?EmbeddingConfig $config = null): array
{
return \Kreuzberg\KreuzbergApi::embedTextsAsync($texts, $config ?? new EmbeddingConfig()); // delegate to native extension class
}
/**
* Render a single PDF page to PNG bytes.
*
* Returns raw PNG-encoded bytes for the specified page at the given DPI.
* Uses pdf_oxide with tiny-skia for pure-Rust rendering.
*
* @param string $pdf_bytes
* @param int $page_index
* @param ?int $dpi
* @param ?string $password
* @return string
* @throws \Kreuzberg\KreuzbergException
*/
public static function renderPdfPageToPng(
string $pdf_bytes, int $page_index, ?int $dpi = null, ?string $password = null): string
{
return \Kreuzberg\KreuzbergApi::renderPdfPageToPng($pdf_bytes, $page_index, $dpi, $password); // delegate to native extension class
}
/**
* Detect the MIME type of a file at the given path.
*
* Uses the file extension and optionally the file content to determine the MIME type.
* Set `check_exists` to `true` to verify the file exists before detection.
*
* @param string $path
* @param bool $check_exists
* @return string
* @throws \Kreuzberg\KreuzbergException
*/
public static function detectMimeType(
string $path, bool $check_exists): string
{
return \Kreuzberg\KreuzbergApi::detectMimeType($path, $check_exists); // delegate to native extension class
}
/**
* Embed a list of texts using the configured embedding model.
*
* Returns a 2D vector where each inner vector is the embedding for the corresponding text.
*
* @param array<string> $texts
* @param EmbeddingConfig $config
* @return array<array<float>>
* @throws \Kreuzberg\KreuzbergException
*/
public static function embedTexts(
array $texts, ?EmbeddingConfig $config = null): array
{
return \Kreuzberg\KreuzbergApi::embedTexts($texts, $config ?? new EmbeddingConfig()); // delegate to native extension class
}
/**
* Get an embedding preset by name.
*
* Returns `None` if no preset with the given name exists. Returns an owned
* clone so the value is safe to pass across FFI boundaries.
*
* @param string $name
* @return ?EmbeddingPreset
*/
public static function getEmbeddingPreset(
string $name): ?EmbeddingPreset
{
return \Kreuzberg\KreuzbergApi::getEmbeddingPreset($name); // delegate to native extension class
}
/**
* List the names of all available embedding presets.
*
* Returns owned `String`s so the values are safe to pass across FFI boundaries.
*
* @return array<string>
*/
public static function listEmbeddingPresets(
): array
{
return \Kreuzberg\KreuzbergApi::listEmbeddingPresets(); // delegate to native extension class
}
/**
* registerOcrBackend.
*
* @param OcrBackend $backend
* @return void
*/
public static function registerOcrBackend(
OcrBackend $backend) : void
{
\Kreuzberg\KreuzbergApi::registerOcrBackend($backend); // delegate to native extension class
}
/**
* unregisterOcrBackend.
*
* @param string $name
* @return void
*/
public static function unregisterOcrBackend(
string $name) : void
{
\Kreuzberg\KreuzbergApi::unregisterOcrBackend($name); // delegate to native extension class
}
/**
* clearOcrBackends.
*
* @return void
*/
public static function clearOcrBackends(
) : void
{
\Kreuzberg\KreuzbergApi::clearOcrBackends(); // delegate to native extension class
}
/**
* registerPostProcessor.
*
* @param PostProcessor $backend
* @return void
*/
public static function registerPostProcessor(
PostProcessor $backend) : void
{
\Kreuzberg\KreuzbergApi::registerPostProcessor($backend); // delegate to native extension class
}
/**
* unregisterPostProcessor.
*
* @param string $name
* @return void
*/
public static function unregisterPostProcessor(
string $name) : void
{
\Kreuzberg\KreuzbergApi::unregisterPostProcessor($name); // delegate to native extension class
}
/**
* clearPostProcessors.
*
* @return void
*/
public static function clearPostProcessors(
) : void
{
\Kreuzberg\KreuzbergApi::clearPostProcessors(); // delegate to native extension class
}
/**
* registerValidator.
*
* @param Validator $backend
* @return void
*/
public static function registerValidator(
Validator $backend) : void
{
\Kreuzberg\KreuzbergApi::registerValidator($backend); // delegate to native extension class
}
/**
* unregisterValidator.
*
* @param string $name
* @return void
*/
public static function unregisterValidator(
string $name) : void
{
\Kreuzberg\KreuzbergApi::unregisterValidator($name); // delegate to native extension class
}
/**
* clearValidators.
*
* @return void
*/
public static function clearValidators(
) : void
{
\Kreuzberg\KreuzbergApi::clearValidators(); // delegate to native extension class
}
/**
* registerEmbeddingBackend.
*
* @param EmbeddingBackend $backend
* @return void
*/
public static function registerEmbeddingBackend(
EmbeddingBackend $backend) : void
{
\Kreuzberg\KreuzbergApi::registerEmbeddingBackend($backend); // delegate to native extension class
}
/**
* unregisterEmbeddingBackend.
*
* @param string $name
* @return void
*/
public static function unregisterEmbeddingBackend(
string $name) : void
{
\Kreuzberg\KreuzbergApi::unregisterEmbeddingBackend($name); // delegate to native extension class
}
/**
* clearEmbeddingBackends.
*
* @return void
*/
public static function clearEmbeddingBackends(
) : void
{
\Kreuzberg\KreuzbergApi::clearEmbeddingBackends(); // delegate to native extension class
}
/**
* registerDocumentExtractor.
*
* @param DocumentExtractor $backend
* @return void
*/
public static function registerDocumentExtractor(
DocumentExtractor $backend) : void
{
\Kreuzberg\KreuzbergApi::registerDocumentExtractor($backend); // delegate to native extension class
}
/**
* unregisterDocumentExtractor.
*
* @param string $name
* @return void
*/
public static function unregisterDocumentExtractor(
string $name) : void
{
\Kreuzberg\KreuzbergApi::unregisterDocumentExtractor($name); // delegate to native extension class
}
/**
* clearDocumentExtractors.
*
* @return void
*/
public static function clearDocumentExtractors(
) : void
{
\Kreuzberg\KreuzbergApi::clearDocumentExtractors(); // delegate to native extension class
}
/**
* registerRenderer.
*
* @param Renderer $backend
* @return void
*/
public static function registerRenderer(
Renderer $backend) : void
{
\Kreuzberg\KreuzbergApi::registerRenderer($backend); // delegate to native extension class
}
/**
* unregisterRenderer.
*
* @param string $name
* @return void
*/
public static function unregisterRenderer(
string $name) : void
{
\Kreuzberg\KreuzbergApi::unregisterRenderer($name); // delegate to native extension class
}
/**
* clearRenderers.
*
* @return void
*/
public static function clearRenderers(
) : void
{
\Kreuzberg\KreuzbergApi::clearRenderers(); // delegate to native extension class
}
}

View File

@@ -0,0 +1,87 @@
<?php
declare(strict_types=1);
namespace Kreuzberg;
/**
* Plugin interface for OcrBackend.
*
* Implement this interface and register an instance with the corresponding
* registration function to provide custom behavior for extraction.
*/
interface OcrBackend
{
/**
* Process an image and extract text via OCR.
*
* @param mixed $image_bytes
* @param mixed $config
* @return mixed Return value from the plugin method
*/
public function process_image(mixed $image_bytes, mixed $config): mixed;
/**
* Process a file and extract text via OCR.
*
* @param mixed $path
* @param mixed $config
* @return mixed Return value from the plugin method
*/
public function process_image_file(mixed $path, mixed $config): mixed;
/**
* Check if this backend supports a given language code.
*
* @param string $lang
* @return mixed Return value from the plugin method
*/
public function supports_language(string $lang): mixed;
/**
* Get the backend type identifier.
*
* @return mixed Return value from the plugin method
*/
public function backend_type(): mixed;
/**
* Optional: Get a list of all supported languages.
*
* @return mixed Return value from the plugin method
*/
public function supported_languages(): mixed;
/**
* Optional: Check if the backend supports table detection.
*
* @return mixed Return value from the plugin method
*/
public function supports_table_detection(): mixed;
/**
* Check if the backend supports direct document-level processing (e.g. for PDFs).
*
* @return mixed Return value from the plugin method
*/
public function supports_document_processing(): mixed;
/**
* Process a document file directly via OCR.
*
* @param mixed $_path
* @param mixed $_config
* @return mixed Return value from the plugin method
*/
public function process_document(mixed $_path, mixed $_config): mixed;
}

View File

@@ -0,0 +1,61 @@
<?php
declare(strict_types=1);
namespace Kreuzberg;
/**
* Plugin interface for PostProcessor.
*
* Implement this interface and register an instance with the corresponding
* registration function to provide custom behavior for extraction.
*/
interface PostProcessor
{
/**
* Process an extraction result.
*
* @param mixed $result
* @param mixed $config
* @return mixed Return value from the plugin method
*/
public function process(mixed $result, mixed $config): mixed;
/**
* Get the processing stage for this post-processor.
*
* @return mixed Return value from the plugin method
*/
public function processing_stage(): mixed;
/**
* Optional: Check if this processor should run for a given result.
*
* @param mixed $_result
* @param mixed $_config
* @return mixed Return value from the plugin method
*/
public function should_process(mixed $_result, mixed $_config): mixed;
/**
* Optional: Estimate processing time in milliseconds.
*
* @param mixed $_result
* @return mixed Return value from the plugin method
*/
public function estimated_duration_ms(mixed $_result): mixed;
/**
* Execution priority within the processing stage.
*
* @return mixed Return value from the plugin method
*/
public function priority(): mixed;
}

View File

@@ -0,0 +1,25 @@
<?php
declare(strict_types=1);
namespace Kreuzberg;
/**
* Plugin interface for Renderer.
*
* Implement this interface and register an instance with the corresponding
* registration function to provide custom behavior for extraction.
*/
interface Renderer
{
/**
* Render an `InternalDocument` to the output format.
*
* @param mixed $doc
* @return mixed Return value from the plugin method
*/
public function render(mixed $doc): mixed;
}

View File

@@ -0,0 +1,44 @@
<?php
declare(strict_types=1);
namespace Kreuzberg;
/**
* Plugin interface for Validator.
*
* Implement this interface and register an instance with the corresponding
* registration function to provide custom behavior for extraction.
*/
interface Validator
{
/**
* Validate an extraction result.
*
* @param mixed $result
* @param mixed $config
* @return mixed Return value from the plugin method
*/
public function validate(mixed $result, mixed $config): mixed;
/**
* Optional: Check if this validator should run for a given result.
*
* @param mixed $_result
* @param mixed $_config
* @return mixed Return value from the plugin method
*/
public function should_validate(mixed $_result, mixed $_config): mixed;
/**
* Optional: Get the validation priority.
*
* @return mixed Return value from the plugin method
*/
public function priority(): mixed;
}

File diff suppressed because it is too large Load Diff