5257 lines
207 KiB
Markdown
5257 lines
207 KiB
Markdown
|
|
---
|
|||
|
|
title: "Java API Reference"
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Java API Reference <span class="version-badge">v5.0.0-rc.3</span>
|
|||
|
|
|
|||
|
|
### Functions
|
|||
|
|
|
|||
|
|
#### extractBytes()
|
|||
|
|
|
|||
|
|
Extract content from a byte array.
|
|||
|
|
|
|||
|
|
This is the main entry point for in-memory extraction. It performs the following steps:
|
|||
|
|
|
|||
|
|
1. Validate MIME type
|
|||
|
|
2. Handle legacy format conversion if needed
|
|||
|
|
3. Select appropriate extractor from registry
|
|||
|
|
4. Extract content
|
|||
|
|
5. Run post-processing pipeline
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
An `ExtractionResult` containing the extracted content and metadata.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Returns `KreuzbergError.Validation` if MIME type is invalid.
|
|||
|
|
Returns `KreuzbergError.UnsupportedFormat` if MIME type is not supported.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static ExtractionResult extractBytes(byte[] content, String mimeType, ExtractionConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `content` | `byte[]` | Yes | The byte array to extract |
|
|||
|
|
| `mimeType` | `String` | Yes | MIME type of the content |
|
|||
|
|
| `config` | `ExtractionConfig` | Yes | Extraction configuration |
|
|||
|
|
|
|||
|
|
**Returns:** `ExtractionResult`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### extractFile()
|
|||
|
|
|
|||
|
|
Extract content from a file.
|
|||
|
|
|
|||
|
|
This is the main entry point for file-based extraction. It performs the following steps:
|
|||
|
|
|
|||
|
|
1. Check cache for existing result (if caching enabled)
|
|||
|
|
2. Detect or validate MIME type
|
|||
|
|
3. Select appropriate extractor from registry
|
|||
|
|
4. Extract content
|
|||
|
|
5. Run post-processing pipeline
|
|||
|
|
6. Store result in cache (if caching enabled)
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
An `ExtractionResult` containing the extracted content and metadata.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Returns `KreuzbergError.Io` if the file doesn't exist (NotFound) or for other file I/O errors.
|
|||
|
|
Returns `KreuzbergError.UnsupportedFormat` if MIME type is not supported.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static ExtractionResult extractFile(String path, String mimeType, ExtractionConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `path` | `String` | Yes | Path to the file to extract |
|
|||
|
|
| `mimeType` | `Optional<String>` | No | Optional MIME type override. If None, will be auto-detected |
|
|||
|
|
| `config` | `ExtractionConfig` | Yes | Extraction configuration |
|
|||
|
|
|
|||
|
|
**Returns:** `ExtractionResult`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### extractFileSync()
|
|||
|
|
|
|||
|
|
Synchronous wrapper for `extract_file`.
|
|||
|
|
|
|||
|
|
This is a convenience function that blocks the current thread until extraction completes.
|
|||
|
|
For async code, use `extract_file` directly.
|
|||
|
|
|
|||
|
|
Uses the global Tokio runtime for 100x+ performance improvement over creating
|
|||
|
|
a new runtime per call. Always uses the global runtime to avoid nested runtime issues.
|
|||
|
|
|
|||
|
|
This function is only available with the `tokio-runtime` feature. For WASM targets,
|
|||
|
|
use a truly synchronous extraction approach instead.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static ExtractionResult extractFileSync(String path, String mimeType, ExtractionConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `path` | `String` | Yes | Path to the file |
|
|||
|
|
| `mimeType` | `Optional<String>` | No | The mime type |
|
|||
|
|
| `config` | `ExtractionConfig` | Yes | The configuration options |
|
|||
|
|
|
|||
|
|
**Returns:** `ExtractionResult`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### extractBytesSync()
|
|||
|
|
|
|||
|
|
Synchronous wrapper for `extract_bytes`.
|
|||
|
|
|
|||
|
|
Uses the global Tokio runtime for 100x+ performance improvement over creating
|
|||
|
|
a new runtime per call.
|
|||
|
|
|
|||
|
|
With the `tokio-runtime` feature, this blocks the current thread using the global
|
|||
|
|
Tokio runtime. Without it (WASM), this calls a truly synchronous implementation.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static ExtractionResult extractBytesSync(byte[] content, String mimeType, ExtractionConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `content` | `byte[]` | Yes | The content to process |
|
|||
|
|
| `mimeType` | `String` | Yes | The mime type |
|
|||
|
|
| `config` | `ExtractionConfig` | Yes | The configuration options |
|
|||
|
|
|
|||
|
|
**Returns:** `ExtractionResult`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### batchExtractFilesSync()
|
|||
|
|
|
|||
|
|
Synchronous wrapper for `batch_extract_files`.
|
|||
|
|
|
|||
|
|
Uses the global Tokio runtime for optimal performance.
|
|||
|
|
Only available with `tokio-runtime` (WASM has no filesystem).
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<ExtractionResult> batchExtractFilesSync(List<BatchFileItem> items, ExtractionConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `items` | `List<BatchFileItem>` | Yes | The items |
|
|||
|
|
| `config` | `ExtractionConfig` | Yes | The configuration options |
|
|||
|
|
|
|||
|
|
**Returns:** `List<ExtractionResult>`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### batchExtractBytesSync()
|
|||
|
|
|
|||
|
|
Synchronous wrapper for `batch_extract_bytes`.
|
|||
|
|
|
|||
|
|
Uses the global Tokio runtime for optimal performance.
|
|||
|
|
With the `tokio-runtime` feature, this blocks the current thread using the global
|
|||
|
|
Tokio runtime. Without it (WASM), this calls a truly synchronous implementation
|
|||
|
|
that iterates through items and calls `extract_bytes_sync()`.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<ExtractionResult> batchExtractBytesSync(List<BatchBytesItem> items, ExtractionConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `items` | `List<BatchBytesItem>` | Yes | The items |
|
|||
|
|
| `config` | `ExtractionConfig` | Yes | The configuration options |
|
|||
|
|
|
|||
|
|
**Returns:** `List<ExtractionResult>`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### batchExtractFiles()
|
|||
|
|
|
|||
|
|
Extract content from multiple files concurrently.
|
|||
|
|
|
|||
|
|
This function processes multiple files in parallel, automatically managing
|
|||
|
|
concurrency to prevent resource exhaustion. The concurrency limit can be
|
|||
|
|
configured via `ExtractionConfig.max_concurrent_extractions` or defaults
|
|||
|
|
to `(num_cpus * 1.5).ceil()`.
|
|||
|
|
|
|||
|
|
Each file can optionally specify a `FileExtractionConfig` that overrides specific
|
|||
|
|
fields from the batch-level `config`. Pass `null` for a file to use the batch defaults.
|
|||
|
|
Batch-level settings like `max_concurrent_extractions` and `use_cache` are always
|
|||
|
|
taken from the batch-level `config`.
|
|||
|
|
|
|||
|
|
per-file configuration overrides.
|
|||
|
|
|
|||
|
|
- `config` - Batch-level extraction configuration (provides defaults and batch settings)
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
A vector of `ExtractionResult` in the same order as the input items.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Individual file errors are captured in the result metadata. System errors
|
|||
|
|
(IO, RuntimeError equivalents) will bubble up and fail the entire batch.
|
|||
|
|
|
|||
|
|
Simple usage with no per-file overrides:
|
|||
|
|
|
|||
|
|
Per-file configuration overrides:
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<ExtractionResult> batchExtractFiles(List<BatchFileItem> items, ExtractionConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `items` | `List<BatchFileItem>` | Yes | Vector of `BatchFileItem` structs, each containing a path and optional |
|
|||
|
|
| `config` | `ExtractionConfig` | Yes | Batch-level extraction configuration (provides defaults and batch settings) |
|
|||
|
|
|
|||
|
|
**Returns:** `List<ExtractionResult>`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### batchExtractBytes()
|
|||
|
|
|
|||
|
|
Extract content from multiple byte arrays concurrently.
|
|||
|
|
|
|||
|
|
This function processes multiple byte arrays in parallel, automatically managing
|
|||
|
|
concurrency to prevent resource exhaustion. The concurrency limit can be
|
|||
|
|
configured via `ExtractionConfig.max_concurrent_extractions` or defaults
|
|||
|
|
to `(num_cpus * 1.5).ceil()`.
|
|||
|
|
|
|||
|
|
Each item can optionally specify a `FileExtractionConfig` that overrides specific
|
|||
|
|
fields from the batch-level `config`. Pass `null` as the config to use
|
|||
|
|
the batch-level defaults for that item.
|
|||
|
|
|
|||
|
|
MIME type, and optional per-item configuration overrides.
|
|||
|
|
|
|||
|
|
- `config` - Batch-level extraction configuration
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
A vector of `ExtractionResult` in the same order as the input items.
|
|||
|
|
|
|||
|
|
Simple usage with no per-item overrides:
|
|||
|
|
|
|||
|
|
Per-item configuration overrides:
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<ExtractionResult> batchExtractBytes(List<BatchBytesItem> items, ExtractionConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `items` | `List<BatchBytesItem>` | Yes | Vector of `BatchBytesItem` structs, each containing content bytes, |
|
|||
|
|
| `config` | `ExtractionConfig` | Yes | Batch-level extraction configuration |
|
|||
|
|
|
|||
|
|
**Returns:** `List<ExtractionResult>`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### detectMimeTypeFromBytes()
|
|||
|
|
|
|||
|
|
Detect MIME type from raw file bytes.
|
|||
|
|
|
|||
|
|
Uses magic byte signatures to detect file type from content.
|
|||
|
|
Falls back to `infer` crate for comprehensive detection.
|
|||
|
|
|
|||
|
|
For ZIP-based files, inspects contents to distinguish Office Open XML
|
|||
|
|
formats (DOCX, XLSX, PPTX) from plain ZIP archives.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
The detected MIME type string.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Returns `KreuzbergError.UnsupportedFormat` if MIME type cannot be determined.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static String detectMimeTypeFromBytes(byte[] content) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `content` | `byte[]` | Yes | Raw file bytes |
|
|||
|
|
|
|||
|
|
**Returns:** `String`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### getExtensionsForMime()
|
|||
|
|
|
|||
|
|
Get file extensions for a given MIME type.
|
|||
|
|
|
|||
|
|
Returns all known file extensions that map to the specified MIME type.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
A vector of file extensions (without leading dot) for the MIME type.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<String> getExtensionsForMime(String mimeType) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `mimeType` | `String` | Yes | The MIME type to look up |
|
|||
|
|
|
|||
|
|
**Returns:** `List<String>`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### clearEmbeddingBackends()
|
|||
|
|
|
|||
|
|
Clear all embedding backends from the global registry.
|
|||
|
|
|
|||
|
|
Calls `shutdown()` on every registered backend, then empties the registry.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
- Any error returned by a backend's `shutdown()` method. The first error
|
|||
|
|
encountered stops processing of remaining backends.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static void clearEmbeddingBackends() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Returns:** `void`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### listEmbeddingBackends()
|
|||
|
|
|
|||
|
|
List the names of all registered embedding backends.
|
|||
|
|
|
|||
|
|
Used by `kreuzberg-cli`, the api/mcp endpoints, and generated language
|
|||
|
|
bindings.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<String> listEmbeddingBackends() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Returns:** `List<String>`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### listDocumentExtractors()
|
|||
|
|
|
|||
|
|
List names of all registered document extractors.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<String> listDocumentExtractors() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Returns:** `List<String>`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### clearDocumentExtractors()
|
|||
|
|
|
|||
|
|
Clear all document extractors from the global registry.
|
|||
|
|
|
|||
|
|
Calls `shutdown()` on every registered extractor, then empties the registry.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
- Any error returned by an extractor's `shutdown()` method. The first error
|
|||
|
|
encountered stops processing of remaining extractors.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static void clearDocumentExtractors() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Returns:** `void`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### listOcrBackends()
|
|||
|
|
|
|||
|
|
List all registered OCR backends.
|
|||
|
|
|
|||
|
|
Returns the names of all OCR backends currently registered in the global registry.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
A vector of OCR backend names.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<String> listOcrBackends() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Returns:** `List<String>`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### clearOcrBackends()
|
|||
|
|
|
|||
|
|
Clear all OCR backends from the global registry.
|
|||
|
|
|
|||
|
|
Removes all OCR backends and calls their `shutdown()` methods.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
- `Ok(())` if all backends were cleared successfully
|
|||
|
|
- `Err(...)` if any shutdown method failed
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static void clearOcrBackends() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Returns:** `void`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### listPostProcessors()
|
|||
|
|
|
|||
|
|
List all registered post-processor names.
|
|||
|
|
|
|||
|
|
Returns a vector of all post-processor names currently registered in the
|
|||
|
|
global registry.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
- `Ok(Vec<String>)` - Vector of post-processor names
|
|||
|
|
- `Err(...)` if the registry lock is poisoned
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<String> listPostProcessors() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Returns:** `List<String>`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### clearPostProcessors()
|
|||
|
|
|
|||
|
|
Remove all registered post-processors.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static void clearPostProcessors() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Returns:** `void`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### listRenderers()
|
|||
|
|
|
|||
|
|
List names of all registered renderers.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Returns an error if the registry lock is poisoned.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<String> listRenderers() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Returns:** `List<String>`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### clearRenderers()
|
|||
|
|
|
|||
|
|
Clear all renderers from the global registry.
|
|||
|
|
|
|||
|
|
Removes every renderer, including the built-in defaults (markdown, html,
|
|||
|
|
djot, plain). After calling this no renderers are registered; re-register
|
|||
|
|
as needed.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Returns an error if the registry lock is poisoned.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static void clearRenderers() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Returns:** `void`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### listValidators()
|
|||
|
|
|
|||
|
|
List names of all registered validators.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<String> listValidators() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Returns:** `List<String>`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### clearValidators()
|
|||
|
|
|
|||
|
|
Remove all registered validators.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static void clearValidators() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Returns:** `void`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### compare()
|
|||
|
|
|
|||
|
|
Compare two extraction results and return a structured diff.
|
|||
|
|
|
|||
|
|
The comparison is purely structural — no I/O, no side effects. All fields
|
|||
|
|
of `ExtractionDiff` are populated according to the provided `DiffOptions`.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static ExtractionDiff compare(ExtractionResult a, ExtractionResult b, DiffOptions opts)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `a` | `ExtractionResult` | Yes | The extraction result |
|
|||
|
|
| `b` | `ExtractionResult` | Yes | The extraction result |
|
|||
|
|
| `opts` | `DiffOptions` | Yes | The options to use |
|
|||
|
|
|
|||
|
|
**Returns:** `ExtractionDiff`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### embedTextsAsync()
|
|||
|
|
|
|||
|
|
Generate embeddings asynchronously for a list of text strings.
|
|||
|
|
|
|||
|
|
This is the async counterpart to `embed_texts`. It offloads the blocking
|
|||
|
|
ONNX inference work to a dedicated blocking thread pool via Tokio's
|
|||
|
|
`spawn_blocking`, keeping the async executor free.
|
|||
|
|
|
|||
|
|
Returns one embedding vector per input text in the same order.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
- `KreuzbergError.MissingDependency` if ONNX Runtime is not installed
|
|||
|
|
- `KreuzbergError.Embedding` if the preset name is unknown, model download fails,
|
|||
|
|
or the blocking inference task panics
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<List<Float>> embedTextsAsync(List<String> texts, EmbeddingConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `texts` | `List<String>` | Yes | Vec of strings to embed (owned, sent to blocking thread) |
|
|||
|
|
| `config` | `EmbeddingConfig` | Yes | Embedding configuration specifying model, batch size, and normalization |
|
|||
|
|
|
|||
|
|
**Returns:** `List<List<Float>>`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### renderPdfPageToPng()
|
|||
|
|
|
|||
|
|
Render a single PDF page to PNG bytes.
|
|||
|
|
|
|||
|
|
Returns raw PNG-encoded bytes for the specified page at the given DPI.
|
|||
|
|
Uses pdf_oxide with tiny-skia for pure-Rust rendering.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Returns `KreuzbergError.Parsing` if the PDF cannot be opened, authenticated,
|
|||
|
|
or rendered, or if `page_index` is out of range.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static byte[] renderPdfPageToPng(byte[] pdfBytes, long pageIndex, int dpi, String password) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `pdfBytes` | `byte[]` | Yes | Raw PDF file bytes |
|
|||
|
|
| `pageIndex` | `long` | Yes | Zero-based page index |
|
|||
|
|
| `dpi` | `Optional<Integer>` | No | Resolution in dots per inch (default: 150) |
|
|||
|
|
| `password` | `Optional<String>` | No | Optional password for encrypted PDFs |
|
|||
|
|
|
|||
|
|
**Returns:** `byte[]`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### detectMimeType()
|
|||
|
|
|
|||
|
|
Detect the MIME type of a file at the given path.
|
|||
|
|
|
|||
|
|
Uses the file extension and optionally the file content to determine the MIME type.
|
|||
|
|
Set `check_exists` to `true` to verify the file exists before detection.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static String detectMimeType(String path, boolean checkExists) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `path` | `String` | Yes | Path to the file |
|
|||
|
|
| `checkExists` | `boolean` | Yes | The check exists |
|
|||
|
|
|
|||
|
|
**Returns:** `String`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### embedTexts()
|
|||
|
|
|
|||
|
|
Embed a list of texts using the configured embedding model.
|
|||
|
|
|
|||
|
|
Returns a 2D vector where each inner vector is the embedding for the corresponding text.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<List<Float>> embedTexts(List<String> texts, EmbeddingConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `texts` | `List<String>` | Yes | The texts |
|
|||
|
|
| `config` | `EmbeddingConfig` | Yes | The configuration options |
|
|||
|
|
|
|||
|
|
**Returns:** `List<List<Float>>`
|
|||
|
|
**Errors:** Throws `ErrorException`.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### getEmbeddingPreset()
|
|||
|
|
|
|||
|
|
Get an embedding preset by name.
|
|||
|
|
|
|||
|
|
Returns `null` if no preset with the given name exists. Returns an owned
|
|||
|
|
clone so the value is safe to pass across FFI boundaries.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static Optional<EmbeddingPreset> getEmbeddingPreset(String name)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Parameters:**
|
|||
|
|
|
|||
|
|
| Name | Type | Required | Description |
|
|||
|
|
|------|------|----------|-------------|
|
|||
|
|
| `name` | `String` | Yes | The name |
|
|||
|
|
|
|||
|
|
**Returns:** `Optional<EmbeddingPreset>`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### listEmbeddingPresets()
|
|||
|
|
|
|||
|
|
List the names of all available embedding presets.
|
|||
|
|
|
|||
|
|
Returns owned `String`s so the values are safe to pass across FFI boundaries.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static List<String> listEmbeddingPresets()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Returns:** `List<String>`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Types
|
|||
|
|
|
|||
|
|
#### AccelerationConfig
|
|||
|
|
|
|||
|
|
Hardware acceleration configuration for ONNX Runtime models.
|
|||
|
|
|
|||
|
|
Controls which execution provider (CPU, CoreML, CUDA, TensorRT) is used
|
|||
|
|
for inference in layout detection and embedding generation.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `provider` | `ExecutionProviderType` | `ExecutionProviderType.AUTO` | Execution provider to use for ONNX inference. |
|
|||
|
|
| `deviceId` | `int` | — | GPU device ID (for CUDA/TensorRT). Ignored for CPU/CoreML/Auto. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ArchiveEntry
|
|||
|
|
|
|||
|
|
A single file extracted from an archive.
|
|||
|
|
|
|||
|
|
When archives (ZIP, TAR, 7Z, GZIP) are extracted with recursive extraction
|
|||
|
|
enabled, each processable file produces its own full `ExtractionResult`.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `path` | `String` | — | Archive-relative file path (e.g. "folder/document.pdf"). |
|
|||
|
|
| `mimeType` | `String` | — | Detected MIME type of the file. |
|
|||
|
|
| `result` | `ExtractionResult` | — | Full extraction result for this file. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ArchiveMetadata
|
|||
|
|
|
|||
|
|
Archive (ZIP/TAR/7Z) metadata.
|
|||
|
|
|
|||
|
|
Extracted from compressed archive files containing file lists and size information.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `format` | `String` | — | Archive format ("ZIP", "TAR", "7Z", etc.) |
|
|||
|
|
| `fileCount` | `int` | — | Total number of files in the archive |
|
|||
|
|
| `fileList` | `List<String>` | `Collections.emptyList()` | List of file paths within the archive |
|
|||
|
|
| `totalSize` | `long` | — | Total uncompressed size in bytes |
|
|||
|
|
| `compressedSize` | `Optional<Long>` | `null` | Compressed size in bytes (if available) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### BBox
|
|||
|
|
|
|||
|
|
Bounding box in original image coordinates (x1, y1) top-left, (x2, y2) bottom-right.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `x1` | `float` | — | X1 |
|
|||
|
|
| `y1` | `float` | — | Y1 |
|
|||
|
|
| `x2` | `float` | — | X2 |
|
|||
|
|
| `y2` | `float` | — | Y2 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### BatchBytesItem
|
|||
|
|
|
|||
|
|
Batch item for byte array extraction.
|
|||
|
|
|
|||
|
|
Used with `batch_extract_bytes` and `batch_extract_bytes_sync`
|
|||
|
|
to represent a single item in a batch extraction job.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `content` | `byte[]` | — | The content bytes to extract from |
|
|||
|
|
| `mimeType` | `String` | — | MIME type of the content (e.g., "application/pdf", "text/html") |
|
|||
|
|
| `config` | `Optional<FileExtractionConfig>` | `null` | Per-item configuration overrides (None uses batch-level defaults) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### BatchFileItem
|
|||
|
|
|
|||
|
|
Batch item for file extraction.
|
|||
|
|
|
|||
|
|
Used with `batch_extract_files` and `batch_extract_files_sync`
|
|||
|
|
to represent a single file in a batch extraction job.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `path` | `String` | — | Path to the file to extract from |
|
|||
|
|
| `config` | `Optional<FileExtractionConfig>` | `null` | Per-file configuration overrides (None uses batch-level defaults) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### BibtexMetadata
|
|||
|
|
|
|||
|
|
BibTeX bibliography metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `entryCount` | `long` | — | Number of entries in the bibliography. |
|
|||
|
|
| `citationKeys` | `List<String>` | `Collections.emptyList()` | Citation keys |
|
|||
|
|
| `authors` | `List<String>` | `Collections.emptyList()` | Authors |
|
|||
|
|
| `yearRange` | `Optional<YearRange>` | `null` | Year range (year range) |
|
|||
|
|
| `entryTypes` | `Optional<Map<String, Long>>` | `Collections.emptyMap()` | Entry types |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### BoundingBox
|
|||
|
|
|
|||
|
|
Bounding box coordinates for element positioning.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `x0` | `double` | — | Left x-coordinate |
|
|||
|
|
| `y0` | `double` | — | Bottom y-coordinate |
|
|||
|
|
| `x1` | `double` | — | Right x-coordinate |
|
|||
|
|
| `y1` | `double` | — | Top y-coordinate |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### CacheStats
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `totalFiles` | `long` | — | Total files |
|
|||
|
|
| `totalSizeMb` | `double` | — | Total size mb |
|
|||
|
|
| `availableSpaceMb` | `double` | — | Available space mb |
|
|||
|
|
| `oldestFileAgeDays` | `double` | — | Oldest file age days |
|
|||
|
|
| `newestFileAgeDays` | `double` | — | Newest file age days |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### CellChange
|
|||
|
|
|
|||
|
|
A single changed cell within a table.
|
|||
|
|
|
|||
|
|
Defined here (rather than only in `crate.diff`) so `RevisionDelta` can
|
|||
|
|
reference it unconditionally, without requiring the `diff` Cargo feature.
|
|||
|
|
`crate.diff` re-exports this type verbatim.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `row` | `long` | — | Zero-based row index. |
|
|||
|
|
| `col` | `long` | — | Zero-based column index. |
|
|||
|
|
| `from` | `String` | — | Value before the change. |
|
|||
|
|
| `to` | `String` | — | Value after the change. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Chunk
|
|||
|
|
|
|||
|
|
A text chunk with optional embedding and metadata.
|
|||
|
|
|
|||
|
|
Chunks are created when chunking is enabled in `ExtractionConfig`. Each chunk
|
|||
|
|
contains the text content, optional embedding vector (if embedding generation
|
|||
|
|
is configured), and metadata about its position in the document.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `content` | `String` | — | The text content of this chunk. |
|
|||
|
|
| `chunkType` | `ChunkType` | `/* serde(default) */` | Semantic structural classification of this chunk. Assigned by the heuristic classifier based on content patterns and heading context. Defaults to `ChunkType.Unknown` when no rule matches. |
|
|||
|
|
| `embedding` | `Optional<List<Float>>` | `null` | Optional embedding vector for this chunk. Only populated when `EmbeddingConfig` is provided in chunking configuration. The dimensionality depends on the chosen embedding model. |
|
|||
|
|
| `metadata` | `ChunkMetadata` | — | Metadata about this chunk's position and properties. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ChunkMetadata
|
|||
|
|
|
|||
|
|
Metadata about a chunk's position in the original document.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `byteStart` | `long` | — | Byte offset where this chunk starts in the original text (UTF-8 valid boundary). |
|
|||
|
|
| `byteEnd` | `long` | — | Byte offset where this chunk ends in the original text (UTF-8 valid boundary). |
|
|||
|
|
| `tokenCount` | `Optional<Long>` | `null` | Number of tokens in this chunk (if available). This is calculated by the embedding model's tokenizer if embeddings are enabled. |
|
|||
|
|
| `chunkIndex` | `long` | — | Zero-based index of this chunk in the document. |
|
|||
|
|
| `totalChunks` | `long` | — | Total number of chunks in the document. |
|
|||
|
|
| `firstPage` | `Optional<Integer>` | `null` | First page number this chunk spans (1-indexed). Only populated when page tracking is enabled in extraction configuration. |
|
|||
|
|
| `lastPage` | `Optional<Integer>` | `null` | Last page number this chunk spans (1-indexed, equal to first_page for single-page chunks). Only populated when page tracking is enabled in extraction configuration. |
|
|||
|
|
| `headingContext` | `Optional<HeadingContext>` | `/* serde(default) */` | Heading context when using Markdown chunker. Contains the heading hierarchy this chunk falls under. Only populated when `ChunkerType.Markdown` is used. |
|
|||
|
|
| `imageIndices` | `List<Integer>` | `/* serde(default) */` | Indices into `ExtractionResult.images` for images on pages covered by this chunk. Contains zero-based indices into the top-level `images` collection for every image whose `page_number` falls within `[first_page, last_page]`. Empty when image extraction is disabled or the chunk spans no pages with images. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ChunkingConfig
|
|||
|
|
|
|||
|
|
Chunking configuration.
|
|||
|
|
|
|||
|
|
Configures text chunking for document content, including chunk size,
|
|||
|
|
overlap, trimming behavior, and optional embeddings.
|
|||
|
|
|
|||
|
|
Use `..the default constructor` when constructing to allow for future field additions:
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `maxCharacters` | `long` | `1000` | Maximum size per chunk (in units determined by `sizing`). When `sizing` is `Characters` (default), this is the max character count. When using token-based sizing, this is the max token count. Default: 1000 |
|
|||
|
|
| `overlap` | `long` | `200` | Overlap between chunks (in units determined by `sizing`). Default: 200 |
|
|||
|
|
| `trim` | `boolean` | `true` | Whether to trim whitespace from chunk boundaries. Default: true |
|
|||
|
|
| `chunkerType` | `ChunkerType` | `ChunkerType.TEXT` | Type of chunker to use (Text or Markdown). Default: Text |
|
|||
|
|
| `embedding` | `Optional<EmbeddingConfig>` | `null` | Optional embedding configuration for chunk embeddings. |
|
|||
|
|
| `preset` | `Optional<String>` | `null` | Use a preset configuration (overrides individual settings if provided). |
|
|||
|
|
| `sizing` | `ChunkSizing` | `ChunkSizing.CHARACTERS` | How to measure chunk size. Default: `Characters` (Unicode character count). Enable `chunking-tiktoken` or `chunking-tokenizers` features for token-based sizing. |
|
|||
|
|
| `prependHeadingContext` | `boolean` | `false` | When `true` and `chunker_type` is `Markdown`, prepend the heading hierarchy path (e.g. `"# Title > ## Section\n\n"`) to each chunk's content string. This is useful for RAG pipelines where each chunk needs self-contained context about its position in the document structure. Default: `false` |
|
|||
|
|
| `topicThreshold` | `Optional<Float>` | `null` | Optional cosine similarity threshold for semantic topic boundary detection. Only used when `chunker_type` is `Semantic` and an `EmbeddingConfig` is provided. You almost never need to set this. When omitted, defaults to `0.75` which works well for most documents. Lower values detect more topic boundaries (more, smaller chunks); higher values detect fewer. Range: `0.0..=1.0`. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static ChunkingConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### CitationMetadata
|
|||
|
|
|
|||
|
|
Citation file metadata (RIS, PubMed, EndNote).
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `citationCount` | `long` | — | Number of citations |
|
|||
|
|
| `format` | `Optional<String>` | `null` | Format |
|
|||
|
|
| `authors` | `List<String>` | `Collections.emptyList()` | Authors |
|
|||
|
|
| `yearRange` | `Optional<YearRange>` | `null` | Year range (year range) |
|
|||
|
|
| `dois` | `List<String>` | `Collections.emptyList()` | Dois |
|
|||
|
|
| `keywords` | `List<String>` | `Collections.emptyList()` | Keywords |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ContentFilterConfig
|
|||
|
|
|
|||
|
|
Cross-extractor content filtering configuration.
|
|||
|
|
|
|||
|
|
Controls whether "furniture" content (headers, footers, page numbers,
|
|||
|
|
watermarks, repeating text) is included in or stripped from extraction
|
|||
|
|
results. Applies across all extractors (PDF, DOCX, RTF, ODT, HTML, etc.)
|
|||
|
|
with format-specific implementation.
|
|||
|
|
|
|||
|
|
When `null` on `ExtractionConfig`, each extractor uses its current
|
|||
|
|
default behavior unchanged.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `includeHeaders` | `boolean` | `false` | Include running headers in extraction output. - PDF: Disables top-margin furniture stripping and prevents the layout model from treating `PageHeader`-classified regions as furniture. - DOCX: Includes document headers in text output. - RTF/ODT: Headers already included; this is a no-op when true. - HTML/EPUB: Keeps `<header>` element content. Default: `false` (headers are stripped or excluded). |
|
|||
|
|
| `includeFooters` | `boolean` | `false` | Include running footers in extraction output. - PDF: Disables bottom-margin furniture stripping and prevents the layout model from treating `PageFooter`-classified regions as furniture. - DOCX: Includes document footers in text output. - RTF/ODT: Footers already included; this is a no-op when true. - HTML/EPUB: Keeps `<footer>` element content. Default: `false` (footers are stripped or excluded). |
|
|||
|
|
| `stripRepeatingText` | `boolean` | `true` | Enable the heuristic cross-page repeating text detector. When `true` (default), text that repeats verbatim across a supermajority of pages is classified as furniture and stripped. Disable this if brand names or repeated headings are being incorrectly removed by the heuristic. Note: when a layout-detection model is active, the model may independently classify page-header / page-footer regions as furniture on a per-page basis. To preserve those regions, set `include_headers = true`, `include_footers = true`, or both, in addition to disabling this flag. Primarily affects PDF extraction. Default: `true`. |
|
|||
|
|
| `includeWatermarks` | `boolean` | `false` | Include watermark text in extraction output. - PDF: Keeps watermark artifacts and arXiv identifiers. - Other formats: No effect currently. Default: `false` (watermarks are stripped). |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static ContentFilterConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ContributorRole
|
|||
|
|
|
|||
|
|
JATS contributor with role.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `name` | `String` | — | The name |
|
|||
|
|
| `role` | `Optional<String>` | `null` | Role |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### CoreProperties
|
|||
|
|
|
|||
|
|
Dublin Core metadata from docProps/core.xml
|
|||
|
|
|
|||
|
|
Contains standard metadata fields defined by the Dublin Core standard
|
|||
|
|
and Office-specific extensions.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `title` | `Optional<String>` | `null` | Document title |
|
|||
|
|
| `subject` | `Optional<String>` | `null` | Document subject/topic |
|
|||
|
|
| `creator` | `Optional<String>` | `null` | Document creator/author |
|
|||
|
|
| `keywords` | `Optional<String>` | `null` | Keywords or tags |
|
|||
|
|
| `description` | `Optional<String>` | `null` | Document description/abstract |
|
|||
|
|
| `lastModifiedBy` | `Optional<String>` | `null` | User who last modified the document |
|
|||
|
|
| `revision` | `Optional<String>` | `null` | Revision number |
|
|||
|
|
| `created` | `Optional<String>` | `null` | Creation timestamp (ISO 8601) |
|
|||
|
|
| `modified` | `Optional<String>` | `null` | Last modification timestamp (ISO 8601) |
|
|||
|
|
| `category` | `Optional<String>` | `null` | Document category |
|
|||
|
|
| `contentStatus` | `Optional<String>` | `null` | Content status (Draft, Final, etc.) |
|
|||
|
|
| `language` | `Optional<String>` | `null` | Document language |
|
|||
|
|
| `identifier` | `Optional<String>` | `null` | Unique identifier |
|
|||
|
|
| `version` | `Optional<String>` | `null` | Document version |
|
|||
|
|
| `lastPrinted` | `Optional<String>` | `null` | Last print timestamp (ISO 8601) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### CsvMetadata
|
|||
|
|
|
|||
|
|
CSV/TSV file metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `rowCount` | `int` | — | Number of rows |
|
|||
|
|
| `columnCount` | `int` | — | Number of columns |
|
|||
|
|
| `delimiter` | `Optional<String>` | `null` | Delimiter |
|
|||
|
|
| `hasHeader` | `boolean` | — | Whether header |
|
|||
|
|
| `columnTypes` | `Optional<List<String>>` | `Collections.emptyList()` | Column types |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DbfFieldInfo
|
|||
|
|
|
|||
|
|
dBASE field information.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `name` | `String` | — | The name |
|
|||
|
|
| `fieldType` | `String` | — | Field type |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DbfMetadata
|
|||
|
|
|
|||
|
|
dBASE (DBF) file metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `recordCount` | `long` | — | Number of records |
|
|||
|
|
| `fieldCount` | `long` | — | Number of fields |
|
|||
|
|
| `fields` | `List<DbfFieldInfo>` | `Collections.emptyList()` | Fields |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DetectResponse
|
|||
|
|
|
|||
|
|
MIME type detection response.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `mimeType` | `String` | — | Detected MIME type |
|
|||
|
|
| `filename` | `Optional<String>` | `null` | Original filename (if provided) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DetectionResult
|
|||
|
|
|
|||
|
|
Page-level detection result containing all detections and page metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `pageWidth` | `int` | — | Page width |
|
|||
|
|
| `pageHeight` | `int` | — | Page height |
|
|||
|
|
| `detections` | `List<LayoutDetection>` | — | Detections |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DiffHunk
|
|||
|
|
|
|||
|
|
A single contiguous hunk in a unified diff.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `fromLine` | `long` | — | Starting line number in the old content (0-indexed). |
|
|||
|
|
| `fromCount` | `long` | — | Number of lines from the old content in this hunk. |
|
|||
|
|
| `toLine` | `long` | — | Starting line number in the new content (0-indexed). |
|
|||
|
|
| `toCount` | `long` | — | Number of lines from the new content in this hunk. |
|
|||
|
|
| `lines` | `List<DiffLine>` | — | Lines that make up this hunk. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DiffOptions
|
|||
|
|
|
|||
|
|
Options controlling how two `ExtractionResult` values are compared.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `includeMetadata` | `boolean` | `true` | Include metadata changes in the diff. Default: `true`. |
|
|||
|
|
| `includeEmbedded` | `boolean` | `true` | Include embedded-children changes in the diff. Default: `true`. |
|
|||
|
|
| `maxContentChars` | `Optional<Long>` | `null` | Truncate content to this many characters before diffing. Useful for very large documents where only the first N characters matter. `null` means no truncation. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static DiffOptions defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DjotContent
|
|||
|
|
|
|||
|
|
Comprehensive Djot document structure with semantic preservation.
|
|||
|
|
|
|||
|
|
This type captures the full richness of Djot markup, including:
|
|||
|
|
|
|||
|
|
- Block-level structures (headings, lists, blockquotes, code blocks, etc.)
|
|||
|
|
- Inline formatting (emphasis, strong, highlight, subscript, superscript, etc.)
|
|||
|
|
- Attributes (classes, IDs, key-value pairs)
|
|||
|
|
- Links, images, footnotes
|
|||
|
|
- Math expressions (inline and display)
|
|||
|
|
- Tables with full structure
|
|||
|
|
|
|||
|
|
Available when the `djot` feature is enabled.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `plainText` | `String` | — | Plain text representation for backwards compatibility |
|
|||
|
|
| `blocks` | `List<FormattedBlock>` | — | Structured block-level content |
|
|||
|
|
| `metadata` | `Metadata` | — | Metadata from YAML frontmatter |
|
|||
|
|
| `tables` | `List<Table>` | — | Extracted tables as structured data |
|
|||
|
|
| `images` | `List<DjotImage>` | — | Extracted images with metadata |
|
|||
|
|
| `links` | `List<DjotLink>` | — | Extracted links with URLs |
|
|||
|
|
| `footnotes` | `List<Footnote>` | — | Footnote definitions |
|
|||
|
|
| `attributes` | `List<String>` | `/* serde(default) */` | Attributes mapped by element identifier (if present) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DjotImage
|
|||
|
|
|
|||
|
|
Image element in Djot.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `src` | `String` | — | Image source URL or path |
|
|||
|
|
| `alt` | `String` | — | Alternative text |
|
|||
|
|
| `title` | `Optional<String>` | `null` | Optional title |
|
|||
|
|
| `attributes` | `Optional<String>` | `null` | Element attributes |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DjotLink
|
|||
|
|
|
|||
|
|
Link element in Djot.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `url` | `String` | — | Link URL |
|
|||
|
|
| `text` | `String` | — | Link text content |
|
|||
|
|
| `title` | `Optional<String>` | `null` | Optional title |
|
|||
|
|
| `attributes` | `Optional<String>` | `null` | Element attributes |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DocumentExtractor
|
|||
|
|
|
|||
|
|
Trait for document extractor plugins.
|
|||
|
|
|
|||
|
|
Implement this trait to add support for new document formats or to override
|
|||
|
|
built-in extraction behavior with custom logic.
|
|||
|
|
|
|||
|
|
### Return Type
|
|||
|
|
|
|||
|
|
Extractors return `InternalDocument`, a flat intermediate representation.
|
|||
|
|
The pipeline converts this into the public `ExtractionResult` via the
|
|||
|
|
derivation step.
|
|||
|
|
|
|||
|
|
### Priority System
|
|||
|
|
|
|||
|
|
When multiple extractors support the same MIME type, the registry selects
|
|||
|
|
the extractor with the highest priority value. Use this to:
|
|||
|
|
|
|||
|
|
- Override built-in extractors (priority > 50)
|
|||
|
|
- Provide fallback extractors (priority < 50)
|
|||
|
|
- Implement specialized extractors for specific use cases
|
|||
|
|
|
|||
|
|
Default priority is 50.
|
|||
|
|
|
|||
|
|
### Thread Safety
|
|||
|
|
|
|||
|
|
Extractors must be thread-safe (`Send + Sync`) to support concurrent extraction.
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### extractBytes()
|
|||
|
|
|
|||
|
|
Extract content from a byte array.
|
|||
|
|
|
|||
|
|
This is the core extraction method that processes in-memory document data.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
An `InternalDocument` containing the extracted elements, metadata, and tables.
|
|||
|
|
The pipeline will convert this into the public `ExtractionResult`.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
- `KreuzbergError.Parsing` - Document parsing failed
|
|||
|
|
- `KreuzbergError.Validation` - Invalid document structure
|
|||
|
|
- `KreuzbergError.Io` - I/O errors (these always bubble up)
|
|||
|
|
- `KreuzbergError.MissingDependency` - Required dependency not available
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public InternalDocument extractBytes(byte[] content, String mimeType, ExtractionConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### extractFile()
|
|||
|
|
|
|||
|
|
Extract content from a file.
|
|||
|
|
|
|||
|
|
Default implementation reads the file and calls `extract_bytes`.
|
|||
|
|
Override for custom file handling, streaming, or memory optimizations.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
An `InternalDocument` containing the extracted elements, metadata, and tables.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Same as `extract_bytes`, plus file I/O errors.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public InternalDocument extractFile(String path, String mimeType, ExtractionConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### supportedMimeTypes()
|
|||
|
|
|
|||
|
|
Get the list of MIME types supported by this extractor.
|
|||
|
|
|
|||
|
|
Can include exact MIME types and prefix patterns:
|
|||
|
|
|
|||
|
|
- Exact: `"application/pdf"`, `"text/plain"`
|
|||
|
|
- Prefix: `"image/*"` (matches any image type)
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
A slice of MIME type strings.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public List<String> supportedMimeTypes()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### priority()
|
|||
|
|
|
|||
|
|
Get the priority of this extractor.
|
|||
|
|
|
|||
|
|
Higher priority extractors are preferred when multiple extractors
|
|||
|
|
support the same MIME type.
|
|||
|
|
|
|||
|
|
### Priority Guidelines
|
|||
|
|
|
|||
|
|
- **0-25**: Fallback/low-quality extractors
|
|||
|
|
- **26-49**: Alternative extractors
|
|||
|
|
- **50**: Default priority (built-in extractors)
|
|||
|
|
- **51-75**: Premium/enhanced extractors
|
|||
|
|
- **76-100**: Specialized/high-priority extractors
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
Priority value (default: 50)
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public int priority()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### canHandle()
|
|||
|
|
|
|||
|
|
Optional: Check if this extractor can handle a specific file.
|
|||
|
|
|
|||
|
|
Allows for more sophisticated detection beyond MIME types.
|
|||
|
|
Defaults to `true` (rely on MIME type matching).
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
`true` if the extractor can handle this file, `false` otherwise.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public boolean canHandle(String path, String mimeType)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DocumentNode
|
|||
|
|
|
|||
|
|
A single node in the document tree.
|
|||
|
|
|
|||
|
|
Each node has deterministic `id`, typed `content`, optional `parent`/`children`
|
|||
|
|
for tree structure, and metadata like page number, bounding box, and content layer.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `id` | `String` | — | Deterministic identifier (hash of content + position). |
|
|||
|
|
| `content` | `NodeContent` | — | Node content — tagged enum, type-specific data only. |
|
|||
|
|
| `parent` | `Optional<Integer>` | `null` | Parent node index (`null` = root-level node). |
|
|||
|
|
| `children` | `List<Integer>` | `/* serde(default) */` | Child node indices in reading order. |
|
|||
|
|
| `contentLayer` | `ContentLayer` | `/* serde(default) */` | Content layer classification. |
|
|||
|
|
| `page` | `Optional<Integer>` | `null` | Page number where this node starts (1-indexed). |
|
|||
|
|
| `pageEnd` | `Optional<Integer>` | `null` | Page number where this node ends (for multi-page tables/sections). |
|
|||
|
|
| `bbox` | `Optional<BoundingBox>` | `null` | Bounding box in document coordinates. |
|
|||
|
|
| `annotations` | `List<TextAnnotation>` | `/* serde(default) */` | Inline annotations (formatting, links) on this node's text content. Only meaningful for text-carrying nodes; empty for containers. |
|
|||
|
|
| `attributes` | `Optional<Map<String, String>>` | `null` | Format-specific key-value attributes. Extensible bag for miscellaneous data without a dedicated typed field: CSS classes, LaTeX environment names, Excel cell formulas, slide layout names, etc. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DocumentRelationship
|
|||
|
|
|
|||
|
|
A resolved relationship between two nodes in the document tree.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `source` | `int` | — | Source node index (the referencing node). |
|
|||
|
|
| `target` | `int` | — | Target node index (the referenced node). |
|
|||
|
|
| `kind` | `RelationshipKind` | — | Semantic kind of the relationship. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DocumentRevision
|
|||
|
|
|
|||
|
|
A single tracked change embedded in a document.
|
|||
|
|
|
|||
|
|
Populated by per-format extractors that understand change-tracking metadata
|
|||
|
|
(DOCX `w:ins`/`w:del`/`w:rPrChange`, ODT `text:change-*`, …). Every
|
|||
|
|
extractor defaults to `ExtractionResult.revisions = None` until a
|
|||
|
|
format-specific implementation is added.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `revisionId` | `String` | — | Format-specific revision identifier. For DOCX this is the `w:id` attribute value on the change element (e.g. `"42"`). When the attribute is absent a synthetic fallback is generated (`"docx-ins-0"`, `"docx-del-3"`, …). |
|
|||
|
|
| `author` | `Optional<String>` | `null` | Display name of the author who made this change, when available. |
|
|||
|
|
| `timestamp` | `Optional<String>` | `null` | ISO-8601 timestamp of the change, when available. Stored as a plain string so this type remains FFI-friendly and unconditionally available without the `chrono` optional dep. DOCX populates this from the `w:date` attribute (e.g. `"2024-03-15T10:30:00Z"`). |
|
|||
|
|
| `kind` | `RevisionKind` | — | Semantic kind of this revision. |
|
|||
|
|
| `anchor` | `Optional<RevisionAnchor>` | `null` | Best-effort document location for this revision. Resolution is format-dependent and may be `null` when the location cannot be determined (e.g. changes inside table cells before table-cell anchor support is added). |
|
|||
|
|
| `delta` | `RevisionDelta` | — | The content changes that make up this revision. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DocumentStructure
|
|||
|
|
|
|||
|
|
Top-level structured document representation.
|
|||
|
|
|
|||
|
|
A flat array of nodes with index-based parent/child references forming a tree.
|
|||
|
|
Root-level nodes have `parent: None`. Use `body_roots()` and `furniture_roots()`
|
|||
|
|
to iterate over top-level content by layer.
|
|||
|
|
|
|||
|
|
### Validation
|
|||
|
|
|
|||
|
|
Call `validate()` after construction to verify all node indices are in bounds
|
|||
|
|
and parent-child relationships are bidirectionally consistent.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `nodes` | `List<DocumentNode>` | `Collections.emptyList()` | All nodes in document/reading order. |
|
|||
|
|
| `sourceFormat` | `Optional<String>` | `null` | Origin format identifier (e.g. "docx", "pptx", "html", "pdf"). Allows renderers to apply format-aware heuristics when converting the document tree to output formats. |
|
|||
|
|
| `relationships` | `List<DocumentRelationship>` | `Collections.emptyList()` | Resolved relationships between nodes (footnote refs, citations, anchor links, etc.). Populated during derivation from the internal document representation. Empty when no relationships are detected. |
|
|||
|
|
| `nodeTypes` | `List<String>` | `Collections.emptyList()` | Sorted, deduplicated list of node type names present in this document. Each value is the snake_case `node_type` tag of the corresponding `NodeContent` variant (e.g. `"paragraph"`, `"heading"`, `"table"`, …). Computed from `nodes` via `DocumentStructure.finalize_node_types`. Empty until that method is called (internal construction paths call it at the end of derivation). |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### finalizeNodeTypes()
|
|||
|
|
|
|||
|
|
Compute and populate the `node_types` field from the current `nodes`.
|
|||
|
|
|
|||
|
|
Call this after all nodes have been added to the structure. Internal
|
|||
|
|
construction paths (builder, derivation) call this automatically.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public void finalizeNodeTypes()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### isEmpty()
|
|||
|
|
|
|||
|
|
Check if the document structure is empty.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public boolean isEmpty()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static DocumentStructure defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DocxAppProperties
|
|||
|
|
|
|||
|
|
Application properties from docProps/app.xml for DOCX
|
|||
|
|
|
|||
|
|
Contains Word-specific document statistics and metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `application` | `Optional<String>` | `null` | Application name (e.g., "Microsoft Office Word") |
|
|||
|
|
| `appVersion` | `Optional<String>` | `null` | Application version |
|
|||
|
|
| `template` | `Optional<String>` | `null` | Template filename |
|
|||
|
|
| `totalTime` | `Optional<Integer>` | `null` | Total editing time in minutes |
|
|||
|
|
| `pages` | `Optional<Integer>` | `null` | Number of pages |
|
|||
|
|
| `words` | `Optional<Integer>` | `null` | Number of words |
|
|||
|
|
| `characters` | `Optional<Integer>` | `null` | Number of characters (excluding spaces) |
|
|||
|
|
| `charactersWithSpaces` | `Optional<Integer>` | `null` | Number of characters (including spaces) |
|
|||
|
|
| `lines` | `Optional<Integer>` | `null` | Number of lines |
|
|||
|
|
| `paragraphs` | `Optional<Integer>` | `null` | Number of paragraphs |
|
|||
|
|
| `company` | `Optional<String>` | `null` | Company name |
|
|||
|
|
| `docSecurity` | `Optional<Integer>` | `null` | Document security level |
|
|||
|
|
| `scaleCrop` | `Optional<Boolean>` | `null` | Scale crop flag |
|
|||
|
|
| `linksUpToDate` | `Optional<Boolean>` | `null` | Links up to date flag |
|
|||
|
|
| `sharedDoc` | `Optional<Boolean>` | `null` | Shared document flag |
|
|||
|
|
| `hyperlinksChanged` | `Optional<Boolean>` | `null` | Hyperlinks changed flag |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DocxMetadata
|
|||
|
|
|
|||
|
|
Word document metadata.
|
|||
|
|
|
|||
|
|
Extracted from DOCX files using shared Office Open XML metadata extraction.
|
|||
|
|
Integrates with `office_metadata` module for core/app/custom properties.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `coreProperties` | `Optional<CoreProperties>` | `null` | Core properties from docProps/core.xml (Dublin Core metadata) Contains title, creator, subject, keywords, dates, etc. Shared format across DOCX/PPTX/XLSX documents. |
|
|||
|
|
| `appProperties` | `Optional<DocxAppProperties>` | `null` | Application properties from docProps/app.xml (Word-specific statistics) Contains word count, page count, paragraph count, editing time, etc. DOCX-specific variant of Office application properties. |
|
|||
|
|
| `customProperties` | `Optional<Map<String, Object>>` | `Collections.emptyMap()` | Custom properties from docProps/custom.xml (user-defined properties) Contains key-value pairs defined by users or applications. Values can be strings, numbers, booleans, or dates. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Element
|
|||
|
|
|
|||
|
|
Semantic element extracted from document.
|
|||
|
|
|
|||
|
|
Represents a logical unit of content with semantic classification,
|
|||
|
|
unique identifier, and metadata for tracking origin and position.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `elementId` | `String` | — | Unique element identifier |
|
|||
|
|
| `elementType` | `ElementType` | — | Semantic type of this element |
|
|||
|
|
| `text` | `String` | — | Text content of the element |
|
|||
|
|
| `metadata` | `ElementMetadata` | — | Metadata about the element |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ElementMetadata
|
|||
|
|
|
|||
|
|
Metadata for a semantic element.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `pageNumber` | `Optional<Integer>` | `null` | Page number (1-indexed) |
|
|||
|
|
| `filename` | `Optional<String>` | `null` | Source filename or document name |
|
|||
|
|
| `coordinates` | `Optional<BoundingBox>` | `null` | Bounding box coordinates if available |
|
|||
|
|
| `elementIndex` | `Optional<Long>` | `null` | Position index in the element sequence |
|
|||
|
|
| `additional` | `Map<String, String>` | — | Additional custom metadata |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### EmailAttachment
|
|||
|
|
|
|||
|
|
Email attachment representation.
|
|||
|
|
|
|||
|
|
Contains metadata and optionally the content of an email attachment.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `name` | `Optional<String>` | `null` | Attachment name (from Content-Disposition header) |
|
|||
|
|
| `filename` | `Optional<String>` | `null` | Filename of the attachment |
|
|||
|
|
| `mimeType` | `Optional<String>` | `null` | MIME type of the attachment |
|
|||
|
|
| `size` | `Optional<Long>` | `null` | Size in bytes |
|
|||
|
|
| `isImage` | `boolean` | — | Whether this attachment is an image |
|
|||
|
|
| `data` | `Optional<byte[]>` | `null` | Attachment data (if extracted). Uses `bytes.Bytes` for cheap cloning of large buffers. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### EmailConfig
|
|||
|
|
|
|||
|
|
Configuration for email extraction.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `msgFallbackCodepage` | `Optional<Integer>` | `null` | Windows codepage number to use when an MSG file contains no codepage property. Defaults to `null`, which falls back to windows-1252. If an unrecognized or invalid codepage number is supplied (including 0), the behavior silently falls back to windows-1252 — the same as when the MSG file itself contains an unrecognized codepage. No error or warning is emitted. Users should verify output when supplying unusual values. Common values: - 1250: Central European (Polish, Czech, Hungarian, etc.) - 1251: Cyrillic (Russian, Ukrainian, Bulgarian, etc.) - 1252: Western European (default) - 1253: Greek - 1254: Turkish - 1255: Hebrew - 1256: Arabic - 932: Japanese (Shift-JIS) - 936: Simplified Chinese (GBK) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### EmailExtractionResult
|
|||
|
|
|
|||
|
|
Email extraction result.
|
|||
|
|
|
|||
|
|
Complete representation of an extracted email message (.eml or .msg)
|
|||
|
|
including headers, body content, and attachments.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `subject` | `Optional<String>` | `null` | Email subject line |
|
|||
|
|
| `fromEmail` | `Optional<String>` | `null` | Sender email address |
|
|||
|
|
| `toEmails` | `List<String>` | — | Primary recipient email addresses |
|
|||
|
|
| `ccEmails` | `List<String>` | — | CC recipient email addresses |
|
|||
|
|
| `bccEmails` | `List<String>` | — | BCC recipient email addresses |
|
|||
|
|
| `date` | `Optional<String>` | `null` | Email date/timestamp |
|
|||
|
|
| `messageId` | `Optional<String>` | `null` | Message-ID header value |
|
|||
|
|
| `plainText` | `Optional<String>` | `null` | Plain text version of the email body |
|
|||
|
|
| `htmlContent` | `Optional<String>` | `null` | HTML version of the email body |
|
|||
|
|
| `content` | `String` | — | Cleaned/processed text content. Aliased as `cleaned_text` for back-compat. |
|
|||
|
|
| `attachments` | `List<EmailAttachment>` | — | List of email attachments |
|
|||
|
|
| `metadata` | `Map<String, String>` | — | Additional email headers and metadata |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### EmailMetadata
|
|||
|
|
|
|||
|
|
Email metadata extracted from .eml and .msg files.
|
|||
|
|
|
|||
|
|
Includes sender/recipient information, message ID, and attachment list.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `fromEmail` | `Optional<String>` | `null` | Sender's email address |
|
|||
|
|
| `fromName` | `Optional<String>` | `null` | Sender's display name |
|
|||
|
|
| `toEmails` | `List<String>` | `Collections.emptyList()` | Primary recipients |
|
|||
|
|
| `ccEmails` | `List<String>` | `Collections.emptyList()` | CC recipients |
|
|||
|
|
| `bccEmails` | `List<String>` | `Collections.emptyList()` | BCC recipients |
|
|||
|
|
| `messageId` | `Optional<String>` | `null` | Message-ID header value |
|
|||
|
|
| `attachments` | `List<String>` | `Collections.emptyList()` | List of attachment filenames |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### EmbeddedChanges
|
|||
|
|
|
|||
|
|
Changes to embedded archive children between two results.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `added` | `List<ArchiveEntry>` | — | Children present in `b` but not in `a` (matched by `path`). |
|
|||
|
|
| `removed` | `List<ArchiveEntry>` | — | Children present in `a` but not in `b` (matched by `path`). |
|
|||
|
|
| `changed` | `List<EmbeddedDiff>` | — | Children present in both but with differing content (matched by `path`). Each entry holds the diff of the nested `ExtractionResult`. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### EmbeddedDiff
|
|||
|
|
|
|||
|
|
Diff for a single embedded archive entry that appears in both results.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `path` | `String` | — | Archive-relative path identifying this entry. |
|
|||
|
|
| `diff` | `ExtractionDiff` | — | The recursive diff of the entry's extraction result. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### EmbeddedFile
|
|||
|
|
|
|||
|
|
Embedded file descriptor extracted from the PDF name tree.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `name` | `String` | — | The filename as stored in the PDF name tree. |
|
|||
|
|
| `data` | `byte[]` | — | Raw file bytes from the embedded stream (already decompressed by lopdf). |
|
|||
|
|
| `compressedSize` | `long` | — | Compressed byte count of the original stream (before decompression). Used by callers to compute the decompression ratio and detect zip-bomb-style attacks that embed a tiny compressed stream expanding to gigabytes of data. |
|
|||
|
|
| `mimeType` | `Optional<String>` | `null` | MIME type if specified in the filespec, otherwise `null`. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### EmbeddingBackend
|
|||
|
|
|
|||
|
|
Trait for in-process embedding backend plugins.
|
|||
|
|
|
|||
|
|
Async to match the convention used by `OcrBackend`,
|
|||
|
|
`DocumentExtractor`, and `PostProcessor`.
|
|||
|
|
Host-language bridges (PyO3, napi-rs, Rustler, extendr, magnus, ext-php-rs,
|
|||
|
|
C FFI, etc.) wrap their synchronous host callables in `spawn_blocking` or the
|
|||
|
|
equivalent to satisfy the async signature.
|
|||
|
|
|
|||
|
|
### Thread safety
|
|||
|
|
|
|||
|
|
Backends must be `Send + Sync + 'static`. They are stored in
|
|||
|
|
`Arc<dyn EmbeddingBackend>` and called concurrently from kreuzberg's chunking
|
|||
|
|
pipeline. If the backend's underlying model isn't thread-safe, the backend
|
|||
|
|
itself must serialize access internally (e.g. via `Mutex<Inner>`).
|
|||
|
|
|
|||
|
|
### Contract
|
|||
|
|
|
|||
|
|
- `embed(texts)` MUST return exactly `texts.len()` vectors, each of length
|
|||
|
|
`self.dimensions()`. The dispatcher in `embed_texts`
|
|||
|
|
validates this before returning to downstream consumers; a non-conforming
|
|||
|
|
backend surfaces as a `KreuzbergError.Validation`, not a panic.
|
|||
|
|
|
|||
|
|
- `embed` may be called from any thread. Its future must be `Send`
|
|||
|
|
(enforced by `async_trait` when `#[async_trait]` is used on non-WASM targets).
|
|||
|
|
|
|||
|
|
- `dimensions()` is called exactly once at registration, immediately after
|
|||
|
|
`initialize()` succeeds. The returned value is cached by the registry and
|
|||
|
|
used for all subsequent shape validation. Lazy-loading implementations can
|
|||
|
|
defer model loading into `initialize()` and report the real dimension
|
|||
|
|
afterwards. Later mutations of the backend's reported dimension are not
|
|||
|
|
observed by kreuzberg — implementations that need to change dimension
|
|||
|
|
must unregister and re-register.
|
|||
|
|
|
|||
|
|
- `shutdown()` (inherited from `Plugin`) may be invoked
|
|||
|
|
concurrently with an in-flight `embed()` call. Implementations must
|
|||
|
|
tolerate this — e.g. by letting in-flight calls finish using resources
|
|||
|
|
held via the `Arc<dyn EmbeddingBackend>` reference, and only releasing
|
|||
|
|
shared state that isn't needed by `embed`.
|
|||
|
|
|
|||
|
|
### Runtime
|
|||
|
|
|
|||
|
|
The synchronous `embed_texts` entry uses
|
|||
|
|
`tokio.task.block_in_place` to await the trait's async `embed`, which
|
|||
|
|
requires a multi-thread tokio runtime. Callers running inside a
|
|||
|
|
`current_thread` runtime (e.g. `#[tokio.test]` without `flavor = "multi_thread"`,
|
|||
|
|
or `tokio.runtime.Builder.new_current_thread()`) must use
|
|||
|
|
`embed_texts_async` instead, which awaits directly without
|
|||
|
|
`block_in_place`.
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### dimensions()
|
|||
|
|
|
|||
|
|
Embedding vector dimension. Must be `> 0` and must match the length of
|
|||
|
|
every vector returned by `embed`.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public long dimensions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### embed()
|
|||
|
|
|
|||
|
|
Embed a batch of texts, returning one vector per input in order.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Implementations should return `Plugin` for
|
|||
|
|
backend-specific failures. The dispatcher layers its own validation
|
|||
|
|
(length, per-vector dimension) on top.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public List<List<Float>> embed(List<String> texts) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### EmbeddingConfig
|
|||
|
|
|
|||
|
|
Embedding configuration for text chunks.
|
|||
|
|
|
|||
|
|
Configures embedding generation using ONNX models via the vendored embedding engine.
|
|||
|
|
Requires the `embeddings` feature to be enabled.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `model` | `EmbeddingModelType` | `EmbeddingModelType.PRESET` | The embedding model to use (defaults to "balanced" preset if not specified) |
|
|||
|
|
| `normalize` | `boolean` | `true` | Whether to normalize embedding vectors (recommended for cosine similarity) |
|
|||
|
|
| `batchSize` | `long` | `32` | Batch size for embedding generation |
|
|||
|
|
| `showDownloadProgress` | `boolean` | `false` | Show model download progress |
|
|||
|
|
| `cacheDir` | `Optional<String>` | `null` | Custom cache directory for model files Defaults to `~/.cache/kreuzberg/embeddings/` if not specified. Allows full customization of model download location. |
|
|||
|
|
| `acceleration` | `Optional<AccelerationConfig>` | `null` | Hardware acceleration for the embedding ONNX model. When set, controls which execution provider (CPU, CUDA, CoreML, TensorRT) is used for inference. Defaults to `null` (auto-select per platform). |
|
|||
|
|
| `maxEmbedDurationSecs` | `Optional<Long>` | `null` | Maximum wall-clock duration (in seconds) for a single `embed()` call when using `EmbeddingModelType.Plugin`. Applies only to the in-process plugin path — protects against hung host-language backends (e.g. a Python callback deadlocked on the GIL, a model stuck on CUDA OOM retries, etc.). On timeout, the dispatcher returns `Plugin` instead of blocking forever. `null` disables the timeout. The default (60 seconds) is conservative for common in-process inference; increase for large batches on slow hardware. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static EmbeddingConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### EmbeddingPreset
|
|||
|
|
|
|||
|
|
Preset configurations for common RAG use cases.
|
|||
|
|
|
|||
|
|
Each preset combines chunk size, overlap, and embedding model
|
|||
|
|
to provide an optimized configuration for specific scenarios.
|
|||
|
|
|
|||
|
|
All string fields are owned `String` for FFI compatibility — instances
|
|||
|
|
are safe to clone and pass across language boundaries.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `name` | `String` | — | The name |
|
|||
|
|
| `chunkSize` | `long` | — | Chunk size |
|
|||
|
|
| `overlap` | `long` | — | Overlap |
|
|||
|
|
| `modelRepo` | `String` | — | HuggingFace repository name for the model. |
|
|||
|
|
| `pooling` | `String` | — | Pooling strategy: "cls" or "mean". |
|
|||
|
|
| `modelFile` | `String` | — | Path to the ONNX model file within the repo. |
|
|||
|
|
| `dimensions` | `long` | — | Dimensions |
|
|||
|
|
| `description` | `String` | — | Human-readable description |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### EpubMetadata
|
|||
|
|
|
|||
|
|
EPUB metadata (Dublin Core extensions).
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `coverage` | `Optional<String>` | `null` | Coverage |
|
|||
|
|
| `dcFormat` | `Optional<String>` | `null` | Dc format |
|
|||
|
|
| `relation` | `Optional<String>` | `null` | Relation |
|
|||
|
|
| `source` | `Optional<String>` | `null` | Source |
|
|||
|
|
| `dcType` | `Optional<String>` | `null` | Dc type |
|
|||
|
|
| `coverImage` | `Optional<String>` | `null` | Cover image |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ErrorMetadata
|
|||
|
|
|
|||
|
|
Error metadata (for batch operations).
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `errorType` | `String` | — | Error type |
|
|||
|
|
| `message` | `String` | — | Message |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ExcelMetadata
|
|||
|
|
|
|||
|
|
Excel/spreadsheet format metadata.
|
|||
|
|
|
|||
|
|
Identifies the document as a spreadsheet source via the `FormatMetadata.Excel`
|
|||
|
|
discriminant. Sheet count and sheet names are stored inside this struct.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `sheetCount` | `Optional<Integer>` | `null` | Number of sheets in the workbook. |
|
|||
|
|
| `sheetNames` | `Optional<List<String>>` | `Collections.emptyList()` | Names of all sheets in the workbook. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ExcelSheet
|
|||
|
|
|
|||
|
|
Single Excel worksheet.
|
|||
|
|
|
|||
|
|
Represents one sheet from an Excel workbook with its content
|
|||
|
|
converted to Markdown format and dimensional statistics.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `name` | `String` | — | Sheet name as it appears in Excel |
|
|||
|
|
| `markdown` | `String` | — | Sheet content converted to Markdown tables |
|
|||
|
|
| `rowCount` | `long` | — | Number of rows |
|
|||
|
|
| `colCount` | `long` | — | Number of columns |
|
|||
|
|
| `cellCount` | `long` | — | Total number of non-empty cells |
|
|||
|
|
| `tableCells` | `Optional<List<List<String>>>` | `null` | Pre-extracted table cells (2D vector of cell values) Populated during markdown generation to avoid re-parsing markdown. None for empty sheets. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ExcelWorkbook
|
|||
|
|
|
|||
|
|
Excel workbook representation.
|
|||
|
|
|
|||
|
|
Contains all sheets from an Excel file (.xlsx, .xls, etc.) with
|
|||
|
|
extracted content and metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `sheets` | `List<ExcelSheet>` | — | All sheets in the workbook |
|
|||
|
|
| `metadata` | `Map<String, String>` | — | Workbook-level metadata (author, creation date, etc.) |
|
|||
|
|
| `revisions` | `Optional<List<DocumentRevision>>` | `/* serde(default) */` | Collaborative-edit revision headers from `xl/revisions/revisionHeaders.xml`. Populated for legacy shared-workbook `.xlsx` files that contain the `xl/revisions/` directory. Each `<header>` element maps to one `DocumentRevision { kind: FormatChange }` carrying the header's `guid` (→ `revision_id`), `userName` (→ `author`), and `dateTime` (→ `timestamp`). `anchor` and `delta` are `null`/empty for v1 (per-cell log parsing is a follow-up). `null` when `xl/revisions/revisionHeaders.xml` is absent. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ExtractedImage
|
|||
|
|
|
|||
|
|
Extracted image from a document.
|
|||
|
|
|
|||
|
|
Contains raw image data, metadata, and optional nested OCR results.
|
|||
|
|
Raw bytes allow cross-language compatibility - users can convert to
|
|||
|
|
PIL.Image (Python), Sharp (Node.js), or other formats as needed.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `data` | `byte[]` | — | Raw image data (PNG, JPEG, WebP, etc. bytes). Uses `bytes.Bytes` for cheap cloning of large buffers. |
|
|||
|
|
| `format` | `String` | — | Image format (e.g., "jpeg", "png", "webp") Uses Cow<'static, str> to avoid allocation for static literals. |
|
|||
|
|
| `imageIndex` | `int` | — | Zero-indexed position of this image in the document/page |
|
|||
|
|
| `pageNumber` | `Optional<Integer>` | `null` | Page/slide number where image was found (1-indexed) |
|
|||
|
|
| `width` | `Optional<Integer>` | `null` | Image width in pixels |
|
|||
|
|
| `height` | `Optional<Integer>` | `null` | Image height in pixels |
|
|||
|
|
| `colorspace` | `Optional<String>` | `null` | Colorspace information (e.g., "RGB", "CMYK", "Gray") |
|
|||
|
|
| `bitsPerComponent` | `Optional<Integer>` | `null` | Bits per color component (e.g., 8, 16) |
|
|||
|
|
| `isMask` | `boolean` | `/* serde(default) */` | Whether this image is a mask image |
|
|||
|
|
| `description` | `Optional<String>` | `null` | Optional description of the image |
|
|||
|
|
| `ocrResult` | `Optional<ExtractionResult>` | `null` | Nested OCR extraction result (if image was OCRed) When OCR is performed on this image, the result is embedded here rather than in a separate collection, making the relationship explicit. |
|
|||
|
|
| `boundingBox` | `Optional<BoundingBox>` | `/* serde(default) */` | Bounding box of the image on the page (PDF coordinates: x0=left, y0=bottom, x1=right, y1=top). Only populated for PDF-extracted images when position data is available from the PDF extractor. |
|
|||
|
|
| `sourcePath` | `Optional<String>` | `/* serde(default) */` | Original source path of the image within the document archive (e.g., "media/image1.png" in DOCX). Used for rendering image references when the binary data is not extracted. |
|
|||
|
|
| `imageKind` | `Optional<ImageKind>` | `/* serde(default) */` | Heuristic classification of what this image likely depicts. `null` if classification was disabled or inconclusive. |
|
|||
|
|
| `kindConfidence` | `Optional<Float>` | `/* serde(default) */` | Confidence score for `image_kind`, in the range 0.0 to 1.0. |
|
|||
|
|
| `clusterId` | `Optional<Integer>` | `/* serde(default) */` | Identifier shared across images that form a single logical figure (e.g. all raster tiles of one technical drawing). `null` for singletons. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ExtractedUri
|
|||
|
|
|
|||
|
|
A URI extracted from a document.
|
|||
|
|
|
|||
|
|
Represents any link, reference, or resource pointer found during extraction.
|
|||
|
|
The `kind` field classifies the URI semantically, while `label` carries
|
|||
|
|
optional human-readable display text.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `url` | `String` | — | The URL or path string. |
|
|||
|
|
| `label` | `Optional<String>` | `null` | Optional display text / label for the link. |
|
|||
|
|
| `page` | `Optional<Integer>` | `null` | Optional page number where the URI was found (1-indexed). |
|
|||
|
|
| `kind` | `UriKind` | — | Semantic classification of the URI. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ExtractionConfig
|
|||
|
|
|
|||
|
|
Main extraction configuration.
|
|||
|
|
|
|||
|
|
This struct contains all configuration options for the extraction process.
|
|||
|
|
It can be loaded from TOML, YAML, or JSON files, or created programmatically.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `useCache` | `boolean` | `true` | Enable caching of extraction results |
|
|||
|
|
| `enableQualityProcessing` | `boolean` | `true` | Enable quality post-processing |
|
|||
|
|
| `ocr` | `Optional<OcrConfig>` | `null` | OCR configuration (None = OCR disabled) |
|
|||
|
|
| `forceOcr` | `boolean` | `false` | Force OCR even for searchable PDFs |
|
|||
|
|
| `forceOcrPages` | `Optional<List<Integer>>` | `null` | Force OCR on specific pages only (1-indexed page numbers, must be >= 1). When set, only the listed pages are OCR'd regardless of text layer quality. Unlisted pages use native text extraction. Ignored when `force_ocr` is `true`. Only applies to PDF documents. Duplicates are automatically deduplicated. An `ocr` config is recommended for backend/language selection; defaults are used if absent. |
|
|||
|
|
| `disableOcr` | `boolean` | `false` | Disable OCR entirely, even for images. When `true`, OCR is skipped for all document types. Images return metadata only (dimensions, format, EXIF) without text extraction. PDFs use only native text extraction without OCR fallback. Cannot be `true` simultaneously with `force_ocr`. *Added in v4.7.0.* |
|
|||
|
|
| `chunking` | `Optional<ChunkingConfig>` | `null` | Text chunking configuration (None = chunking disabled) |
|
|||
|
|
| `contentFilter` | `Optional<ContentFilterConfig>` | `null` | Content filtering configuration (None = use extractor defaults). Controls whether document "furniture" (headers, footers, watermarks, repeating text) is included in or stripped from extraction results. See `ContentFilterConfig` for per-field documentation. |
|
|||
|
|
| `images` | `Optional<ImageExtractionConfig>` | `null` | Image extraction configuration (None = no image extraction) |
|
|||
|
|
| `pdfOptions` | `Optional<PdfConfig>` | `null` | PDF-specific options (None = use defaults) |
|
|||
|
|
| `tokenReduction` | `Optional<TokenReductionOptions>` | `null` | Token reduction configuration (None = no token reduction) |
|
|||
|
|
| `languageDetection` | `Optional<LanguageDetectionConfig>` | `null` | Language detection configuration (None = no language detection) |
|
|||
|
|
| `pages` | `Optional<PageConfig>` | `null` | Page extraction configuration (None = no page tracking) |
|
|||
|
|
| `keywords` | `Optional<KeywordConfig>` | `null` | Keyword extraction configuration (None = no keyword extraction) |
|
|||
|
|
| `postprocessor` | `Optional<PostProcessorConfig>` | `null` | Post-processor configuration (None = use defaults) |
|
|||
|
|
| `htmlOptions` | `Optional<String>` | `null` | HTML to Markdown conversion options (None = use defaults) Configure how HTML documents are converted to Markdown, including heading styles, list formatting, code block styles, and preprocessing options. |
|
|||
|
|
| `htmlOutput` | `Optional<HtmlOutputConfig>` | `null` | Styled HTML output configuration. When set alongside `output_format = OutputFormat.Html`, the extraction pipeline uses `StyledHtmlRenderer` which emits stable `kb-*` CSS class hooks on every structural element and optionally embeds theme CSS or user-supplied CSS in a `<style>` block. When `null`, the existing plain comrak-based HTML renderer is used. |
|
|||
|
|
| `extractionTimeoutSecs` | `Optional<Long>` | `null` | Default per-file timeout in seconds for batch extraction. When set, each file in a batch will be canceled after this duration unless overridden by `FileExtractionConfig.timeout_secs`. Defaults to `Some(60)` to prevent pathological files (e.g. deeply nested archives, documents with millions of cells) from running indefinitely and exhausting caller resources. Set to `null` to disable the timeout for trusted input or long-running workloads. |
|
|||
|
|
| `maxConcurrentExtractions` | `Optional<Long>` | `null` | Maximum concurrent extractions in batch operations (None = (num_cpus × 1.5).ceil()). Limits parallelism to prevent resource exhaustion when processing large batches. Defaults to (num_cpus × 1.5).ceil() when not set. |
|
|||
|
|
| `resultFormat` | `ResultFormat` | `ResultFormat.UNIFIED` | Result structure format Controls whether results are returned in unified format (default) with all content in the `content` field, or element-based format with semantic elements (for Unstructured-compatible output). |
|
|||
|
|
| `securityLimits` | `Optional<SecurityLimits>` | `null` | Security limits for archive extraction. Controls maximum archive size, compression ratio, file count, and other security thresholds to prevent decompression bomb attacks. Also caps nesting depth, iteration count, entity / token length, total content size, and table cell count for every extraction path that ingests user-controlled bytes. When `null`, default limits are used. |
|
|||
|
|
| `maxEmbeddedFileBytes` | `Optional<Long>` | `null` | Maximum uncompressed size in bytes for a single embedded file before recursive extraction is attempted (default: 50 MiB). Applies to embedded objects inside OOXML containers (DOCX, PPTX) and to email attachments processed via recursive extraction. Files that exceed this limit are skipped with a `ProcessingWarning` rather than passed to the extraction pipeline, preventing a single oversized embedded object from consuming unbounded memory or time. Set to `null` to disable the per-embedded-file cap (falls back to `security_limits.max_archive_size` as the only guard). |
|
|||
|
|
| `outputFormat` | `OutputFormat` | `OutputFormat.PLAIN` | Content text format (default: Plain). Controls the format of the extracted content: - `Plain`: Raw extracted text (default) - `Markdown`: Markdown formatted output - `Djot`: Djot markup format (requires djot feature) - `Html`: HTML formatted output When set to a structured format, extraction results will include formatted output. The `formatted_content` field may be populated when format conversion is applied. |
|
|||
|
|
| `layout` | `Optional<LayoutDetectionConfig>` | `null` | Layout detection configuration (None = layout detection disabled). When set, PDF pages and images are analyzed for document structure (headings, code, formulas, tables, figures, etc.) using RT-DETR models via ONNX Runtime. For PDFs, layout hints override paragraph classification in the markdown pipeline. For images, per-region OCR is performed with markdown formatting based on detected layout classes. Requires the `layout-detection` feature to run inference; the field is present whenever the `layout-types` feature is active (which includes `layout-detection` as well as the no-ORT target groups). |
|
|||
|
|
| `useLayoutForMarkdown` | `boolean` | `false` | Run layout detection on the non-OCR PDF markdown path. When `true` and `layout` is `Some(_)`, layout regions inform heading, table, list, and figure detection in the structure pipeline that would otherwise rely on font-clustering heuristics alone. Significantly improves SF1 (structural F1) at the cost of inference latency (~150-300ms/page CPU, ~20-50ms/page GPU). Default: `false`. Requires the `layout-detection` feature. |
|
|||
|
|
| `includeDocumentStructure` | `boolean` | `false` | Enable structured document tree output. When true, populates the `document` field on `ExtractionResult` with a hierarchical `DocumentStructure` containing heading-driven section nesting, table grids, content layer classification, and inline annotations. Independent of `result_format` — can be combined with Unified or ElementBased. |
|
|||
|
|
| `acceleration` | `Optional<AccelerationConfig>` | `null` | Hardware acceleration configuration for ONNX Runtime models. Controls execution provider selection for layout detection and embedding models. When `null`, uses platform defaults (CoreML on macOS, CUDA on Linux, CPU on Windows). |
|
|||
|
|
| `cacheNamespace` | `Optional<String>` | `null` | Cache namespace for tenant isolation. When set, cache entries are stored under `{cache_dir}/{namespace}/`. Must be alphanumeric, hyphens, or underscores only (max 64 chars). Different namespaces have isolated cache spaces on the same filesystem. |
|
|||
|
|
| `cacheTtlSecs` | `Optional<Long>` | `null` | Per-request cache TTL in seconds. Overrides the global `max_age_days` for this specific extraction. When `0`, caching is completely skipped (no read or write). When `null`, the global TTL applies. |
|
|||
|
|
| `email` | `Optional<EmailConfig>` | `null` | Email extraction configuration (None = use defaults). Currently supports configuring the fallback codepage for MSG files that do not specify one. See `EmailConfig` for details. |
|
|||
|
|
| `concurrency` | `Optional<String>` | `null` | Concurrency limits for constrained environments (None = use defaults). Controls Rayon thread pool size, ONNX Runtime intra-op threads, and (when `max_concurrent_extractions` is unset) the batch concurrency semaphore. See `ConcurrencyConfig` for details. |
|
|||
|
|
| `maxArchiveDepth` | `long` | — | Maximum recursion depth for archive extraction (default: 3). Set to 0 to disable recursive extraction (legacy behavior). |
|
|||
|
|
| `treeSitter` | `Optional<TreeSitterConfig>` | `null` | Tree-sitter language pack configuration (None = tree-sitter disabled). When set, enables code file extraction using tree-sitter parsers. Controls grammar download behavior and code analysis options. |
|
|||
|
|
| `structuredExtraction` | `Optional<StructuredExtractionConfig>` | `null` | Structured extraction via LLM (None = disabled). When set, the extracted document content is sent to an LLM with the provided JSON schema. The structured response is stored in `ExtractionResult.structured_output`. |
|
|||
|
|
| `cancelToken` | `Optional<String>` | `null` | Cancellation token for this extraction (None = no external cancellation). Pass a `CancellationToken` clone here and call `CancellationToken.cancel` from another thread / task to abort the extraction in progress. The extractor checks the token at safe checkpoints (before lock acquisition, between pages, between batch items) and returns `KreuzbergError.Cancelled` when set. The field is excluded from serialization because `CancellationToken` is a runtime handle, not a configuration value. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static ExtractionConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### needsImageProcessing()
|
|||
|
|
|
|||
|
|
Check if image processing is needed by examining OCR and image extraction settings.
|
|||
|
|
|
|||
|
|
Returns `true` if either OCR is enabled or image extraction is configured,
|
|||
|
|
indicating that image decompression and processing should occur.
|
|||
|
|
Returns `false` if both are disabled, allowing optimization to skip unnecessary
|
|||
|
|
image decompression for text-only extraction workflows.
|
|||
|
|
|
|||
|
|
### Optimization Impact
|
|||
|
|
For text-only extractions (no OCR, no image extraction), skipping image
|
|||
|
|
decompression can improve CPU utilization by 5-10% by avoiding wasteful
|
|||
|
|
image I/O and processing when results won't be used.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public boolean needsImageProcessing()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ExtractionDiff
|
|||
|
|
|
|||
|
|
The complete diff between two `ExtractionResult` values.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `contentDiff` | `List<DiffHunk>` | — | Unified-diff hunks for the `content` field. Empty when the content is identical. |
|
|||
|
|
| `tablesAdded` | `List<Table>` | — | Tables present in `b` but not in `a` (by index position, excess right-side tables). |
|
|||
|
|
| `tablesRemoved` | `List<Table>` | — | Tables present in `a` but not in `b` (by index position, excess left-side tables). |
|
|||
|
|
| `tablesChanged` | `List<TableDiff>` | — | Cell-level changes for table pairs that share the same index and dimensions. |
|
|||
|
|
| `metadataChanged` | `Object` | — | Metadata difference, encoded as a JSON object with three top-level keys: `added` (keys present in `b` but not `a`), `removed` (keys present in `a` but not `b`), and `changed` (keys whose values differ — each entry is `{ "from": <value-in-a>, "to": <value-in-b> }`). This is NOT RFC 6902 JSON Patch — we deliberately chose a flatter shape to avoid pulling in a json-patch crate. If you need RFC 6902 semantics (with JSON Pointer paths) feed `a.metadata` and `b.metadata` to your preferred json-patch impl directly. |
|
|||
|
|
| `embeddedChanges` | `EmbeddedChanges` | — | Changes to embedded archive children. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ExtractionResult
|
|||
|
|
|
|||
|
|
General extraction result used by the core extraction API.
|
|||
|
|
|
|||
|
|
This is the main result type returned by all extraction functions.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `content` | `String` | — | The extracted text content |
|
|||
|
|
| `mimeType` | `String` | — | The detected MIME type |
|
|||
|
|
| `metadata` | `Metadata` | — | Document metadata |
|
|||
|
|
| `extractionMethod` | `Optional<ExtractionMethod>` | `null` | Extraction strategy used to produce the returned text. Populated when the extractor can reliably distinguish native text extraction, OCR-only extraction, or mixed native/OCR output. |
|
|||
|
|
| `tables` | `List<Table>` | `Collections.emptyList()` | Tables extracted from the document |
|
|||
|
|
| `detectedLanguages` | `Optional<List<String>>` | `Collections.emptyList()` | Detected languages |
|
|||
|
|
| `chunks` | `Optional<List<Chunk>>` | `Collections.emptyList()` | Text chunks when chunking is enabled. When chunking configuration is provided, the content is split into overlapping chunks for efficient processing. Each chunk contains the text, optional embeddings (if enabled), and metadata about its position. |
|
|||
|
|
| `images` | `Optional<List<ExtractedImage>>` | `Collections.emptyList()` | Extracted images from the document. When image extraction is enabled via `ImageExtractionConfig`, this field contains all images found in the document with their raw data and metadata. Each image may optionally contain a nested `ocr_result` if OCR was performed. |
|
|||
|
|
| `pages` | `Optional<List<PageContent>>` | `Collections.emptyList()` | Per-page content when page extraction is enabled. When page extraction is configured, the document is split into per-page content with tables and images mapped to their respective pages. |
|
|||
|
|
| `elements` | `Optional<List<Element>>` | `Collections.emptyList()` | Semantic elements when element-based result format is enabled. When result_format is set to ElementBased, this field contains semantic elements with type classification, unique identifiers, and metadata for Unstructured-compatible element-based processing. |
|
|||
|
|
| `djotContent` | `Optional<DjotContent>` | `null` | Rich Djot content structure (when extracting Djot documents). When extracting Djot documents with structured extraction enabled, this field contains the full semantic structure including: - Block-level elements with nesting - Inline formatting with attributes - Links, images, footnotes - Math expressions - Complete attribute information The `content` field still contains plain text for backward compatibility. Always `null` for non-Djot documents. |
|
|||
|
|
| `ocrElements` | `Optional<List<OcrElement>>` | `Collections.emptyList()` | OCR elements with full spatial and confidence metadata. When OCR is performed with element extraction enabled, this field contains the structured representation of detected text including: - Bounding geometry (rectangles or quadrilaterals) - Confidence scores (detection and recognition) - Rotation information - Hierarchical relationships (Tesseract only) This field preserves all metadata that would otherwise be lost when converting to plain text or markdown output formats. Only populated when `OcrElementConfig.include_elements` is true. |
|
|||
|
|
| `document` | `Optional<DocumentStructure>` | `null` | Structured document tree (when document structure extraction is enabled). When `include_document_structure` is true in `ExtractionConfig`, this field contains the full hierarchical representation of the document including: - Heading-driven section nesting - Table grids with cell-level metadata - Content layer classification (body, header, footer, footnote) - Inline text annotations (formatting, links) - Bounding boxes and page numbers Independent of `result_format` — can be combined with Unified or ElementBased. |
|
|||
|
|
| `extractedKeywords` | `Optional<List<Keyword>>` | `Collections.emptyList()` | Extracted keywords when keyword extraction is enabled. When keyword extraction (RAKE or YAKE) is configured, this field contains the extracted keywords with scores, algorithm info, and position data. Previously stored in `metadata.additional["keywords"]`. |
|
|||
|
|
| `qualityScore` | `Optional<Double>` | `null` | Document quality score from quality analysis. A value between 0.0 and 1.0 indicating the overall text quality. Previously stored in `metadata.additional["quality_score"]`. |
|
|||
|
|
| `processingWarnings` | `List<ProcessingWarning>` | `Collections.emptyList()` | Non-fatal warnings collected during processing pipeline stages. Captures errors from optional pipeline features (embedding, chunking, language detection, output formatting) that don't prevent extraction but may indicate degraded results. Previously stored as individual keys in `metadata.additional`. |
|
|||
|
|
| `annotations` | `Optional<List<PdfAnnotation>>` | `Collections.emptyList()` | PDF annotations extracted from the document. When annotation extraction is enabled via `PdfConfig.extract_annotations`, this field contains text notes, highlights, links, stamps, and other annotations found in PDF documents. |
|
|||
|
|
| `children` | `Optional<List<ArchiveEntry>>` | `Collections.emptyList()` | Nested extraction results from archive contents. When extracting archives, each processable file inside produces its own full extraction result. Set to `null` for non-archive formats. Use `max_archive_depth` in config to control recursion depth. |
|
|||
|
|
| `uris` | `Optional<List<ExtractedUri>>` | `Collections.emptyList()` | URIs/links discovered during document extraction. Contains hyperlinks, image references, citations, email addresses, and other URI-like references found in the document. Always extracted when present in the source document. |
|
|||
|
|
| `revisions` | `Optional<List<DocumentRevision>>` | `Collections.emptyList()` | Tracked changes embedded in the source document. Populated by per-format extractors that understand change-tracking metadata (DOCX `w:ins`/`w:del`/`w:rPrChange`, ODT `text:change-*`, …). Every extractor defaults to `null` until its format-specific implementation is added. Extractors that do populate this field follow the "accepted-changes" convention: inserted text is present in `content`, deleted text is absent — the revision list is the separate audit trail. |
|
|||
|
|
| `structuredOutput` | `Optional<Object>` | `null` | Structured extraction output from LLM-based JSON schema extraction. When `structured_extraction` is configured in `ExtractionConfig`, the extracted document content is sent to a VLM with the provided JSON schema. The response is parsed and stored here as a JSON value matching the schema. |
|
|||
|
|
| `codeIntelligence` | `Optional<Object>` | `null` | Code intelligence results from tree-sitter analysis. Populated when extracting source code files with the `tree-sitter` feature. Contains metrics, structural analysis, imports/exports, comments, docstrings, symbols, diagnostics, and optionally chunked code segments. Stored as an opaque JSON value so that all language bindings (Go, Java, C#, …) can deserialize it as a raw JSON object rather than a typed struct. The underlying type is `tree_sitter_language_pack.ProcessResult`. |
|
|||
|
|
| `llmUsage` | `Optional<List<LlmUsage>>` | `Collections.emptyList()` | LLM token usage and cost data for all LLM calls made during this extraction. Contains one entry per LLM call. Multiple entries are produced when VLM OCR, structured extraction, or LLM embeddings run during the same extraction. `null` when no LLM was used. |
|
|||
|
|
| `formattedContent` | `Optional<String>` | `null` | Pre-rendered content in the requested output format. Populated during `derive_extraction_result` before tree derivation consumes element data. `apply_output_format` swaps this into `content` at the end of the pipeline, after post-processors have operated on plain text. |
|
|||
|
|
| `ocrInternalDocument` | `Optional<String>` | `null` | Structured hOCR document for the OCR+layout pipeline. When tesseract produces hOCR output, the parsed `InternalDocument` carries paragraph structure with bounding boxes and confidence scores. The layout classification step enriches these elements before final rendering. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### fromOcr()
|
|||
|
|
|
|||
|
|
Convert from an OCR result.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static ExtractionResult fromOcr(OcrExtractionResult ocr)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### FictionBookMetadata
|
|||
|
|
|
|||
|
|
FictionBook (FB2) metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `genres` | `List<String>` | `Collections.emptyList()` | Genres |
|
|||
|
|
| `sequences` | `List<String>` | `Collections.emptyList()` | Sequences |
|
|||
|
|
| `annotation` | `Optional<String>` | `null` | Annotation |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### FileExtractionConfig
|
|||
|
|
|
|||
|
|
Per-file extraction configuration overrides for batch processing.
|
|||
|
|
|
|||
|
|
All fields are `Option<T>` — `null` means "use the batch-level default."
|
|||
|
|
This type is used with `batch_extract_files` and
|
|||
|
|
`batch_extract_bytes` to allow heterogeneous
|
|||
|
|
extraction settings within a single batch.
|
|||
|
|
|
|||
|
|
### Excluded Fields
|
|||
|
|
|
|||
|
|
The following `ExtractionConfig` fields are batch-level only and
|
|||
|
|
cannot be overridden per file:
|
|||
|
|
|
|||
|
|
- `max_concurrent_extractions` — controls batch parallelism
|
|||
|
|
- `use_cache` — global caching policy
|
|||
|
|
- `acceleration` — shared ONNX execution provider
|
|||
|
|
- `security_limits` — global archive security policy
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `enableQualityProcessing` | `Optional<Boolean>` | `null` | Override quality post-processing for this file. |
|
|||
|
|
| `ocr` | `Optional<OcrConfig>` | `null` | Override OCR configuration for this file (None in the Option = use batch default). |
|
|||
|
|
| `forceOcr` | `Optional<Boolean>` | `null` | Override force OCR for this file. |
|
|||
|
|
| `forceOcrPages` | `Optional<List<Integer>>` | `Collections.emptyList()` | Override force OCR pages for this file (1-indexed page numbers). |
|
|||
|
|
| `disableOcr` | `Optional<Boolean>` | `null` | Override disable OCR for this file. |
|
|||
|
|
| `chunking` | `Optional<ChunkingConfig>` | `null` | Override chunking configuration for this file. |
|
|||
|
|
| `contentFilter` | `Optional<ContentFilterConfig>` | `null` | Override content filtering configuration for this file. |
|
|||
|
|
| `images` | `Optional<ImageExtractionConfig>` | `null` | Override image extraction configuration for this file. |
|
|||
|
|
| `pdfOptions` | `Optional<PdfConfig>` | `null` | Override PDF options for this file. |
|
|||
|
|
| `tokenReduction` | `Optional<TokenReductionOptions>` | `null` | Override token reduction for this file. |
|
|||
|
|
| `languageDetection` | `Optional<LanguageDetectionConfig>` | `null` | Override language detection for this file. |
|
|||
|
|
| `pages` | `Optional<PageConfig>` | `null` | Override page extraction for this file. |
|
|||
|
|
| `keywords` | `Optional<KeywordConfig>` | `null` | Override keyword extraction for this file. |
|
|||
|
|
| `postprocessor` | `Optional<PostProcessorConfig>` | `null` | Override post-processor for this file. |
|
|||
|
|
| `htmlOptions` | `Optional<String>` | `null` | Override HTML conversion options for this file. |
|
|||
|
|
| `resultFormat` | `Optional<ResultFormat>` | `null` | Override result format for this file. |
|
|||
|
|
| `outputFormat` | `Optional<OutputFormat>` | `null` | Override output content format for this file. |
|
|||
|
|
| `includeDocumentStructure` | `Optional<Boolean>` | `null` | Override document structure output for this file. |
|
|||
|
|
| `layout` | `Optional<LayoutDetectionConfig>` | `null` | Override layout detection for this file. |
|
|||
|
|
| `timeoutSecs` | `Optional<Long>` | `null` | Override per-file extraction timeout in seconds. When set, the extraction for this file will be canceled after the specified duration. A timed-out file produces an error result without affecting other files in the batch. |
|
|||
|
|
| `treeSitter` | `Optional<TreeSitterConfig>` | `null` | Override tree-sitter configuration for this file. |
|
|||
|
|
| `structuredExtraction` | `Optional<StructuredExtractionConfig>` | `null` | Override structured extraction configuration for this file. When set, enables LLM-based structured extraction with a JSON schema for this specific file. The extracted content is sent to a VLM/LLM and the response is parsed according to the provided schema. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Footnote
|
|||
|
|
|
|||
|
|
Footnote in Djot.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `label` | `String` | — | Footnote label |
|
|||
|
|
| `content` | `List<FormattedBlock>` | — | Footnote content blocks |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### FormattedBlock
|
|||
|
|
|
|||
|
|
Block-level element in a Djot document.
|
|||
|
|
|
|||
|
|
Represents structural elements like headings, paragraphs, lists, code blocks, etc.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `blockType` | `BlockType` | — | Type of block element |
|
|||
|
|
| `level` | `Optional<Long>` | `null` | Heading level (1-6) for headings, or nesting level for lists |
|
|||
|
|
| `inlineContent` | `List<InlineElement>` | — | Inline content within the block |
|
|||
|
|
| `attributes` | `Optional<String>` | `null` | Element attributes (classes, IDs, key-value pairs) |
|
|||
|
|
| `language` | `Optional<String>` | `null` | Language identifier for code blocks |
|
|||
|
|
| `code` | `Optional<String>` | `null` | Raw code content for code blocks |
|
|||
|
|
| `children` | `List<FormattedBlock>` | `/* serde(default) */` | Nested blocks for containers (blockquotes, list items, divs) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### GridCell
|
|||
|
|
|
|||
|
|
Individual grid cell with position and span metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `content` | `String` | — | Cell text content. |
|
|||
|
|
| `row` | `int` | — | Zero-indexed row position. |
|
|||
|
|
| `col` | `int` | — | Zero-indexed column position. |
|
|||
|
|
| `rowSpan` | `int` | `/* serde(default) */` | Number of rows this cell spans. |
|
|||
|
|
| `colSpan` | `int` | `/* serde(default) */` | Number of columns this cell spans. |
|
|||
|
|
| `isHeader` | `boolean` | `/* serde(default) */` | Whether this is a header cell. |
|
|||
|
|
| `bbox` | `Optional<BoundingBox>` | `null` | Bounding box for this cell (if available). |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### HeaderMetadata
|
|||
|
|
|
|||
|
|
Header/heading element metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `level` | `byte` | — | Header level: 1 (h1) through 6 (h6) |
|
|||
|
|
| `text` | `String` | — | Normalized text content of the header |
|
|||
|
|
| `id` | `Optional<String>` | `null` | HTML id attribute if present |
|
|||
|
|
| `depth` | `int` | — | Document tree depth at the header element |
|
|||
|
|
| `htmlOffset` | `int` | — | Byte offset in original HTML document |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### HeadingContext
|
|||
|
|
|
|||
|
|
Heading context for a chunk within a Markdown document.
|
|||
|
|
|
|||
|
|
Contains the heading hierarchy from document root to this chunk's section.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `headings` | `List<HeadingLevel>` | — | The heading hierarchy from document root to this chunk's section. Index 0 is the outermost (h1), last element is the most specific. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### HeadingLevel
|
|||
|
|
|
|||
|
|
A single heading in the hierarchy.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `level` | `byte` | — | Heading depth (1 = h1, 2 = h2, etc.) |
|
|||
|
|
| `text` | `String` | — | The text content of the heading. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### HierarchicalBlock
|
|||
|
|
|
|||
|
|
A text block with hierarchy level assignment.
|
|||
|
|
|
|||
|
|
Represents a block of text with semantic heading information extracted from
|
|||
|
|
font size clustering and hierarchical analysis.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `text` | `String` | — | The text content of this block |
|
|||
|
|
| `fontSize` | `float` | — | The font size of the text in this block |
|
|||
|
|
| `level` | `String` | — | The hierarchy level of this block (H1-H6 or Body) Levels correspond to HTML heading tags: - "h1": Top-level heading - "h2": Secondary heading - "h3": Tertiary heading - "h4": Quaternary heading - "h5": Quinary heading - "h6": Senary heading - "body": Body text (no heading level) |
|
|||
|
|
| `bbox` | `Optional<List<Float>>` | `null` | Bounding box information for the block Contains coordinates as (left, top, right, bottom) in PDF units. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### HierarchyConfig
|
|||
|
|
|
|||
|
|
Hierarchy extraction configuration for PDF text structure analysis.
|
|||
|
|
|
|||
|
|
Enables extraction of document hierarchy levels (H1-H6) based on font size
|
|||
|
|
clustering and semantic analysis. When enabled, hierarchical blocks are
|
|||
|
|
included in page content.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `enabled` | `boolean` | `true` | Enable hierarchy extraction |
|
|||
|
|
| `kClusters` | `long` | `3` | Number of font size clusters to use for hierarchy levels (1-7) Default: 6, which provides H1-H6 heading levels with body text. Larger values create more fine-grained hierarchy levels. |
|
|||
|
|
| `includeBbox` | `boolean` | `true` | Include bounding box information in hierarchy blocks |
|
|||
|
|
| `ocrCoverageThreshold` | `Optional<Float>` | `null` | OCR coverage threshold for smart OCR triggering (0.0-1.0) Determines when OCR should be triggered based on text block coverage. OCR is triggered when text blocks cover less than this fraction of the page. Default: 0.5 (trigger OCR if less than 50% of page has text) |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static HierarchyConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### HtmlMetadata
|
|||
|
|
|
|||
|
|
HTML metadata extracted from HTML documents.
|
|||
|
|
|
|||
|
|
Includes document-level metadata, Open Graph data, Twitter Card metadata,
|
|||
|
|
and extracted structural elements (headers, links, images, structured data).
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `title` | `Optional<String>` | `null` | Document title from `<title>` tag |
|
|||
|
|
| `description` | `Optional<String>` | `null` | Document description from `<meta name="description">` tag |
|
|||
|
|
| `keywords` | `List<String>` | `Collections.emptyList()` | Document keywords from `<meta name="keywords">` tag, split on commas |
|
|||
|
|
| `author` | `Optional<String>` | `null` | Document author from `<meta name="author">` tag |
|
|||
|
|
| `canonicalUrl` | `Optional<String>` | `null` | Canonical URL from `<link rel="canonical">` tag |
|
|||
|
|
| `baseHref` | `Optional<String>` | `null` | Base URL from `<base href="">` tag for resolving relative URLs |
|
|||
|
|
| `language` | `Optional<String>` | `null` | Document language from `lang` attribute |
|
|||
|
|
| `textDirection` | `Optional<TextDirection>` | `null` | Document text direction from `dir` attribute |
|
|||
|
|
| `openGraph` | `Map<String, String>` | `Collections.emptyMap()` | Open Graph metadata (og:* properties) for social media Keys like "title", "description", "image", "url", etc. |
|
|||
|
|
| `twitterCard` | `Map<String, String>` | `Collections.emptyMap()` | Twitter Card metadata (twitter:* properties) Keys like "card", "site", "creator", "title", "description", "image", etc. |
|
|||
|
|
| `metaTags` | `Map<String, String>` | `Collections.emptyMap()` | Additional meta tags not covered by specific fields Keys are meta name/property attributes, values are content |
|
|||
|
|
| `headers` | `List<HeaderMetadata>` | `Collections.emptyList()` | Extracted header elements with hierarchy |
|
|||
|
|
| `links` | `List<LinkMetadata>` | `Collections.emptyList()` | Extracted hyperlinks with type classification |
|
|||
|
|
| `images` | `List<ImageMetadataType>` | `Collections.emptyList()` | Extracted images with source and dimensions |
|
|||
|
|
| `structuredData` | `List<StructuredData>` | `Collections.emptyList()` | Extracted structured data blocks |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### HtmlOutputConfig
|
|||
|
|
|
|||
|
|
Configuration for styled HTML output.
|
|||
|
|
|
|||
|
|
When set on `ExtractionConfig.html_output` alongside
|
|||
|
|
`output_format = OutputFormat.Html`, the pipeline builds a
|
|||
|
|
`StyledHtmlRenderer` instead of
|
|||
|
|
the plain comrak-based renderer.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `css` | `Optional<String>` | `null` | Inline CSS string injected into the output after the theme stylesheet. Concatenated after `css_file` content when both are set. |
|
|||
|
|
| `cssFile` | `Optional<String>` | `null` | Path to a CSS file loaded once at renderer construction time. Concatenated before `css` when both are set. |
|
|||
|
|
| `theme` | `HtmlTheme` | `HtmlTheme.UNSTYLED` | Built-in colour/typography theme. Default: `HtmlTheme.Unstyled`. |
|
|||
|
|
| `classPrefix` | `String` | — | CSS class prefix applied to every emitted class name. Default: `"kb-"`. Change this if your host application already uses classes that start with `kb-`. |
|
|||
|
|
| `embedCss` | `boolean` | `true` | When `true` (default), write the resolved CSS into a `<style>` block immediately after the opening `<div class="{prefix}doc">`. Set to `false` to emit only the structural markup and wire up your own stylesheet targeting the `kb-*` class names. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static HtmlOutputConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ImageExtractionConfig
|
|||
|
|
|
|||
|
|
Image extraction configuration.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `extractImages` | `boolean` | `true` | Extract images from documents |
|
|||
|
|
| `targetDpi` | `int` | `300` | Target DPI for image normalization |
|
|||
|
|
| `maxImageDimension` | `int` | `4096` | Maximum dimension for images (width or height) |
|
|||
|
|
| `injectPlaceholders` | `boolean` | `true` | Whether to inject image reference placeholders into markdown output. When `true` (default), image references like `` are appended to the markdown. Set to `false` to extract images as data without polluting the markdown output. |
|
|||
|
|
| `autoAdjustDpi` | `boolean` | `true` | Automatically adjust DPI based on image content |
|
|||
|
|
| `minDpi` | `int` | `72` | Minimum DPI threshold |
|
|||
|
|
| `maxDpi` | `int` | `600` | Maximum DPI threshold |
|
|||
|
|
| `maxImagesPerPage` | `Optional<Integer>` | `null` | Maximum number of image objects to extract per PDF page. Some PDFs (e.g. technical diagrams stored as thousands of raster fragments) can trigger extremely long or indefinite extraction times when every image object on a dense page is decoded individually via the PDF extractor. Setting this limit causes kreuzberg to stop collecting individual images once the count per page reaches the cap and emit a warning instead. `null` (default) means no limit — all images are extracted. |
|
|||
|
|
| `classify` | `boolean` | `true` | When `true` (default), extracted images are classified by kind and grouped into clusters where they appear to belong to one figure. |
|
|||
|
|
| `includePageRasters` | `boolean` | `false` | When `true`, full-page renders produced during OCR preprocessing are captured and returned as `ImageKind.PageRaster` entries in `ExtractionResult.images`. **PDF + OCR only.** No rasters are captured for non-PDF inputs or when the document-level OCR bypass is active (whole-document backend). When OCR is enabled and this flag is set but the active backend skips per-page rendering, a `ProcessingWarning` is emitted in `ExtractionResult.processing_warnings`. Defaults to `false`. Enable when downstream consumers need page thumbnails (e.g. citation previews, visual grounding). |
|
|||
|
|
| `runOcrOnImages` | `boolean` | `true` | Run OCR on extracted images and include the recognized text in the document content. When `true` (default) and `ExtractionConfig.ocr` is configured, extracted images are processed with the configured OCR backend. Set to `false` to extract images without OCR processing, even when OCR is enabled. |
|
|||
|
|
| `ocrTextOnly` | `boolean` | `false` | When `true`, image OCR results are rendered as plain text without the `` markdown placeholder. Only takes effect when `run_ocr_on_images` is also `true`. |
|
|||
|
|
| `appendOcrText` | `boolean` | `false` | When `true` and `ocr_text_only` is `false`, append the OCR text after the image placeholder in the rendered output. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static ImageExtractionConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ImageMetadata
|
|||
|
|
|
|||
|
|
Image metadata extracted from image files.
|
|||
|
|
|
|||
|
|
Includes dimensions, format, and EXIF data.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `width` | `int` | — | Image width in pixels |
|
|||
|
|
| `height` | `int` | — | Image height in pixels |
|
|||
|
|
| `format` | `String` | — | Image format (e.g., "PNG", "JPEG", "TIFF") |
|
|||
|
|
| `exif` | `Map<String, String>` | `Collections.emptyMap()` | EXIF metadata tags |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ImageMetadataType
|
|||
|
|
|
|||
|
|
Image element metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `src` | `String` | — | Image source (URL, data URI, or SVG content) |
|
|||
|
|
| `alt` | `Optional<String>` | `null` | Alternative text from alt attribute |
|
|||
|
|
| `title` | `Optional<String>` | `null` | Title attribute |
|
|||
|
|
| `dimensions` | `Optional<List<Integer>>` | `null` | Image dimensions as (width, height) if available |
|
|||
|
|
| `imageType` | `ImageType` | — | Image type classification |
|
|||
|
|
| `attributes` | `List<List<String>>` | — | Additional attributes as key-value pairs |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ImagePreprocessingConfig
|
|||
|
|
|
|||
|
|
Image preprocessing configuration for OCR.
|
|||
|
|
|
|||
|
|
These settings control how images are preprocessed before OCR to improve
|
|||
|
|
text recognition quality. Different preprocessing strategies work better
|
|||
|
|
for different document types.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `targetDpi` | `int` | `300` | Target DPI for the image (300 is standard, 600 for small text). |
|
|||
|
|
| `autoRotate` | `boolean` | `true` | Auto-detect and correct image rotation. |
|
|||
|
|
| `deskew` | `boolean` | `true` | Correct skew (tilted images). |
|
|||
|
|
| `denoise` | `boolean` | `false` | Remove noise from the image. |
|
|||
|
|
| `contrastEnhance` | `boolean` | `false` | Enhance contrast for better text visibility. |
|
|||
|
|
| `binarizationMethod` | `String` | `"otsu"` | Binarization method: "otsu", "sauvola", "adaptive". |
|
|||
|
|
| `invertColors` | `boolean` | `false` | Invert colors (white text on black → black on white). |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static ImagePreprocessingConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ImagePreprocessingMetadata
|
|||
|
|
|
|||
|
|
Image preprocessing metadata.
|
|||
|
|
|
|||
|
|
Tracks the transformations applied to an image during OCR preprocessing,
|
|||
|
|
including DPI normalization, resizing, and resampling.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `originalDimensions` | `List<Long>` | — | Original image dimensions (width, height) in pixels |
|
|||
|
|
| `originalDpi` | `List<Double>` | — | Original image DPI (horizontal, vertical) |
|
|||
|
|
| `targetDpi` | `int` | — | Target DPI from configuration |
|
|||
|
|
| `scaleFactor` | `double` | — | Scaling factor applied to the image |
|
|||
|
|
| `autoAdjusted` | `boolean` | — | Whether DPI was auto-adjusted based on content |
|
|||
|
|
| `finalDpi` | `int` | — | Final DPI after processing |
|
|||
|
|
| `newDimensions` | `Optional<List<Long>>` | `null` | New dimensions after resizing (if resized) |
|
|||
|
|
| `resampleMethod` | `String` | — | Resampling algorithm used ("LANCZOS3", "CATMULLROM", etc.) |
|
|||
|
|
| `dimensionClamped` | `boolean` | — | Whether dimensions were clamped to max_image_dimension |
|
|||
|
|
| `calculatedDpi` | `Optional<Integer>` | `null` | Calculated optimal DPI (if auto_adjust_dpi enabled) |
|
|||
|
|
| `skippedResize` | `boolean` | — | Whether resize was skipped (dimensions already optimal) |
|
|||
|
|
| `resizeError` | `Optional<String>` | `null` | Error message if resize failed |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### InlineElement
|
|||
|
|
|
|||
|
|
Inline element within a block.
|
|||
|
|
|
|||
|
|
Represents text with formatting, links, images, etc.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `elementType` | `InlineType` | — | Type of inline element |
|
|||
|
|
| `content` | `String` | — | Text content |
|
|||
|
|
| `attributes` | `Optional<String>` | `null` | Element attributes |
|
|||
|
|
| `metadata` | `Optional<Map<String, String>>` | `null` | Additional metadata (e.g., href for links, src/alt for images) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### JatsMetadata
|
|||
|
|
|
|||
|
|
JATS (Journal Article Tag Suite) metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `copyright` | `Optional<String>` | `null` | Copyright |
|
|||
|
|
| `license` | `Optional<String>` | `null` | License |
|
|||
|
|
| `historyDates` | `Map<String, String>` | `Collections.emptyMap()` | History dates |
|
|||
|
|
| `contributorRoles` | `List<ContributorRole>` | `Collections.emptyList()` | Contributor roles |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Keyword
|
|||
|
|
|
|||
|
|
Extracted keyword with metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `text` | `String` | — | The keyword text. |
|
|||
|
|
| `score` | `float` | — | Relevance score (higher is better, algorithm-specific range). |
|
|||
|
|
| `algorithm` | `KeywordAlgorithm` | — | Algorithm that extracted this keyword. |
|
|||
|
|
| `positions` | `Optional<List<Long>>` | `null` | Optional positions where keyword appears in text (character offsets). |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### KeywordConfig
|
|||
|
|
|
|||
|
|
Keyword extraction configuration.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `algorithm` | `KeywordAlgorithm` | `KeywordAlgorithm.YAKE` | Algorithm to use for extraction. |
|
|||
|
|
| `maxKeywords` | `long` | `10` | Maximum number of keywords to extract (default: 10). |
|
|||
|
|
| `minScore` | `float` | `0` | Minimum score threshold (0.0-1.0, default: 0.0). Keywords with scores below this threshold are filtered out. Note: Score ranges differ between algorithms. |
|
|||
|
|
| `ngramRange` | `List<Long>` | `Collections.emptyList()` | N-gram range for keyword extraction (min, max). (1, 1) = unigrams only (1, 2) = unigrams and bigrams (1, 3) = unigrams, bigrams, and trigrams (default) |
|
|||
|
|
| `language` | `Optional<String>` | `null` | Language code for stopword filtering (e.g., "en", "de", "fr"). If None, no stopword filtering is applied. |
|
|||
|
|
| `yakeParams` | `Optional<YakeParams>` | `null` | YAKE-specific tuning parameters. |
|
|||
|
|
| `rakeParams` | `Optional<RakeParams>` | `null` | RAKE-specific tuning parameters. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static KeywordConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### LanguageDetectionConfig
|
|||
|
|
|
|||
|
|
Language detection configuration.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `enabled` | `boolean` | `true` | Enable language detection |
|
|||
|
|
| `minConfidence` | `double` | `0.8` | Minimum confidence threshold (0.0-1.0) |
|
|||
|
|
| `detectMultiple` | `boolean` | `false` | Detect multiple languages in the document |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static LanguageDetectionConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### LayoutDetection
|
|||
|
|
|
|||
|
|
A single layout detection result.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `className` | `LayoutClass` | — | Class name (layout class) |
|
|||
|
|
| `confidence` | `float` | — | Confidence |
|
|||
|
|
| `bbox` | `BBox` | — | Bbox (b box) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### LayoutDetectionConfig
|
|||
|
|
|
|||
|
|
Layout detection configuration.
|
|||
|
|
|
|||
|
|
Controls layout detection behavior in the extraction pipeline.
|
|||
|
|
When set on `ExtractionConfig`, layout detection
|
|||
|
|
is enabled for PDF extraction.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `confidenceThreshold` | `Optional<Float>` | `null` | Confidence threshold override (None = use model default). |
|
|||
|
|
| `applyHeuristics` | `boolean` | `true` | Whether to apply postprocessing heuristics (default: true). |
|
|||
|
|
| `tableModel` | `TableModel` | `TableModel.TATR` | Table structure recognition model. Controls which model is used for table cell detection within layout-detected table regions. Defaults to `TableModel.Tatr`. |
|
|||
|
|
| `acceleration` | `Optional<AccelerationConfig>` | `null` | Hardware acceleration for ONNX models (layout detection + table structure). When set, controls which execution provider (CPU, CUDA, CoreML, TensorRT) is used for inference. Defaults to `null` (auto-select per platform). |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static LayoutDetectionConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### LayoutRegion
|
|||
|
|
|
|||
|
|
A detected layout region on a page.
|
|||
|
|
|
|||
|
|
When layout detection is enabled, each page may have layout regions
|
|||
|
|
identifying different content types (text, pictures, tables, etc.)
|
|||
|
|
with confidence scores and spatial positions.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `className` | `String` | — | Layout class name (e.g. "picture", "table", "text", "section_header"). |
|
|||
|
|
| `confidence` | `double` | — | Confidence score from the layout detection model (0.0 to 1.0). |
|
|||
|
|
| `boundingBox` | `BoundingBox` | — | Bounding box in document coordinate space. |
|
|||
|
|
| `areaFraction` | `double` | — | Fraction of the page area covered by this region (0.0 to 1.0). |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### LinkMetadata
|
|||
|
|
|
|||
|
|
Link element metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `href` | `String` | — | The href URL value |
|
|||
|
|
| `text` | `String` | — | Link text content (normalized) |
|
|||
|
|
| `title` | `Optional<String>` | `null` | Optional title attribute |
|
|||
|
|
| `linkType` | `LinkType` | — | Link type classification |
|
|||
|
|
| `rel` | `List<String>` | — | Rel attribute values |
|
|||
|
|
| `attributes` | `List<List<String>>` | — | Additional attributes as key-value pairs |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### LlmConfig
|
|||
|
|
|
|||
|
|
Configuration for an LLM provider/model via liter-llm.
|
|||
|
|
|
|||
|
|
Each feature (VLM OCR, VLM embeddings, structured extraction) carries
|
|||
|
|
its own `LlmConfig`, allowing different providers per feature.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `model` | `String` | — | Provider/model string using liter-llm routing format. Examples: `"openai/gpt-4o"`, `"anthropic/claude-sonnet-4-20250514"`, `"groq/llama-3.1-70b-versatile"`. |
|
|||
|
|
| `apiKey` | `Optional<String>` | `null` | API key for the provider. When `null`, liter-llm falls back to the provider's standard environment variable (e.g., `OPENAI_API_KEY`). |
|
|||
|
|
| `baseUrl` | `Optional<String>` | `null` | Custom base URL override for the provider endpoint. |
|
|||
|
|
| `timeoutSecs` | `Optional<Long>` | `null` | Request timeout in seconds (default: 60). |
|
|||
|
|
| `maxRetries` | `Optional<Integer>` | `null` | Maximum retry attempts (default: 3). |
|
|||
|
|
| `temperature` | `Optional<Double>` | `null` | Sampling temperature for generation tasks. |
|
|||
|
|
| `maxTokens` | `Optional<Long>` | `null` | Maximum tokens to generate. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### LlmUsage
|
|||
|
|
|
|||
|
|
Token usage and cost data for a single LLM call made during extraction.
|
|||
|
|
|
|||
|
|
Populated when VLM OCR, structured extraction, or LLM-based embeddings
|
|||
|
|
are used. Multiple entries may be present when multiple LLM calls occur
|
|||
|
|
within one extraction (e.g. VLM OCR + structured extraction).
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `model` | `String` | — | The LLM model identifier (e.g. "openai/gpt-4o", "anthropic/claude-sonnet-4-20250514"). |
|
|||
|
|
| `source` | `String` | — | The pipeline stage that triggered this LLM call (e.g. "vlm_ocr", "structured_extraction", "embeddings"). |
|
|||
|
|
| `inputTokens` | `Optional<Long>` | `null` | Number of input/prompt tokens consumed. |
|
|||
|
|
| `outputTokens` | `Optional<Long>` | `null` | Number of output/completion tokens generated. |
|
|||
|
|
| `totalTokens` | `Optional<Long>` | `null` | Total tokens (input + output). |
|
|||
|
|
| `estimatedCost` | `Optional<Double>` | `null` | Estimated cost in USD based on the provider's published pricing. |
|
|||
|
|
| `finishReason` | `Optional<String>` | `null` | Why the model stopped generating (e.g. "stop", "length", "content_filter"). |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Metadata
|
|||
|
|
|
|||
|
|
Extraction result metadata.
|
|||
|
|
|
|||
|
|
Contains common fields applicable to all formats, format-specific metadata
|
|||
|
|
via a discriminated union, and additional custom fields from postprocessors.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `title` | `Optional<String>` | `null` | Document title |
|
|||
|
|
| `subject` | `Optional<String>` | `null` | Document subject or description |
|
|||
|
|
| `authors` | `Optional<List<String>>` | `Collections.emptyList()` | Primary author(s) - always Vec for consistency |
|
|||
|
|
| `keywords` | `Optional<List<String>>` | `Collections.emptyList()` | Keywords/tags - always Vec for consistency |
|
|||
|
|
| `language` | `Optional<String>` | `null` | Primary language (ISO 639 code) |
|
|||
|
|
| `createdAt` | `Optional<String>` | `null` | Creation timestamp (ISO 8601 format) |
|
|||
|
|
| `modifiedAt` | `Optional<String>` | `null` | Last modification timestamp (ISO 8601 format) |
|
|||
|
|
| `createdBy` | `Optional<String>` | `null` | User who created the document |
|
|||
|
|
| `modifiedBy` | `Optional<String>` | `null` | User who last modified the document |
|
|||
|
|
| `pages` | `Optional<PageStructure>` | `null` | Page/slide/sheet structure with boundaries |
|
|||
|
|
| `format` | `Optional<FormatMetadata>` | `null` | Format-specific metadata (discriminated union) Contains detailed metadata specific to the document format. Serialized as a nested `"format"` object with a `format_type` discriminator field. |
|
|||
|
|
| `imagePreprocessing` | `Optional<ImagePreprocessingMetadata>` | `null` | Image preprocessing metadata (when OCR preprocessing was applied) |
|
|||
|
|
| `jsonSchema` | `Optional<Object>` | `null` | JSON schema (for structured data extraction) |
|
|||
|
|
| `error` | `Optional<ErrorMetadata>` | `null` | Error metadata (for batch operations) |
|
|||
|
|
| `extractionDurationMs` | `Optional<Long>` | `null` | Extraction duration in milliseconds (for benchmarking). This field is populated by batch extraction to provide per-file timing information. It's `null` for single-file extraction (which uses external timing). |
|
|||
|
|
| `category` | `Optional<String>` | `null` | Document category (from frontmatter or classification). |
|
|||
|
|
| `tags` | `Optional<List<String>>` | `Collections.emptyList()` | Document tags (from frontmatter). |
|
|||
|
|
| `documentVersion` | `Optional<String>` | `null` | Document version string (from frontmatter). |
|
|||
|
|
| `abstractText` | `Optional<String>` | `null` | Abstract or summary text (from frontmatter). |
|
|||
|
|
| `outputFormat` | `Optional<String>` | `null` | Output format identifier (e.g., "markdown", "html", "text"). Set by the output format pipeline stage when format conversion is applied. Previously stored in `metadata.additional["output_format"]`. |
|
|||
|
|
| `ocrUsed` | `boolean` | — | Whether OCR was used during extraction. Set to `true` whenever the extraction pipeline ran an OCR backend (Tesseract, PaddleOCR, VLM, etc.) and used that output as the primary or fallback text. `false` means native text extraction was used exclusively. |
|
|||
|
|
| `additional` | `Map<String, Object>` | `Collections.emptyMap()` | Additional custom fields from postprocessors. Serialized as a nested `"additional"` object (not flattened at root level). Uses `Cow<'static, str>` keys so static string keys avoid allocation. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### isEmpty()
|
|||
|
|
|
|||
|
|
Returns `true` when no metadata fields, format-specific metadata, or
|
|||
|
|
additional postprocessor fields are populated.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public boolean isEmpty()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ModelPaths
|
|||
|
|
|
|||
|
|
Combined paths to all models needed for OCR (backward compatibility).
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `detModel` | `String` | — | Path to the detection model directory. |
|
|||
|
|
| `clsModel` | `String` | — | Path to the classification model directory. |
|
|||
|
|
| `recModel` | `String` | — | Path to the recognition model directory. |
|
|||
|
|
| `dictFile` | `String` | — | Path to the character dictionary file. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrBackend
|
|||
|
|
|
|||
|
|
Trait for OCR backend plugins.
|
|||
|
|
|
|||
|
|
Implement this trait to add custom OCR capabilities. OCR backends can be:
|
|||
|
|
|
|||
|
|
- Native Rust implementations (like Tesseract)
|
|||
|
|
- FFI bridges to Python libraries (like EasyOCR, PaddleOCR)
|
|||
|
|
- Cloud-based OCR services (Google Vision, AWS Textract, etc.)
|
|||
|
|
|
|||
|
|
### Thread Safety
|
|||
|
|
|
|||
|
|
OCR backends must be thread-safe (`Send + Sync`) to support concurrent processing.
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### processImage()
|
|||
|
|
|
|||
|
|
Process an image and extract text via OCR.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
An `ExtractionResult` containing the extracted text and metadata.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
- `KreuzbergError.Ocr` - OCR processing failed
|
|||
|
|
- `KreuzbergError.Validation` - Invalid image format or configuration
|
|||
|
|
- `KreuzbergError.Io` - I/O errors (these always bubble up)
|
|||
|
|
|
|||
|
|
### Reading `backend_options`
|
|||
|
|
|
|||
|
|
Backends that support runtime tuning can read `config.backend_options` and
|
|||
|
|
deserialize only the keys they care about. Unknown keys are silently ignored,
|
|||
|
|
so multiple backends can coexist in a pipeline without key conflicts.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public ExtractionResult processImage(byte[] imageBytes, OcrConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### processImageFile()
|
|||
|
|
|
|||
|
|
Process a file and extract text via OCR.
|
|||
|
|
|
|||
|
|
Default implementation reads the file and calls `process_image`.
|
|||
|
|
Override for custom file handling or optimizations.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Same as `process_image`, plus file I/O errors.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public ExtractionResult processImageFile(String path, OcrConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### supportsLanguage()
|
|||
|
|
|
|||
|
|
Check if this backend supports a given language code.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
`true` if the language is supported, `false` otherwise.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public boolean supportsLanguage(String lang)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### backendType()
|
|||
|
|
|
|||
|
|
Get the backend type identifier.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
The backend type enum value.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public OcrBackendType backendType()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### supportedLanguages()
|
|||
|
|
|
|||
|
|
Optional: Get a list of all supported languages.
|
|||
|
|
|
|||
|
|
Defaults to empty list. Override to provide comprehensive language support info.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public List<String> supportedLanguages()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### supportsTableDetection()
|
|||
|
|
|
|||
|
|
Optional: Check if the backend supports table detection.
|
|||
|
|
|
|||
|
|
Defaults to `false`. Override if your backend can detect and extract tables.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public boolean supportsTableDetection()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### supportsDocumentProcessing()
|
|||
|
|
|
|||
|
|
Check if the backend supports direct document-level processing (e.g. for PDFs).
|
|||
|
|
|
|||
|
|
Defaults to `false`. Override if the backend has optimized document processing.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public boolean supportsDocumentProcessing()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### processDocument()
|
|||
|
|
|
|||
|
|
Process a document file directly via OCR.
|
|||
|
|
|
|||
|
|
Only called if `supports_document_processing` returns `true`.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public ExtractionResult processDocument(String path, OcrConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrConfidence
|
|||
|
|
|
|||
|
|
Confidence scores for an OCR element.
|
|||
|
|
|
|||
|
|
Separates detection confidence (how confident that text exists at this location)
|
|||
|
|
from recognition confidence (how confident about the actual text content).
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `detection` | `Optional<Double>` | `null` | Detection confidence: how confident the OCR engine is that text exists here. PaddleOCR provides this as `box_score`, Tesseract doesn't have a direct equivalent. Range: 0.0 to 1.0 (or None if not available). |
|
|||
|
|
| `recognition` | `double` | — | Recognition confidence: how confident about the text content. Range: 0.0 to 1.0. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrConfig
|
|||
|
|
|
|||
|
|
OCR configuration.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `enabled` | `boolean` | `true` | Whether OCR is enabled. Setting `enabled: false` is a shorthand for `disable_ocr: true` on the parent `ExtractionConfig`. Images return metadata only; PDFs use native text extraction without OCR fallback. Defaults to `true`. When `false`, all other OCR settings are ignored. |
|
|||
|
|
| `backend` | `String` | — | OCR backend: tesseract, easyocr, paddleocr |
|
|||
|
|
| `language` | `String` | — | Language code (e.g., "eng", "deu") |
|
|||
|
|
| `tesseractConfig` | `Optional<TesseractConfig>` | `null` | Tesseract-specific configuration (optional) |
|
|||
|
|
| `outputFormat` | `Optional<OutputFormat>` | `null` | Output format for OCR results (optional, for format conversion) |
|
|||
|
|
| `paddleOcrConfig` | `Optional<Object>` | `null` | PaddleOCR-specific configuration (optional, JSON passthrough) |
|
|||
|
|
| `backendOptions` | `Optional<Object>` | `null` | Arbitrary per-call options passed through to the backend unchanged. Custom OCR backends and built-in backends that support runtime tuning can read this value and deserialize the keys they care about. Keys unknown to the backend are silently ignored. This is the recommended extension point for per-call parameters that are not covered by the typed fields above (e.g. mode switching, preprocessing flags, inference batch size). **Scope:** when `pipeline` is `null`, this value is propagated to the primary stage of the auto-constructed pipeline. When `pipeline` is explicitly set, this field has **no effect** — the caller must set `OcrPipelineStage.backend_options` directly on the relevant stage(s) instead. Example: ```json { "mode": "fast", "enable_layout": true, "timeout_ms": 5000 } ``` |
|
|||
|
|
| `elementConfig` | `Optional<OcrElementConfig>` | `null` | OCR element extraction configuration |
|
|||
|
|
| `qualityThresholds` | `Optional<OcrQualityThresholds>` | `null` | Quality thresholds for the native-text-to-OCR fallback decision. When None, uses compiled defaults (matching previous hardcoded behavior). |
|
|||
|
|
| `pipeline` | `Optional<OcrPipelineConfig>` | `null` | Multi-backend OCR pipeline configuration. When set, enables weighted fallback across multiple OCR backends based on output quality. When None, uses the single `backend` field (same as today). |
|
|||
|
|
| `autoRotate` | `boolean` | `false` | Enable automatic page rotation based on orientation detection. When enabled, uses Tesseract's `DetectOrientationScript()` to detect page orientation (0/90/180/270 degrees) before OCR. If the page is rotated with high confidence, the image is corrected before recognition. This is critical for handling rotated scanned documents. |
|
|||
|
|
| `vlmConfig` | `Optional<LlmConfig>` | `null` | VLM (Vision Language Model) OCR configuration. Required when `backend` is `"vlm"`. Uses liter-llm to send page images to a vision model for text extraction. |
|
|||
|
|
| `vlmPrompt` | `Optional<String>` | `null` | Custom Jinja2 prompt template for VLM OCR. When `null`, uses the default template. Available variables: - `{{ language }}` — The document language code (e.g., "eng", "deu"). |
|
|||
|
|
| `acceleration` | `Optional<AccelerationConfig>` | `null` | Hardware acceleration for ONNX Runtime models (e.g. PaddleOCR, layout detection). Not user-configurable via config files — injected at runtime from `ExtractionConfig.acceleration` before each `process_image` call. |
|
|||
|
|
| `tessdataBytes` | `Optional<Map<String, byte[]>>` | `null` | Caller-supplied Tesseract `traineddata` bytes per language code. Primary use case is the WASM build, which has no filesystem and cannot download tessdata at runtime. Native builds typically rely on `TessdataManager` and ignore this field. When present, the WASM Tesseract backend prefers these bytes over its compile-time-bundled English data. Skipped by serde to keep config files small — supply via the typed API at runtime. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static OcrConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrElement
|
|||
|
|
|
|||
|
|
A unified OCR element representing detected text with full metadata.
|
|||
|
|
|
|||
|
|
This is the primary type for structured OCR output, preserving all information
|
|||
|
|
from both Tesseract and PaddleOCR backends.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `text` | `String` | — | The recognized text content. |
|
|||
|
|
| `geometry` | `OcrBoundingGeometry` | `OcrBoundingGeometry.RECTANGLE` | Bounding geometry (rectangle or quadrilateral). |
|
|||
|
|
| `confidence` | `OcrConfidence` | — | Confidence scores for detection and recognition. |
|
|||
|
|
| `level` | `OcrElementLevel` | `OcrElementLevel.LINE` | Hierarchical level (word, line, block, page). |
|
|||
|
|
| `rotation` | `Optional<OcrRotation>` | `null` | Rotation information (if detected). |
|
|||
|
|
| `pageNumber` | `int` | — | Page number (1-indexed). |
|
|||
|
|
| `parentId` | `Optional<String>` | `null` | Parent element ID for hierarchical relationships. Only used for Tesseract output which has word -> line -> block hierarchy. |
|
|||
|
|
| `backendMetadata` | `Map<String, Object>` | `Collections.emptyMap()` | Backend-specific metadata that doesn't fit the unified schema. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrElementConfig
|
|||
|
|
|
|||
|
|
Configuration for OCR element extraction.
|
|||
|
|
|
|||
|
|
Controls how OCR elements are extracted and filtered.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `includeElements` | `boolean` | — | Whether to include OCR elements in the extraction result. When true, the `ocr_elements` field in `ExtractionResult` will be populated. |
|
|||
|
|
| `minLevel` | `OcrElementLevel` | `OcrElementLevel.LINE` | Minimum hierarchical level to include. Elements below this level (e.g., words when min_level is Line) will be excluded. |
|
|||
|
|
| `minConfidence` | `double` | — | Minimum recognition confidence threshold (0.0-1.0). Elements with confidence below this threshold will be filtered out. |
|
|||
|
|
| `buildHierarchy` | `boolean` | — | Whether to build hierarchical relationships between elements. When true, `parent_id` fields will be populated based on spatial containment. Only meaningful for Tesseract output. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrExtractionResult
|
|||
|
|
|
|||
|
|
OCR extraction result.
|
|||
|
|
|
|||
|
|
Result of performing OCR on an image or scanned document,
|
|||
|
|
including recognized text and detected tables.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `content` | `String` | — | Recognized text content |
|
|||
|
|
| `mimeType` | `String` | — | Original MIME type of the processed image |
|
|||
|
|
| `metadata` | `Map<String, Object>` | — | OCR processing metadata (confidence scores, language, etc.) |
|
|||
|
|
| `tables` | `List<OcrTable>` | — | Tables detected and extracted via OCR |
|
|||
|
|
| `ocrElements` | `Optional<List<OcrElement>>` | `/* serde(default) */` | Structured OCR elements with bounding boxes and confidence scores. Available when TSV output is requested or table detection is enabled. |
|
|||
|
|
| `internalDocument` | `Optional<String>` | `null` | Structured document produced from hOCR parsing. Carries paragraph structure, bounding boxes, and confidence scores that the flattened `content` string discards. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrMetadata
|
|||
|
|
|
|||
|
|
OCR processing metadata.
|
|||
|
|
|
|||
|
|
Captures information about OCR processing configuration and results.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `language` | `String` | — | OCR language code(s) used |
|
|||
|
|
| `psm` | `int` | — | Tesseract Page Segmentation Mode (PSM) |
|
|||
|
|
| `outputFormat` | `String` | — | Output format (e.g., "text", "hocr") |
|
|||
|
|
| `tableCount` | `int` | — | Number of tables detected |
|
|||
|
|
| `tableRows` | `Optional<Integer>` | `null` | Table rows |
|
|||
|
|
| `tableCols` | `Optional<Integer>` | `null` | Table cols |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrPipelineConfig
|
|||
|
|
|
|||
|
|
Multi-backend OCR pipeline with quality-based fallback.
|
|||
|
|
|
|||
|
|
Backends are tried in priority order (highest first). After each backend
|
|||
|
|
produces output, quality is evaluated. If it meets `quality_thresholds.pipeline_min_quality`,
|
|||
|
|
the result is accepted. Otherwise the next backend is tried.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `stages` | `List<OcrPipelineStage>` | — | Ordered list of backends to try. Sorted by priority (descending) at runtime. |
|
|||
|
|
| `qualityThresholds` | `OcrQualityThresholds` | `/* serde(default) */` | Quality thresholds for deciding whether to accept a result or try the next backend. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrPipelineStage
|
|||
|
|
|
|||
|
|
A single backend stage in the OCR pipeline.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `backend` | `String` | — | Backend name: "tesseract", "paddleocr", "easyocr", or a custom registered name. |
|
|||
|
|
| `priority` | `int` | `/* serde(default) */` | Priority weight (higher = tried first). Stages are sorted by priority descending. |
|
|||
|
|
| `language` | `Optional<String>` | `/* serde(default) */` | Language override for this stage (None = use parent OcrConfig.language). |
|
|||
|
|
| `tesseractConfig` | `Optional<TesseractConfig>` | `/* serde(default) */` | Tesseract-specific config override for this stage. |
|
|||
|
|
| `paddleOcrConfig` | `Optional<Object>` | `/* serde(default) */` | PaddleOCR-specific config for this stage. |
|
|||
|
|
| `vlmConfig` | `Optional<LlmConfig>` | `/* serde(default) */` | VLM config override for this pipeline stage. |
|
|||
|
|
| `backendOptions` | `Optional<Object>` | `/* serde(default) */` | Arbitrary per-call options passed through to the backend unchanged. Backends that support runtime tuning (mode switching, preprocessing flags, inference parameters, etc.) read this value and deserialize the keys they care about. Keys unknown to the backend are silently ignored, so options from different backends can coexist in the same config without conflict. Example (custom backend): ```json { "mode": "fast", "enable_layout": true } ``` |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrQualityThresholds
|
|||
|
|
|
|||
|
|
Quality thresholds for OCR fallback decisions and pipeline quality gating.
|
|||
|
|
|
|||
|
|
All fields default to the values that match the previous hardcoded behavior,
|
|||
|
|
so `OcrQualityThresholds.default()` preserves existing semantics exactly.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `minTotalNonWhitespace` | `long` | `64` | Minimum total non-whitespace characters to consider text substantive. |
|
|||
|
|
| `minNonWhitespacePerPage` | `double` | `32` | Minimum non-whitespace characters per page on average. |
|
|||
|
|
| `minMeaningfulWordLen` | `long` | `4` | Minimum character count for a word to be "meaningful". |
|
|||
|
|
| `minMeaningfulWords` | `long` | `3` | Minimum count of meaningful words before text is accepted. |
|
|||
|
|
| `minAlnumRatio` | `double` | `0.3` | Minimum alphanumeric ratio (non-whitespace chars that are alphanumeric). |
|
|||
|
|
| `minGarbageChars` | `long` | `5` | Minimum Unicode replacement characters (U+FFFD) to trigger OCR fallback. |
|
|||
|
|
| `maxFragmentedWordRatio` | `double` | `0.6` | Maximum fraction of short (1-2 char) words before text is considered fragmented. |
|
|||
|
|
| `criticalFragmentedWordRatio` | `double` | `0.8` | Critical fragmentation threshold — triggers OCR regardless of meaningful words. Normal English text has ~20-30% short words. 80%+ is definitive garbage. |
|
|||
|
|
| `minAvgWordLength` | `double` | `2` | Minimum average word length. Below this with enough words indicates garbled extraction. |
|
|||
|
|
| `minWordsForAvgLengthCheck` | `long` | `50` | Minimum word count before average word length check applies. |
|
|||
|
|
| `minConsecutiveRepeatRatio` | `double` | `0.08` | Minimum consecutive word repetition ratio to detect column scrambling. |
|
|||
|
|
| `minWordsForRepeatCheck` | `long` | `50` | Minimum word count before consecutive repetition check is applied. |
|
|||
|
|
| `substantiveMinChars` | `long` | `100` | Minimum character count for "substantive markdown" OCR skip gate. |
|
|||
|
|
| `nonTextMinChars` | `long` | `20` | Minimum character count for "non-text content" OCR skip gate. |
|
|||
|
|
| `alnumWsRatioThreshold` | `double` | `0.4` | Alphanumeric+whitespace ratio threshold for skip decisions. |
|
|||
|
|
| `pipelineMinQuality` | `double` | `0.5` | Minimum quality score (0.0-1.0) for a pipeline stage result to be accepted. If the result from a backend scores below this, try the next backend. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static OcrQualityThresholds defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrRotation
|
|||
|
|
|
|||
|
|
Rotation information for an OCR element.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `angleDegrees` | `double` | — | Rotation angle in degrees (0, 90, 180, 270 for PaddleOCR). |
|
|||
|
|
| `confidence` | `Optional<Double>` | `null` | Confidence score for the rotation detection. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrTable
|
|||
|
|
|
|||
|
|
Table detected via OCR.
|
|||
|
|
|
|||
|
|
Represents a table structure recognized during OCR processing.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `cells` | `List<List<String>>` | — | Table cells as a 2D vector (rows × columns) |
|
|||
|
|
| `markdown` | `String` | — | Markdown representation of the table |
|
|||
|
|
| `pageNumber` | `int` | — | Page number where the table was found (1-indexed) |
|
|||
|
|
| `boundingBox` | `Optional<OcrTableBoundingBox>` | `/* serde(default) */` | Bounding box of the table in pixel coordinates (from OCR word positions). |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrTableBoundingBox
|
|||
|
|
|
|||
|
|
Bounding box for an OCR-detected table in pixel coordinates.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `left` | `int` | — | Left x-coordinate (pixels) |
|
|||
|
|
| `top` | `int` | — | Top y-coordinate (pixels) |
|
|||
|
|
| `right` | `int` | — | Right x-coordinate (pixels) |
|
|||
|
|
| `bottom` | `int` | — | Bottom y-coordinate (pixels) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OrientationResult
|
|||
|
|
|
|||
|
|
Document orientation detection result.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `degrees` | `int` | — | Detected orientation in degrees (0, 90, 180, or 270). |
|
|||
|
|
| `confidence` | `float` | — | Confidence score (0.0-1.0). |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PaddleOcrConfig
|
|||
|
|
|
|||
|
|
Configuration for PaddleOCR backend.
|
|||
|
|
|
|||
|
|
Configures PaddleOCR text detection and recognition with multi-language support.
|
|||
|
|
Uses a builder pattern for convenient configuration.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `language` | `String` | — | Language code (e.g., "en", "ch", "jpn", "kor", "deu", "fra") |
|
|||
|
|
| `cacheDir` | `Optional<String>` | `null` | Optional custom cache directory for model files |
|
|||
|
|
| `useAngleCls` | `boolean` | — | Enable angle classification for rotated text (default: false). Can misfire on short text regions, rotating crops incorrectly before recognition. |
|
|||
|
|
| `enableTableDetection` | `boolean` | — | Enable table structure detection (default: false) |
|
|||
|
|
| `detDbThresh` | `float` | — | Database threshold for text detection (default: 0.3) Range: 0.0-1.0, higher values require more confident detections |
|
|||
|
|
| `detDbBoxThresh` | `float` | — | Box threshold for text bounding box refinement (default: 0.5) Range: 0.0-1.0 |
|
|||
|
|
| `detDbUnclipRatio` | `float` | — | Unclip ratio for expanding text bounding boxes (default: 1.6) Controls the expansion of detected text regions |
|
|||
|
|
| `detLimitSideLen` | `int` | — | Maximum side length for detection image (default: 960) Larger images may be resized to this limit for faster inference |
|
|||
|
|
| `recBatchNum` | `int` | — | Batch size for recognition inference (default: 6) Number of text regions to process simultaneously |
|
|||
|
|
| `padding` | `int` | — | Padding in pixels added around the image before detection (default: 10). Large values can include surrounding content like table gridlines. |
|
|||
|
|
| `dropScore` | `float` | — | Minimum recognition confidence score for text lines (default: 0.5). Text regions with recognition confidence below this threshold are discarded. Matches PaddleOCR Python's `drop_score` parameter. Range: 0.0-1.0 |
|
|||
|
|
| `modelTier` | `String` | — | Model tier controlling detection/recognition model size and accuracy trade-off. - `"mobile"` (default): Lightweight models (~4.5MB detection, ~16.5MB recognition), fast download and inference - `"server"`: Large, high-accuracy models (~88MB detection, ~84MB recognition), best for GPU or complex documents |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### withCacheDir()
|
|||
|
|
|
|||
|
|
Sets a custom cache directory for model files.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public PaddleOcrConfig withCacheDir(String path)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### withTableDetection()
|
|||
|
|
|
|||
|
|
Enables or disables table structure detection.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public PaddleOcrConfig withTableDetection(boolean enable)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### withAngleCls()
|
|||
|
|
|
|||
|
|
Enables or disables angle classification for rotated text.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public PaddleOcrConfig withAngleCls(boolean enable)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### withDetDbThresh()
|
|||
|
|
|
|||
|
|
Sets the database threshold for text detection.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public PaddleOcrConfig withDetDbThresh(float threshold)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### withDetDbBoxThresh()
|
|||
|
|
|
|||
|
|
Sets the box threshold for text bounding box refinement.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public PaddleOcrConfig withDetDbBoxThresh(float threshold)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### withDetDbUnclipRatio()
|
|||
|
|
|
|||
|
|
Sets the unclip ratio for expanding text bounding boxes.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public PaddleOcrConfig withDetDbUnclipRatio(float ratio)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### withDetLimitSideLen()
|
|||
|
|
|
|||
|
|
Sets the maximum side length for detection images.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public PaddleOcrConfig withDetLimitSideLen(int length)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### withRecBatchNum()
|
|||
|
|
|
|||
|
|
Sets the batch size for recognition inference.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public PaddleOcrConfig withRecBatchNum(int batchSize)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### withDropScore()
|
|||
|
|
|
|||
|
|
Sets the minimum recognition confidence threshold.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public PaddleOcrConfig withDropScore(float score)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### withPadding()
|
|||
|
|
|
|||
|
|
Sets padding in pixels added around images before detection.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public PaddleOcrConfig withPadding(int padding)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### withModelTier()
|
|||
|
|
|
|||
|
|
Sets the model tier controlling detection/recognition model size.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public PaddleOcrConfig withModelTier(String tier)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
Creates a default configuration with English language support.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static PaddleOcrConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PageBoundary
|
|||
|
|
|
|||
|
|
Byte offset boundary for a page.
|
|||
|
|
|
|||
|
|
Tracks where a specific page's content starts and ends in the main content string,
|
|||
|
|
enabling mapping from byte positions to page numbers. Offsets are guaranteed to be
|
|||
|
|
at valid UTF-8 character boundaries when using standard String methods (push_str, push, etc.).
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `byteStart` | `long` | — | Byte offset where this page starts in the content string (UTF-8 valid boundary, inclusive) |
|
|||
|
|
| `byteEnd` | `long` | — | Byte offset where this page ends in the content string (UTF-8 valid boundary, exclusive) |
|
|||
|
|
| `pageNumber` | `int` | — | Page number (1-indexed) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PageConfig
|
|||
|
|
|
|||
|
|
Page extraction and tracking configuration.
|
|||
|
|
|
|||
|
|
Controls how pages are extracted, tracked, and represented in the extraction results.
|
|||
|
|
When `null`, page tracking is disabled.
|
|||
|
|
|
|||
|
|
Page range tracking in chunk metadata (first_page/last_page) is automatically enabled
|
|||
|
|
when page boundaries are available and chunking is configured.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `extractPages` | `boolean` | `false` | Extract pages as separate array (ExtractionResult.pages) |
|
|||
|
|
| `insertPageMarkers` | `boolean` | `false` | Insert page markers in main content string |
|
|||
|
|
| `markerFormat` | `String` | `"<!-- PAGE {page_num} -->"` | Page marker format (use {page_num} placeholder) Default: "\n\n<!-- PAGE {page_num} -->\n\n" |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static PageConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PageContent
|
|||
|
|
|
|||
|
|
Content for a single page/slide.
|
|||
|
|
|
|||
|
|
When page extraction is enabled, documents are split into per-page content
|
|||
|
|
with associated tables and images mapped to each page.
|
|||
|
|
|
|||
|
|
### Performance
|
|||
|
|
|
|||
|
|
Uses Arc-wrapped tables and images for memory efficiency:
|
|||
|
|
|
|||
|
|
- `Vec<Arc<Table>>` enables zero-copy sharing of table data
|
|||
|
|
- `Vec<Arc<ExtractedImage>>` enables zero-copy sharing of image data
|
|||
|
|
- Maintains exact JSON compatibility via custom Serialize/Deserialize
|
|||
|
|
|
|||
|
|
This reduces memory overhead for documents with shared tables/images
|
|||
|
|
by avoiding redundant copies during serialization.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `pageNumber` | `int` | — | Page number (1-indexed) |
|
|||
|
|
| `content` | `String` | — | Text content for this page |
|
|||
|
|
| `tables` | `List<Table>` | `/* serde(default) */` | Tables found on this page (uses Arc for memory efficiency) Serializes as Vec<Table> for JSON compatibility while maintaining Arc semantics in-memory for zero-copy sharing. |
|
|||
|
|
| `imageIndices` | `List<Integer>` | `/* serde(default) */` | Indices into `ExtractionResult.images` for images found on this page. Each value is a zero-based index into the top-level `images` collection. Only populated when `extract_images = true` in the extraction config. |
|
|||
|
|
| `hierarchy` | `Optional<PageHierarchy>` | `null` | Hierarchy information for the page (when hierarchy extraction is enabled) Contains text hierarchy levels (H1-H6) extracted from the page content. |
|
|||
|
|
| `isBlank` | `Optional<Boolean>` | `null` | Whether this page is blank (no meaningful text content) Determined during extraction based on text content analysis. A page is blank if it has fewer than 3 non-whitespace characters and contains no tables or images. |
|
|||
|
|
| `layoutRegions` | `Optional<List<LayoutRegion>>` | `null` | Layout detection regions for this page (when layout detection is enabled). Contains detected layout regions with class, confidence, bounding box, and area fraction. Only populated when layout detection is configured. |
|
|||
|
|
| `speakerNotes` | `Optional<String>` | `null` | Speaker notes for this slide (PPTX only). Contains the text from the slide's notes pane (`ppt/notesSlides/notesSlide{N}.xml`). Only populated when the source is a PPTX file and notes are present. |
|
|||
|
|
| `sectionName` | `Optional<String>` | `null` | Section name this slide belongs to (PPTX only). PowerPoint sections group slides into logical chapters (`<p:sectionLst>` in `ppt/presentation.xml`). Only populated when the source is a PPTX file and the slide belongs to a named section. |
|
|||
|
|
| `sheetName` | `Optional<String>` | `null` | Sheet name for this page (XLSX/ODS only). Each spreadsheet sheet maps to one `PageContent` entry. This field carries the sheet's display name as it appears in the workbook. `null` for all non-spreadsheet formats and for sheets with an empty name. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PageHierarchy
|
|||
|
|
|
|||
|
|
Page hierarchy structure containing heading levels and block information.
|
|||
|
|
|
|||
|
|
Used when PDF text hierarchy extraction is enabled. Contains hierarchical
|
|||
|
|
blocks with heading levels (H1-H6) for semantic document structure.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `blockCount` | `int` | — | Number of hierarchy blocks on this page |
|
|||
|
|
| `blocks` | `List<HierarchicalBlock>` | `/* serde(default) */` | Hierarchical blocks with heading levels |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PageInfo
|
|||
|
|
|
|||
|
|
Metadata for individual page/slide/sheet.
|
|||
|
|
|
|||
|
|
Captures per-page information including dimensions, content counts,
|
|||
|
|
and visibility state (for presentations).
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `number` | `int` | — | Page number (1-indexed) |
|
|||
|
|
| `title` | `Optional<String>` | `null` | Page title (usually for presentations) |
|
|||
|
|
| `dimensions` | `Optional<List<Double>>` | `null` | Dimensions in points (PDF) or pixels (images): (width, height) |
|
|||
|
|
| `imageCount` | `Optional<Integer>` | `null` | Number of images on this page |
|
|||
|
|
| `tableCount` | `Optional<Integer>` | `null` | Number of tables on this page |
|
|||
|
|
| `hidden` | `Optional<Boolean>` | `null` | Whether this page is hidden (e.g., in presentations) |
|
|||
|
|
| `isBlank` | `Optional<Boolean>` | `null` | Whether this page is blank (no meaningful text, no images, no tables) A page is considered blank if it has fewer than 3 non-whitespace characters and contains no tables or images. This is useful for filtering out empty pages in scanned documents or PDFs with blank separator pages. |
|
|||
|
|
| `hasVectorGraphics` | `boolean` | `/* serde(default) */` | Whether this page contains non-trivial vector graphics (paths, shapes, curves) Indicates the presence of vector-drawn content such as charts, diagrams, or geometric shapes (e.g., from Adobe InDesign, LaTeX TikZ). These are invisible to `ExtractionResult.images` since they are not embedded as raster XObjects. Set to `true` when path count exceeds a heuristic threshold, signaling that downstream consumers may want to rasterize the page to capture this content. Only populated for PDFs; `null` for other document types. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PageStructure
|
|||
|
|
|
|||
|
|
Unified page structure for documents.
|
|||
|
|
|
|||
|
|
Supports different page types (PDF pages, PPTX slides, Excel sheets)
|
|||
|
|
with character offset boundaries for chunk-to-page mapping.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `totalCount` | `int` | — | Total number of pages/slides/sheets |
|
|||
|
|
| `unitType` | `PageUnitType` | — | Type of paginated unit |
|
|||
|
|
| `boundaries` | `Optional<List<PageBoundary>>` | `null` | Character offset boundaries for each page Maps character ranges in the extracted content to page numbers. Used for chunk page range calculation. |
|
|||
|
|
| `pages` | `Optional<List<PageInfo>>` | `null` | Detailed per-page metadata (optional, only when needed) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PdfAnnotation
|
|||
|
|
|
|||
|
|
A PDF annotation extracted from a document page.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `annotationType` | `PdfAnnotationType` | — | The type of annotation. |
|
|||
|
|
| `content` | `Optional<String>` | `null` | Text content of the annotation (e.g., comment text, link URL). |
|
|||
|
|
| `pageNumber` | `int` | — | Page number where the annotation appears (1-indexed). |
|
|||
|
|
| `boundingBox` | `Optional<BoundingBox>` | `null` | Bounding box of the annotation on the page. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PdfConfig
|
|||
|
|
|
|||
|
|
PDF-specific configuration.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `extractImages` | `boolean` | `false` | Extract images from PDF |
|
|||
|
|
| `extractTables` | `boolean` | `true` | Extract tables from PDF. When `true` (default), runs pdf_oxide's native grid detector and, if it finds nothing, falls back to the heuristic text-layer reconstruction in `pdf.oxide.table.extract_tables_heuristic`. Set to `false` to skip both passes — `tables` will then be empty in the result. |
|
|||
|
|
| `passwords` | `Optional<List<String>>` | `null` | List of passwords to try when opening encrypted PDFs |
|
|||
|
|
| `extractMetadata` | `boolean` | `true` | Extract PDF metadata |
|
|||
|
|
| `hierarchy` | `Optional<HierarchyConfig>` | `null` | Hierarchy extraction configuration (None = hierarchy extraction disabled) |
|
|||
|
|
| `extractAnnotations` | `boolean` | `false` | Extract PDF annotations (text notes, highlights, links, stamps). Default: false |
|
|||
|
|
| `topMarginFraction` | `Optional<Float>` | `null` | Top margin fraction (0.0–1.0) of page height to exclude headers/running heads. Default: 0.06 (6%) |
|
|||
|
|
| `bottomMarginFraction` | `Optional<Float>` | `null` | Bottom margin fraction (0.0–1.0) of page height to exclude footers/page numbers. Default: 0.05 (5%) |
|
|||
|
|
| `allowSingleColumnTables` | `boolean` | `false` | Allow single-column pseudo tables in extraction results. By default, tables with fewer than 2 columns (layout-guided) or 3 columns (heuristic) are rejected. When `true`, the minimum column count is relaxed to 1, allowing single-column structured data (glossaries, itemized lists) to be emitted as tables. Other quality filters (density, sparsity, prose detection) still apply. |
|
|||
|
|
| `ocrInlineImages` | `boolean` | `false` | Perform OCR on inline images extracted from PDF pages and attach the recognized text to each `ExtractedImage.ocr_result`. Requires Tesseract to be available; if `ExtractionConfig.ocr` is `null` the extractor falls back to `TesseractConfig.default()`. Per-image failures degrade gracefully (the image is returned without OCR text rather than failing the whole extraction). Default: `false`. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static PdfConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PdfMetadata
|
|||
|
|
|
|||
|
|
PDF-specific metadata.
|
|||
|
|
|
|||
|
|
Contains metadata fields specific to PDF documents that are not in the common
|
|||
|
|
`Metadata` structure. Common fields like title, authors, keywords, and dates
|
|||
|
|
are at the `Metadata` level.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `pdfVersion` | `Optional<String>` | `null` | PDF version (e.g., "1.7", "2.0") |
|
|||
|
|
| `producer` | `Optional<String>` | `null` | PDF producer (application that created the PDF) |
|
|||
|
|
| `isEncrypted` | `Optional<Boolean>` | `null` | Whether the PDF is encrypted/password-protected |
|
|||
|
|
| `width` | `Optional<Long>` | `null` | First page width in points (1/72 inch) |
|
|||
|
|
| `height` | `Optional<Long>` | `null` | First page height in points (1/72 inch) |
|
|||
|
|
| `pageCount` | `Optional<Integer>` | `null` | Total number of pages in the PDF document |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Plugin
|
|||
|
|
|
|||
|
|
Base trait that all plugins must implement.
|
|||
|
|
|
|||
|
|
This trait provides common functionality for plugin lifecycle management,
|
|||
|
|
identification, and metadata.
|
|||
|
|
|
|||
|
|
### Thread Safety
|
|||
|
|
|
|||
|
|
All plugins must be `Send + Sync` to support concurrent usage across threads.
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### name()
|
|||
|
|
|
|||
|
|
Returns the unique name/identifier for this plugin.
|
|||
|
|
|
|||
|
|
The name should be:
|
|||
|
|
|
|||
|
|
- Unique across all plugins
|
|||
|
|
- Lowercase with hyphens (e.g., "my-custom-plugin")
|
|||
|
|
- URL-safe characters only
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public String name()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### version()
|
|||
|
|
|
|||
|
|
Returns the semantic version of this plugin.
|
|||
|
|
|
|||
|
|
Should follow semver format: `MAJOR.MINOR.PATCH`
|
|||
|
|
|
|||
|
|
Defaults to the kreuzberg crate version.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public String version()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### initialize()
|
|||
|
|
|
|||
|
|
Initialize the plugin.
|
|||
|
|
|
|||
|
|
Called once when the plugin is registered. Use this to:
|
|||
|
|
|
|||
|
|
- Load configuration
|
|||
|
|
- Initialize resources (connections, caches, etc.)
|
|||
|
|
- Validate dependencies
|
|||
|
|
|
|||
|
|
### Thread Safety
|
|||
|
|
|
|||
|
|
This method takes `&self` instead of `&mut self` to work with `Arc<dyn Plugin>`.
|
|||
|
|
Plugins needing mutable state during initialization should use interior mutability
|
|||
|
|
patterns (Mutex, RwLock, OnceCell, etc.).
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Should return an error if initialization fails. The plugin will not be
|
|||
|
|
registered if this method returns an error.
|
|||
|
|
|
|||
|
|
Defaults to a no-op for stateless plugins.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public void initialize() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### shutdown()
|
|||
|
|
|
|||
|
|
Shutdown the plugin.
|
|||
|
|
|
|||
|
|
Called when the plugin is being unregistered or the application is shutting down.
|
|||
|
|
Use this to:
|
|||
|
|
|
|||
|
|
- Close connections
|
|||
|
|
- Flush caches
|
|||
|
|
- Release resources
|
|||
|
|
|
|||
|
|
### Thread Safety
|
|||
|
|
|
|||
|
|
This method takes `&self` instead of `&mut self` to work with `Arc<dyn Plugin>`.
|
|||
|
|
Plugins needing mutable state during shutdown should use interior mutability
|
|||
|
|
patterns (Mutex, RwLock, etc.).
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Errors during shutdown are logged but don't prevent the shutdown process.
|
|||
|
|
|
|||
|
|
Defaults to a no-op for stateless plugins.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public void shutdown() throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### description()
|
|||
|
|
|
|||
|
|
Optional plugin description for debugging and logging.
|
|||
|
|
|
|||
|
|
Defaults to empty string if not overridden.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public String description()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### author()
|
|||
|
|
|
|||
|
|
Optional plugin author information.
|
|||
|
|
|
|||
|
|
Defaults to empty string if not overridden.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public String author()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PostProcessor
|
|||
|
|
|
|||
|
|
Trait for post-processor plugins.
|
|||
|
|
|
|||
|
|
Post-processors transform or enrich extraction results after the initial
|
|||
|
|
extraction is complete. They can:
|
|||
|
|
|
|||
|
|
- Clean and normalize text
|
|||
|
|
- Add metadata (language, keywords, entities)
|
|||
|
|
- Split content into chunks
|
|||
|
|
- Score quality
|
|||
|
|
- Apply custom transformations
|
|||
|
|
|
|||
|
|
### Processing Order
|
|||
|
|
|
|||
|
|
Post-processors are executed in stage order:
|
|||
|
|
|
|||
|
|
1. **Early** - Language detection, entity extraction
|
|||
|
|
2. **Middle** - Keyword extraction, token reduction
|
|||
|
|
3. **Late** - Custom hooks, final validation
|
|||
|
|
|
|||
|
|
Within each stage, processors are executed in registration order.
|
|||
|
|
|
|||
|
|
### Error Handling
|
|||
|
|
|
|||
|
|
Post-processor errors are non-fatal by default - they're captured in metadata
|
|||
|
|
and execution continues. To make errors fatal, return an error from `process()`.
|
|||
|
|
|
|||
|
|
### Thread Safety
|
|||
|
|
|
|||
|
|
Post-processors must be thread-safe (`Send + Sync`).
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### process()
|
|||
|
|
|
|||
|
|
Process an extraction result.
|
|||
|
|
|
|||
|
|
Transform or enrich the extraction result. Can modify:
|
|||
|
|
|
|||
|
|
- `content` - The extracted text
|
|||
|
|
- `metadata` - Add or update metadata fields
|
|||
|
|
- `tables` - Modify or enhance table data
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
`Ok(())` if processing succeeded, `Err(...)` for fatal failures.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Return errors for fatal processing failures. Non-fatal errors should be
|
|||
|
|
captured in metadata directly on the result.
|
|||
|
|
|
|||
|
|
### Performance
|
|||
|
|
|
|||
|
|
This signature avoids unnecessary cloning of large extraction results by
|
|||
|
|
taking a mutable reference instead of ownership. Processors modify the
|
|||
|
|
result in place.
|
|||
|
|
|
|||
|
|
### Example - Language Detection
|
|||
|
|
|
|||
|
|
### Example - Text Cleaning
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
async fn process(&self, result: &mut ExtractionResult, config: &ExtractionConfig)
|
|||
|
|
-> Result<()> {
|
|||
|
|
// Remove excessive whitespace
|
|||
|
|
result.content = result
|
|||
|
|
.content
|
|||
|
|
.split_whitespace()
|
|||
|
|
.collect::<Vec<_>>()
|
|||
|
|
.join(" ");
|
|||
|
|
|
|||
|
|
Ok(())
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public void process(ExtractionResult result, ExtractionConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### processingStage()
|
|||
|
|
|
|||
|
|
Get the processing stage for this post-processor.
|
|||
|
|
|
|||
|
|
Determines when this processor runs in the pipeline.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
The `ProcessingStage` (Early, Middle, or Late).
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public ProcessingStage processingStage()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### shouldProcess()
|
|||
|
|
|
|||
|
|
Optional: Check if this processor should run for a given result.
|
|||
|
|
|
|||
|
|
Allows conditional processing based on MIME type, metadata, or content.
|
|||
|
|
Defaults to `true` (always run).
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
`true` if the processor should run, `false` to skip.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public boolean shouldProcess(ExtractionResult result, ExtractionConfig config)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### estimatedDurationMs()
|
|||
|
|
|
|||
|
|
Optional: Estimate processing time in milliseconds.
|
|||
|
|
|
|||
|
|
Used for logging and debugging. Defaults to 0 (unknown).
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
Estimated processing time in milliseconds.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public long estimatedDurationMs(ExtractionResult result)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### priority()
|
|||
|
|
|
|||
|
|
Execution priority within the processing stage.
|
|||
|
|
|
|||
|
|
Higher values run first within the same `ProcessingStage`. Defaults to 50.
|
|||
|
|
Use 0-49 for fallback processors, 50 for normal processors, and 51-255
|
|||
|
|
for high-priority processors that should run early in their stage.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public int priority()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PostProcessorConfig
|
|||
|
|
|
|||
|
|
Post-processor configuration.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `enabled` | `boolean` | `true` | Enable post-processors |
|
|||
|
|
| `enabledProcessors` | `Optional<List<String>>` | `null` | Whitelist of processor names to run (None = all enabled) |
|
|||
|
|
| `disabledProcessors` | `Optional<List<String>>` | `null` | Blacklist of processor names to skip (None = none disabled) |
|
|||
|
|
| `enabledSet` | `Optional<List<String>>` | `null` | Pre-computed AHashSet for O(1) enabled processor lookup |
|
|||
|
|
| `disabledSet` | `Optional<List<String>>` | `null` | Pre-computed AHashSet for O(1) disabled processor lookup |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static PostProcessorConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PptxAppProperties
|
|||
|
|
|
|||
|
|
Application properties from docProps/app.xml for PPTX
|
|||
|
|
|
|||
|
|
Contains PowerPoint-specific document metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `application` | `Optional<String>` | `null` | Application name (e.g., "Microsoft Office PowerPoint") |
|
|||
|
|
| `appVersion` | `Optional<String>` | `null` | Application version |
|
|||
|
|
| `totalTime` | `Optional<Integer>` | `null` | Total editing time in minutes |
|
|||
|
|
| `company` | `Optional<String>` | `null` | Company name |
|
|||
|
|
| `docSecurity` | `Optional<Integer>` | `null` | Document security level |
|
|||
|
|
| `scaleCrop` | `Optional<Boolean>` | `null` | Scale crop flag |
|
|||
|
|
| `linksUpToDate` | `Optional<Boolean>` | `null` | Links up to date flag |
|
|||
|
|
| `sharedDoc` | `Optional<Boolean>` | `null` | Shared document flag |
|
|||
|
|
| `hyperlinksChanged` | `Optional<Boolean>` | `null` | Hyperlinks changed flag |
|
|||
|
|
| `slides` | `Optional<Integer>` | `null` | Number of slides |
|
|||
|
|
| `notes` | `Optional<Integer>` | `null` | Number of notes |
|
|||
|
|
| `hiddenSlides` | `Optional<Integer>` | `null` | Number of hidden slides |
|
|||
|
|
| `multimediaClips` | `Optional<Integer>` | `null` | Number of multimedia clips |
|
|||
|
|
| `presentationFormat` | `Optional<String>` | `null` | Presentation format (e.g., "Widescreen", "Standard") |
|
|||
|
|
| `slideTitles` | `List<String>` | `Collections.emptyList()` | Slide titles |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PptxExtractionResult
|
|||
|
|
|
|||
|
|
PowerPoint (PPTX) extraction result.
|
|||
|
|
|
|||
|
|
Contains extracted slide content, metadata, and embedded images/tables.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `content` | `String` | — | Extracted text content from all slides |
|
|||
|
|
| `metadata` | `PptxMetadata` | — | Presentation metadata |
|
|||
|
|
| `slideCount` | `long` | — | Total number of slides |
|
|||
|
|
| `imageCount` | `long` | — | Total number of embedded images |
|
|||
|
|
| `tableCount` | `long` | — | Total number of tables |
|
|||
|
|
| `images` | `List<ExtractedImage>` | — | Extracted images from the presentation |
|
|||
|
|
| `pageStructure` | `Optional<PageStructure>` | `null` | Slide structure with boundaries (when page tracking is enabled) |
|
|||
|
|
| `pageContents` | `Optional<List<PageContent>>` | `null` | Per-slide content (when page tracking is enabled) |
|
|||
|
|
| `document` | `Optional<DocumentStructure>` | `null` | Structured document representation |
|
|||
|
|
| `hyperlinks` | `List<String>` | `/* serde(default) */` | Hyperlinks discovered in slides as (url, optional_label) pairs. |
|
|||
|
|
| `officeMetadata` | `Map<String, String>` | `/* serde(default) */` | Office metadata extracted from docProps/core.xml and docProps/app.xml. Contains keys like "title", "author", "created_by", "subject", "keywords", "modified_by", "created_at", "modified_at", etc. |
|
|||
|
|
| `revisions` | `Optional<List<DocumentRevision>>` | `/* serde(default) */` | Slide comments as revisions. Each `<p:cm>` element in `ppt/comments/comment{N}.xml` becomes a `DocumentRevision { kind: Comment }` with author (resolved from `ppt/commentAuthors.xml`), ISO-8601 timestamp, and `RevisionAnchor.Slide { index }`. `null` when no comment XML parts exist. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PptxMetadata
|
|||
|
|
|
|||
|
|
PowerPoint presentation metadata.
|
|||
|
|
|
|||
|
|
Extracted from PPTX files containing slide counts and presentation details.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `slideCount` | `int` | — | Total number of slides in the presentation |
|
|||
|
|
| `slideNames` | `List<String>` | `Collections.emptyList()` | Names of slides (if available) |
|
|||
|
|
| `imageCount` | `Optional<Integer>` | `null` | Number of embedded images |
|
|||
|
|
| `tableCount` | `Optional<Integer>` | `null` | Number of tables |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ProcessingWarning
|
|||
|
|
|
|||
|
|
A non-fatal warning from a processing pipeline stage.
|
|||
|
|
|
|||
|
|
Captures errors from optional features that don't prevent extraction
|
|||
|
|
but may indicate degraded results.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `source` | `String` | — | The pipeline stage or feature that produced this warning (e.g., "embedding", "chunking", "language_detection", "output_format"). |
|
|||
|
|
| `message` | `String` | — | Human-readable description of what went wrong. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PstMetadata
|
|||
|
|
|
|||
|
|
Outlook PST archive metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `messageCount` | `long` | — | Number of messages |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### RakeParams
|
|||
|
|
|
|||
|
|
RAKE-specific parameters.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `minWordLength` | `long` | `1` | Minimum word length to consider (default: 1). |
|
|||
|
|
| `maxWordsPerPhrase` | `long` | `3` | Maximum words in a keyword phrase (default: 3). |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static RakeParams defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### RecognizedTable
|
|||
|
|
|
|||
|
|
Pre-computed table markdown for a table detection region.
|
|||
|
|
|
|||
|
|
Produced by the TATR-based table structure recognizer and surfaced as part of
|
|||
|
|
layout-aware OCR results. The struct lives here (under `layout-types`, pure-Rust)
|
|||
|
|
so that consumers who do not enable `layout-detection` (ORT) can still reference
|
|||
|
|
the type in their own code.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `detectionBbox` | `BBox` | — | Detection bbox that this table corresponds to (for matching). |
|
|||
|
|
| `cells` | `List<List<String>>` | — | Table cells as a 2D vector (rows × columns). |
|
|||
|
|
| `markdown` | `String` | — | Rendered markdown table. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Renderer
|
|||
|
|
|
|||
|
|
Trait for document renderers that convert `InternalDocument` to output strings.
|
|||
|
|
|
|||
|
|
Renderers are typically stateless converters that transform the internal
|
|||
|
|
document representation into a specific output format (Markdown, HTML,
|
|||
|
|
Djot, plain text, etc.). They participate in the standard `Plugin`
|
|||
|
|
lifecycle so custom renderers can be registered from any supported binding
|
|||
|
|
language.
|
|||
|
|
|
|||
|
|
The format name is exposed via `Plugin.name`. For stateless renderers
|
|||
|
|
the `Plugin` lifecycle methods (`version`, `initialize`, `shutdown`) all
|
|||
|
|
take no-op defaults and need not be overridden.
|
|||
|
|
|
|||
|
|
### Thread Safety
|
|||
|
|
|
|||
|
|
Renderers must be `Send + Sync` (inherited from `Plugin`).
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### render()
|
|||
|
|
|
|||
|
|
Render an `InternalDocument` to the output format.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
The rendered output as a string.
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
Returns an error if rendering fails.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public String render(InternalDocument doc) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### RevisionDelta
|
|||
|
|
|
|||
|
|
The content changes that make up a single revision.
|
|||
|
|
|
|||
|
|
For insertions and deletions the `content` field carries the added/removed
|
|||
|
|
lines as `DiffLine.Added` / `DiffLine.Removed` entries. For format
|
|||
|
|
changes, `content` is empty — the property diff is left as a TODO for a
|
|||
|
|
later enrichment pass.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `content` | `List<DiffLine>` | `Collections.emptyList()` | Line-level content changes for this revision. |
|
|||
|
|
| `tableChanges` | `List<CellChange>` | `Collections.emptyList()` | Cell-level table changes for this revision. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### SecurityLimits
|
|||
|
|
|
|||
|
|
Configuration for security limits across extractors.
|
|||
|
|
|
|||
|
|
All limits are intentionally conservative to prevent DoS attacks
|
|||
|
|
while still supporting legitimate documents.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `maxArchiveSize` | `long` | `524288000` | Maximum uncompressed size for archives (500 MB) |
|
|||
|
|
| `maxCompressionRatio` | `long` | `100` | Maximum compression ratio before flagging as potential bomb (100:1) |
|
|||
|
|
| `maxFilesInArchive` | `long` | `10000` | Maximum number of files in archive (10,000) |
|
|||
|
|
| `maxNestingDepth` | `long` | `1024` | Maximum nesting depth for structures (100) |
|
|||
|
|
| `maxEntityLength` | `long` | `1048576` | Maximum length of any single XML entity / attribute / token (1 MiB). This is a per-token cap, NOT a total cap — billion-laughs class attacks where a single entity expands to hundreds of MB are caught here, while normal long text content (a paragraph, a CDATA block) is caught by `max_content_size` instead. |
|
|||
|
|
| `maxContentSize` | `long` | `104857600` | Maximum string growth per document (100 MB) |
|
|||
|
|
| `maxIterations` | `long` | `10000000` | Maximum iterations per operation |
|
|||
|
|
| `maxXmlDepth` | `long` | `1024` | Maximum XML depth (100 levels) |
|
|||
|
|
| `maxTableCells` | `long` | `100000` | Maximum cells per table (100,000) |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static SecurityLimits defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ServerConfig
|
|||
|
|
|
|||
|
|
API server configuration.
|
|||
|
|
|
|||
|
|
This struct holds all configuration options for the Kreuzberg API server,
|
|||
|
|
including host/port settings, CORS configuration, and upload limits.
|
|||
|
|
|
|||
|
|
### Defaults
|
|||
|
|
|
|||
|
|
- `host`: "127.0.0.1" (localhost only)
|
|||
|
|
- `port`: 8000
|
|||
|
|
- `cors_origins`: empty vector (allows all origins)
|
|||
|
|
- `max_request_body_bytes`: 104_857_600 (100 MB)
|
|||
|
|
- `max_multipart_field_bytes`: 104_857_600 (100 MB)
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `host` | `String` | — | Server host address (e.g., "127.0.0.1", "0.0.0.0") |
|
|||
|
|
| `port` | `short` | — | Server port number |
|
|||
|
|
| `corsOrigins` | `List<String>` | `Collections.emptyList()` | CORS allowed origins. Empty vector means allow all origins. If this is an empty vector, the server will accept requests from any origin. If populated with specific origins (e.g., `"<https://example.com"`>), only those origins will be allowed. |
|
|||
|
|
| `maxRequestBodyBytes` | `long` | — | Maximum size of request body in bytes (default: 100 MB) |
|
|||
|
|
| `maxMultipartFieldBytes` | `long` | — | Maximum size of multipart fields in bytes (default: 100 MB) |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static ServerConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### listenAddr()
|
|||
|
|
|
|||
|
|
Get the server listen address (host:port).
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public String listenAddr()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### corsAllowsAll()
|
|||
|
|
|
|||
|
|
Check if CORS allows all origins.
|
|||
|
|
|
|||
|
|
Returns `true` if the `cors_origins` vector is empty, meaning all origins
|
|||
|
|
are allowed. Returns `false` if specific origins are configured.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public boolean corsAllowsAll()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### isOriginAllowed()
|
|||
|
|
|
|||
|
|
Check if a given origin is allowed by CORS configuration.
|
|||
|
|
|
|||
|
|
Returns `true` if:
|
|||
|
|
|
|||
|
|
- CORS allows all origins (empty origins list), or
|
|||
|
|
- The given origin is in the allowed origins list
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public boolean isOriginAllowed(String origin)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### maxRequestBodyMb()
|
|||
|
|
|
|||
|
|
Get maximum request body size in megabytes (rounded up).
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public long maxRequestBodyMb()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### maxMultipartFieldMb()
|
|||
|
|
|
|||
|
|
Get maximum multipart field size in megabytes (rounded up).
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public long maxMultipartFieldMb()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### StructuredData
|
|||
|
|
|
|||
|
|
Structured data (Schema.org, microdata, RDFa) block.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `dataType` | `StructuredDataType` | — | Type of structured data |
|
|||
|
|
| `rawJson` | `String` | — | Raw JSON string representation |
|
|||
|
|
| `schemaType` | `Optional<String>` | `null` | Schema type if detectable (e.g., "Article", "Event", "Product") |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### StructuredDataResult
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `content` | `String` | — | The extracted text content |
|
|||
|
|
| `format` | `String` | — | Format |
|
|||
|
|
| `metadata` | `Map<String, String>` | — | Document metadata |
|
|||
|
|
| `textFields` | `List<String>` | — | Text fields |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### StructuredExtractionConfig
|
|||
|
|
|
|||
|
|
Configuration for LLM-based structured data extraction.
|
|||
|
|
|
|||
|
|
Sends extracted document content to a VLM with a JSON schema,
|
|||
|
|
returning structured data that conforms to the schema.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `schema` | `Object` | — | JSON Schema defining the desired output structure. |
|
|||
|
|
| `schemaName` | `String` | `/* serde(default) */` | Schema name passed to the LLM's structured output mode. |
|
|||
|
|
| `schemaDescription` | `Optional<String>` | `/* serde(default) */` | Optional schema description for the LLM. |
|
|||
|
|
| `strict` | `boolean` | `/* serde(default) */` | Enable strict mode — output must exactly match the schema. |
|
|||
|
|
| `prompt` | `Optional<String>` | `/* serde(default) */` | Custom Jinja2 extraction prompt template. When `null`, a default template is used. Available template variables: - `{{ content }}` — The extracted document text. - `{{ schema }}` — The JSON schema as a formatted string. - `{{ schema_name }}` — The schema name. - `{{ schema_description }}` — The schema description (may be empty). |
|
|||
|
|
| `llm` | `LlmConfig` | — | LLM configuration for the extraction. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### SupportedFormat
|
|||
|
|
|
|||
|
|
A supported document format entry.
|
|||
|
|
|
|||
|
|
Represents a file extension and its corresponding MIME type that Kreuzberg can process.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `extension` | `String` | — | File extension (without leading dot), e.g., "pdf", "docx" |
|
|||
|
|
| `mimeType` | `String` | — | MIME type string, e.g., "application/pdf" |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Table
|
|||
|
|
|
|||
|
|
Extracted table structure.
|
|||
|
|
|
|||
|
|
Represents a table detected and extracted from a document (PDF, image, etc.).
|
|||
|
|
Tables are converted to both structured cell data and Markdown format.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `cells` | `List<List<String>>` | `Collections.emptyList()` | Table cells as a 2D vector (rows × columns) |
|
|||
|
|
| `markdown` | `String` | — | Markdown representation of the table |
|
|||
|
|
| `pageNumber` | `int` | — | Page number where the table was found (1-indexed) |
|
|||
|
|
| `boundingBox` | `Optional<BoundingBox>` | `null` | Bounding box of the table on the page (PDF coordinates: x0=left, y0=bottom, x1=right, y1=top). Only populated for PDF-extracted tables when position data is available. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### TableCell
|
|||
|
|
|
|||
|
|
Individual table cell with content and optional styling.
|
|||
|
|
|
|||
|
|
Future extension point for rich table support with cell-level metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `content` | `String` | — | Cell content as text |
|
|||
|
|
| `rowSpan` | `int` | — | Row span (number of rows this cell spans) |
|
|||
|
|
| `colSpan` | `int` | — | Column span (number of columns this cell spans) |
|
|||
|
|
| `isHeader` | `boolean` | — | Whether this is a header cell |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### TableDiff
|
|||
|
|
|
|||
|
|
Cell-level changes for a pair of tables that share the same index.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `fromIndex` | `long` | — | Zero-based index of the table in both `a.tables` and `b.tables`. |
|
|||
|
|
| `toIndex` | `long` | — | Zero-based index in `b.tables` (equal to `from_index` for same-dimension tables). |
|
|||
|
|
| `cellChanges` | `List<CellChange>` | — | Cell-level changes within the table. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### TableGrid
|
|||
|
|
|
|||
|
|
Structured table grid with cell-level metadata.
|
|||
|
|
|
|||
|
|
Stores row/column dimensions and a flat list of cells with position info.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `rows` | `int` | — | Number of rows in the table. |
|
|||
|
|
| `cols` | `int` | — | Number of columns in the table. |
|
|||
|
|
| `cells` | `List<GridCell>` | `Collections.emptyList()` | All cells in row-major order. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### TesseractConfig
|
|||
|
|
|
|||
|
|
Tesseract OCR configuration.
|
|||
|
|
|
|||
|
|
Provides fine-grained control over Tesseract OCR engine parameters.
|
|||
|
|
Most users can use the defaults, but these settings allow optimization
|
|||
|
|
for specific document types (invoices, handwriting, etc.).
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `language` | `String` | `"eng"` | Language code (e.g., "eng", "deu", "fra") |
|
|||
|
|
| `psm` | `int` | `3` | Page Segmentation Mode (0-13). Common values: - 3: Fully automatic page segmentation (native default) - 6: Assume a single uniform block of text (WASM default — avoids layout-analysis hang) - 11: Sparse text with no particular order |
|
|||
|
|
| `outputFormat` | `String` | `"markdown"` | Output format ("text" or "markdown") |
|
|||
|
|
| `oem` | `int` | `3` | OCR Engine Mode (0-3). - 0: Legacy engine only - 1: Neural nets (LSTM) only (usually best) - 2: Legacy + LSTM - 3: Default (based on what's available) |
|
|||
|
|
| `minConfidence` | `double` | `0` | Minimum confidence threshold (0.0-100.0). Words with confidence below this threshold may be rejected or flagged. |
|
|||
|
|
| `preprocessing` | `Optional<ImagePreprocessingConfig>` | `null` | Image preprocessing configuration. Controls how images are preprocessed before OCR. Can significantly improve quality for scanned documents or low-quality images. |
|
|||
|
|
| `enableTableDetection` | `boolean` | `true` | Enable automatic table detection and reconstruction |
|
|||
|
|
| `tableMinConfidence` | `double` | `0` | Minimum confidence threshold for table detection (0.0-1.0) |
|
|||
|
|
| `tableColumnThreshold` | `int` | `50` | Column threshold for table detection (pixels) |
|
|||
|
|
| `tableRowThresholdRatio` | `double` | `0.5` | Row threshold ratio for table detection (0.0-1.0) |
|
|||
|
|
| `useCache` | `boolean` | `true` | Enable OCR result caching |
|
|||
|
|
| `classifyUsePreAdaptedTemplates` | `boolean` | `true` | Use pre-adapted templates for character classification |
|
|||
|
|
| `languageModelNgramOn` | `boolean` | `false` | Enable N-gram language model |
|
|||
|
|
| `tesseditDontBlkrejGoodWds` | `boolean` | `true` | Don't reject good words during block-level processing |
|
|||
|
|
| `tesseditDontRowrejGoodWds` | `boolean` | `true` | Don't reject good words during row-level processing |
|
|||
|
|
| `tesseditEnableDictCorrection` | `boolean` | `true` | Enable dictionary correction |
|
|||
|
|
| `tesseditCharWhitelist` | `String` | `""` | Whitelist of allowed characters (empty = all allowed) |
|
|||
|
|
| `tesseditCharBlacklist` | `String` | `""` | Blacklist of forbidden characters (empty = none forbidden) |
|
|||
|
|
| `tesseditUsePrimaryParamsModel` | `boolean` | `true` | Use primary language params model |
|
|||
|
|
| `textordSpaceSizeIsVariable` | `boolean` | `true` | Variable-width space detection |
|
|||
|
|
| `thresholdingMethod` | `boolean` | `false` | Use adaptive thresholding method |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static TesseractConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### TextAnnotation
|
|||
|
|
|
|||
|
|
Inline text annotation — byte-range based formatting and links.
|
|||
|
|
|
|||
|
|
Annotations reference byte offsets into the node's text content,
|
|||
|
|
enabling precise identification of formatted regions.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `start` | `int` | — | Start byte offset in the node's text content (inclusive). |
|
|||
|
|
| `end` | `int` | — | End byte offset in the node's text content (exclusive). |
|
|||
|
|
| `kind` | `AnnotationKind` | — | Annotation type. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### TextExtractionResult
|
|||
|
|
|
|||
|
|
Plain text and Markdown extraction result.
|
|||
|
|
|
|||
|
|
Contains the extracted text along with statistics and,
|
|||
|
|
for Markdown files, structural elements like headers and links.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `content` | `String` | — | Extracted text content |
|
|||
|
|
| `lineCount` | `long` | — | Number of lines |
|
|||
|
|
| `wordCount` | `long` | — | Number of words |
|
|||
|
|
| `characterCount` | `long` | — | Number of characters |
|
|||
|
|
| `headers` | `Optional<List<String>>` | `null` | Markdown headers (text only, Markdown files only) |
|
|||
|
|
| `links` | `Optional<List<List<String>>>` | `null` | Markdown links as (text, URL) tuples (Markdown files only) |
|
|||
|
|
| `codeBlocks` | `Optional<List<List<String>>>` | `null` | Code blocks as (language, code) tuples (Markdown files only) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### TextMetadata
|
|||
|
|
|
|||
|
|
Text/Markdown metadata.
|
|||
|
|
|
|||
|
|
Extracted from plain text and Markdown files. Includes word counts and,
|
|||
|
|
for Markdown, structural elements like headers and links.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `lineCount` | `int` | — | Number of lines in the document |
|
|||
|
|
| `wordCount` | `int` | — | Number of words |
|
|||
|
|
| `characterCount` | `int` | — | Number of characters |
|
|||
|
|
| `headers` | `Optional<List<String>>` | `Collections.emptyList()` | Markdown headers (headings text only, for Markdown files) |
|
|||
|
|
| `links` | `Optional<List<List<String>>>` | `Collections.emptyList()` | Markdown links as (text, url) tuples (for Markdown files) |
|
|||
|
|
| `codeBlocks` | `Optional<List<List<String>>>` | `Collections.emptyList()` | Code blocks as (language, code) tuples (for Markdown files) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### TokenReductionConfig
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `level` | `ReductionLevel` | `ReductionLevel.MODERATE` | Level (reduction level) |
|
|||
|
|
| `languageHint` | `Optional<String>` | `null` | Language hint |
|
|||
|
|
| `preserveMarkdown` | `boolean` | `false` | Preserve markdown |
|
|||
|
|
| `preserveCode` | `boolean` | `true` | Preserve code |
|
|||
|
|
| `semanticThreshold` | `float` | `0.3` | Semantic threshold |
|
|||
|
|
| `enableParallel` | `boolean` | `true` | Enable parallel |
|
|||
|
|
| `useSimd` | `boolean` | `true` | Use simd |
|
|||
|
|
| `customStopwords` | `Optional<Map<String, List<String>>>` | `null` | Custom stopwords |
|
|||
|
|
| `preservePatterns` | `List<String>` | `Collections.emptyList()` | Preserve patterns |
|
|||
|
|
| `targetReduction` | `Optional<Float>` | `null` | Target reduction |
|
|||
|
|
| `enableSemanticClustering` | `boolean` | `false` | Enable semantic clustering |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static TokenReductionConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### TokenReductionOptions
|
|||
|
|
|
|||
|
|
Token reduction configuration.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `mode` | `String` | — | Reduction mode: "off", "light", "moderate", "aggressive", "maximum" |
|
|||
|
|
| `preserveImportantWords` | `boolean` | `true` | Preserve important words (capitalized, technical terms) |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static TokenReductionOptions defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### TreeSitterConfig
|
|||
|
|
|
|||
|
|
Configuration for tree-sitter language pack integration.
|
|||
|
|
|
|||
|
|
Controls grammar download behavior and code analysis options.
|
|||
|
|
|
|||
|
|
### Example (TOML)
|
|||
|
|
|
|||
|
|
```toml
|
|||
|
|
[tree_sitter]
|
|||
|
|
languages = ["python", "rust"]
|
|||
|
|
groups = ["web"]
|
|||
|
|
|
|||
|
|
[tree_sitter.process]
|
|||
|
|
structure = true
|
|||
|
|
comments = true
|
|||
|
|
docstrings = true
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `enabled` | `boolean` | `true` | Enable code intelligence processing (default: true). When `false`, tree-sitter analysis is completely skipped even if the config section is present. |
|
|||
|
|
| `cacheDir` | `Optional<String>` | `null` | Custom cache directory for downloaded grammars. When `null`, uses the default: `~/.cache/tree-sitter-language-pack/v{version}/libs/`. |
|
|||
|
|
| `languages` | `Optional<List<String>>` | `null` | Languages to pre-download on init (e.g., `["python", "rust"]`). |
|
|||
|
|
| `groups` | `Optional<List<String>>` | `null` | Language groups to pre-download (e.g., `["web", "systems", "scripting"]`). |
|
|||
|
|
| `process` | `TreeSitterProcessConfig` | — | Processing options for code analysis. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static TreeSitterConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### TreeSitterProcessConfig
|
|||
|
|
|
|||
|
|
Processing options for tree-sitter code analysis.
|
|||
|
|
|
|||
|
|
Controls which analysis features are enabled when extracting code files.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `structure` | `boolean` | `true` | Extract structural items (functions, classes, structs, etc.). Default: true. |
|
|||
|
|
| `imports` | `boolean` | `true` | Extract import statements. Default: true. |
|
|||
|
|
| `exports` | `boolean` | `true` | Extract export statements. Default: true. |
|
|||
|
|
| `comments` | `boolean` | `false` | Extract comments. Default: false. |
|
|||
|
|
| `docstrings` | `boolean` | `false` | Extract docstrings. Default: false. |
|
|||
|
|
| `symbols` | `boolean` | `false` | Extract symbol definitions. Default: false. |
|
|||
|
|
| `diagnostics` | `boolean` | `false` | Include parse diagnostics. Default: false. |
|
|||
|
|
| `chunkMaxSize` | `Optional<Long>` | `null` | Maximum chunk size in bytes. `null` disables chunking. |
|
|||
|
|
| `contentMode` | `CodeContentMode` | `CodeContentMode.CHUNKS` | Content rendering mode for code extraction. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static TreeSitterProcessConfig defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### Validator
|
|||
|
|
|
|||
|
|
Trait for validator plugins.
|
|||
|
|
|
|||
|
|
Validators check extraction results for quality, completeness, or correctness.
|
|||
|
|
Unlike post-processors, validator errors **fail fast** - if a validator returns
|
|||
|
|
an error, the extraction fails immediately.
|
|||
|
|
|
|||
|
|
### Use Cases
|
|||
|
|
|
|||
|
|
- **Quality Gates**: Ensure extracted content meets minimum quality standards
|
|||
|
|
- **Compliance**: Verify content meets regulatory requirements
|
|||
|
|
- **Content Filtering**: Reject documents containing unwanted content
|
|||
|
|
- **Format Validation**: Verify extracted content structure
|
|||
|
|
- **Security Checks**: Scan for malicious content
|
|||
|
|
|
|||
|
|
### Error Handling
|
|||
|
|
|
|||
|
|
Validator errors are **fatal** - they cause the extraction to fail and bubble up
|
|||
|
|
to the caller. Use validators for hard requirements that must be met.
|
|||
|
|
|
|||
|
|
For non-fatal checks, use post-processors instead.
|
|||
|
|
|
|||
|
|
### Thread Safety
|
|||
|
|
|
|||
|
|
Validators must be thread-safe (`Send + Sync`).
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### validate()
|
|||
|
|
|
|||
|
|
Validate an extraction result.
|
|||
|
|
|
|||
|
|
Check the extraction result and return `Ok(())` if valid, or an error
|
|||
|
|
if validation fails.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
- `Ok(())` if validation passes
|
|||
|
|
- `Err(...)` if validation fails (extraction will fail)
|
|||
|
|
|
|||
|
|
**Errors:**
|
|||
|
|
|
|||
|
|
- `KreuzbergError.Validation` - Validation failed
|
|||
|
|
- Any other error type appropriate for the failure
|
|||
|
|
|
|||
|
|
### Example - Content Length Validation
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
async fn validate(&self, result: &ExtractionResult, config: &ExtractionConfig)
|
|||
|
|
-> Result<()> {
|
|||
|
|
let length = result.content.len();
|
|||
|
|
|
|||
|
|
if length < self.min {
|
|||
|
|
return Err(KreuzbergError::validation(format!(
|
|||
|
|
"Content too short: {} < {} characters",
|
|||
|
|
length, self.min
|
|||
|
|
)));
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
if length > self.max {
|
|||
|
|
return Err(KreuzbergError::validation(format!(
|
|||
|
|
"Content too long: {} > {} characters",
|
|||
|
|
length, self.max
|
|||
|
|
)));
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
Ok(())
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Example - Quality Score Validation
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
async fn validate(&self, result: &ExtractionResult, config: &ExtractionConfig)
|
|||
|
|
-> Result<()> {
|
|||
|
|
// Check if quality_score exists in metadata
|
|||
|
|
let score = result.metadata
|
|||
|
|
.additional
|
|||
|
|
.get("quality_score")
|
|||
|
|
.and_then(|v| v.as_f64())
|
|||
|
|
.unwrap_or(0.0);
|
|||
|
|
|
|||
|
|
if score < self.min_score {
|
|||
|
|
return Err(KreuzbergError::validation(format!(
|
|||
|
|
"Quality score too low: {} < {}",
|
|||
|
|
score, self.min_score
|
|||
|
|
)));
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
Ok(())
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Example - Security Validation
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
async fn validate(&self, result: &ExtractionResult, config: &ExtractionConfig)
|
|||
|
|
-> Result<()> {
|
|||
|
|
// Check for blocked patterns
|
|||
|
|
for pattern in &self.blocked_patterns {
|
|||
|
|
if result.content.contains(pattern) {
|
|||
|
|
return Err(KreuzbergError::validation(format!(
|
|||
|
|
"Content contains blocked pattern: {}",
|
|||
|
|
pattern
|
|||
|
|
)));
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
Ok(())
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public void validate(ExtractionResult result, ExtractionConfig config) throws Error
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### shouldValidate()
|
|||
|
|
|
|||
|
|
Optional: Check if this validator should run for a given result.
|
|||
|
|
|
|||
|
|
Allows conditional validation based on MIME type, metadata, or content.
|
|||
|
|
Defaults to `true` (always run).
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
`true` if the validator should run, `false` to skip.
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public boolean shouldValidate(ExtractionResult result, ExtractionConfig config)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### priority()
|
|||
|
|
|
|||
|
|
Optional: Get the validation priority.
|
|||
|
|
|
|||
|
|
Higher priority validators run first. Useful for ordering validation checks
|
|||
|
|
(e.g., run cheap validations before expensive ones).
|
|||
|
|
|
|||
|
|
Default priority is 50.
|
|||
|
|
|
|||
|
|
**Returns:**
|
|||
|
|
|
|||
|
|
Priority value (higher = runs earlier).
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public int priority()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### XlsxAppProperties
|
|||
|
|
|
|||
|
|
Application properties from docProps/app.xml for XLSX
|
|||
|
|
|
|||
|
|
Contains Excel-specific document metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `application` | `Optional<String>` | `null` | Application name (e.g., "Microsoft Excel") |
|
|||
|
|
| `appVersion` | `Optional<String>` | `null` | Application version |
|
|||
|
|
| `docSecurity` | `Optional<Integer>` | `null` | Document security level |
|
|||
|
|
| `scaleCrop` | `Optional<Boolean>` | `null` | Scale crop flag |
|
|||
|
|
| `linksUpToDate` | `Optional<Boolean>` | `null` | Links up to date flag |
|
|||
|
|
| `sharedDoc` | `Optional<Boolean>` | `null` | Shared document flag |
|
|||
|
|
| `hyperlinksChanged` | `Optional<Boolean>` | `null` | Hyperlinks changed flag |
|
|||
|
|
| `company` | `Optional<String>` | `null` | Company name |
|
|||
|
|
| `worksheetNames` | `List<String>` | `Collections.emptyList()` | Worksheet names |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### XmlExtractionResult
|
|||
|
|
|
|||
|
|
XML extraction result.
|
|||
|
|
|
|||
|
|
Contains extracted text content from XML files along with
|
|||
|
|
structural statistics about the XML document.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `content` | `String` | — | Extracted text content (XML structure filtered out) |
|
|||
|
|
| `elementCount` | `long` | — | Total number of XML elements processed |
|
|||
|
|
| `uniqueElements` | `List<String>` | — | List of unique element names found (sorted) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### XmlMetadata
|
|||
|
|
|
|||
|
|
XML metadata extracted during XML parsing.
|
|||
|
|
|
|||
|
|
Provides statistics about XML document structure.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `elementCount` | `int` | — | Total number of XML elements processed |
|
|||
|
|
| `uniqueElements` | `List<String>` | `Collections.emptyList()` | List of unique element tag names (sorted) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### YakeParams
|
|||
|
|
|
|||
|
|
YAKE-specific parameters.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `windowSize` | `long` | `2` | Window size for co-occurrence analysis (default: 2). Controls the context window for computing co-occurrence statistics. |
|
|||
|
|
|
|||
|
|
### Methods
|
|||
|
|
|
|||
|
|
#### defaultOptions()
|
|||
|
|
|
|||
|
|
**Signature:**
|
|||
|
|
|
|||
|
|
```java
|
|||
|
|
public static YakeParams defaultOptions()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### YearRange
|
|||
|
|
|
|||
|
|
Year range for bibliographic metadata.
|
|||
|
|
|
|||
|
|
| Field | Type | Default | Description |
|
|||
|
|
|-------|------|---------|-------------|
|
|||
|
|
| `min` | `Optional<Integer>` | `null` | Min |
|
|||
|
|
| `max` | `Optional<Integer>` | `null` | Max |
|
|||
|
|
| `years` | `List<Integer>` | `/* serde(default) */` | Years |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Enums
|
|||
|
|
|
|||
|
|
#### ExecutionProviderType
|
|||
|
|
|
|||
|
|
ONNX Runtime execution provider type.
|
|||
|
|
|
|||
|
|
Determines which hardware backend is used for model inference.
|
|||
|
|
`Auto` (default) selects the best available provider per platform.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `AUTO` | Auto-select: CoreML on macOS, CUDA on Linux, CPU elsewhere. |
|
|||
|
|
| `CPU` | CPU execution provider (always available). |
|
|||
|
|
| `CORE_ML` | Apple CoreML (macOS/iOS Neural Engine + GPU). |
|
|||
|
|
| `CUDA` | NVIDIA CUDA GPU acceleration. |
|
|||
|
|
| `TENSOR_RT` | NVIDIA TensorRT (optimized CUDA inference). |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OutputFormat
|
|||
|
|
|
|||
|
|
Output format for extraction results.
|
|||
|
|
|
|||
|
|
Controls the format of the `content` field in `ExtractionResult`.
|
|||
|
|
When set to `Markdown`, `Djot`, or `Html`, the output uses that format.
|
|||
|
|
`Plain` returns the raw extracted text.
|
|||
|
|
`Structured` returns JSON with full OCR element data including bounding
|
|||
|
|
boxes and confidence scores.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `PLAIN` | Plain text content only (default) |
|
|||
|
|
| `MARKDOWN` | Markdown format |
|
|||
|
|
| `DJOT` | Djot markup format |
|
|||
|
|
| `HTML` | HTML format |
|
|||
|
|
| `JSON` | JSON tree format with heading-driven sections. |
|
|||
|
|
| `STRUCTURED` | Structured JSON format with full OCR element metadata. |
|
|||
|
|
| `CUSTOM` | Custom renderer registered via the RendererRegistry. The string is the renderer name (e.g., "docx", "latex"). — Fields: `0`: `String` |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### HtmlTheme
|
|||
|
|
|
|||
|
|
Built-in HTML theme selection.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `DEFAULT` | Sensible defaults: system font stack, neutral colours, readable line measure. CSS custom properties (`--kb-*`) are all defined so user CSS can override individual values. |
|
|||
|
|
| `GIT_HUB` | GitHub Markdown-inspired palette and spacing. |
|
|||
|
|
| `DARK` | Dark background, light text. |
|
|||
|
|
| `LIGHT` | Minimal light theme with generous whitespace. |
|
|||
|
|
| `UNSTYLED` | No built-in stylesheet emitted. CSS custom properties are still defined on `:root` so user stylesheets can reference `var(--kb-*)` tokens. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### TableModel
|
|||
|
|
|
|||
|
|
Which table structure recognition model to use.
|
|||
|
|
|
|||
|
|
Controls the model used for table cell detection within layout-detected
|
|||
|
|
table regions. Wire format is snake_case in all serializers (JSON, TOML,
|
|||
|
|
YAML).
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `TATR` | TATR (Table Transformer) -- default, 30MB, DETR-based row/column detection. |
|
|||
|
|
| `SLANET_WIRED` | SLANeXT wired variant -- 365MB, optimized for bordered tables. |
|
|||
|
|
| `SLANET_WIRELESS` | SLANeXT wireless variant -- 365MB, optimized for borderless tables. |
|
|||
|
|
| `SLANET_PLUS` | SLANet-plus -- 7.78MB, lightweight general-purpose. |
|
|||
|
|
| `SLANET_AUTO` | Classifier-routed SLANeXT: auto-select wired/wireless per table. Uses PP-LCNet classifier (6.78MB) + both SLANeXT variants (730MB total). |
|
|||
|
|
| `DISABLED` | Disable table structure model inference entirely; use heuristic path only. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ChunkerType
|
|||
|
|
|
|||
|
|
Type of text chunker to use.
|
|||
|
|
|
|||
|
|
### Variants
|
|||
|
|
|
|||
|
|
- `Text` - Generic text splitter, splits on whitespace and punctuation
|
|||
|
|
- `Markdown` - Markdown-aware splitter, preserves formatting and structure
|
|||
|
|
- `Yaml` - YAML-aware splitter, creates one chunk per top-level key
|
|||
|
|
- `Semantic` - Topic-aware chunker. With an `EmbeddingConfig`, splits at
|
|||
|
|
embedding-based topic shifts tuned by `topic_threshold` (default 0.75,
|
|||
|
|
lower = more splits). Without an embedding, falls back to a
|
|||
|
|
structural-boundary heuristic (ALL-CAPS headers, numbered sections,
|
|||
|
|
blank-line paragraphs) and merges groups into chunks capped at
|
|||
|
|
`max_characters` (default 1000). `topic_threshold` has no effect in the
|
|||
|
|
fallback path. For best results, pair with an embedding model.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `TEXT` | Text format |
|
|||
|
|
| `MARKDOWN` | Markdown format |
|
|||
|
|
| `YAML` | Yaml format |
|
|||
|
|
| `SEMANTIC` | Semantic |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ChunkSizing
|
|||
|
|
|
|||
|
|
How chunk size is measured.
|
|||
|
|
|
|||
|
|
Defaults to `Characters` (Unicode character count). When using token-based sizing,
|
|||
|
|
chunks are sized by token count according to the specified tokenizer.
|
|||
|
|
|
|||
|
|
Token-based sizing uses HuggingFace tokenizers loaded at runtime. Any tokenizer
|
|||
|
|
available on HuggingFace Hub can be used, including OpenAI-compatible tokenizers
|
|||
|
|
(e.g., `Xenova/gpt-4o`, `Xenova/cl100k_base`).
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `CHARACTERS` | Size measured in Unicode characters (default). |
|
|||
|
|
| `TOKENIZER` | Size measured in tokens from a HuggingFace tokenizer. — Fields: `model`: `String`, `cacheDir`: `String` |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### EmbeddingModelType
|
|||
|
|
|
|||
|
|
Embedding model types supported by Kreuzberg.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `PRESET` | Use a preset model configuration (recommended) — Fields: `name`: `String` |
|
|||
|
|
| `CUSTOM` | Use a custom ONNX model from HuggingFace — Fields: `modelId`: `String`, `dimensions`: `long` |
|
|||
|
|
| `LLM` | Provider-hosted embedding model via liter-llm. Uses the model specified in the nested `LlmConfig` (e.g., `"openai/text-embedding-3-small"`). — Fields: `llm`: `LlmConfig` |
|
|||
|
|
| `PLUGIN` | In-process embedding backend registered via the plugin system. The caller registers an `EmbeddingBackend` once (e.g. a wrapper around an already-loaded `llama-cpp-python`, `sentence-transformers`, or tuned ONNX model), then references it by name in config. Kreuzberg calls back into the registered backend during chunking and standalone embed requests — no HuggingFace download, no ONNX Runtime requirement, no HTTP sidecar. When this variant is selected, only the following `EmbeddingConfig` fields apply: `normalize` (post-call L2 normalization) and `max_embed_duration_secs` (dispatcher timeout). Model-loading fields (`batch_size`, `cache_dir`, `show_download_progress`, `acceleration`) are ignored — the host owns the model lifecycle. Semantic chunking falls back to `ChunkingConfig.max_characters` when this variant is used, since there is no preset to look a chunk-size ceiling up against — size your context window via `max_characters` directly. See `register_embedding_backend`. — Fields: `name`: `String` |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### CodeContentMode
|
|||
|
|
|
|||
|
|
Content rendering mode for code extraction.
|
|||
|
|
|
|||
|
|
Controls how extracted code content is represented in the `content` field
|
|||
|
|
of `ExtractionResult`.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `CHUNKS` | Use TSLP semantic chunks as content (default). |
|
|||
|
|
| `RAW` | Use raw source code as content. |
|
|||
|
|
| `STRUCTURE` | Emit function/class headings + docstrings (no code bodies). |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ListType
|
|||
|
|
|
|||
|
|
Type of list detection.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `BULLET` | Bullet points (-, *, •, etc.) |
|
|||
|
|
| `NUMBERED` | Numbered lists (1., 2., etc.) |
|
|||
|
|
| `LETTERED` | Lettered lists (a., b., A., B., etc.) |
|
|||
|
|
| `INDENTED` | Indented items |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ReductionLevel
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `OFF` | Off |
|
|||
|
|
| `LIGHT` | Light |
|
|||
|
|
| `MODERATE` | Moderate |
|
|||
|
|
| `AGGRESSIVE` | Aggressive |
|
|||
|
|
| `MAXIMUM` | Maximum |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PdfAnnotationType
|
|||
|
|
|
|||
|
|
Type of PDF annotation.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `TEXT` | Sticky note / text annotation |
|
|||
|
|
| `HIGHLIGHT` | Highlighted text region |
|
|||
|
|
| `LINK` | Hyperlink annotation |
|
|||
|
|
| `STAMP` | Rubber stamp annotation |
|
|||
|
|
| `UNDERLINE` | Underline text markup |
|
|||
|
|
| `STRIKE_OUT` | Strikeout text markup |
|
|||
|
|
| `OTHER` | Any other annotation type |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### BlockType
|
|||
|
|
|
|||
|
|
Types of block-level elements in Djot.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `PARAGRAPH` | Paragraph element |
|
|||
|
|
| `HEADING` | Heading element |
|
|||
|
|
| `BLOCKQUOTE` | Blockquote element |
|
|||
|
|
| `CODE_BLOCK` | Code block |
|
|||
|
|
| `LIST_ITEM` | List item |
|
|||
|
|
| `ORDERED_LIST` | Ordered list |
|
|||
|
|
| `BULLET_LIST` | Bullet list |
|
|||
|
|
| `TASK_LIST` | Task list |
|
|||
|
|
| `DEFINITION_LIST` | Definition list |
|
|||
|
|
| `DEFINITION_TERM` | Definition term |
|
|||
|
|
| `DEFINITION_DESCRIPTION` | Definition description |
|
|||
|
|
| `DIV` | Div |
|
|||
|
|
| `SECTION` | Section element |
|
|||
|
|
| `THEMATIC_BREAK` | Thematic break |
|
|||
|
|
| `RAW_BLOCK` | Raw block |
|
|||
|
|
| `MATH_DISPLAY` | Math display |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### InlineType
|
|||
|
|
|
|||
|
|
Types of inline elements in Djot.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `TEXT` | Text format |
|
|||
|
|
| `STRONG` | Strong |
|
|||
|
|
| `EMPHASIS` | Emphasis |
|
|||
|
|
| `HIGHLIGHT` | Highlight |
|
|||
|
|
| `SUBSCRIPT` | Subscript |
|
|||
|
|
| `SUPERSCRIPT` | Superscript |
|
|||
|
|
| `INSERT` | Insert |
|
|||
|
|
| `DELETE` | Delete |
|
|||
|
|
| `CODE` | Code |
|
|||
|
|
| `LINK` | Link |
|
|||
|
|
| `IMAGE` | Image element |
|
|||
|
|
| `SPAN` | Span |
|
|||
|
|
| `MATH` | Math |
|
|||
|
|
| `RAW_INLINE` | Raw inline |
|
|||
|
|
| `FOOTNOTE_REF` | Footnote ref |
|
|||
|
|
| `SYMBOL` | Symbol |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### RelationshipKind
|
|||
|
|
|
|||
|
|
Semantic kind of a relationship between document elements.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `FOOTNOTE_REFERENCE` | Footnote marker -> footnote definition. |
|
|||
|
|
| `CITATION_REFERENCE` | Citation marker -> bibliography entry. |
|
|||
|
|
| `INTERNAL_LINK` | Internal anchor link (`#id`) -> target heading/element. |
|
|||
|
|
| `CAPTION` | Caption paragraph -> figure/table it describes. |
|
|||
|
|
| `LABEL` | Label -> labeled element (HTML `<label for>`, LaTeX `\label{}`). |
|
|||
|
|
| `TOC_ENTRY` | TOC entry -> target section. |
|
|||
|
|
| `CROSS_REFERENCE` | Cross-reference (LaTeX `\ref{}`, DOCX cross-reference field). |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ContentLayer
|
|||
|
|
|
|||
|
|
Content layer classification for document nodes.
|
|||
|
|
|
|||
|
|
Replaces separate body/furniture arrays with per-node granularity.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `BODY` | Main document body content. |
|
|||
|
|
| `HEADER` | Page/section header (running header). |
|
|||
|
|
| `FOOTER` | Page/section footer (running footer). |
|
|||
|
|
| `FOOTNOTE` | Footnote content. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### NodeContent
|
|||
|
|
|
|||
|
|
Tagged enum for node content. Each variant carries only type-specific data.
|
|||
|
|
|
|||
|
|
Uses `#[serde(tag = "node_type")]` to avoid "type" keyword collision in
|
|||
|
|
Go/Java/TypeScript bindings.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `TITLE` | Document title. — Fields: `text`: `String` |
|
|||
|
|
| `HEADING` | Section heading with level (1-6). — Fields: `level`: `byte`, `text`: `String` |
|
|||
|
|
| `PARAGRAPH` | Body text paragraph. — Fields: `text`: `String` |
|
|||
|
|
| `LIST` | List container — children are `ListItem` nodes. — Fields: `ordered`: `boolean` |
|
|||
|
|
| `LIST_ITEM` | Individual list item. — Fields: `text`: `String` |
|
|||
|
|
| `TABLE` | Table with structured cell grid. — Fields: `grid`: `TableGrid` |
|
|||
|
|
| `IMAGE` | Image reference. — Fields: `description`: `String`, `imageIndex`: `int`, `src`: `String` |
|
|||
|
|
| `CODE` | Code block. — Fields: `text`: `String`, `language`: `String` |
|
|||
|
|
| `QUOTE` | Block quote — container, children carry the quoted content. |
|
|||
|
|
| `FORMULA` | Mathematical formula / equation. — Fields: `text`: `String` |
|
|||
|
|
| `FOOTNOTE` | Footnote reference content. — Fields: `text`: `String` |
|
|||
|
|
| `GROUP` | Logical grouping container (section, key-value area). `heading_level` + `heading_text` capture the section heading directly rather than relying on a first-child positional convention. — Fields: `label`: `String`, `headingLevel`: `byte`, `headingText`: `String` |
|
|||
|
|
| `PAGE_BREAK` | Page break marker. |
|
|||
|
|
| `SLIDE` | Presentation slide container — children are the slide's content nodes. — Fields: `number`: `int`, `title`: `String` |
|
|||
|
|
| `DEFINITION_LIST` | Definition list container — children are `DefinitionItem` nodes. |
|
|||
|
|
| `DEFINITION_ITEM` | Individual definition list entry with term and definition. — Fields: `term`: `String`, `definition`: `String` |
|
|||
|
|
| `CITATION` | Citation or bibliographic reference. — Fields: `key`: `String`, `text`: `String` |
|
|||
|
|
| `ADMONITION` | Admonition / callout container (note, warning, tip, etc.). Children carry the admonition body content. — Fields: `kind`: `String`, `title`: `String` |
|
|||
|
|
| `RAW_BLOCK` | Raw block preserved verbatim from the source format. Used for content that cannot be mapped to a semantic node type (e.g. JSX in MDX, raw LaTeX in markdown, embedded HTML). — Fields: `format`: `String`, `content`: `String` |
|
|||
|
|
| `METADATA_BLOCK` | Structured metadata block (email headers, YAML frontmatter, etc.). — Fields: `entries`: `List<List<String>>` |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### AnnotationKind
|
|||
|
|
|
|||
|
|
Types of inline text annotations.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `BOLD` | Bold |
|
|||
|
|
| `ITALIC` | Italic |
|
|||
|
|
| `UNDERLINE` | Underline |
|
|||
|
|
| `STRIKETHROUGH` | Strikethrough |
|
|||
|
|
| `CODE` | Code |
|
|||
|
|
| `SUBSCRIPT` | Subscript |
|
|||
|
|
| `SUPERSCRIPT` | Superscript |
|
|||
|
|
| `LINK` | Link — Fields: `url`: `String`, `title`: `String` |
|
|||
|
|
| `HIGHLIGHT` | Highlighted text (PDF highlights, HTML `<mark>`). |
|
|||
|
|
| `COLOR` | Text color (CSS-compatible value, e.g. "#ff0000", "red"). — Fields: `value`: `String` |
|
|||
|
|
| `FONT_SIZE` | Font size with units (e.g. "12pt", "1.2em", "16px"). — Fields: `value`: `String` |
|
|||
|
|
| `CUSTOM` | Extensible annotation for format-specific styling. — Fields: `name`: `String`, `value`: `String` |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ExtractionMethod
|
|||
|
|
|
|||
|
|
How the extracted text was produced.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `NATIVE` | Native |
|
|||
|
|
| `OCR` | Ocr |
|
|||
|
|
| `MIXED` | Mixed |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ChunkType
|
|||
|
|
|
|||
|
|
Semantic structural classification of a text chunk.
|
|||
|
|
|
|||
|
|
Assigned by the heuristic classifier in `chunking.classifier`.
|
|||
|
|
Defaults to `Unknown` when no rule matches.
|
|||
|
|
Designed to be extended in future versions without breaking changes.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `HEADING` | Section heading or document title. |
|
|||
|
|
| `PARTY_LIST` | Party list: names, addresses, and signatories. |
|
|||
|
|
| `DEFINITIONS` | Definition clause ("X means…", "X shall mean…"). |
|
|||
|
|
| `OPERATIVE_CLAUSE` | Operative clause containing legal/contractual action verbs. |
|
|||
|
|
| `SIGNATURE_BLOCK` | Signature block with signatures, names, and dates. |
|
|||
|
|
| `SCHEDULE` | Schedule, annex, appendix, or exhibit section. |
|
|||
|
|
| `TABLE_LIKE` | Table-like content with aligned columns or repeated patterns. |
|
|||
|
|
| `FORMULA` | Mathematical formula or equation. |
|
|||
|
|
| `CODE_BLOCK` | Code block or preformatted content. |
|
|||
|
|
| `IMAGE` | Embedded or referenced image content. |
|
|||
|
|
| `ORG_CHART` | Organizational chart or hierarchy diagram. |
|
|||
|
|
| `DIAGRAM` | Diagram, figure, or visual illustration. |
|
|||
|
|
| `UNKNOWN` | Unclassified or mixed content. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ImageKind
|
|||
|
|
|
|||
|
|
Heuristic classification of what an image likely depicts.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `PHOTOGRAPH` | Photographic image (natural scene, photograph) |
|
|||
|
|
| `DIAGRAM` | Technical or schematic diagram |
|
|||
|
|
| `CHART` | Chart, graph, or plot |
|
|||
|
|
| `DRAWING` | Freehand or technical drawing |
|
|||
|
|
| `TEXT_BLOCK` | Text-heavy image (scanned text, document) |
|
|||
|
|
| `DECORATION` | Decorative element or border |
|
|||
|
|
| `LOGO` | Logo or brand mark |
|
|||
|
|
| `ICON` | Small icon |
|
|||
|
|
| `TILE_FRAGMENT` | Fragment of a larger tiled image (tile of a technical drawing) |
|
|||
|
|
| `MASK` | Mask or transparency map |
|
|||
|
|
| `PAGE_RASTER` | Full-page render produced during OCR preprocessing; used as a citation thumbnail. |
|
|||
|
|
| `UNKNOWN` | Could not classify with reasonable confidence |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ResultFormat
|
|||
|
|
|
|||
|
|
Result-shape selection for extraction results.
|
|||
|
|
|
|||
|
|
Distinct from `OutputFormat` (which controls rendering — Plain, Markdown,
|
|||
|
|
HTML, etc.). `ResultFormat` controls the *shape* of the result: a unified content
|
|||
|
|
blob vs. an element-based decomposition.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `UNIFIED` | Unified format with all content in `content` field |
|
|||
|
|
| `ELEMENT_BASED` | Element-based format with semantic element extraction |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ElementType
|
|||
|
|
|
|||
|
|
Semantic element type classification.
|
|||
|
|
|
|||
|
|
Categorizes text content into semantic units for downstream processing.
|
|||
|
|
Supports the element types commonly found in Unstructured documents.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `TITLE` | Document title |
|
|||
|
|
| `NARRATIVE_TEXT` | Main narrative text body |
|
|||
|
|
| `HEADING` | Section heading |
|
|||
|
|
| `LIST_ITEM` | List item (bullet, numbered, etc.) |
|
|||
|
|
| `TABLE` | Table element |
|
|||
|
|
| `IMAGE` | Image element |
|
|||
|
|
| `PAGE_BREAK` | Page break marker |
|
|||
|
|
| `CODE_BLOCK` | Code block |
|
|||
|
|
| `BLOCK_QUOTE` | Block quote |
|
|||
|
|
| `FOOTER` | Footer text |
|
|||
|
|
| `HEADER` | Header text |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### FormatMetadata
|
|||
|
|
|
|||
|
|
Format-specific metadata (discriminated union).
|
|||
|
|
|
|||
|
|
Only one format type can exist per extraction result. This provides
|
|||
|
|
type-safe, clean metadata without nested optionals.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `PDF` | Pdf format — Fields: `0`: `PdfMetadata` |
|
|||
|
|
| `DOCX` | Docx format — Fields: `0`: `DocxMetadata` |
|
|||
|
|
| `EXCEL` | Excel — Fields: `0`: `ExcelMetadata` |
|
|||
|
|
| `EMAIL` | Email — Fields: `0`: `EmailMetadata` |
|
|||
|
|
| `PPTX` | Pptx format — Fields: `0`: `PptxMetadata` |
|
|||
|
|
| `ARCHIVE` | Archive — Fields: `0`: `ArchiveMetadata` |
|
|||
|
|
| `IMAGE` | Image element — Fields: `0`: `ImageMetadata` |
|
|||
|
|
| `XML` | Xml format — Fields: `0`: `XmlMetadata` |
|
|||
|
|
| `TEXT` | Text format — Fields: `0`: `TextMetadata` |
|
|||
|
|
| `HTML` | Preserve as HTML `<mark>` tags — Fields: `0`: `HtmlMetadata` |
|
|||
|
|
| `OCR` | Ocr — Fields: `0`: `OcrMetadata` |
|
|||
|
|
| `CSV` | Csv format — Fields: `0`: `CsvMetadata` |
|
|||
|
|
| `BIBTEX` | Bibtex — Fields: `0`: `BibtexMetadata` |
|
|||
|
|
| `CITATION` | Citation — Fields: `0`: `CitationMetadata` |
|
|||
|
|
| `FICTION_BOOK` | Fiction book — Fields: `0`: `FictionBookMetadata` |
|
|||
|
|
| `DBF` | Dbf — Fields: `0`: `DbfMetadata` |
|
|||
|
|
| `JATS` | Jats — Fields: `0`: `JatsMetadata` |
|
|||
|
|
| `EPUB` | Epub format — Fields: `0`: `EpubMetadata` |
|
|||
|
|
| `PST` | Pst — Fields: `0`: `PstMetadata` |
|
|||
|
|
| `CODE` | Code — Fields: `0`: `String` |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### TextDirection
|
|||
|
|
|
|||
|
|
Text direction enumeration for HTML documents.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `LEFT_TO_RIGHT` | Left-to-right text direction |
|
|||
|
|
| `RIGHT_TO_LEFT` | Right-to-left text direction |
|
|||
|
|
| `AUTO` | Automatic text direction detection |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### LinkType
|
|||
|
|
|
|||
|
|
Link type classification.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `ANCHOR` | Anchor link (#section) |
|
|||
|
|
| `INTERNAL` | Internal link (same domain) |
|
|||
|
|
| `EXTERNAL` | External link (different domain) |
|
|||
|
|
| `EMAIL` | Email link (mailto:) |
|
|||
|
|
| `PHONE` | Phone link (tel:) |
|
|||
|
|
| `OTHER` | Other link type |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ImageType
|
|||
|
|
|
|||
|
|
Image type classification.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `DATA_URI` | Data URI image |
|
|||
|
|
| `INLINE_SVG` | Inline SVG |
|
|||
|
|
| `EXTERNAL` | External image URL |
|
|||
|
|
| `RELATIVE` | Relative path image |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### StructuredDataType
|
|||
|
|
|
|||
|
|
Structured data type classification.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `JSON_LD` | JSON-LD structured data |
|
|||
|
|
| `MICRODATA` | Microdata |
|
|||
|
|
| `RDFA` | RDFa |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrBoundingGeometry
|
|||
|
|
|
|||
|
|
Bounding geometry for an OCR element.
|
|||
|
|
|
|||
|
|
Supports both axis-aligned rectangles (from Tesseract) and 4-point quadrilaterals
|
|||
|
|
(from PaddleOCR and rotated text detection).
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `RECTANGLE` | Axis-aligned bounding box (typical for Tesseract output). — Fields: `left`: `int`, `top`: `int`, `width`: `int`, `height`: `int` |
|
|||
|
|
| `QUADRILATERAL` | 4-point quadrilateral for rotated/skewed text (PaddleOCR). Points are in clockwise order starting from top-left: `[top_left, top_right, bottom_right, bottom_left]` — Fields: `points`: `String` |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### OcrElementLevel
|
|||
|
|
|
|||
|
|
Hierarchical level of an OCR element.
|
|||
|
|
|
|||
|
|
Maps to Tesseract's page segmentation hierarchy and provides
|
|||
|
|
equivalent semantics for PaddleOCR.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `WORD` | Individual word |
|
|||
|
|
| `LINE` | Line of text (default for PaddleOCR) |
|
|||
|
|
| `BLOCK` | Paragraph or text block |
|
|||
|
|
| `PAGE` | Page-level element |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PageUnitType
|
|||
|
|
|
|||
|
|
Type of paginated unit in a document.
|
|||
|
|
|
|||
|
|
Distinguishes between different types of "pages" (PDF pages, presentation slides, spreadsheet sheets).
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `PAGE` | Standard document pages (PDF, DOCX, images) |
|
|||
|
|
| `SLIDE` | Presentation slides (PPTX, ODP) |
|
|||
|
|
| `SHEET` | Spreadsheet sheets (XLSX, ODS) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### DiffLine
|
|||
|
|
|
|||
|
|
A single line in a unified-diff hunk.
|
|||
|
|
|
|||
|
|
Defined here (rather than only in `crate.diff`) so `RevisionDelta` can
|
|||
|
|
reference it unconditionally, without requiring the `diff` Cargo feature.
|
|||
|
|
`crate.diff` re-exports this type verbatim.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `CONTEXT` | Unchanged context line. — Fields: `0`: `String` |
|
|||
|
|
| `ADDED` | Line added in the "after" version. — Fields: `0`: `String` |
|
|||
|
|
| `REMOVED` | Line removed from the "before" version. — Fields: `0`: `String` |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### RevisionKind
|
|||
|
|
|
|||
|
|
Semantic classification of a tracked change.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `INSERTION` | Text or content was inserted. |
|
|||
|
|
| `DELETION` | Text or content was deleted. |
|
|||
|
|
| `FORMAT_CHANGE` | Run-level formatting (font, size, colour, …) was changed. |
|
|||
|
|
| `COMMENT` | A reviewer comment or annotation. |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### RevisionAnchor
|
|||
|
|
|
|||
|
|
Best-effort document location for a revision.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `PARAGRAPH` | Body paragraph, identified by its zero-based index in the document flow. — Fields: `index`: `long` |
|
|||
|
|
| `TABLE_CELL` | Cell inside a table. — Fields: `row`: `long`, `col`: `long`, `tableIndex`: `long` |
|
|||
|
|
| `PAGE` | Page, identified by its zero-based index. — Fields: `index`: `long` |
|
|||
|
|
| `SLIDE` | Presentation slide, identified by its zero-based index. — Fields: `index`: `long` |
|
|||
|
|
| `SHEET` | Spreadsheet cell or range, identified by sheet index and optional name. — Fields: `index`: `long`, `name`: `String` |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### UriKind
|
|||
|
|
|
|||
|
|
Semantic classification of an extracted URI.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `HYPERLINK` | A clickable hyperlink (web URL, file link). |
|
|||
|
|
| `IMAGE` | An image or media resource reference. |
|
|||
|
|
| `ANCHOR` | An internal anchor or cross-reference target. |
|
|||
|
|
| `CITATION` | A citation or bibliographic reference (DOI, academic ref). |
|
|||
|
|
| `REFERENCE` | A general reference (e.g. `\ref{}` in LaTeX, `:ref:` in RST). |
|
|||
|
|
| `EMAIL` | An email address (`mailto:` link or bare email). |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### KeywordAlgorithm
|
|||
|
|
|
|||
|
|
Keyword algorithm selection.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `YAKE` | YAKE (Yet Another Keyword Extractor) - statistical approach |
|
|||
|
|
| `RAKE` | RAKE (Rapid Automatic Keyword Extraction) - co-occurrence based |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PsmMode
|
|||
|
|
|
|||
|
|
Page Segmentation Mode for Tesseract OCR
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `OSD_ONLY` | Osd only |
|
|||
|
|
| `AUTO_OSD` | Auto osd |
|
|||
|
|
| `AUTO_ONLY` | Auto only |
|
|||
|
|
| `AUTO` | Auto |
|
|||
|
|
| `SINGLE_COLUMN` | Single column |
|
|||
|
|
| `SINGLE_BLOCK_VERTICAL` | Single block vertical |
|
|||
|
|
| `SINGLE_BLOCK` | Single block |
|
|||
|
|
| `SINGLE_LINE` | Single line |
|
|||
|
|
| `SINGLE_WORD` | Single word |
|
|||
|
|
| `CIRCLE_WORD` | Circle word |
|
|||
|
|
| `SINGLE_CHAR` | Single char |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### PaddleLanguage
|
|||
|
|
|
|||
|
|
Supported languages in PaddleOCR.
|
|||
|
|
|
|||
|
|
Maps user-friendly language codes to paddle-ocr-rs language identifiers.
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `ENGLISH` | English |
|
|||
|
|
| `CHINESE` | Simplified Chinese |
|
|||
|
|
| `JAPANESE` | Japanese |
|
|||
|
|
| `KOREAN` | Korean |
|
|||
|
|
| `GERMAN` | German |
|
|||
|
|
| `FRENCH` | French |
|
|||
|
|
| `LATIN` | Latin script (covers most European languages) |
|
|||
|
|
| `CYRILLIC` | Cyrillic (Russian and related) |
|
|||
|
|
| `TRADITIONAL_CHINESE` | Traditional Chinese |
|
|||
|
|
| `THAI` | Thai |
|
|||
|
|
| `GREEK` | Greek |
|
|||
|
|
| `EAST_SLAVIC` | East Slavic (Russian, Ukrainian, Belarusian) |
|
|||
|
|
| `ARABIC` | Arabic (Arabic, Persian, Urdu) |
|
|||
|
|
| `DEVANAGARI` | Devanagari (Hindi, Marathi, Sanskrit, Nepali) |
|
|||
|
|
| `TAMIL` | Tamil |
|
|||
|
|
| `TELUGU` | Telugu |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### LayoutClass
|
|||
|
|
|
|||
|
|
The 17 canonical document layout classes.
|
|||
|
|
|
|||
|
|
All model backends (RT-DETR, YOLO, etc.) map their native class IDs
|
|||
|
|
to this shared set. Models with fewer classes (DocLayNet: 11, PubLayNet: 5)
|
|||
|
|
map to the closest equivalent.
|
|||
|
|
|
|||
|
|
Wire format is snake_case in all serializers (JSON, TOML, YAML).
|
|||
|
|
|
|||
|
|
| Value | Description |
|
|||
|
|
|-------|-------------|
|
|||
|
|
| `CAPTION` | Caption element |
|
|||
|
|
| `FOOTNOTE` | Footnote element |
|
|||
|
|
| `FORMULA` | Formula |
|
|||
|
|
| `LIST_ITEM` | List item |
|
|||
|
|
| `PAGE_FOOTER` | Page footer |
|
|||
|
|
| `PAGE_HEADER` | Page header |
|
|||
|
|
| `PICTURE` | Picture |
|
|||
|
|
| `SECTION_HEADER` | Section header |
|
|||
|
|
| `TABLE` | Table element |
|
|||
|
|
| `TEXT` | Text format |
|
|||
|
|
| `TITLE` | Title element |
|
|||
|
|
| `DOCUMENT_INDEX` | Document index |
|
|||
|
|
| `CODE` | Code |
|
|||
|
|
| `CHECKBOX_SELECTED` | Checkbox selected |
|
|||
|
|
| `CHECKBOX_UNSELECTED` | Checkbox unselected |
|
|||
|
|
| `FORM` | Form |
|
|||
|
|
| `KEY_VALUE_REGION` | Key value region |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Errors
|
|||
|
|
|
|||
|
|
#### KreuzbergError
|
|||
|
|
|
|||
|
|
Main error type for all Kreuzberg operations.
|
|||
|
|
|
|||
|
|
All errors in Kreuzberg use this enum, which preserves error chains
|
|||
|
|
and provides context for debugging.
|
|||
|
|
|
|||
|
|
### Variants
|
|||
|
|
|
|||
|
|
- `Io` - File system and I/O errors (always bubble up)
|
|||
|
|
- `Parsing` - Document parsing errors (corrupt files, unsupported features)
|
|||
|
|
- `Ocr` - OCR processing errors
|
|||
|
|
- `Validation` - Input validation errors (invalid paths, config, parameters)
|
|||
|
|
- `Cache` - Cache operation errors (non-fatal, can be ignored)
|
|||
|
|
- `ImageProcessing` - Image manipulation errors
|
|||
|
|
- `Serialization` - JSON/MessagePack serialization errors
|
|||
|
|
- `MissingDependency` - Missing optional dependencies (tesseract, etc.)
|
|||
|
|
- `Plugin` - Plugin-specific errors
|
|||
|
|
- `LockPoisoned` - Mutex/RwLock poisoning (should not happen in normal operation)
|
|||
|
|
- `UnsupportedFormat` - Unsupported MIME type or file format
|
|||
|
|
- `Other` - Catch-all for uncommon errors
|
|||
|
|
|
|||
|
|
| Variant | Description |
|
|||
|
|
|---------|-------------|
|
|||
|
|
| `IO` | IO error: {0} |
|
|||
|
|
| `PARSING` | Parsing error: {message} |
|
|||
|
|
| `OCR` | OCR error: {message} |
|
|||
|
|
| `VALIDATION` | Validation error: {message} |
|
|||
|
|
| `CACHE` | Cache error: {message} |
|
|||
|
|
| `IMAGE_PROCESSING` | Image processing error: {message} |
|
|||
|
|
| `SERIALIZATION` | Serialization error: {message} |
|
|||
|
|
| `MISSING_DEPENDENCY` | Missing dependency: {0} |
|
|||
|
|
| `PLUGIN` | Plugin error in '{plugin_name}': {message} |
|
|||
|
|
| `LOCK_POISONED` | Lock poisoned: {0} |
|
|||
|
|
| `UNSUPPORTED_FORMAT` | Unsupported format: {0} |
|
|||
|
|
| `EMBEDDING` | Embedding error: {message} |
|
|||
|
|
| `TIMEOUT` | Extraction timed out after {elapsed_ms}ms (limit: {limit_ms}ms) |
|
|||
|
|
| `CANCELLED` | Extraction cancelled |
|
|||
|
|
| `SECURITY` | Security violation: {message} |
|
|||
|
|
| `OTHER` | {0} |
|
|||
|
|
|
|||
|
|
---
|