---
title: "Java API Reference"
---
## Java API Reference v5.0.0-rc.3
### Functions
#### extractBytes()
Extract content from a byte array.
This is the main entry point for in-memory extraction. It performs the following steps:
1. Validate MIME type
2. Handle legacy format conversion if needed
3. Select appropriate extractor from registry
4. Extract content
5. Run post-processing pipeline
**Returns:**
An `ExtractionResult` containing the extracted content and metadata.
**Errors:**
Returns `KreuzbergError.Validation` if MIME type is invalid.
Returns `KreuzbergError.UnsupportedFormat` if MIME type is not supported.
**Signature:**
```java
public static ExtractionResult extractBytes(byte[] content, String mimeType, ExtractionConfig config) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `content` | `byte[]` | Yes | The byte array to extract |
| `mimeType` | `String` | Yes | MIME type of the content |
| `config` | `ExtractionConfig` | Yes | Extraction configuration |
**Returns:** `ExtractionResult`
**Errors:** Throws `ErrorException`.
---
#### extractFile()
Extract content from a file.
This is the main entry point for file-based extraction. It performs the following steps:
1. Check cache for existing result (if caching enabled)
2. Detect or validate MIME type
3. Select appropriate extractor from registry
4. Extract content
5. Run post-processing pipeline
6. Store result in cache (if caching enabled)
**Returns:**
An `ExtractionResult` containing the extracted content and metadata.
**Errors:**
Returns `KreuzbergError.Io` if the file doesn't exist (NotFound) or for other file I/O errors.
Returns `KreuzbergError.UnsupportedFormat` if MIME type is not supported.
**Signature:**
```java
public static ExtractionResult extractFile(String path, String mimeType, ExtractionConfig config) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `path` | `String` | Yes | Path to the file to extract |
| `mimeType` | `Optional` | No | Optional MIME type override. If None, will be auto-detected |
| `config` | `ExtractionConfig` | Yes | Extraction configuration |
**Returns:** `ExtractionResult`
**Errors:** Throws `ErrorException`.
---
#### extractFileSync()
Synchronous wrapper for `extract_file`.
This is a convenience function that blocks the current thread until extraction completes.
For async code, use `extract_file` directly.
Uses the global Tokio runtime for 100x+ performance improvement over creating
a new runtime per call. Always uses the global runtime to avoid nested runtime issues.
This function is only available with the `tokio-runtime` feature. For WASM targets,
use a truly synchronous extraction approach instead.
**Signature:**
```java
public static ExtractionResult extractFileSync(String path, String mimeType, ExtractionConfig config) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `path` | `String` | Yes | Path to the file |
| `mimeType` | `Optional` | No | The mime type |
| `config` | `ExtractionConfig` | Yes | The configuration options |
**Returns:** `ExtractionResult`
**Errors:** Throws `ErrorException`.
---
#### extractBytesSync()
Synchronous wrapper for `extract_bytes`.
Uses the global Tokio runtime for 100x+ performance improvement over creating
a new runtime per call.
With the `tokio-runtime` feature, this blocks the current thread using the global
Tokio runtime. Without it (WASM), this calls a truly synchronous implementation.
**Signature:**
```java
public static ExtractionResult extractBytesSync(byte[] content, String mimeType, ExtractionConfig config) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `content` | `byte[]` | Yes | The content to process |
| `mimeType` | `String` | Yes | The mime type |
| `config` | `ExtractionConfig` | Yes | The configuration options |
**Returns:** `ExtractionResult`
**Errors:** Throws `ErrorException`.
---
#### batchExtractFilesSync()
Synchronous wrapper for `batch_extract_files`.
Uses the global Tokio runtime for optimal performance.
Only available with `tokio-runtime` (WASM has no filesystem).
**Signature:**
```java
public static List batchExtractFilesSync(List items, ExtractionConfig config) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `items` | `List` | Yes | The items |
| `config` | `ExtractionConfig` | Yes | The configuration options |
**Returns:** `List`
**Errors:** Throws `ErrorException`.
---
#### batchExtractBytesSync()
Synchronous wrapper for `batch_extract_bytes`.
Uses the global Tokio runtime for optimal performance.
With the `tokio-runtime` feature, this blocks the current thread using the global
Tokio runtime. Without it (WASM), this calls a truly synchronous implementation
that iterates through items and calls `extract_bytes_sync()`.
**Signature:**
```java
public static List batchExtractBytesSync(List items, ExtractionConfig config) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `items` | `List` | Yes | The items |
| `config` | `ExtractionConfig` | Yes | The configuration options |
**Returns:** `List`
**Errors:** Throws `ErrorException`.
---
#### batchExtractFiles()
Extract content from multiple files concurrently.
This function processes multiple files in parallel, automatically managing
concurrency to prevent resource exhaustion. The concurrency limit can be
configured via `ExtractionConfig.max_concurrent_extractions` or defaults
to `(num_cpus * 1.5).ceil()`.
Each file can optionally specify a `FileExtractionConfig` that overrides specific
fields from the batch-level `config`. Pass `null` for a file to use the batch defaults.
Batch-level settings like `max_concurrent_extractions` and `use_cache` are always
taken from the batch-level `config`.
per-file configuration overrides.
- `config` - Batch-level extraction configuration (provides defaults and batch settings)
**Returns:**
A vector of `ExtractionResult` in the same order as the input items.
**Errors:**
Individual file errors are captured in the result metadata. System errors
(IO, RuntimeError equivalents) will bubble up and fail the entire batch.
Simple usage with no per-file overrides:
Per-file configuration overrides:
**Signature:**
```java
public static List batchExtractFiles(List items, ExtractionConfig config) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `items` | `List` | Yes | Vector of `BatchFileItem` structs, each containing a path and optional |
| `config` | `ExtractionConfig` | Yes | Batch-level extraction configuration (provides defaults and batch settings) |
**Returns:** `List`
**Errors:** Throws `ErrorException`.
---
#### batchExtractBytes()
Extract content from multiple byte arrays concurrently.
This function processes multiple byte arrays in parallel, automatically managing
concurrency to prevent resource exhaustion. The concurrency limit can be
configured via `ExtractionConfig.max_concurrent_extractions` or defaults
to `(num_cpus * 1.5).ceil()`.
Each item can optionally specify a `FileExtractionConfig` that overrides specific
fields from the batch-level `config`. Pass `null` as the config to use
the batch-level defaults for that item.
MIME type, and optional per-item configuration overrides.
- `config` - Batch-level extraction configuration
**Returns:**
A vector of `ExtractionResult` in the same order as the input items.
Simple usage with no per-item overrides:
Per-item configuration overrides:
**Signature:**
```java
public static List batchExtractBytes(List items, ExtractionConfig config) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `items` | `List` | Yes | Vector of `BatchBytesItem` structs, each containing content bytes, |
| `config` | `ExtractionConfig` | Yes | Batch-level extraction configuration |
**Returns:** `List`
**Errors:** Throws `ErrorException`.
---
#### detectMimeTypeFromBytes()
Detect MIME type from raw file bytes.
Uses magic byte signatures to detect file type from content.
Falls back to `infer` crate for comprehensive detection.
For ZIP-based files, inspects contents to distinguish Office Open XML
formats (DOCX, XLSX, PPTX) from plain ZIP archives.
**Returns:**
The detected MIME type string.
**Errors:**
Returns `KreuzbergError.UnsupportedFormat` if MIME type cannot be determined.
**Signature:**
```java
public static String detectMimeTypeFromBytes(byte[] content) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `content` | `byte[]` | Yes | Raw file bytes |
**Returns:** `String`
**Errors:** Throws `ErrorException`.
---
#### getExtensionsForMime()
Get file extensions for a given MIME type.
Returns all known file extensions that map to the specified MIME type.
**Returns:**
A vector of file extensions (without leading dot) for the MIME type.
**Signature:**
```java
public static List getExtensionsForMime(String mimeType) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `mimeType` | `String` | Yes | The MIME type to look up |
**Returns:** `List`
**Errors:** Throws `ErrorException`.
---
#### clearEmbeddingBackends()
Clear all embedding backends from the global registry.
Calls `shutdown()` on every registered backend, then empties the registry.
**Errors:**
- Any error returned by a backend's `shutdown()` method. The first error
encountered stops processing of remaining backends.
**Signature:**
```java
public static void clearEmbeddingBackends() throws Error
```
**Returns:** `void`
**Errors:** Throws `ErrorException`.
---
#### listEmbeddingBackends()
List the names of all registered embedding backends.
Used by `kreuzberg-cli`, the api/mcp endpoints, and generated language
bindings.
**Signature:**
```java
public static List listEmbeddingBackends() throws Error
```
**Returns:** `List`
**Errors:** Throws `ErrorException`.
---
#### listDocumentExtractors()
List names of all registered document extractors.
**Signature:**
```java
public static List listDocumentExtractors() throws Error
```
**Returns:** `List`
**Errors:** Throws `ErrorException`.
---
#### clearDocumentExtractors()
Clear all document extractors from the global registry.
Calls `shutdown()` on every registered extractor, then empties the registry.
**Errors:**
- Any error returned by an extractor's `shutdown()` method. The first error
encountered stops processing of remaining extractors.
**Signature:**
```java
public static void clearDocumentExtractors() throws Error
```
**Returns:** `void`
**Errors:** Throws `ErrorException`.
---
#### listOcrBackends()
List all registered OCR backends.
Returns the names of all OCR backends currently registered in the global registry.
**Returns:**
A vector of OCR backend names.
**Signature:**
```java
public static List listOcrBackends() throws Error
```
**Returns:** `List`
**Errors:** Throws `ErrorException`.
---
#### clearOcrBackends()
Clear all OCR backends from the global registry.
Removes all OCR backends and calls their `shutdown()` methods.
**Returns:**
- `Ok(())` if all backends were cleared successfully
- `Err(...)` if any shutdown method failed
**Signature:**
```java
public static void clearOcrBackends() throws Error
```
**Returns:** `void`
**Errors:** Throws `ErrorException`.
---
#### listPostProcessors()
List all registered post-processor names.
Returns a vector of all post-processor names currently registered in the
global registry.
**Returns:**
- `Ok(Vec)` - Vector of post-processor names
- `Err(...)` if the registry lock is poisoned
**Signature:**
```java
public static List listPostProcessors() throws Error
```
**Returns:** `List`
**Errors:** Throws `ErrorException`.
---
#### clearPostProcessors()
Remove all registered post-processors.
**Signature:**
```java
public static void clearPostProcessors() throws Error
```
**Returns:** `void`
**Errors:** Throws `ErrorException`.
---
#### listRenderers()
List names of all registered renderers.
**Errors:**
Returns an error if the registry lock is poisoned.
**Signature:**
```java
public static List listRenderers() throws Error
```
**Returns:** `List`
**Errors:** Throws `ErrorException`.
---
#### clearRenderers()
Clear all renderers from the global registry.
Removes every renderer, including the built-in defaults (markdown, html,
djot, plain). After calling this no renderers are registered; re-register
as needed.
**Errors:**
Returns an error if the registry lock is poisoned.
**Signature:**
```java
public static void clearRenderers() throws Error
```
**Returns:** `void`
**Errors:** Throws `ErrorException`.
---
#### listValidators()
List names of all registered validators.
**Signature:**
```java
public static List listValidators() throws Error
```
**Returns:** `List`
**Errors:** Throws `ErrorException`.
---
#### clearValidators()
Remove all registered validators.
**Signature:**
```java
public static void clearValidators() throws Error
```
**Returns:** `void`
**Errors:** Throws `ErrorException`.
---
#### compare()
Compare two extraction results and return a structured diff.
The comparison is purely structural — no I/O, no side effects. All fields
of `ExtractionDiff` are populated according to the provided `DiffOptions`.
**Signature:**
```java
public static ExtractionDiff compare(ExtractionResult a, ExtractionResult b, DiffOptions opts)
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `a` | `ExtractionResult` | Yes | The extraction result |
| `b` | `ExtractionResult` | Yes | The extraction result |
| `opts` | `DiffOptions` | Yes | The options to use |
**Returns:** `ExtractionDiff`
---
#### embedTextsAsync()
Generate embeddings asynchronously for a list of text strings.
This is the async counterpart to `embed_texts`. It offloads the blocking
ONNX inference work to a dedicated blocking thread pool via Tokio's
`spawn_blocking`, keeping the async executor free.
Returns one embedding vector per input text in the same order.
**Errors:**
- `KreuzbergError.MissingDependency` if ONNX Runtime is not installed
- `KreuzbergError.Embedding` if the preset name is unknown, model download fails,
or the blocking inference task panics
**Signature:**
```java
public static List> embedTextsAsync(List texts, EmbeddingConfig config) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `texts` | `List` | Yes | Vec of strings to embed (owned, sent to blocking thread) |
| `config` | `EmbeddingConfig` | Yes | Embedding configuration specifying model, batch size, and normalization |
**Returns:** `List>`
**Errors:** Throws `ErrorException`.
---
#### renderPdfPageToPng()
Render a single PDF page to PNG bytes.
Returns raw PNG-encoded bytes for the specified page at the given DPI.
Uses pdf_oxide with tiny-skia for pure-Rust rendering.
**Errors:**
Returns `KreuzbergError.Parsing` if the PDF cannot be opened, authenticated,
or rendered, or if `page_index` is out of range.
**Signature:**
```java
public static byte[] renderPdfPageToPng(byte[] pdfBytes, long pageIndex, int dpi, String password) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `pdfBytes` | `byte[]` | Yes | Raw PDF file bytes |
| `pageIndex` | `long` | Yes | Zero-based page index |
| `dpi` | `Optional` | No | Resolution in dots per inch (default: 150) |
| `password` | `Optional` | No | Optional password for encrypted PDFs |
**Returns:** `byte[]`
**Errors:** Throws `ErrorException`.
---
#### detectMimeType()
Detect the MIME type of a file at the given path.
Uses the file extension and optionally the file content to determine the MIME type.
Set `check_exists` to `true` to verify the file exists before detection.
**Signature:**
```java
public static String detectMimeType(String path, boolean checkExists) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `path` | `String` | Yes | Path to the file |
| `checkExists` | `boolean` | Yes | The check exists |
**Returns:** `String`
**Errors:** Throws `ErrorException`.
---
#### embedTexts()
Embed a list of texts using the configured embedding model.
Returns a 2D vector where each inner vector is the embedding for the corresponding text.
**Signature:**
```java
public static List> embedTexts(List texts, EmbeddingConfig config) throws Error
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `texts` | `List` | Yes | The texts |
| `config` | `EmbeddingConfig` | Yes | The configuration options |
**Returns:** `List>`
**Errors:** Throws `ErrorException`.
---
#### getEmbeddingPreset()
Get an embedding preset by name.
Returns `null` if no preset with the given name exists. Returns an owned
clone so the value is safe to pass across FFI boundaries.
**Signature:**
```java
public static Optional getEmbeddingPreset(String name)
```
**Parameters:**
| Name | Type | Required | Description |
|------|------|----------|-------------|
| `name` | `String` | Yes | The name |
**Returns:** `Optional`
---
#### listEmbeddingPresets()
List the names of all available embedding presets.
Returns owned `String`s so the values are safe to pass across FFI boundaries.
**Signature:**
```java
public static List listEmbeddingPresets()
```
**Returns:** `List`
---
### Types
#### AccelerationConfig
Hardware acceleration configuration for ONNX Runtime models.
Controls which execution provider (CPU, CoreML, CUDA, TensorRT) is used
for inference in layout detection and embedding generation.
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `provider` | `ExecutionProviderType` | `ExecutionProviderType.AUTO` | Execution provider to use for ONNX inference. |
| `deviceId` | `int` | — | GPU device ID (for CUDA/TensorRT). Ignored for CPU/CoreML/Auto. |
---
#### ArchiveEntry
A single file extracted from an archive.
When archives (ZIP, TAR, 7Z, GZIP) are extracted with recursive extraction
enabled, each processable file produces its own full `ExtractionResult`.
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `path` | `String` | — | Archive-relative file path (e.g. "folder/document.pdf"). |
| `mimeType` | `String` | — | Detected MIME type of the file. |
| `result` | `ExtractionResult` | — | Full extraction result for this file. |
---
#### ArchiveMetadata
Archive (ZIP/TAR/7Z) metadata.
Extracted from compressed archive files containing file lists and size information.
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `format` | `String` | — | Archive format ("ZIP", "TAR", "7Z", etc.) |
| `fileCount` | `int` | — | Total number of files in the archive |
| `fileList` | `List` | `Collections.emptyList()` | List of file paths within the archive |
| `totalSize` | `long` | — | Total uncompressed size in bytes |
| `compressedSize` | `Optional` | `null` | Compressed size in bytes (if available) |
---
#### BBox
Bounding box in original image coordinates (x1, y1) top-left, (x2, y2) bottom-right.
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `x1` | `float` | — | X1 |
| `y1` | `float` | — | Y1 |
| `x2` | `float` | — | X2 |
| `y2` | `float` | — | Y2 |
---
#### BatchBytesItem
Batch item for byte array extraction.
Used with `batch_extract_bytes` and `batch_extract_bytes_sync`
to represent a single item in a batch extraction job.
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `content` | `byte[]` | — | The content bytes to extract from |
| `mimeType` | `String` | — | MIME type of the content (e.g., "application/pdf", "text/html") |
| `config` | `Optional` | `null` | Per-item configuration overrides (None uses batch-level defaults) |
---
#### BatchFileItem
Batch item for file extraction.
Used with `batch_extract_files` and `batch_extract_files_sync`
to represent a single file in a batch extraction job.
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `path` | `String` | — | Path to the file to extract from |
| `config` | `Optional` | `null` | Per-file configuration overrides (None uses batch-level defaults) |
---
#### BibtexMetadata
BibTeX bibliography metadata.
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `entryCount` | `long` | — | Number of entries in the bibliography. |
| `citationKeys` | `List` | `Collections.emptyList()` | Citation keys |
| `authors` | `List` | `Collections.emptyList()` | Authors |
| `yearRange` | `Optional` | `null` | Year range (year range) |
| `entryTypes` | `Optional