hjess/fil

Fork 0

Files

Henrik Jess Nielsen b4c07d3693

Deploy fil (kreuzberg) / deploy (push) Successful in 49s

Details

Nomad changes

2026-06-01 23:40:55 +02:00

16 KiB

Raw Blame History

Extraction Basics

Eight core extraction functions are available, organized by input type (file path vs bytes), cardinality (single vs batch), and execution model (sync vs async).

Input	Single sync	Single async	Batch sync	Batch async
File path	`extract_file_sync`	`extract_file`	`batch_extract_files_sync`	`batch_extract_files`
Bytes	`extract_bytes_sync`	`extract_bytes`	`batch_extract_bytes_sync`	`batch_extract_bytes`

!!! Tip "Sync vs Async" Use async variants when you're already in an async context or processing multiple files concurrently. For scripts and simple pipelines, sync variants are simpler and just as fast for single files.

Extract from Files

Pass a file path. Kreuzberg detects the MIME type from the extension and selects the right parser automatically.

Synchronous

=== "Python"

--8<-- "snippets/python/api/extract_file_sync.md"

=== "TypeScript"

--8<-- "snippets/typescript/getting-started/extract_file_sync.md"

=== "Rust"

--8<-- "snippets/rust/api/extract_file_sync.md"

=== "Go"

--8<-- "snippets/go/api/extract_file_sync.md"

=== "Java"

--8<-- "snippets/java/api/extract_file_sync.md"

=== "C#"

--8<-- "snippets/csharp/extract_file_sync.md"

=== "Ruby"

--8<-- "snippets/ruby/api/extract_file_sync.md"

=== "R"

--8<-- "snippets/r/api/extract_file_sync.md"

=== "C"

--8<-- "snippets/c/api/extract_file_sync.md"

=== "Wasm"

--8<-- "snippets/wasm/api/extract_file_sync.md"

Asynchronous

=== "Python"

--8<-- "snippets/python/api/extract_file_async.md"

=== "TypeScript"

--8<-- "snippets/typescript/getting-started/extract_file_async.md"

=== "Rust"

--8<-- "snippets/rust/api/extract_file_async.md"

=== "Go"

--8<-- "snippets/go/api/extract_file_async.md"

=== "Java"

--8<-- "snippets/java/api/extract_file_async.md"

=== "C#"

--8<-- "snippets/csharp/extract_file_async.md"

=== "Ruby"

--8<-- "snippets/ruby/api/extract_file_async.md"

=== "R"

--8<-- "snippets/r/api/extract_file_async.md"

=== "C"

--8<-- "snippets/c/api/extract_file_async.md"

=== "Wasm"

--8<-- "snippets/wasm/api/extract_file_async.md"

Extract from Bytes

When the file is already loaded in memory (for example, from an upload or network response), pass the byte array with its MIME type. Unlike file extraction, the MIME type is required since there's no file extension to infer it from.

Synchronous

=== "Python"

--8<-- "snippets/python/api/extract_bytes_sync.md"

=== "TypeScript"

--8<-- "snippets/typescript/getting-started/extract_bytes_sync.md"

=== "Rust"

--8<-- "snippets/rust/api/extract_bytes_sync.md"

=== "Go"

--8<-- "snippets/go/api/extract_bytes_sync.md"

=== "Java"

--8<-- "snippets/java/api/extract_bytes_sync.md"

=== "C#"

--8<-- "snippets/csharp/extract_bytes_sync.md"

=== "Ruby"

--8<-- "snippets/ruby/api/extract_bytes_sync.md"

=== "R"

--8<-- "snippets/r/api/extract_bytes_sync.md"

=== "C"

--8<-- "snippets/c/api/extract_bytes_sync.md"

=== "Wasm"

--8<-- "snippets/wasm/api/extract_bytes_sync.md"

Asynchronous

=== "Python"

--8<-- "snippets/python/api/extract_bytes_async.md"

=== "TypeScript"

--8<-- "snippets/typescript/getting-started/extract_bytes_async.md"

=== "Rust"

--8<-- "snippets/rust/api/extract_bytes_async.md"

=== "Go"

--8<-- "snippets/go/api/extract_bytes_async.md"

=== "Java"

--8<-- "snippets/java/api/extract_bytes_async.md"

=== "C#"

--8<-- "snippets/csharp/extract_bytes_async.md"

=== "Ruby"

--8<-- "snippets/ruby/api/extract_bytes_async.md"

=== "R"

--8<-- "snippets/r/api/extract_bytes_async.md"

=== "C"

--8<-- "snippets/c/api/extract_bytes_async.md"

=== "Wasm"

--8<-- "snippets/wasm/api/extract_bytes_async.md"

Batch Processing

Batch functions accept an array of file paths (or byte arrays) and process them concurrently. This is typically 2-5x faster than looping over single-file functions because Kreuzberg parallelizes internally.

Batch Extract Files

=== "Python"

--8<-- "snippets/python/api/batch_extract_files_sync.md"

=== "TypeScript"

--8<-- "snippets/typescript/getting-started/batch_extract_files_sync.md"

=== "Rust"

--8<-- "snippets/rust/api/batch_extract_files_sync.md"

=== "Go"

--8<-- "snippets/go/api/batch_extract_files_sync.md"

=== "Java"

--8<-- "snippets/java/api/batch_extract_files_sync.md"

=== "C#"

--8<-- "snippets/csharp/batch_extract_files_sync.md"

=== "Ruby"

--8<-- "snippets/ruby/api/batch_extract_files_sync.md"

=== "R"

--8<-- "snippets/r/api/batch_extract_files_sync.md"

=== "C"

--8<-- "snippets/c/api/batch_extract_files_sync.md"

=== "Wasm"

--8<-- "snippets/wasm/api/batch_extract_files_sync.md"

Batch Extract Bytes

=== "Python"

--8<-- "snippets/python/api/batch_extract_bytes_sync.md"

=== "TypeScript"

--8<-- "snippets/typescript/getting-started/batch_extract_bytes_sync.md"

=== "Rust"

--8<-- "snippets/rust/api/batch_extract_bytes_sync.md"

=== "Go"

--8<-- "snippets/go/api/batch_extract_bytes_sync.md"

=== "Java"

--8<-- "snippets/java/api/batch_extract_bytes_sync.md"

=== "C#"

--8<-- "snippets/csharp/batch_extract_bytes_sync.md"

=== "Ruby"

--8<-- "snippets/ruby/api/batch_extract_bytes_sync.md"

=== "R"

--8<-- "snippets/r/api/batch_extract_bytes_sync.md"

=== "C"

--8<-- "snippets/c/api/batch_extract_bytes_sync.md"

=== "Wasm"

--8<-- "snippets/wasm/api/batch_extract_bytes_sync.md"

Per-File Configuration v4.5.0

When a batch contains a mix of document types that need different settings (for example, scanned images needing OCR alongside text-based PDFs), use FileExtractionConfig to override options per file while sharing a common batch config.

=== "Python"

```python title="mixed_batch.py"
from kreuzberg import (
    batch_extract_files_sync,
    ExtractionConfig,
    FileExtractionConfig,
    OcrConfig,
)

config = ExtractionConfig(output_format="markdown")

paths = ["report.pdf", "scan.tiff", "notes.html"]
file_configs = [
    None,
    FileExtractionConfig(
        force_ocr=True,
        ocr=OcrConfig(backend="tesseract", language="deu"),
    ),
    FileExtractionConfig(output_format="plain"),
]

results = batch_extract_files_sync(paths, config, file_configs=file_configs)
```

=== "TypeScript"

```typescript title="mixed_batch.ts"
import { batchExtractFilesSync } from '@kreuzberg/node';

const results = batchExtractFilesSync(
  ['report.pdf', 'scan.tiff', 'notes.html'],
  { outputFormat: 'markdown' },
  [
    null,
    { forceOcr: true, ocr: { backend: 'tesseract', language: 'deu' } },
    { outputFormat: 'plain' },
  ],
);
```

=== "Rust"

```rust title="mixed_batch.rs"
use kreuzberg::{
    batch_extract_files, ExtractionConfig, FileExtractionConfig,
    OcrConfig, OutputFormat,
};
use std::path::PathBuf;

let config = ExtractionConfig {
    output_format: OutputFormat::Markdown,
    ..Default::default()
};

let paths = vec![
    PathBuf::from("report.pdf"),
    PathBuf::from("scan.tiff"),
    PathBuf::from("notes.html"),
];

let file_configs = vec![
    None,
    Some(FileExtractionConfig {
        force_ocr: Some(true),
        ocr: Some(OcrConfig {
            backend: "tesseract".to_string(),
            language: "deu".to_string(),
            ..Default::default()
        }),
        ..Default::default()
    }),
    Some(FileExtractionConfig {
        output_format: Some(OutputFormat::Plain),
        ..Default::default()
    }),
];

let results = batch_extract_files(paths, &config, Some(&file_configs)).await?;
```

Fields set to None in FileExtractionConfig inherit the batch default. Batch-level concerns like max_concurrent_extractions, use_cache, and security_limits cannot be overridden per file. See the Configuration Reference for the full list of overridable fields.

Content Filtering v4.8.0

Kreuzberg strips running headers, footers, watermarks, and cross-page repeating text by default so that downstream RAG and LLM pipelines see clean body content. ContentFilterConfig lets you opt back in to any of these when you need them, for example when extracting legal forms where the header carries the case number, or when running text analysis on a PDF whose brand name was being incorrectly removed by the repeating-text heuristic.

By default headers, footers, and watermarks are stripped and cross-page repeating text is deduplicated; see ContentFilterConfig for field-level defaults and per-format behavior.

=== "Python"

```python title="keep_headers_footers.py"
from kreuzberg import (
    extract_file_sync,
    ContentFilterConfig,
    ExtractionConfig,
)

# Legal/forms work: keep header and footer text
config = ExtractionConfig(
    content_filter=ContentFilterConfig(
        include_headers=True,
        include_footers=True,
    ),
)

result = extract_file_sync("contract.pdf", config=config)
```

=== "TypeScript"

```typescript title="disable_repeating_text.ts"
import { extract } from "@kreuzberg/node";

// Disable cross-page deduplication so brand names aren't stripped
const result = await extract("brochure.pdf", {
  contentFilter: {
    stripRepeatingText: false,
  },
});
```

=== "Rust"

```rust title="content_filter.rs"
use kreuzberg::{extract_file_sync, ContentFilterConfig, ExtractionConfig};

let config = ExtractionConfig {
    content_filter: Some(ContentFilterConfig {
        include_headers: true,
        include_footers: true,
        strip_repeating_text: true,
        include_watermarks: false,
    }),
    ..Default::default()
};

let result = extract_file_sync("contract.pdf", None, &config)?;
```

When a layout-detection model is active, it can independently classify regions as page headers or footers and strip them per page. Setting include_headers=True / include_footers=True also disables that per-page stripping. See the reference page for the full field semantics and per-format behavior.

Supported Formats

Kreuzberg supports 90+ file formats across 8 categories:

Category	Extensions	Notes
PDF	`.pdf`	Native text + OCR for scanned pages
Images	`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.webp`	Requires OCR backend
Office	`.docx`, `.pptx`, `.xlsx`	Modern formats via native parsers
Legacy Office	`.doc`, `.ppt`	Native OLE/CFB parsing
Email	`.eml`, `.msg`	Full support including attachments
Web	`.html`, `.htm`	Converted to Markdown with metadata
Text	`.md`, `.txt`, `.xml`, `.json`, `.yaml`, `.toml`, `.csv`	Direct extraction
Archives	`.zip`, `.tar`, `.tar.gz`, `.tar.bz2`	Recursive extraction

Page Tracking

Kreuzberg can track page boundaries and extract per-page content. Page tracking availability depends on the format:

PDF — Full byte-accurate page tracking with O(1) lookup
PPTX — Slide boundary tracking (each slide = one page)
DOCX — Best-effort detection using explicit <w:br type="page"/> tags
Other formats — No page tracking

Enable page extraction with PageConfig:

config = ExtractionConfig(
    pages=PageConfig(
        insert_page_markers=True,
        marker_format="\n\n<!-- PAGE {page_num} -->\n\n"
    )
)

Page markers like  are inserted at boundaries in the content field — useful for LLMs that need to understand document layout. When both page tracking and chunking are enabled, chunks automatically include first_page and last_page metadata.

See PageConfig Reference for all options and Advanced Page Tracking for chunk-to-page mapping examples.

Code File Extraction

Source code files (.py, .rs, .ts, .go, etc.) go through tree-sitter and produce a ProcessResult on ExtractionResult.code_intelligence (structure, imports/exports, symbols, docstrings, diagnostics, semantic chunks). Code files bypass text chunking — TSLP's function/class-aware CodeChunks map directly to Kreuzberg Chunks with semantic chunk_type and heading context.

See Code Intelligence for usage and TreeSitterProcessConfig for fields.

PDF Page Rendering

Render individual PDF pages as PNG images. Unlike the extraction pipeline (which parses text, tables, metadata), this API produces raw pixel data for thumbnails, vision model input, or custom OCR pipelines.

Two Approaches

API	When to use
`render_pdf_page`	You know which page you need, or only need a few pages
`PdfPageIterator`	Process every page sequentially without loading all images into memory

DPI Configuration

DPI	Pixel size (US Letter)	Use case
72	612 x 792	Thumbnails, quick previews
150 (default)	1275 x 1650	General-purpose, screen display
300	2550 x 3300	OCR input, print quality

Tip: Use 300 DPI when rendering pages for OCR or vision models. The default 150 DPI may reduce recognition accuracy on small text.

MIME Type Detection

When extracting from bytes, Kreuzberg requires an explicit MIME type since there's no file extension to infer it from. For file paths, auto-detection from the extension is automatic.

Example: Override MIME Type

from kreuzberg import extract_file

# File without extension — provide MIME type explicitly
result = extract_file("document_copy", mime_type="application/pdf", config=config)

Error Handling

All extraction functions raise typed exceptions on failure. Catch specific exceptions to handle different failure modes:

=== "Python"

--8<-- "snippets/python/utils/error_handling.md"

=== "TypeScript"

--8<-- "snippets/typescript/api/error_handling.md"

=== "Rust"

--8<-- "snippets/rust/api/error_handling.md"

=== "Go"

--8<-- "snippets/go/api/error_handling.md"

=== "Java"

--8<-- "snippets/java/api/error_handling.md"

=== "C#"

--8<-- "snippets/csharp/error_handling.md"

=== "Ruby"

--8<-- "snippets/ruby/api/error_handling.md"

=== "R"

--8<-- "snippets/r/api/error_handling.md"

=== "C"

--8<-- "snippets/c/api/error_handling.md"

=== "Wasm"

--8<-- "snippets/wasm/api/error_handling_wasm.md"

!!! Warning "System Errors" OSError (Python), IOException (Rust), and system-level errors always propagate through. These indicate real system problems (permissions, disk space, etc.) that your application should handle.

Next Steps

Configuration — all configuration options and file formats
OCR Guide — set up optical character recognition
Advanced Features — chunking, language detection, embeddings
Element-Based Output — structured element arrays for RAG
Document Structure — hierarchical tree output

16 KiB Raw Blame History

Extraction Basics

Extract from Files

Synchronous

Asynchronous

Extract from Bytes

Synchronous

Asynchronous

Batch Processing

Batch Extract Files

Batch Extract Bytes

Per-File Configuration v4.5.0

Content Filtering v4.8.0

Supported Formats

Page Tracking

Code File Extraction

PDF Page Rendering

Two Approaches

DPI Configuration

MIME Type Detection

Example: Override MIME Type

Error Handling

Next Steps

16 KiB

Raw Blame History