Files
fil/crates/kreuzberg-wasm/README.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

18 KiB
Generated

WebAssembly

Extract text, tables, images, and metadata from 90+ file formats and 300+ programming languages including PDF, Office documents, and images. WebAssembly bindings for browsers, Deno, and Cloudflare Workers with portable deployment and multi-threading support.

What This Package Provides

  • Document intelligence core — extract text, tables, images, metadata, entities, keywords, and code intelligence from one API.
  • Format coverage — PDF, Office, images, HTML/XML, email, archives, notebooks, citations, scientific formats, and plain text.
  • OCR choices — Tesseract, PaddleOCR, EasyOCR where supported, VLM OCR through liter-llm, and plugin hooks for custom backends.
  • Same engine as every binding — Rust, Python, Node.js, Go, Java, PHP, Ruby, .NET, Elixir, R, WASM, Kotlin Android, Swift, Dart, Zig, and C FFI share the same Rust implementation.
  • WASM package — browser and edge-compatible extraction where native libraries are unavailable.

Installation

Package Installation

pnpm add @kreuzberg/wasm

System Requirements

  • Modern browser with WebAssembly support, or Deno 1.0+, or Cloudflare Workers
  • Optional: Tesseract WASM for OCR functionality

Quick Start

Basic Extraction

Extract text, metadata, and structure from any supported document format:

import { extractBytes, initWasm } from "@kreuzberg/wasm";

async function main() {
  await initWasm();

  const buffer = await fetch("document.pdf").then((r) => r.arrayBuffer());
  const bytes = new Uint8Array(buffer);

  const result = await extractBytes(bytes, "application/pdf");

  console.log("Extracted content:");
  console.log(result.content);
  console.log("MIME type:", result.mimeType);
  console.log("Metadata:", result.metadata);
}

main().catch(console.error);

Common Use Cases

Extract with Custom Configuration

Most use cases benefit from configuration to control extraction behavior:

With OCR (for scanned documents):

import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";

async function extractWithOcr() {
  await initWasm();

  try {
    await enableOcr();
    console.log("OCR enabled successfully");
  } catch (error) {
    console.error("Failed to enable OCR:", error);
    return;
  }

  const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));

  const result = await extractBytes(bytes, "image/png", {
    ocr: {
      backend: "tesseract-wasm",
      language: "eng",
    },
  });

  console.log("Extracted text:");
  console.log(result.content);
}

extractWithOcr().catch(console.error);

Table Extraction

See Configuration Guide for table extraction options.

Processing Multiple Files

import { extractBytes, initWasm } from "@kreuzberg/wasm";

interface DocumentJob {
  name: string;
  bytes: Uint8Array;
  mimeType: string;
}

async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
  await initWasm();

  const results: Record<string, string> = {};
  const queue = [...documents];

  const workers = Array(concurrency)
    .fill(null)
    .map(async () => {
      while (queue.length > 0) {
        const doc = queue.shift();
        if (!doc) break;

        try {
          const result = await extractBytes(doc.bytes, doc.mimeType);
          results[doc.name] = result.content;
        } catch (error) {
          console.error(`Failed to process ${doc.name}:`, error);
        }
      }
    });

  await Promise.all(workers);
  return results;
}

Async Processing

For non-blocking document processing:

import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";

async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
  const caps = getWasmCapabilities();
  if (!caps.hasWasm) {
    throw new Error("WebAssembly not supported");
  }

  await initWasm();

  const results = await Promise.all(
    files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])),
  );

  return results.map((r) => ({
    content: r.content,
    pageCount: r.metadata?.pageCount,
  }));
}

const fileBytes = [new Uint8Array([1, 2, 3])];
const mimes = ["application/pdf"];

extractDocuments(fileBytes, mimes)
  .then((results) => console.log(results))
  .catch(console.error);

Next Steps

Features

Supported File Formats (90+)

90+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

Category Formats Capabilities
Word Processing .docx, .docm, .dotx, .dotm, .dot, .odt Full text, tables, images, metadata, styles
Spreadsheets .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .xltx, .xlt, .ods Sheet data, formulas, cell metadata, charts
Presentations .pptx, .pptm, .ppsx, .potx, .potm, .pot, .ppt Slides, speaker notes, images, metadata
PDF .pdf Text, tables, images, metadata, OCR support
eBooks .epub, .fb2 Chapters, metadata, embedded resources
Database .dbf Table data extraction, field type support
Hangul .hwp, .hwpx Korean document format, text extraction

Images (OCR-Enabled)

Category Formats Features
Raster .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif OCR, table detection, EXIF metadata, dimensions, color space
Advanced .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata
Vector .svg DOM parsing, embedded text, graphics metadata

Web & Data

Category Formats Features
Markup .html, .htm, .xhtml, .xml, .svg DOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data .json, .yaml, .yml, .toml, .csv, .tsv Schema detection, nested structures, validation
Text & Markdown .txt, .md, .markdown, .djot, .rst, .org, .rtf CommonMark, GFM, Djot, reStructuredText, Org Mode

Email & Archives

Category Formats Features
Email .eml, .msg Headers, body (HTML/plain), attachments, threading
Archives .zip, .tar, .tgz, .gz, .7z File listing, nested archives, metadata

Academic & Scientific

Category Formats Features
Citations .bib, .biblatex, .ris, .nbib, .enw, .csl Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON
Scientific .tex, .latex, .typst, .jats, .ipynb, .docbook LaTeX, Jupyter notebooks, PubMed JATS
Documentation .opml, .pod, .mdoc, .troff Technical documentation formats

Code Intelligence (300+ Languages)

Feature Description
Structure Extraction Functions, classes, methods, structs, interfaces, enums
Import/Export Analysis Module dependencies, re-exports, wildcard imports
Symbol Extraction Variables, constants, type aliases, properties
Docstring Parsing Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats
Diagnostics Parse errors with line/column positions
Syntax-Aware Chunking Split code by semantic boundaries, not arbitrary byte offsets

Powered by tree-sitter-language-packdocumentation.

Complete Format Reference

Key Capabilities

  • Text Extraction - Extract all text content with position and formatting information
  • Metadata Extraction - Retrieve document properties, creation date, author, etc.
  • Table Extraction - Parse tables with structure and cell content preservation
  • Image Extraction - Extract embedded images and render page previews
  • OCR Support - Integrate multiple OCR backends for scanned documents
  • Async/Await - Non-blocking document processing with concurrent operations
  • Plugin System - Extensible post-processing for custom text transformation
  • Batch Processing - Efficiently process multiple documents in parallel
  • Memory Efficient - Stream large files without loading entirely into memory
  • Language Detection - Detect and support multiple languages in documents
  • Code Intelligence - Extract structure, imports, exports, symbols, and docstrings from 300+ programming languages via tree-sitter
  • Configuration - Fine-grained control over extraction behavior

Performance Characteristics

Format Speed Memory Notes
PDF (text) 10-100 MB/s ~50MB per doc Fastest extraction
Office docs 20-200 MB/s ~100MB per doc DOCX, XLSX, PPTX
Images (OCR) 1-5 MB/s Variable Depends on OCR backend
Archives 5-50 MB/s ~200MB per doc ZIP, TAR, etc.
Web formats 50-200 MB/s Streaming HTML, XML, JSON

OCR Support

Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:

  • Tesseract-Wasm

OCR Configuration Example

import { enableOcr, extractBytes, initWasm } from "@kreuzberg/wasm";

async function extractWithOcr() {
  await initWasm();

  try {
    await enableOcr();
    console.log("OCR enabled successfully");
  } catch (error) {
    console.error("Failed to enable OCR:", error);
    return;
  }

  const bytes = new Uint8Array(await fetch("scanned-page.png").then((r) => r.arrayBuffer()));

  const result = await extractBytes(bytes, "image/png", {
    ocr: {
      backend: "tesseract-wasm",
      language: "eng",
    },
  });

  console.log("Extracted text:");
  console.log(result.content);
}

extractWithOcr().catch(console.error);

Async Support

This binding provides full async/await support for non-blocking document processing:

import { extractBytes, getWasmCapabilities, initWasm } from "@kreuzberg/wasm";

async function extractDocuments(files: Uint8Array[], mimeTypes: string[]) {
  const caps = getWasmCapabilities();
  if (!caps.hasWasm) {
    throw new Error("WebAssembly not supported");
  }

  await initWasm();

  const results = await Promise.all(
    files.map((bytes, index) => extractBytes(bytes, mimeTypes[index])),
  );

  return results.map((r) => ({
    content: r.content,
    pageCount: r.metadata?.pageCount,
  }));
}

const fileBytes = [new Uint8Array([1, 2, 3])];
const mimes = ["application/pdf"];

extractDocuments(fileBytes, mimes)
  .then((results) => console.log(results))
  .catch(console.error);

Plugin System

Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.

For detailed plugin documentation, visit Plugin System Guide.

Batch Processing

Process multiple documents efficiently:

import { extractBytes, initWasm } from "@kreuzberg/wasm";

interface DocumentJob {
  name: string;
  bytes: Uint8Array;
  mimeType: string;
}

async function _processBatch(documents: DocumentJob[], concurrency: number = 3) {
  await initWasm();

  const results: Record<string, string> = {};
  const queue = [...documents];

  const workers = Array(concurrency)
    .fill(null)
    .map(async () => {
      while (queue.length > 0) {
        const doc = queue.shift();
        if (!doc) break;

        try {
          const result = await extractBytes(doc.bytes, doc.mimeType);
          results[doc.name] = result.content;
        } catch (error) {
          console.error(`Failed to process ${doc.name}:`, error);
        }
      }
    });

  await Promise.all(workers);
  return results;
}

Configuration

For advanced configuration options including language detection, table extraction, OCR settings, and more:

Configuration Guide

Documentation

Contributing

Contributions are welcome! See Contributing Guide.

Part of Kreuzberg.dev

  • Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
  • kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
  • html-to-markdown — fast, lossless HTML→Markdown engine.
  • liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
  • tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
  • alef — the polyglot binding generator that produces this README and all per-language bindings.
  • Discord — community, roadmap, announcements.

License

Elastic-2.0 License — see LICENSE for details.

Support