Nomad changes
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s

This commit is contained in:
Henrik Jess Nielsen
2026-06-01 23:40:55 +02:00
parent 72b1a0a6ed
commit b4c07d3693
5723 changed files with 1130655 additions and 0 deletions

6
packages/r/.lintr generated Normal file
View File

@@ -0,0 +1,6 @@
linters: linters_with_defaults(
line_length_linter(120),
object_name_linter = NULL,
object_usage_linter = NULL,
commented_code_linter = NULL
)

23
packages/r/DESCRIPTION generated Normal file
View File

@@ -0,0 +1,23 @@
Package: kreuzberg
Title: High-performance document intelligence library
Version: 5.0.0.9003
Authors@R: person("Na'aman", "Hirschfeld", email = "naaman@kreuzberg.dev", role = c("aut", "cre"))
Description: High-performance document intelligence library
Rust bindings generated with extendr.
URL: https://github.com/kreuzberg-dev/kreuzberg
BugReports: https://github.com/kreuzberg-dev/kreuzberg/issues
License: Elastic-2.0
Depends: R (>= 4.2)
Imports: jsonlite
Suggests:
testthat (>= 3.0.0),
withr,
roxygen2,
lintr,
styler
SystemRequirements: Cargo (Rust's package manager), rustc (>= 1.91)
Config/rextendr/version: 0.4.2
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.3
Config/testthat/edition: 3

93
packages/r/LICENSE generated Normal file
View File

@@ -0,0 +1,93 @@
Elastic License 2.0 (ELv2)
Copyright 2025-2026 Kreuzberg, Inc.
Acceptance
By using the software, you agree to all of the terms and conditions below.
Copyright License
The licensor grants you a non-exclusive, royalty-free, worldwide,
non-sublicensable, non-transferable license to use, copy, distribute, make
available, and prepare derivative works of the software, in each case subject to
the limitations and conditions below.
Limitations
You may not provide the software to third parties as a hosted or managed
service, where the service provides users with access to any substantial set of
the features or functionality of the software.
You may not move, change, disable, or circumvent the license key functionality
in the software, and you may not remove or obscure any functionality in the
software that is protected by the license key.
You may not alter, remove, or obscure any licensing, copyright, or other notices
of the licensor in the software. Any use of the licensor's trademarks is subject
to applicable law.
Patents
The licensor grants you a license, under any patent claims the licensor can
license, or becomes able to license, to make, have made, use, sell, offer for
sale, import and have imported the software, in each case subject to the
limitations and conditions in this license. This license does not cover any
patent claims that you cause to be infringed by modifications or additions to the
software. If you or your company make any written claim that the software
infringes or contributes to infringement of any patent, your patent license for
the software granted under these terms ends immediately. If your company makes
such a claim, your patent license ends immediately for work on behalf of your
company.
Notices
You must ensure that anyone who gets a copy of any part of the software from you
also gets a copy of these terms.
If you modify the software, you must include in any modified copies of the
software prominent notices stating that you have modified the software.
No Other Rights
These terms do not imply any licenses other than those expressly granted in
these terms.
Termination
If you use the software in violation of these terms, such use is not licensed,
and your licenses will automatically terminate. If the licensor provides you with
a notice of your violation, and you cease all violation of this license no later
than 30 days after you receive that notice, your licenses will be reinstated
retroactively. However, if you violate these terms after such reinstatement, any
additional violation of these terms will cause your licenses to terminate
automatically and permanently.
No Liability
As far as the law allows, the software comes as is, without any warranty or
condition, and the licensor will not be liable to you for any damages arising out
of these terms or the use or nature of the software, under any kind of legal
claim.
Definitions
The licensor is the entity offering these terms, and the software is the
software the licensor makes available under these terms, including any portion
of it.
you refers to the individual or entity agreeing to these terms.
your company is any legal entity, sole proprietorship, or other kind of
organization that you work for, plus all organizations that have control over,
are under the control of, or are under common control with that organization.
control means ownership of substantially all the assets of an entity, or the
power to direct its management and policies by vote, contract, or otherwise.
Control can be direct or indirect.
your licenses are all the licenses granted to you for the software under these
terms.
use means anything you do with the software requiring one of your licenses.
trademark means trademarks, service marks, and similar rights.

429
packages/r/NAMESPACE generated Normal file
View File

@@ -0,0 +1,429 @@
# Generated by alef — do not edit.
useDynLib(kreuzberg, .registration = TRUE)
export(extract_bytes)
export(extract_file)
export(extract_file_sync)
export(extract_bytes_sync)
export(batch_extract_files_sync)
export(batch_extract_bytes_sync)
export(batch_extract_files)
export(batch_extract_bytes)
export(detect_mime_type_from_bytes)
export(get_extensions_for_mime)
export(list_embedding_backends)
export(list_document_extractors)
export(list_ocr_backends)
export(list_post_processors)
export(list_renderers)
export(list_validators)
export(compare)
export(embed_texts_async)
export(render_pdf_page_to_png)
export(detect_mime_type)
export(embed_texts)
export(get_embedding_preset)
export(list_embedding_presets)
export(register_ocr_backend)
export(unregister_ocr_backend)
export(clear_ocr_backends)
export(register_post_processor)
export(unregister_post_processor)
export(clear_post_processors)
export(register_validator)
export(unregister_validator)
export(clear_validators)
export(register_embedding_backend)
export(unregister_embedding_backend)
export(clear_embedding_backends)
export(register_document_extractor)
export(unregister_document_extractor)
export(clear_document_extractors)
export(register_renderer)
export(unregister_renderer)
export(clear_renderers)
export(CacheStats)
S3method("$", CacheStats)
S3method("[[", CacheStats)
export(AccelerationConfig)
S3method("$", AccelerationConfig)
S3method("[[", AccelerationConfig)
export(ContentFilterConfig)
S3method("$", ContentFilterConfig)
S3method("[[", ContentFilterConfig)
export(EmailConfig)
S3method("$", EmailConfig)
S3method("[[", EmailConfig)
export(ExtractionConfig)
S3method("$", ExtractionConfig)
S3method("[[", ExtractionConfig)
export(FileExtractionConfig)
S3method("$", FileExtractionConfig)
S3method("[[", FileExtractionConfig)
export(BatchBytesItem)
S3method("$", BatchBytesItem)
S3method("[[", BatchBytesItem)
export(BatchFileItem)
S3method("$", BatchFileItem)
S3method("[[", BatchFileItem)
export(ImageExtractionConfig)
S3method("$", ImageExtractionConfig)
S3method("[[", ImageExtractionConfig)
export(TokenReductionOptions)
S3method("$", TokenReductionOptions)
S3method("[[", TokenReductionOptions)
export(LanguageDetectionConfig)
S3method("$", LanguageDetectionConfig)
S3method("[[", LanguageDetectionConfig)
export(HtmlOutputConfig)
S3method("$", HtmlOutputConfig)
S3method("[[", HtmlOutputConfig)
export(LayoutDetectionConfig)
S3method("$", LayoutDetectionConfig)
S3method("[[", LayoutDetectionConfig)
export(LlmConfig)
S3method("$", LlmConfig)
S3method("[[", LlmConfig)
export(StructuredExtractionConfig)
S3method("$", StructuredExtractionConfig)
S3method("[[", StructuredExtractionConfig)
export(OcrQualityThresholds)
S3method("$", OcrQualityThresholds)
S3method("[[", OcrQualityThresholds)
export(OcrPipelineStage)
S3method("$", OcrPipelineStage)
S3method("[[", OcrPipelineStage)
export(OcrConfig)
S3method("$", OcrConfig)
S3method("[[", OcrConfig)
export(PageConfig)
S3method("$", PageConfig)
S3method("[[", PageConfig)
export(PdfConfig)
S3method("$", PdfConfig)
S3method("[[", PdfConfig)
export(HierarchyConfig)
S3method("$", HierarchyConfig)
S3method("[[", HierarchyConfig)
export(PostProcessorConfig)
S3method("$", PostProcessorConfig)
S3method("[[", PostProcessorConfig)
export(ChunkingConfig)
S3method("$", ChunkingConfig)
S3method("[[", ChunkingConfig)
export(EmbeddingConfig)
S3method("$", EmbeddingConfig)
S3method("[[", EmbeddingConfig)
export(TreeSitterConfig)
S3method("$", TreeSitterConfig)
S3method("[[", TreeSitterConfig)
export(TreeSitterProcessConfig)
S3method("$", TreeSitterProcessConfig)
S3method("[[", TreeSitterProcessConfig)
export(SupportedFormat)
S3method("$", SupportedFormat)
S3method("[[", SupportedFormat)
export(ServerConfig)
S3method("$", ServerConfig)
S3method("[[", ServerConfig)
export(StructuredDataResult)
S3method("$", StructuredDataResult)
S3method("[[", StructuredDataResult)
export(DocxAppProperties)
S3method("$", DocxAppProperties)
S3method("[[", DocxAppProperties)
export(XlsxAppProperties)
S3method("$", XlsxAppProperties)
S3method("[[", XlsxAppProperties)
export(PptxAppProperties)
S3method("$", PptxAppProperties)
S3method("[[", PptxAppProperties)
export(CoreProperties)
S3method("$", CoreProperties)
S3method("[[", CoreProperties)
export(SecurityLimits)
S3method("$", SecurityLimits)
S3method("[[", SecurityLimits)
export(TokenReductionConfig)
S3method("$", TokenReductionConfig)
S3method("[[", TokenReductionConfig)
export(PdfAnnotation)
S3method("$", PdfAnnotation)
S3method("[[", PdfAnnotation)
export(InlineElement)
S3method("$", InlineElement)
S3method("[[", InlineElement)
export(DjotImage)
S3method("$", DjotImage)
S3method("[[", DjotImage)
export(DjotLink)
S3method("$", DjotLink)
S3method("[[", DjotLink)
export(DocumentRelationship)
S3method("$", DocumentRelationship)
S3method("[[", DocumentRelationship)
export(GridCell)
S3method("$", GridCell)
S3method("[[", GridCell)
export(TextAnnotation)
S3method("$", TextAnnotation)
S3method("[[", TextAnnotation)
export(ArchiveEntry)
S3method("$", ArchiveEntry)
S3method("[[", ArchiveEntry)
export(ProcessingWarning)
S3method("$", ProcessingWarning)
S3method("[[", ProcessingWarning)
export(LlmUsage)
S3method("$", LlmUsage)
S3method("[[", LlmUsage)
export(Chunk)
S3method("$", Chunk)
S3method("[[", Chunk)
export(HeadingLevel)
S3method("$", HeadingLevel)
S3method("[[", HeadingLevel)
export(ChunkMetadata)
S3method("$", ChunkMetadata)
S3method("[[", ChunkMetadata)
export(ExtractedImage)
S3method("$", ExtractedImage)
S3method("[[", ExtractedImage)
export(BoundingBox)
S3method("$", BoundingBox)
S3method("[[", BoundingBox)
export(ElementMetadata)
S3method("$", ElementMetadata)
S3method("[[", ElementMetadata)
export(Element)
S3method("$", Element)
S3method("[[", Element)
export(XmlExtractionResult)
S3method("$", XmlExtractionResult)
S3method("[[", XmlExtractionResult)
export(EmailAttachment)
S3method("$", EmailAttachment)
S3method("[[", EmailAttachment)
export(OcrTableBoundingBox)
S3method("$", OcrTableBoundingBox)
S3method("[[", OcrTableBoundingBox)
export(ImagePreprocessingConfig)
S3method("$", ImagePreprocessingConfig)
S3method("[[", ImagePreprocessingConfig)
export(TesseractConfig)
S3method("$", TesseractConfig)
S3method("[[", TesseractConfig)
export(ImagePreprocessingMetadata)
S3method("$", ImagePreprocessingMetadata)
S3method("[[", ImagePreprocessingMetadata)
export(Metadata)
S3method("$", Metadata)
S3method("[[", Metadata)
export(ExcelMetadata)
S3method("$", ExcelMetadata)
S3method("[[", ExcelMetadata)
export(EmailMetadata)
S3method("$", EmailMetadata)
S3method("[[", EmailMetadata)
export(ArchiveMetadata)
S3method("$", ArchiveMetadata)
S3method("[[", ArchiveMetadata)
export(ImageMetadata)
S3method("$", ImageMetadata)
S3method("[[", ImageMetadata)
export(XmlMetadata)
S3method("$", XmlMetadata)
S3method("[[", XmlMetadata)
export(HeaderMetadata)
S3method("$", HeaderMetadata)
S3method("[[", HeaderMetadata)
export(StructuredData)
S3method("$", StructuredData)
S3method("[[", StructuredData)
export(OcrMetadata)
S3method("$", OcrMetadata)
S3method("[[", OcrMetadata)
export(ErrorMetadata)
S3method("$", ErrorMetadata)
S3method("[[", ErrorMetadata)
export(PptxMetadata)
S3method("$", PptxMetadata)
S3method("[[", PptxMetadata)
export(DocxMetadata)
S3method("$", DocxMetadata)
S3method("[[", DocxMetadata)
export(CsvMetadata)
S3method("$", CsvMetadata)
S3method("[[", CsvMetadata)
export(BibtexMetadata)
S3method("$", BibtexMetadata)
S3method("[[", BibtexMetadata)
export(CitationMetadata)
S3method("$", CitationMetadata)
S3method("[[", CitationMetadata)
export(YearRange)
S3method("$", YearRange)
S3method("[[", YearRange)
export(FictionBookMetadata)
S3method("$", FictionBookMetadata)
S3method("[[", FictionBookMetadata)
export(DbfFieldInfo)
S3method("$", DbfFieldInfo)
S3method("[[", DbfFieldInfo)
export(ContributorRole)
S3method("$", ContributorRole)
S3method("[[", ContributorRole)
export(EpubMetadata)
S3method("$", EpubMetadata)
S3method("[[", EpubMetadata)
export(PstMetadata)
S3method("$", PstMetadata)
S3method("[[", PstMetadata)
export(OcrConfidence)
S3method("$", OcrConfidence)
S3method("[[", OcrConfidence)
export(OcrRotation)
S3method("$", OcrRotation)
S3method("[[", OcrRotation)
export(OcrElement)
S3method("$", OcrElement)
S3method("[[", OcrElement)
export(OcrElementConfig)
S3method("$", OcrElementConfig)
S3method("[[", OcrElementConfig)
export(PageBoundary)
S3method("$", PageBoundary)
S3method("[[", PageBoundary)
export(PageInfo)
S3method("$", PageInfo)
S3method("[[", PageInfo)
export(LayoutRegion)
S3method("$", LayoutRegion)
S3method("[[", LayoutRegion)
export(HierarchicalBlock)
S3method("$", HierarchicalBlock)
S3method("[[", HierarchicalBlock)
export(CellChange)
S3method("$", CellChange)
S3method("[[", CellChange)
export(DocumentRevision)
S3method("$", DocumentRevision)
S3method("[[", DocumentRevision)
export(TableCell)
S3method("$", TableCell)
S3method("[[", TableCell)
export(ExtractedUri)
S3method("$", ExtractedUri)
S3method("[[", ExtractedUri)
export(DetectResponse)
S3method("$", DetectResponse)
S3method("[[", DetectResponse)
export(DiffOptions)
S3method("$", DiffOptions)
S3method("[[", DiffOptions)
export(DiffHunk)
S3method("$", DiffHunk)
S3method("[[", DiffHunk)
export(EmbeddedDiff)
S3method("$", EmbeddedDiff)
S3method("[[", EmbeddedDiff)
export(EmbeddingPreset)
S3method("$", EmbeddingPreset)
S3method("[[", EmbeddingPreset)
export(YakeParams)
S3method("$", YakeParams)
S3method("[[", YakeParams)
export(RakeParams)
S3method("$", RakeParams)
S3method("[[", RakeParams)
export(KeywordConfig)
S3method("$", KeywordConfig)
S3method("[[", KeywordConfig)
export(Keyword)
S3method("$", Keyword)
S3method("[[", Keyword)
export(PaddleOcrConfig)
S3method("$", PaddleOcrConfig)
S3method("[[", PaddleOcrConfig)
export(ModelPaths)
S3method("$", ModelPaths)
S3method("[[", ModelPaths)
export(OrientationResult)
S3method("$", OrientationResult)
S3method("[[", OrientationResult)
export(BBox)
S3method("$", BBox)
S3method("[[", BBox)
export(LayoutDetection)
S3method("$", LayoutDetection)
S3method("[[", LayoutDetection)
export(EmbeddedFile)
S3method("$", EmbeddedFile)
S3method("[[", EmbeddedFile)
export(PdfMetadata)
S3method("$", PdfMetadata)
S3method("[[", PdfMetadata)
export(OutputFormat)
S3method("$", OutputFormat)
S3method("[[", OutputFormat)
export(FormatMetadata)
S3method("$", FormatMetadata)
S3method("[[", FormatMetadata)
export(DiffLine)
S3method("$", DiffLine)
S3method("[[", DiffLine)
export(ChunkSizing)
S3method("$", ChunkSizing)
S3method("[[", ChunkSizing)
export(EmbeddingModelType)
S3method("$", EmbeddingModelType)
S3method("[[", EmbeddingModelType)
export(NodeContent)
S3method("$", NodeContent)
S3method("[[", NodeContent)
export(AnnotationKind)
S3method("$", AnnotationKind)
S3method("[[", AnnotationKind)
export(OcrBoundingGeometry)
S3method("$", OcrBoundingGeometry)
S3method("[[", OcrBoundingGeometry)
export(RevisionAnchor)
S3method("$", RevisionAnchor)
S3method("[[", RevisionAnchor)
export(cors_allows_all)
export(is_empty)
export(is_origin_allowed)
export(listen_addr)
export(max_multipart_field_mb)
export(max_request_body_mb)
export(needs_image_processing)
export(with_angle_cls)
export(with_cache_dir)
export(with_det_db_box_thresh)
export(with_det_db_thresh)
export(with_det_db_unclip_ratio)
export(with_det_limit_side_len)
export(with_drop_score)
export(with_model_tier)
export(with_padding)
export(with_rec_batch_num)
export(with_table_detection)
S3method(needs_image_processing, ExtractionConfig)
S3method(listen_addr, ServerConfig)
S3method(cors_allows_all, ServerConfig)
S3method(is_origin_allowed, ServerConfig)
S3method(max_request_body_mb, ServerConfig)
S3method(max_multipart_field_mb, ServerConfig)
S3method(is_empty, Metadata)
S3method(with_cache_dir, PaddleOcrConfig)
S3method(with_table_detection, PaddleOcrConfig)
S3method(with_angle_cls, PaddleOcrConfig)
S3method(with_det_db_thresh, PaddleOcrConfig)
S3method(with_det_db_box_thresh, PaddleOcrConfig)
S3method(with_det_db_unclip_ratio, PaddleOcrConfig)
S3method(with_det_limit_side_len, PaddleOcrConfig)
S3method(with_rec_batch_num, PaddleOcrConfig)
S3method(with_drop_score, PaddleOcrConfig)
S3method(with_padding, PaddleOcrConfig)
S3method(with_model_tier, PaddleOcrConfig)

3518
packages/r/R/extendr-wrappers.R generated Normal file

File diff suppressed because it is too large Load Diff

8
packages/r/R/kreuzberg.R generated Normal file
View File

@@ -0,0 +1,8 @@
# This file is auto-generated by alef — DO NOT EDIT.
# alef:hash:4e15143f4af1ae8bafbdb1506ef057da924484c66a19483966333558ad437e75
# To regenerate: alef generate
# To verify freshness: alef verify --exit-code
# Issues & docs: https://github.com/kreuzberg-dev/alef
#' @useDynLib kreuzberg, .registration = TRUE
NULL

395
packages/r/README.md generated Normal file
View File

@@ -0,0 +1,395 @@
# R
<div align="center" style="display: flex; flex-wrap: wrap; gap: 8px; justify-content: center; margin: 20px 0;">
<a href="https://github.com/kreuzberg-dev/alef">
<img src="https://img.shields.io/badge/Bindings-alef%20%D7%90-007ec6" alt="Bindings">
</a>
<!-- Language Bindings -->
<a href="https://crates.io/crates/kreuzberg">
<img src="https://img.shields.io/crates/v/kreuzberg?label=Rust&color=007ec6" alt="Rust">
</a>
<a href="https://pypi.org/project/kreuzberg/">
<img src="https://img.shields.io/pypi/v/kreuzberg?label=Python&color=007ec6" alt="Python">
</a>
<a href="https://www.npmjs.com/package/@kreuzberg/node">
<img src="https://img.shields.io/npm/v/@kreuzberg/node?label=Node.js&color=007ec6" alt="Node.js">
</a>
<a href="https://www.npmjs.com/package/@kreuzberg/wasm">
<img src="https://img.shields.io/npm/v/@kreuzberg/wasm?label=WASM&color=007ec6" alt="WASM">
</a>
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg">
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg?label=Java&color=007ec6" alt="Java">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/go/v5">
<img src="https://img.shields.io/github/v/tag/kreuzberg-dev/kreuzberg?label=Go&color=007ec6&filter=v5*" alt="Go">
</a>
<a href="https://www.nuget.org/packages/Kreuzberg/">
<img src="https://img.shields.io/nuget/v/Kreuzberg?label=C%23&color=007ec6" alt="C#">
</a>
<a href="https://packagist.org/packages/kreuzberg/kreuzberg">
<img src="https://img.shields.io/packagist/v/kreuzberg/kreuzberg?label=PHP&color=007ec6" alt="PHP">
</a>
<a href="https://rubygems.org/gems/kreuzberg">
<img src="https://img.shields.io/gem/v/kreuzberg?label=Ruby&color=007ec6" alt="Ruby">
</a>
<a href="https://hex.pm/packages/kreuzberg">
<img src="https://img.shields.io/hexpm/v/kreuzberg?label=Elixir&color=007ec6" alt="Elixir">
</a>
<a href="https://kreuzberg-dev.r-universe.dev/kreuzberg">
<img src="https://img.shields.io/badge/R-kreuzberg-007ec6" alt="R">
</a>
<a href="https://pub.dev/packages/kreuzberg">
<img src="https://img.shields.io/pub/v/kreuzberg?label=Dart&color=007ec6" alt="Dart">
</a>
<a href="https://central.sonatype.com/artifact/dev.kreuzberg/kreuzberg-android">
<img src="https://img.shields.io/maven-central/v/dev.kreuzberg/kreuzberg-android?label=Kotlin&color=007ec6" alt="Kotlin">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/swift">
<img src="https://img.shields.io/badge/Swift-SPM-007ec6" alt="Swift">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/tree/main/packages/zig">
<img src="https://img.shields.io/badge/Zig-package-007ec6" alt="Zig">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/releases">
<img src="https://img.shields.io/badge/C-FFI-007ec6" alt="C FFI">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/kreuzberg">
<img src="https://img.shields.io/badge/Docker-ghcr.io-007ec6?logo=docker&logoColor=white" alt="Docker">
</a>
<a href="https://github.com/kreuzberg-dev/kreuzberg/pkgs/container/charts%2Fkreuzberg">
<img src="https://img.shields.io/badge/Helm-ghcr.io-007ec6?logo=helm&logoColor=white" alt="Helm">
</a>
<!-- Project Info -->
<a href="https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE">
<img src="https://img.shields.io/badge/License-Elastic--2.0-007ec6" alt="License">
</a>
<a href="https://docs.kreuzberg.dev">
<img src="https://img.shields.io/badge/Docs-kreuzberg-007ec6" alt="Documentation">
</a>
<a href="https://huggingface.co/Kreuzberg">
<img src="https://img.shields.io/badge/Hugging%20Face-Kreuzberg-007ec6" alt="Hugging Face">
</a>
</div>
<div align="center" style="margin: 24px 0 0;">
<a href="https://kreuzberg.dev">
<img alt="Kreuzberg" src="https://github.com/user-attachments/assets/419fc06c-8313-4324-b159-4b4d3cfce5c0" />
</a>
</div>
<div align="center" style="display: flex; flex-wrap: wrap; gap: 12px; justify-content: center; margin: 28px 0 24px;">
<a href="https://discord.gg/xt9WY3GnKR">
<img height="22" src="https://img.shields.io/badge/Discord-Chat-007ec6?logo=discord&logoColor=white" alt="Join Discord">
</a>
<a href="https://docs.kreuzberg.dev/demo.html">
<img height="22" src="https://img.shields.io/badge/Live%20Demo-Open-007ec6?logo=webassembly&logoColor=white" alt="Live Demo">
</a>
</div>
Extract text, tables, images, and metadata from 90+ file formats and 300+ programming languages including PDF, Office documents, and images. R bindings with native R API, data frame integration, and high-performance document extraction.
## What This Package Provides
- **Document intelligence core** — extract text, tables, images, metadata, entities, keywords, and code intelligence from one API.
- **Format coverage** — PDF, Office, images, HTML/XML, email, archives, notebooks, citations, scientific formats, and plain text.
- **OCR choices** — Tesseract, PaddleOCR, EasyOCR where supported, VLM OCR through liter-llm, and plugin hooks for custom backends.
- **Same engine as every binding** — Rust, Python, Node.js, Go, Java, PHP, Ruby, .NET, Elixir, R, WASM, Kotlin Android, Swift, Dart, Zig, and C FFI share the same Rust implementation.
- **R package** — data workflow binding with data-frame-friendly extracted structures.
## Installation
### Package Installation
Install from the kreuzberg R-universe:
```r
install.packages("kreuzberg",
repos = c("https://kreuzberg-dev.r-universe.dev", getOption("repos")))
```
### System Requirements
- **R 4.1+** required (extendr bindings)
- Optional: [ONNX Runtime](https://github.com/microsoft/onnxruntime/releases) version 1.22.x for embeddings support
- Optional: [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for OCR functionality
## Quick Start
### Basic Extraction
Extract text, metadata, and structure from any supported document format:
```r
library(kreuzberg)
# Extract text from a PDF file
result <- extract_file_sync("document.pdf")
cat(result$content)
```
### Common Use Cases
#### Extract with Custom Configuration
Most use cases benefit from configuration to control extraction behavior:
**With OCR (for scanned documents):**
```r title="R"
library(kreuzberg)
# Configure Tesseract OCR
config <- list(
force_ocr = TRUE,
ocr = list(backend = "tesseract", language = "eng")
)
# Extract text from a scanned image
json <- extract_file_sync("scan.png", "image/png", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)
cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))
```
#### Table Extraction
See [Configuration Guide](https://docs.kreuzberg.dev/guides/configuration/) for table extraction options.
#### Processing Multiple Files
```r title="R"
library(kreuzberg)
# Configure OCR settings via a plain list mirroring the config JSON.
config <- list(
force_ocr = TRUE,
ocr = list(
backend = "tesseract",
language = "eng"
)
)
# Extract an image file with OCR enabled
json <- extract_file_sync("image.png", "image/png", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)
cat("Extracted text from image:\n")
cat(result$content)
```
#### Async Processing
For non-blocking document processing:
```r title="R"
library(kreuzberg)
# Extract a file and inspect the result
result <- extract_file_sync("document.pdf")
# Print result information
cat(sprintf("MIME type: %s\n", mime_type(result)))
cat(sprintf("Content length: %d characters\n", nchar(content(result))))
cat(sprintf("Page count: %d\n", page_count(result)))
# View additional metadata
cat(sprintf("Detected language: %s\n", detected_language(result)))
```
### Next Steps
- **[Installation Guide](https://docs.kreuzberg.dev/getting-started/installation/)** - Platform-specific setup
- **[API Documentation](https://docs.kreuzberg.dev/reference/api-python/)** - Complete API reference
- **[Examples & Guides](https://docs.kreuzberg.dev/)** - Full code examples and usage guides
- **[Configuration Guide](https://docs.kreuzberg.dev/guides/configuration/)** - Advanced configuration options
## Features
### Supported File Formats (90+)
90+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
#### Office Documents
| Category | Formats | Capabilities |
|----------|---------|--------------|
| **Word Processing** | `.docx`, `.docm`, `.dotx`, `.dotm`, `.dot`, `.odt` | Full text, tables, images, metadata, styles |
| **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.xltx`, `.xlt`, `.ods` | Sheet data, formulas, cell metadata, charts |
| **Presentations** | `.pptx`, `.pptm`, `.ppsx`, `.potx`, `.potm`, `.pot`, `.ppt` | Slides, speaker notes, images, metadata |
| **PDF** | `.pdf` | Text, tables, images, metadata, OCR support |
| **eBooks** | `.epub`, `.fb2` | Chapters, metadata, embedded resources |
| **Database** | `.dbf` | Table data extraction, field type support |
| **Hangul** | `.hwp`, `.hwpx` | Korean document format, text extraction |
#### Images (OCR-Enabled)
| Category | Formats | Features |
|----------|---------|----------|
| **Raster** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif` | OCR, table detection, EXIF metadata, dimensions, color space |
| **Advanced** | `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm` | OCR via hayro-jpeg2000 (pure Rust decoder), JBIG2 support, table detection, format-specific metadata |
| **Vector** | `.svg` | DOM parsing, embedded text, graphics metadata |
#### Web & Data
| Category | Formats | Features |
|----------|---------|----------|
| **Markup** | `.html`, `.htm`, `.xhtml`, `.xml`, `.svg` | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
| **Structured Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | Schema detection, nested structures, validation |
| **Text & Markdown** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` | CommonMark, GFM, Djot, reStructuredText, Org Mode |
#### Email & Archives
| Category | Formats | Features |
|----------|---------|----------|
| **Email** | `.eml`, `.msg` | Headers, body (HTML/plain), attachments, threading |
| **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | File listing, nested archives, metadata |
#### Academic & Scientific
| Category | Formats | Features |
|----------|---------|----------|
| **Citations** | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl` | Structured parsing: RIS (structured), PubMed/MEDLINE, EndNote XML (structured), BibTeX, CSL JSON |
| **Scientific** | `.tex`, `.latex`, `.typst`, `.jats`, `.ipynb`, `.docbook` | LaTeX, Jupyter notebooks, PubMed JATS |
| **Documentation** | `.opml`, `.pod`, `.mdoc`, `.troff` | Technical documentation formats |
#### Code Intelligence (300+ Languages)
| Feature | Description |
|---------|-------------|
| **Structure Extraction** | Functions, classes, methods, structs, interfaces, enums |
| **Import/Export Analysis** | Module dependencies, re-exports, wildcard imports |
| **Symbol Extraction** | Variables, constants, type aliases, properties |
| **Docstring Parsing** | Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats |
| **Diagnostics** | Parse errors with line/column positions |
| **Syntax-Aware Chunking** | Split code by semantic boundaries, not arbitrary byte offsets |
Powered by [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — [documentation](https://docs.tree-sitter-language-pack.kreuzberg.dev).
**[Complete Format Reference](https://docs.kreuzberg.dev/reference/formats/)**
### Key Capabilities
- **Text Extraction** - Extract all text content with position and formatting information
- **Metadata Extraction** - Retrieve document properties, creation date, author, etc.
- **Table Extraction** - Parse tables with structure and cell content preservation
- **Image Extraction** - Extract embedded images and render page previews
- **OCR Support** - Integrate multiple OCR backends for scanned documents
- **Plugin System** - Extensible post-processing for custom text transformation
- **Embeddings** - Generate vector embeddings using ONNX Runtime models
- **Batch Processing** - Efficiently process multiple documents in parallel
- **Memory Efficient** - Stream large files without loading entirely into memory
- **Language Detection** - Detect and support multiple languages in documents
- **Code Intelligence** - Extract structure, imports, exports, symbols, and docstrings from [300+ programming languages](https://docs.tree-sitter-language-pack.kreuzberg.dev) via tree-sitter
- **Configuration** - Fine-grained control over extraction behavior
### Performance Characteristics
| Format | Speed | Memory | Notes |
|--------|-------|--------|-------|
| **PDF (text)** | 10-100 MB/s | ~50MB per doc | Fastest extraction |
| **Office docs** | 20-200 MB/s | ~100MB per doc | DOCX, XLSX, PPTX |
| **Images (OCR)** | 1-5 MB/s | Variable | Depends on OCR backend |
| **Archives** | 5-50 MB/s | ~200MB per doc | ZIP, TAR, etc. |
| **Web formats** | 50-200 MB/s | Streaming | HTML, XML, JSON |
## OCR Support
Kreuzberg supports multiple OCR backends for extracting text from scanned documents and images:
- **Tesseract**
- **Paddleocr**
### OCR Configuration Example
```r title="R"
library(kreuzberg)
# Configure Tesseract OCR
config <- list(
force_ocr = TRUE,
ocr = list(backend = "tesseract", language = "eng")
)
# Extract text from a scanned image
json <- extract_file_sync("scan.png", "image/png", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)
cat(sprintf("Extracted %d characters\n", nchar(result$content)))
cat("Content preview:\n")
cat(substr(result$content, 1, 200))
```
## Plugin System
Kreuzberg supports extensible post-processing plugins for custom text transformation and filtering.
For detailed plugin documentation, visit [Plugin System Guide](https://docs.kreuzberg.dev/guides/plugins/).
## Embeddings Support
Generate vector embeddings for extracted text using the built-in ONNX Runtime support. Requires ONNX Runtime installation.
**[Embeddings Guide](https://docs.kreuzberg.dev/features/#embeddings)**
## Batch Processing
Process multiple documents efficiently:
```r title="R"
library(kreuzberg)
# Configure OCR settings via a plain list mirroring the config JSON.
config <- list(
force_ocr = TRUE,
ocr = list(
backend = "tesseract",
language = "eng"
)
)
# Extract an image file with OCR enabled
json <- extract_file_sync("image.png", "image/png", config)
result <- jsonlite::fromJSON(json, simplifyVector = FALSE)
cat("Extracted text from image:\n")
cat(result$content)
```
## Configuration
For advanced configuration options including language detection, table extraction, OCR settings, and more:
**[Configuration Guide](https://docs.kreuzberg.dev/guides/configuration/)**
## Documentation
- **[Official Documentation](https://docs.kreuzberg.dev/)**
- **[API Reference](https://docs.kreuzberg.dev/reference/api-python/)**
- **[Examples & Guides](https://docs.kreuzberg.dev/)**
## Contributing
Contributions are welcome! See [Contributing Guide](https://github.com/kreuzberg-dev/kreuzberg/blob/main/CONTRIBUTING.md).
## Part of Kreuzberg.dev
- [Kreuzberg Cloud](https://github.com/kreuzberg-dev/kreuzberg-cloud) — managed extraction API with SDKs, dashboards, and observability.
- [kreuzcrawl](https://github.com/kreuzberg-dev/kreuzcrawl) — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- [html-to-markdown](https://github.com/kreuzberg-dev/html-to-markdown) — fast, lossless HTML→Markdown engine.
- [liter-llm](https://github.com/kreuzberg-dev/liter-llm) — universal LLM API client with native bindings for 14 languages and 143 providers.
- [tree-sitter-language-pack](https://github.com/kreuzberg-dev/tree-sitter-language-pack) — tree-sitter grammars and code-intelligence primitives.
- [alef](https://github.com/kreuzberg-dev/alef) — the polyglot binding generator that produces this README and all per-language bindings.
- [Discord](https://discord.gg/xt9WY3GnKR) — community, roadmap, announcements.
## License
Elastic-2.0 License — see [LICENSE](../../LICENSE) for details.
## Support
- **Discord Community**: [Join our Discord](https://discord.gg/xt9WY3GnKR)
- **GitHub Issues**: [Report bugs](https://github.com/kreuzberg-dev/kreuzberg/issues)
- **Discussions**: [Ask questions](https://github.com/kreuzberg-dev/kreuzberg/discussions)

14
packages/r/src/Makevars generated Normal file
View File

@@ -0,0 +1,14 @@
CARGO_BUILD_ARGS = --release
STATLIB = ./rust/target/release/libkreuzberg_r.a
PKG_LIBS = -L./rust/target/release -lkreuzberg_r $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS)
all: $(SHLIB)
$(STATLIB):
cargo build --manifest-path ./rust/Cargo.toml $(CARGO_BUILD_ARGS)
$(SHLIB): $(STATLIB)
clean:
rm -f $(SHLIB) $(STATLIB)
cargo clean --manifest-path ./rust/Cargo.toml

14
packages/r/src/Makevars.in generated Normal file
View File

@@ -0,0 +1,14 @@
CARGO_BUILD_ARGS = --release
STATLIB = ./rust/target/release/libkreuzberg_r.a
PKG_LIBS = -L./rust/target/release -lkreuzberg_r $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS)
all: $(SHLIB)
$(STATLIB):
cargo build --manifest-path ./rust/Cargo.toml $(CARGO_BUILD_ARGS)
$(SHLIB): $(STATLIB)
clean:
rm -f $(SHLIB) $(STATLIB)
cargo clean --manifest-path ./rust/Cargo.toml

14
packages/r/src/Makevars.win.in generated Normal file
View File

@@ -0,0 +1,14 @@
CARGO_BUILD_ARGS = --release --target x86_64-pc-windows-gnu
STATLIB = ./rust/target/x86_64-pc-windows-gnu/release/kreuzberg_r.lib
PKG_LIBS = -L./rust/target/x86_64-pc-windows-gnu/release -lkreuzberg_r -lws2_32 -ladvapi32 -luserenv -lbcrypt -lntdll
all: $(SHLIB)
$(STATLIB):
cargo build --manifest-path ./rust/Cargo.toml $(CARGO_BUILD_ARGS)
$(SHLIB): $(STATLIB)
clean:
rm -f $(SHLIB) $(STATLIB)
cargo clean --manifest-path ./rust/Cargo.toml

9
packages/r/src/entrypoint.c generated Normal file
View File

@@ -0,0 +1,9 @@
// Generated entrypoint: forwards to the extendr-generated init function.
// Do not edit — regenerate with `alef generate`.
#include <R_ext/Visibility.h>
void R_init_kreuzberg_extendr(void *dll);
void attribute_visible R_init_kreuzberg(void *dll) {
R_init_kreuzberg_extendr(dll);
}

7387
packages/r/src/rust/Cargo.lock generated Normal file

File diff suppressed because it is too large Load Diff

20
packages/r/src/rust/Cargo.toml generated Normal file
View File

@@ -0,0 +1,20 @@
[package]
name = "kreuzberg-r"
version = "5.0.0-rc.3"
edition = "2024"
license = "Elastic-2.0"
description = "High-performance document intelligence library"
readme = false
keywords = ["document", "extraction", "ocr", "pdf", "text"]
categories = ["text-processing"]
[lib]
crate-type = ["staticlib", "lib"]
[dependencies]
async-trait = "0.1"
extendr-api = "0.9"
kreuzberg = { version = "5.0.0-rc.3", path = "../../../../crates/kreuzberg", features = ["full", "pdf", "ocr", "paddle-ocr", "paddle-ocr-types", "layout-detection", "layout-types", "embeddings", "embedding-presets", "chunking", "keywords-yake", "keywords-rake", "language-detection", "html", "tree-sitter", "office", "email", "archives", "stopwords", "auto-rotate", "auto-rotate-types", "tokio-runtime", "api", "mcp", "liter-llm", "quality"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
tokio = { version = "1", features = ["rt-multi-thread"] }

93
packages/r/src/rust/src/LICENSE generated Normal file
View File

@@ -0,0 +1,93 @@
Elastic License 2.0 (ELv2)
Copyright 2025-2026 Kreuzberg, Inc.
Acceptance
By using the software, you agree to all of the terms and conditions below.
Copyright License
The licensor grants you a non-exclusive, royalty-free, worldwide,
non-sublicensable, non-transferable license to use, copy, distribute, make
available, and prepare derivative works of the software, in each case subject to
the limitations and conditions below.
Limitations
You may not provide the software to third parties as a hosted or managed
service, where the service provides users with access to any substantial set of
the features or functionality of the software.
You may not move, change, disable, or circumvent the license key functionality
in the software, and you may not remove or obscure any functionality in the
software that is protected by the license key.
You may not alter, remove, or obscure any licensing, copyright, or other notices
of the licensor in the software. Any use of the licensor's trademarks is subject
to applicable law.
Patents
The licensor grants you a license, under any patent claims the licensor can
license, or becomes able to license, to make, have made, use, sell, offer for
sale, import and have imported the software, in each case subject to the
limitations and conditions in this license. This license does not cover any
patent claims that you cause to be infringed by modifications or additions to the
software. If you or your company make any written claim that the software
infringes or contributes to infringement of any patent, your patent license for
the software granted under these terms ends immediately. If your company makes
such a claim, your patent license ends immediately for work on behalf of your
company.
Notices
You must ensure that anyone who gets a copy of any part of the software from you
also gets a copy of these terms.
If you modify the software, you must include in any modified copies of the
software prominent notices stating that you have modified the software.
No Other Rights
These terms do not imply any licenses other than those expressly granted in
these terms.
Termination
If you use the software in violation of these terms, such use is not licensed,
and your licenses will automatically terminate. If the licensor provides you with
a notice of your violation, and you cease all violation of this license no later
than 30 days after you receive that notice, your licenses will be reinstated
retroactively. However, if you violate these terms after such reinstatement, any
additional violation of these terms will cause your licenses to terminate
automatically and permanently.
No Liability
As far as the law allows, the software comes as is, without any warranty or
condition, and the licensor will not be liable to you for any damages arising out
of these terms or the use or nature of the software, under any kind of legal
claim.
Definitions
The licensor is the entity offering these terms, and the software is the
software the licensor makes available under these terms, including any portion
of it.
you refers to the individual or entity agreeing to these terms.
your company is any legal entity, sole proprietorship, or other kind of
organization that you work for, plus all organizations that have control over,
are under the control of, or are under common control with that organization.
control means ownership of substantially all the assets of an entity, or the
power to direct its management and policies by vote, contract, or otherwise.
Control can be direct or indirect.
your licenses are all the licenses granted to you for the software under these
terms.
use means anything you do with the software requiring one of your licenses.
trademark means trademarks, service marks, and similar rights.

13407
packages/r/src/rust/src/lib.rs generated Normal file

File diff suppressed because it is too large Load Diff