Files
fil/ATTRIBUTIONS.md
Henrik Jess Nielsen b4c07d3693
All checks were successful
Deploy fil (kreuzberg) / deploy (push) Successful in 49s
Nomad changes
2026-06-01 23:40:55 +02:00

11 KiB

Attributions

This document acknowledges the sources of test documents and baseline data used in the Kreuzberg project.

Pandoc Test Suite

Test documents and reference baseline outputs derived from the Pandoc test suite:

  • Source: https://github.com/jgm/pandoc
  • License: GPL-2.0-or-later
  • Usage: Test documents and reference baselines only (no code copied from Pandoc)
  • Attribution: John MacFarlane and Pandoc contributors
  • Purpose: Baseline reference testing - used to validate our native Rust extractors work correctly on the same documents that Pandoc processes

Test Documents from Pandoc

The following test documents were copied from the Pandoc repository to /test_documents/:

Org Mode

  • org-select-tags.org - SELECT_TAGS and EXCLUDE_TAGS testing
  • pandoc-tables.org - Org Mode table formats
  • pandoc-writer.org - Comprehensive Pandoc test suite in Org Mode format

Typst

  • typst-reader.typ - Fibonacci sequence with mathematical formulas
  • undergradmath.typ - Comprehensive undergraduate mathematics document (16KB)

DocBook

  • docbook-chapter.docbook - Recursive section hierarchy (7 nested levels)
  • docbook-reader.docbook - Comprehensive DocBook 4.4 test suite (36KB, 1704 lines)
  • docbook-xref.docbook - Cross-reference (xref) functionality testing

JATS

  • jats-reader.xml - Comprehensive JATS (Z39.96) Journal Archiving test document (38KB, 1460 lines)

FictionBook

  • test_documents/fictionbook/pandoc/ - 13 FictionBook test files including:
    • basic.fb2 - Basic FictionBook structure
    • images-embedded.fb2 - Embedded base64 images
    • math.fb2 - Mathematical content
    • meta.fb2 - Document metadata testing
    • reader/emphasis.fb2 - Text emphasis testing
    • reader/epigraph.fb2 - Epigraph/quote elements
    • reader/meta.fb2 - Document metadata and title info
    • reader/notes.fb2 - Footnotes/endnotes with cross-references
    • reader/poem.fb2 - Poem/verse structure
    • reader/titles.fb2 - Section titles and heading hierarchy
    • And others

OPML

  • opml-reader.opml - OPML 2.0 outline structure (US states example)
  • pandoc-writer.opml - Comprehensive Pandoc test suite in OPML format

Baseline Outputs Generated

For each test document listed above, three baseline outputs were generated using Pandoc 3.8.3:

  1. Plain Text (*_pandoc_baseline.txt) - Raw text content extraction
  2. JSON Metadata (*_pandoc_meta.json) - Full Pandoc AST with document structure and metadata
  3. Markdown (*_pandoc_markdown.md) - Markdown representation for format comparison

Total: 132 baseline files for 44 documents across 6 formats

GPL Compliance Statement

We acknowledge that Pandoc is licensed under GPL-2.0-or-later. We have:

  • ✓ Used Pandoc's test documents (test data is allowed under GPL)
  • ✓ Generated baseline outputs using Pandoc for comparison purposes
  • ✓ NOT copied any Pandoc source code
  • ✓ Implemented our extractors independently in Rust
  • ✓ Used Pandoc only as a behavioral baseline for testing

Our Rust extractors are independently implemented and do not contain any GPL-licensed code from Pandoc.

Verification

Test documents and baselines can be regenerated at any time using:

./generate_pandoc_baselines.sh

This script processes all test documents and generates fresh baselines using the installed version of Pandoc.

docx-lite

DOCX XML parser vendored into crates/kreuzberg/src/extraction/docx/parser.rs:

  • Source: https://github.com/v-lawyer/docx-lite
  • License: MIT OR Apache-2.0
  • Authors: V-Lawyer Team
  • Version: 0.2.0 (vendored with modifications)
  • Usage: DOCX text extraction parser inlined into kreuzberg core
  • Modifications:
    • Fixed Paragraph::to_text() joining text runs without whitespace (#359)
    • Adapted to kreuzberg's quick-xml v0.39 and zip v7.x APIs
    • Removed file-path based APIs (only bytes/reader needed)

hwpers

Vendored HWP text extraction code from the hwpers crate:

  • Source: https://github.com/Indosaram/hwpers
  • License: MIT OR Apache-2.0
  • Authors: HWP Parser Contributors
  • Vendored Version: 0.5.0
  • Location: crates/kreuzberg/src/extraction/hwp/
  • Purpose: Text extraction from Korean Hangul Word Processor (.hwp) files
  • Scope: Minimal subset — CFB reader, binary record parser, text extraction only
  • Excluded: HWPX (XML/ZIP), writer, renderer, crypto, preview modules

paddle-ocr-rs

Vendored source code from the paddle-ocr-rs crate for PaddleOCR via ONNX Runtime integration:

Vendored Files

The following source files were vendored from paddle-ocr-rs:

  • ocr_lite.rs - Core OCR pipeline and high-level API
  • db_net.rs - DBNet text detection network
  • crnn_net.rs - CRNN text recognition network
  • angle_net.rs - Text angle detection network
  • base_net.rs - Base network trait
  • ocr_utils.rs - Image preprocessing utilities
  • ocr_result.rs - Result type definitions
  • scale_param.rs - Scaling parameter calculations
  • ocr_error.rs - Error type definitions

Modifications

The vendored code has been modified for Kreuzberg integration:

  • Updated to Rust 2024 edition
  • Aligned with Kreuzberg workspace dependencies
  • License changed to MIT with dual copyright (original author retained)

License Compatibility

The original Apache-2.0 license is compatible with MIT relicensing. The original copyright and attribution are preserved in the vendored crate's LICENSE file.


fastembed-rs

Text embedding inference pipeline vendored into crates/kreuzberg/src/embeddings/engine.rs:

  • Source: https://github.com/Anush008/fastembed-rs
  • License: Apache-2.0
  • Author: Anush008 and contributors
  • Vendored Version: Based on 0.2.x
  • Location: crates/kreuzberg/src/embeddings/engine.rs
  • Purpose: ONNX-based text embedding inference with thread-safe concurrent embedding generation

Modifications

The vendored code has been modified from the original fastembed-rs:

  • Changed embed() method signature from &mut self to &self for thread-safe concurrent inference without mutex contention
  • Adapted to Kreuzberg's ONNX Runtime integration and error handling
  • Integrated with Kreuzberg's embedding configuration and model management

License Compatibility

The original Apache-2.0 license is fully compatible with Kreuzberg's Elastic License 2.0 (ELv2). The original copyright and attribution are preserved in the vendored code's comments.


numbers-parser Test Fixtures

Test documents derived from the numbers-parser test suite:

  • Source: https://github.com/masaccio/numbers-parser
  • License: MIT
  • Author: Jon Connell (masaccio)
  • Usage: Test documents and reference baselines only (no code copied)
  • Modifications: Fixtures downloaded directly for integration testing.
  • Location: test_documents/iwork/


yake-rust

YAKE keyword extraction algorithm vendored into kreuzberg:

  • Source: https://github.com/quesurifn/yake-rust
  • License: MIT
  • Authors: Kyle Fahey, Anton Vikstrom, Igor Strebz
  • Vendored Version: 1.0.3
  • Location: crates/kreuzberg/src/keywords/yake/
  • Purpose: YAKE (Yet Another Keyword Extractor) statistical keyword extraction

Modifications

  • Replaced segtok dependency with custom memchr-based sentence splitter (fixes #676 BacktrackLimitExceeded on large files)
  • Integrated with kreuzberg's stopwords module (64 languages vs original 34)
  • Replaced hashbrown with ahash, inlined streaming-stats and levenshtein
  • Optimized punctuation checks with byte lookup tables
  • Removed itertools dependency (manual dedup)

License Compatibility

The original MIT license is compatible with Kreuzberg's Elastic License 2.0 (ELv2).

text-splitter (inlined)

The chunking submodule crates/kreuzberg/src/chunking/text_splitter/ is a trimmed inline copy of text-splitter v0.30.1 by Benjamin Brandt. We inlined it because upstream pins tokenizers = "0.22", which conflicts with kreuzberg's direct tokenizers 0.23 dependency and pulls a duplicate copy of tokenizers into the build graph (breaking the Tokenizer: ChunkSizer bound in chunking::core).

Modifications

  • Dropped the code (tree-sitter) splitter — kreuzberg has its own tree-sitter integration and does not use the upstream code splitter.
  • Dropped the tiktoken-rs sizer — unused.
  • Rebuilt against tokenizers 0.23.
  • Renamed feature gate tokenizerschunking-tokenizers; the markdown splitter is always available because pulldown-cmark is already a non-optional kreuzberg dependency.
  • Tightened visibility on internal types to pub(crate).
  • Path rewiring: upstream crate::* paths inside the inlined module rewritten relative to the new submodule root.

License Compatibility

The MIT license is compatible with Kreuzberg's Elastic License 2.0 (ELv2). The full upstream license text is reproduced below:

MIT License

Copyright (c) 2023 Benjamin Brandt

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Test Documents from text-splitter

The following test inputs were copied from the text-splitter repository to /test_documents/text_splitter/:

  • text/romeo_and_juliet.txt — Shakespeare, public domain (Project Gutenberg)
  • text/room_with_a_view.txt — E. M. Forster, public domain (Project Gutenberg)
  • markdown/commonmark_spec.md — CommonMark spec, CC-BY-SA-4.0
  • markdown/github_flavored.md — GitHub Flavored Markdown spec, CC-BY-4.0

Last Updated: April 9, 2026 Pandoc Version Used: 3.8.3 Baseline Generation Date: December 6, 2025