11 KiB
Attributions
This document acknowledges the sources of test documents and baseline data used in the Kreuzberg project.
Pandoc Test Suite
Test documents and reference baseline outputs derived from the Pandoc test suite:
- Source: https://github.com/jgm/pandoc
- License: GPL-2.0-or-later
- Usage: Test documents and reference baselines only (no code copied from Pandoc)
- Attribution: John MacFarlane and Pandoc contributors
- Purpose: Baseline reference testing - used to validate our native Rust extractors work correctly on the same documents that Pandoc processes
Test Documents from Pandoc
The following test documents were copied from the Pandoc repository to /test_documents/:
Org Mode
org-select-tags.org- SELECT_TAGS and EXCLUDE_TAGS testingpandoc-tables.org- Org Mode table formatspandoc-writer.org- Comprehensive Pandoc test suite in Org Mode format
Typst
typst-reader.typ- Fibonacci sequence with mathematical formulasundergradmath.typ- Comprehensive undergraduate mathematics document (16KB)
DocBook
docbook-chapter.docbook- Recursive section hierarchy (7 nested levels)docbook-reader.docbook- Comprehensive DocBook 4.4 test suite (36KB, 1704 lines)docbook-xref.docbook- Cross-reference (xref) functionality testing
JATS
jats-reader.xml- Comprehensive JATS (Z39.96) Journal Archiving test document (38KB, 1460 lines)
FictionBook
test_documents/fictionbook/pandoc/- 13 FictionBook test files including:basic.fb2- Basic FictionBook structureimages-embedded.fb2- Embedded base64 imagesmath.fb2- Mathematical contentmeta.fb2- Document metadata testingreader/emphasis.fb2- Text emphasis testingreader/epigraph.fb2- Epigraph/quote elementsreader/meta.fb2- Document metadata and title inforeader/notes.fb2- Footnotes/endnotes with cross-referencesreader/poem.fb2- Poem/verse structurereader/titles.fb2- Section titles and heading hierarchy- And others
OPML
opml-reader.opml- OPML 2.0 outline structure (US states example)pandoc-writer.opml- Comprehensive Pandoc test suite in OPML format
Baseline Outputs Generated
For each test document listed above, three baseline outputs were generated using Pandoc 3.8.3:
- Plain Text (
*_pandoc_baseline.txt) - Raw text content extraction - JSON Metadata (
*_pandoc_meta.json) - Full Pandoc AST with document structure and metadata - Markdown (
*_pandoc_markdown.md) - Markdown representation for format comparison
Total: 132 baseline files for 44 documents across 6 formats
GPL Compliance Statement
We acknowledge that Pandoc is licensed under GPL-2.0-or-later. We have:
- ✓ Used Pandoc's test documents (test data is allowed under GPL)
- ✓ Generated baseline outputs using Pandoc for comparison purposes
- ✓ NOT copied any Pandoc source code
- ✓ Implemented our extractors independently in Rust
- ✓ Used Pandoc only as a behavioral baseline for testing
Our Rust extractors are independently implemented and do not contain any GPL-licensed code from Pandoc.
Verification
Test documents and baselines can be regenerated at any time using:
./generate_pandoc_baselines.sh
This script processes all test documents and generates fresh baselines using the installed version of Pandoc.
docx-lite
DOCX XML parser vendored into crates/kreuzberg/src/extraction/docx/parser.rs:
- Source: https://github.com/v-lawyer/docx-lite
- License: MIT OR Apache-2.0
- Authors: V-Lawyer Team
- Version: 0.2.0 (vendored with modifications)
- Usage: DOCX text extraction parser inlined into kreuzberg core
- Modifications:
- Fixed
Paragraph::to_text()joining text runs without whitespace (#359) - Adapted to kreuzberg's
quick-xmlv0.39 andzipv7.x APIs - Removed file-path based APIs (only bytes/reader needed)
- Fixed
hwpers
Vendored HWP text extraction code from the hwpers crate:
- Source: https://github.com/Indosaram/hwpers
- License: MIT OR Apache-2.0
- Authors: HWP Parser Contributors
- Vendored Version: 0.5.0
- Location:
crates/kreuzberg/src/extraction/hwp/ - Purpose: Text extraction from Korean Hangul Word Processor (.hwp) files
- Scope: Minimal subset — CFB reader, binary record parser, text extraction only
- Excluded: HWPX (XML/ZIP), writer, renderer, crypto, preview modules
paddle-ocr-rs
Vendored source code from the paddle-ocr-rs crate for PaddleOCR via ONNX Runtime integration:
- Source: https://github.com/mg-chao/paddle-ocr-rs
- Original License: Apache-2.0
- Author: mg-chao (chao@mgchao.top)
- Vendored Version: 0.6.1
- Location:
crates/kreuzberg-paddle-ocr/ - Purpose: Text detection and recognition using PaddlePaddle's OCR models via ONNX Runtime
Vendored Files
The following source files were vendored from paddle-ocr-rs:
ocr_lite.rs- Core OCR pipeline and high-level APIdb_net.rs- DBNet text detection networkcrnn_net.rs- CRNN text recognition networkangle_net.rs- Text angle detection networkbase_net.rs- Base network traitocr_utils.rs- Image preprocessing utilitiesocr_result.rs- Result type definitionsscale_param.rs- Scaling parameter calculationsocr_error.rs- Error type definitions
Modifications
The vendored code has been modified for Kreuzberg integration:
- Updated to Rust 2024 edition
- Aligned with Kreuzberg workspace dependencies
- License changed to MIT with dual copyright (original author retained)
License Compatibility
The original Apache-2.0 license is compatible with MIT relicensing. The original copyright and attribution are preserved in the vendored crate's LICENSE file.
fastembed-rs
Text embedding inference pipeline vendored into crates/kreuzberg/src/embeddings/engine.rs:
- Source: https://github.com/Anush008/fastembed-rs
- License: Apache-2.0
- Author: Anush008 and contributors
- Vendored Version: Based on 0.2.x
- Location:
crates/kreuzberg/src/embeddings/engine.rs - Purpose: ONNX-based text embedding inference with thread-safe concurrent embedding generation
Modifications
The vendored code has been modified from the original fastembed-rs:
- Changed
embed()method signature from&mut selfto&selffor thread-safe concurrent inference without mutex contention - Adapted to Kreuzberg's ONNX Runtime integration and error handling
- Integrated with Kreuzberg's embedding configuration and model management
License Compatibility
The original Apache-2.0 license is fully compatible with Kreuzberg's Elastic License 2.0 (ELv2). The original copyright and attribution are preserved in the vendored code's comments.
numbers-parser Test Fixtures
Test documents derived from the numbers-parser test suite:
- Source: https://github.com/masaccio/numbers-parser
- License: MIT
- Author: Jon Connell (masaccio)
- Usage: Test documents and reference baselines only (no code copied)
- Modifications: Fixtures downloaded directly for integration testing.
- Location:
test_documents/iwork/
yake-rust
YAKE keyword extraction algorithm vendored into kreuzberg:
- Source: https://github.com/quesurifn/yake-rust
- License: MIT
- Authors: Kyle Fahey, Anton Vikstrom, Igor Strebz
- Vendored Version: 1.0.3
- Location:
crates/kreuzberg/src/keywords/yake/ - Purpose: YAKE (Yet Another Keyword Extractor) statistical keyword extraction
Modifications
- Replaced segtok dependency with custom memchr-based sentence splitter (fixes #676 BacktrackLimitExceeded on large files)
- Integrated with kreuzberg's stopwords module (64 languages vs original 34)
- Replaced hashbrown with ahash, inlined streaming-stats and levenshtein
- Optimized punctuation checks with byte lookup tables
- Removed itertools dependency (manual dedup)
License Compatibility
The original MIT license is compatible with Kreuzberg's Elastic License 2.0 (ELv2).
text-splitter (inlined)
The chunking submodule crates/kreuzberg/src/chunking/text_splitter/ is a trimmed inline copy of text-splitter v0.30.1 by Benjamin Brandt. We inlined it because upstream pins tokenizers = "0.22", which conflicts with kreuzberg's direct tokenizers 0.23 dependency and pulls a duplicate copy of tokenizers into the build graph (breaking the Tokenizer: ChunkSizer bound in chunking::core).
- Source: https://github.com/benbrandt/text-splitter @ v0.30.1
- License: MIT
- Copyright: © 2023 Benjamin Brandt benjamin.j.brandt@gmail.com
- Location:
crates/kreuzberg/src/chunking/text_splitter/
Modifications
- Dropped the
code(tree-sitter) splitter — kreuzberg has its own tree-sitter integration and does not use the upstream code splitter. - Dropped the
tiktoken-rssizer — unused. - Rebuilt against
tokenizers 0.23. - Renamed feature gate
tokenizers→chunking-tokenizers; themarkdownsplitter is always available becausepulldown-cmarkis already a non-optional kreuzberg dependency. - Tightened visibility on internal types to
pub(crate). - Path rewiring: upstream
crate::*paths inside the inlined module rewritten relative to the new submodule root.
License Compatibility
The MIT license is compatible with Kreuzberg's Elastic License 2.0 (ELv2). The full upstream license text is reproduced below:
MIT License
Copyright (c) 2023 Benjamin Brandt
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Test Documents from text-splitter
The following test inputs were copied from the text-splitter repository to /test_documents/text_splitter/:
text/romeo_and_juliet.txt— Shakespeare, public domain (Project Gutenberg)text/room_with_a_view.txt— E. M. Forster, public domain (Project Gutenberg)markdown/commonmark_spec.md— CommonMark spec, CC-BY-SA-4.0markdown/github_flavored.md— GitHub Flavored Markdown spec, CC-BY-4.0
Last Updated: April 9, 2026 Pandoc Version Used: 3.8.3 Baseline Generation Date: December 6, 2025