hjess/fil

Fork 0

Files

Henrik Jess Nielsen b4c07d3693

Deploy fil (kreuzberg) / deploy (push) Successful in 49s

Details

Nomad changes

2026-06-01 23:40:55 +02:00

11 KiB

Raw Blame History

Attributions

This document acknowledges the sources of test documents and baseline data used in the Kreuzberg project.

Pandoc Test Suite

Test documents and reference baseline outputs derived from the Pandoc test suite:

Source: https://github.com/jgm/pandoc
License: GPL-2.0-or-later
Usage: Test documents and reference baselines only (no code copied from Pandoc)
Attribution: John MacFarlane and Pandoc contributors
Purpose: Baseline reference testing - used to validate our native Rust extractors work correctly on the same documents that Pandoc processes

Test Documents from Pandoc

The following test documents were copied from the Pandoc repository to /test_documents/:

Org Mode

org-select-tags.org - SELECT_TAGS and EXCLUDE_TAGS testing
pandoc-tables.org - Org Mode table formats
pandoc-writer.org - Comprehensive Pandoc test suite in Org Mode format

Typst

typst-reader.typ - Fibonacci sequence with mathematical formulas
undergradmath.typ - Comprehensive undergraduate mathematics document (16KB)

DocBook

docbook-chapter.docbook - Recursive section hierarchy (7 nested levels)
docbook-reader.docbook - Comprehensive DocBook 4.4 test suite (36KB, 1704 lines)
docbook-xref.docbook - Cross-reference (xref) functionality testing

JATS

jats-reader.xml - Comprehensive JATS (Z39.96) Journal Archiving test document (38KB, 1460 lines)

FictionBook

test_documents/fictionbook/pandoc/ - 13 FictionBook test files including:
- basic.fb2 - Basic FictionBook structure
- images-embedded.fb2 - Embedded base64 images
- math.fb2 - Mathematical content
- meta.fb2 - Document metadata testing
- reader/emphasis.fb2 - Text emphasis testing
- reader/epigraph.fb2 - Epigraph/quote elements
- reader/meta.fb2 - Document metadata and title info
- reader/notes.fb2 - Footnotes/endnotes with cross-references
- reader/poem.fb2 - Poem/verse structure
- reader/titles.fb2 - Section titles and heading hierarchy
- And others

OPML

opml-reader.opml - OPML 2.0 outline structure (US states example)
pandoc-writer.opml - Comprehensive Pandoc test suite in OPML format

Baseline Outputs Generated

For each test document listed above, three baseline outputs were generated using Pandoc 3.8.3:

Plain Text (*_pandoc_baseline.txt) - Raw text content extraction
JSON Metadata (*_pandoc_meta.json) - Full Pandoc AST with document structure and metadata
Markdown (*_pandoc_markdown.md) - Markdown representation for format comparison

Total: 132 baseline files for 44 documents across 6 formats

GPL Compliance Statement

We acknowledge that Pandoc is licensed under GPL-2.0-or-later. We have:

✓ Used Pandoc's test documents (test data is allowed under GPL)
✓ Generated baseline outputs using Pandoc for comparison purposes
✓ NOT copied any Pandoc source code
✓ Implemented our extractors independently in Rust
✓ Used Pandoc only as a behavioral baseline for testing

Our Rust extractors are independently implemented and do not contain any GPL-licensed code from Pandoc.

Verification

Test documents and baselines can be regenerated at any time using:

./generate_pandoc_baselines.sh

This script processes all test documents and generates fresh baselines using the installed version of Pandoc.

docx-lite

DOCX XML parser vendored into crates/kreuzberg/src/extraction/docx/parser.rs:

Source: https://github.com/v-lawyer/docx-lite
License: MIT OR Apache-2.0
Authors: V-Lawyer Team
Version: 0.2.0 (vendored with modifications)
Usage: DOCX text extraction parser inlined into kreuzberg core
Modifications:
- Fixed Paragraph::to_text() joining text runs without whitespace (#359)
- Adapted to kreuzberg's quick-xml v0.39 and zip v7.x APIs
- Removed file-path based APIs (only bytes/reader needed)

hwpers

Vendored HWP text extraction code from the hwpers crate:

Source: https://github.com/Indosaram/hwpers
License: MIT OR Apache-2.0
Authors: HWP Parser Contributors
Vendored Version: 0.5.0
Location: crates/kreuzberg/src/extraction/hwp/
Purpose: Text extraction from Korean Hangul Word Processor (.hwp) files
Scope: Minimal subset — CFB reader, binary record parser, text extraction only
Excluded: HWPX (XML/ZIP), writer, renderer, crypto, preview modules

paddle-ocr-rs

Vendored source code from the paddle-ocr-rs crate for PaddleOCR via ONNX Runtime integration:

Source: https://github.com/mg-chao/paddle-ocr-rs
Original License: Apache-2.0
Author: mg-chao (chao@mgchao.top)
Vendored Version: 0.6.1
Location: crates/kreuzberg-paddle-ocr/
Purpose: Text detection and recognition using PaddlePaddle's OCR models via ONNX Runtime

Vendored Files

The following source files were vendored from paddle-ocr-rs:

ocr_lite.rs - Core OCR pipeline and high-level API
db_net.rs - DBNet text detection network
crnn_net.rs - CRNN text recognition network
angle_net.rs - Text angle detection network
base_net.rs - Base network trait
ocr_utils.rs - Image preprocessing utilities
ocr_result.rs - Result type definitions
scale_param.rs - Scaling parameter calculations
ocr_error.rs - Error type definitions

Modifications

The vendored code has been modified for Kreuzberg integration:

Updated to Rust 2024 edition
Aligned with Kreuzberg workspace dependencies
License changed to MIT with dual copyright (original author retained)

License Compatibility

The original Apache-2.0 license is compatible with MIT relicensing. The original copyright and attribution are preserved in the vendored crate's LICENSE file.

fastembed-rs

Text embedding inference pipeline vendored into crates/kreuzberg/src/embeddings/engine.rs:

Source: https://github.com/Anush008/fastembed-rs
License: Apache-2.0
Author: Anush008 and contributors
Vendored Version: Based on 0.2.x
Location: crates/kreuzberg/src/embeddings/engine.rs
Purpose: ONNX-based text embedding inference with thread-safe concurrent embedding generation

Modifications

The vendored code has been modified from the original fastembed-rs:

Changed embed() method signature from &mut self to &self for thread-safe concurrent inference without mutex contention
Adapted to Kreuzberg's ONNX Runtime integration and error handling
Integrated with Kreuzberg's embedding configuration and model management

License Compatibility

The original Apache-2.0 license is fully compatible with Kreuzberg's Elastic License 2.0 (ELv2). The original copyright and attribution are preserved in the vendored code's comments.

numbers-parser Test Fixtures

Test documents derived from the numbers-parser test suite:

Source: https://github.com/masaccio/numbers-parser
License: MIT
Author: Jon Connell (masaccio)
Usage: Test documents and reference baselines only (no code copied)
Modifications: Fixtures downloaded directly for integration testing.
Location: test_documents/iwork/

yake-rust

YAKE keyword extraction algorithm vendored into kreuzberg:

Source: https://github.com/quesurifn/yake-rust
License: MIT
Authors: Kyle Fahey, Anton Vikstrom, Igor Strebz
Vendored Version: 1.0.3
Location: crates/kreuzberg/src/keywords/yake/
Purpose: YAKE (Yet Another Keyword Extractor) statistical keyword extraction

Modifications

Replaced segtok dependency with custom memchr-based sentence splitter (fixes #676 BacktrackLimitExceeded on large files)
Integrated with kreuzberg's stopwords module (64 languages vs original 34)
Replaced hashbrown with ahash, inlined streaming-stats and levenshtein
Optimized punctuation checks with byte lookup tables
Removed itertools dependency (manual dedup)

License Compatibility

The original MIT license is compatible with Kreuzberg's Elastic License 2.0 (ELv2).

text-splitter (inlined)

The chunking submodule crates/kreuzberg/src/chunking/text_splitter/ is a trimmed inline copy of text-splitter v0.30.1 by Benjamin Brandt. We inlined it because upstream pins tokenizers = "0.22", which conflicts with kreuzberg's direct tokenizers 0.23 dependency and pulls a duplicate copy of tokenizers into the build graph (breaking the Tokenizer: ChunkSizer bound in chunking::core).

Source: https://github.com/benbrandt/text-splitter @ v0.30.1
License: MIT
Location: crates/kreuzberg/src/chunking/text_splitter/

Modifications

Dropped the code (tree-sitter) splitter — kreuzberg has its own tree-sitter integration and does not use the upstream code splitter.
Dropped the tiktoken-rs sizer — unused.
Rebuilt against tokenizers 0.23.
Renamed feature gate tokenizers → chunking-tokenizers; the markdown splitter is always available because pulldown-cmark is already a non-optional kreuzberg dependency.
Tightened visibility on internal types to pub(crate).
Path rewiring: upstream crate::* paths inside the inlined module rewritten relative to the new submodule root.

License Compatibility

The MIT license is compatible with Kreuzberg's Elastic License 2.0 (ELv2). The full upstream license text is reproduced below:

MIT License

Copyright (c) 2023 Benjamin Brandt

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Test Documents from text-splitter

The following test inputs were copied from the text-splitter repository to /test_documents/text_splitter/:

text/romeo_and_juliet.txt — Shakespeare, public domain (Project Gutenberg)
text/room_with_a_view.txt — E. M. Forster, public domain (Project Gutenberg)
markdown/commonmark_spec.md — CommonMark spec, CC-BY-SA-4.0
markdown/github_flavored.md — GitHub Flavored Markdown spec, CC-BY-4.0

Last Updated: April 9, 2026 Pandoc Version Used: 3.8.3 Baseline Generation Date: December 6, 2025

11 KiB Raw Blame History

Attributions

Pandoc Test Suite

Test Documents from Pandoc

Org Mode

Typst

DocBook

JATS

FictionBook

OPML

Baseline Outputs Generated

GPL Compliance Statement

Verification

docx-lite

hwpers

paddle-ocr-rs

Vendored Files

Modifications

License Compatibility

fastembed-rs

Modifications

License Compatibility

numbers-parser Test Fixtures

yake-rust

Modifications

License Compatibility

text-splitter (inlined)

Modifications

License Compatibility

Test Documents from text-splitter

11 KiB

Raw Blame History