1.2 KiB
1.2 KiB
Ground Truth Generation
Pandoc Commands
pandoc <source_file> -t gfm --wrap=none -o <gt_file.md>
pandoc <source_file> -t plain --wrap=none -o <gt_file.txt>
Artifact Removal
sed -i '' 's/ {#[^}]*}//g' "$file" # Remove {#id} attributes
sed -i '' 's/ {[^}]*}//g' "$file" # Remove {.class} attributes
sed -i '' '/^:::/d' "$file" # Remove fenced div markers
sed -i '' 's/\\\$/$/g' "$file" # Unescape dollar signs
sed -i '' "s/\\\\'/'/g" "$file" # Unescape quotes
Cleanup Rules
- Convert ALL HTML to markdown equivalents where possible
- For colspan/rowspan, put content in first cell, leave others empty
- Remove
<!-- -->comments - Strip
<u>,<sup>,<sub>tags (keep text content) - Convert
<img>to - Collapse 3+ consecutive blank lines to 2
- Never use our own extractor output as GT
Fixture JSON Structure
{
"document": "relative/path/to/source.ext",
"file_type": "docx",
"file_size": 12345,
"expected_frameworks": ["kreuzberg"],
"metadata": { "description": "...", "source": "pandoc-generated" },
"ground_truth": {
"text_file": "relative/path/to/gt.txt",
"markdown_file": "relative/path/to/gt.md",
"source": "pandoc"
}
}