# Ground Truth Generation ## Pandoc Commands ```bash pandoc -t gfm --wrap=none -o pandoc -t plain --wrap=none -o ``` ## Artifact Removal ```bash sed -i '' 's/ {#[^}]*}//g' "$file" # Remove {#id} attributes sed -i '' 's/ {[^}]*}//g' "$file" # Remove {.class} attributes sed -i '' '/^:::/d' "$file" # Remove fenced div markers sed -i '' 's/\\\$/$/g' "$file" # Unescape dollar signs sed -i '' "s/\\\\'/'/g" "$file" # Unescape quotes ``` ## Cleanup Rules 1. Convert ALL HTML to markdown equivalents where possible 2. For colspan/rowspan, put content in first cell, leave others empty 3. Remove `` comments 4. Strip ``, ``, `` tags (keep text content) 5. Convert `` to `![alt](src)` 6. Collapse 3+ consecutive blank lines to 2 7. Never use our own extractor output as GT ## Fixture JSON Structure ```json { "document": "relative/path/to/source.ext", "file_type": "docx", "file_size": 12345, "expected_frameworks": ["kreuzberg"], "metadata": { "description": "...", "source": "pandoc-generated" }, "ground_truth": { "text_file": "relative/path/to/gt.txt", "markdown_file": "relative/path/to/gt.md", "source": "pandoc" } } ```