This commit is contained in:
45
.ai-rulez/ground-truth-generation.md
Normal file
45
.ai-rulez/ground-truth-generation.md
Normal file
@@ -0,0 +1,45 @@
|
||||
# Ground Truth Generation
|
||||
|
||||
## Pandoc Commands
|
||||
|
||||
```bash
|
||||
pandoc <source_file> -t gfm --wrap=none -o <gt_file.md>
|
||||
pandoc <source_file> -t plain --wrap=none -o <gt_file.txt>
|
||||
```
|
||||
|
||||
## Artifact Removal
|
||||
|
||||
```bash
|
||||
sed -i '' 's/ {#[^}]*}//g' "$file" # Remove {#id} attributes
|
||||
sed -i '' 's/ {[^}]*}//g' "$file" # Remove {.class} attributes
|
||||
sed -i '' '/^:::/d' "$file" # Remove fenced div markers
|
||||
sed -i '' 's/\\\$/$/g' "$file" # Unescape dollar signs
|
||||
sed -i '' "s/\\\\'/'/g" "$file" # Unescape quotes
|
||||
```
|
||||
|
||||
## Cleanup Rules
|
||||
|
||||
1. Convert ALL HTML to markdown equivalents where possible
|
||||
2. For colspan/rowspan, put content in first cell, leave others empty
|
||||
3. Remove `<!-- -->` comments
|
||||
4. Strip `<u>`, `<sup>`, `<sub>` tags (keep text content)
|
||||
5. Convert `<img>` to ``
|
||||
6. Collapse 3+ consecutive blank lines to 2
|
||||
7. Never use our own extractor output as GT
|
||||
|
||||
## Fixture JSON Structure
|
||||
|
||||
```json
|
||||
{
|
||||
"document": "relative/path/to/source.ext",
|
||||
"file_type": "docx",
|
||||
"file_size": 12345,
|
||||
"expected_frameworks": ["kreuzberg"],
|
||||
"metadata": { "description": "...", "source": "pandoc-generated" },
|
||||
"ground_truth": {
|
||||
"text_file": "relative/path/to/gt.txt",
|
||||
"markdown_file": "relative/path/to/gt.md",
|
||||
"source": "pandoc"
|
||||
}
|
||||
}
|
||||
```
|
||||
Reference in New Issue
Block a user