System architecture for synthetic document generation with pixel-perfect annotations
The Synthetic Document Dataset (SDD) generates realistic documents with precise ground truth through a multi-stage pipeline. HTML is an intermediate representation—the final outputs are rasterized document images with pixel-accurate bounding box annotations.
This approach ensures documents look authentic while providing exact coordinates for every text element, enabling training of document understanding models with perfect supervision.
HTML is a means, not an end. We use HTML + CSS as an intermediate representation because it provides:
Jinja2 templates define document structure with CSS styling. Faker generates realistic data (names, addresses, amounts). RDFa attributes add semantic markup (DoCO-FD + Schema.org).
HTML templates with variable substitution for dynamic content generation
Realistic synthetic data for names, companies, addresses, financials, medical codes
RDFa annotations linking visual elements to ontological types
Output: HTML document with data-label attributes on every annotatable element
Playwright loads HTML in headless Chromium and executes JavaScript to extract bounding box coordinates. This is the critical step that makes pixel-perfect ground truth possible.
Render document in headless browser with all CSS applied
JavaScript walks DOM, queries getBoundingClientRect() for each data-label element
Capture [x1, y1, x2, y2] in pixels relative to page, map to entity types
Output: Coordinate mapping: element_id → [x1, y1, x2, y2, entity_type, text_content]
Paged.js handles CSS pagination (@page rules, page breaks). WeasyPrint renders paginated HTML to PDF. This produces print-ready documents with realistic page layouts.
Client-side pagination library that polyfills CSS Paged Media specification
Converts HTML/CSS to PDF with proper font rendering and layout
Output: Multi-page PDF document (preserved as intermediate artifact)
pdf2image converts PDF pages to high-resolution images using poppler-utils. Default 300 DPI produces publication-quality document images suitable for ML training.
Resolution: 300 DPI (default) produces 2480×3508px for A4 pages
Format: JPEG or PNG
Quality: Suitable for OCR and document understanding models
The key innovation: coordinates extracted from Stage 2 (HTML) are scaled to match Stage 4 (Image) dimensions.
Since we control both the coordinate extraction and the rendering pipeline, we can guarantee that bounding boxes align perfectly with the final image. The scaling factor is calculated from HTML viewport dimensions to final image pixel dimensions.
| File | Format | Description |
|---|---|---|
{id}.jpg |
JPEG/PNG | Rasterized document image (300 DPI) |
{id}.json |
JSON | Bounding box annotations with entity types |
{id}.ttl |
Turtle | RDF graph with DoCO-FD + Schema.org semantics |
{id}.jsonld |
JSON-LD | Linked Data format for web publishing |
{id}.pdf (optional) |
Intermediate PDF artifact |
Each document element with a data-label attribute becomes an annotation:
{
"document_id": "INV_000001",
"template_type": "invoice",
"width": 2480,
"height": 3508,
"annotations": [
{
"label": "invoice_number",
"text_content": "INV-2024-001",
"box_2d": [245, 180, 290, 420],
"entity_type": "INVOICE_NUMBER",
"generic_type": "IDENTIFIER"
}
]
}
box_2d format: [x1, y1, x2, y2] in pixels, 0-indexed from top-left
Browser-reported element positions guarantee accurate bounding boxes, unlike heuristics or manual annotation
Actual browser rendering with real fonts, spacing, and layout—not synthetic approximations
Every annotation has entity type, generic type, and RDF semantic markup
Generate thousands of unique documents with automatic ground truth—no manual labeling
Pagination produces realistic multi-page documents with per-page annotations
Uses web standards (HTML, CSS, RDFa) with established tooling
| Approach | GT Accuracy | Doc Realism | Scalability |
|---|---|---|---|
| SDD Pipeline | ✓ Perfect | ✓ High | ✓ Unlimited |
| Manual Annotation | ~95% (human error) | ✓ Real documents | Expensive, slow |
| Template + PIL/Pillow | ✓ Perfect | ✗ Unrealistic fonts/layout | ✓ Fast |
| GAN Generation | ✗ No GT | ~Realistic | ✓ Fast |
Bottom Line: This pipeline uniquely combines the realism of web rendering with the precision of programmatic coordinate extraction, producing datasets suitable for training production-grade document understanding models.