Template & Ground Truth Pipeline

System architecture for synthetic document generation with pixel-perfect annotations

System Overview

The Synthetic Document Dataset (SDD) generates realistic documents with precise ground truth through a multi-stage pipeline. HTML is an intermediate representation—the final outputs are rasterized document images with pixel-accurate bounding box annotations.

This approach ensures documents look authentic while providing exact coordinates for every text element, enabling training of document understanding models with perfect supervision.

Why This Architecture?

HTML is a means, not an end. We use HTML + CSS as an intermediate representation because it provides:

Declarative layout - Complex documents defined with semantic markup
Browser-grade rendering - Realistic typography, spacing, and visual appearance
DOM access - JavaScript can query element positions for exact coordinates
Standard tooling - Well-established PDF generation and rasterization pipelines

Pipeline Stages

1 Template Engine

Jinja2 templates define document structure with CSS styling. Faker generates realistic data (names, addresses, amounts). RDFa attributes add semantic markup (DoCO-FD + Schema.org).

Jinja2 Templates

HTML templates with variable substitution for dynamic content generation

Technology: Jinja2

Data Generation

Realistic synthetic data for names, companies, addresses, financials, medical codes

Technology: Faker

Semantic Markup

RDFa annotations linking visual elements to ontological types

Technology: RDFa · DoCO-FD · Schema.org

Output: HTML document with data-label attributes on every annotatable element

↓

2 Browser Engine (Coordinate Extraction)

Playwright loads HTML in headless Chromium and executes JavaScript to extract bounding box coordinates. This is the critical step that makes pixel-perfect ground truth possible.

🌐

Load HTML

Render document in headless browser with all CSS applied

📐

Query Element Positions

JavaScript walks DOM, queries getBoundingClientRect() for each data-label element

📊

Store Coordinates

Capture [x1, y1, x2, y2] in pixels relative to page, map to entity types

Output: Coordinate mapping: element_id → [x1, y1, x2, y2, entity_type, text_content]

↓

3 PDF Generation (Pagination)

Paged.js handles CSS pagination (@page rules, page breaks). WeasyPrint renders paginated HTML to PDF. This produces print-ready documents with realistic page layouts.

Paged.js

Client-side pagination library that polyfills CSS Paged Media specification

Technology: JavaScript

WeasyPrint

Converts HTML/CSS to PDF with proper font rendering and layout

Technology: Python

Output: Multi-page PDF document (preserved as intermediate artifact)

↓

4 Rasterization (Image Generation)

pdf2image converts PDF pages to high-resolution images using poppler-utils. Default 300 DPI produces publication-quality document images suitable for ML training.

Resolution: 300 DPI (default) produces 2480×3508px for A4 pages
Format: JPEG or PNG
Quality: Suitable for OCR and document understanding models

↓

Stage 5: Ground Truth Alignment

The key innovation: coordinates extracted from Stage 2 (HTML) are scaled to match Stage 4 (Image) dimensions.

Since we control both the coordinate extraction and the rendering pipeline, we can guarantee that bounding boxes align perfectly with the final image. The scaling factor is calculated from HTML viewport dimensions to final image pixel dimensions.

5 Final Output Package

File	Format	Description
`{id}.jpg`	JPEG/PNG	Rasterized document image (300 DPI)
`{id}.json`	JSON	Bounding box annotations with entity types
`{id}.ttl`	Turtle	RDF graph with DoCO-FD + Schema.org semantics
`{id}.jsonld`	JSON-LD	Linked Data format for web publishing
`{id}.pdf` (optional)	PDF	Intermediate PDF artifact

Bounding Box Ground Truth

Annotation Structure

Each document element with a data-label attribute becomes an annotation:

{
  "document_id": "INV_000001",
  "template_type": "invoice",
  "width": 2480,
  "height": 3508,
  "annotations": [
    {
      "label": "invoice_number",
      "text_content": "INV-2024-001",
      "box_2d": [245, 180, 290, 420],
      "entity_type": "INVOICE_NUMBER",
      "generic_type": "IDENTIFIER"
    }
  ]
}

box_2d format: [x1, y1, x2, y2] in pixels, 0-indexed from top-left

Key Advantages

🎯 Pixel-Perfect GT

Browser-reported element positions guarantee accurate bounding boxes, unlike heuristics or manual annotation

📄 Realistic Documents

Actual browser rendering with real fonts, spacing, and layout—not synthetic approximations

🏷️ Rich Semantics

Every annotation has entity type, generic type, and RDF semantic markup

🔄 Scalable

Generate thousands of unique documents with automatic ground truth—no manual labeling

📐 Multi-Page Support

Pagination produces realistic multi-page documents with per-page annotations

🌐 Standards-Based

Uses web standards (HTML, CSS, RDFa) with established tooling

Comparison with Alternatives

Approach	GT Accuracy	Doc Realism	Scalability
SDD Pipeline	✓ Perfect	✓ High	✓ Unlimited
Manual Annotation	~95% (human error)	✓ Real documents	Expensive, slow
Template + PIL/Pillow	✓ Perfect	✗ Unrealistic fonts/layout	✓ Fast
GAN Generation	✗ No GT	~Realistic	✓ Fast

Bottom Line: This pipeline uniquely combines the realism of web rendering with the precision of programmatic coordinate extraction, producing datasets suitable for training production-grade document understanding models.