Template & Ground Truth Pipeline

System architecture for synthetic document generation with pixel-perfect annotations

System Overview

The Synthetic Document Dataset (SDD) generates realistic documents with precise ground truth through a multi-stage pipeline. HTML is an intermediate representation—the final outputs are rasterized document images with pixel-accurate bounding box annotations.

This approach ensures documents look authentic while providing exact coordinates for every text element, enabling training of document understanding models with perfect supervision.

Document Generation Pipeline Stage 1: Template Engine Jinja2 Template Faker Data Generates HTML + RDFa render Stage 2: Browser Playwright (Chromium) Execute JS · Extract DOM bbox Stage 3: PDF Engine Paged.js + WeasyPrint Paginate · Generate PDF Stage 4: Rasterization pdf2image (poppler) 300 DPI · JPEG/PNG align Stage 5: Ground Truth Bounding Box Coords JSON · TTL · JSON-LD Final Output Image .jpg / .png Annotations .json / .ttl HTML is intermediate only Not a final product Pipeline Stages: Input Generation Processing (Rendering) Output (GT + Image)

Why This Architecture?

HTML is a means, not an end. We use HTML + CSS as an intermediate representation because it provides:

Pipeline Stages

1 Template Engine

Jinja2 templates define document structure with CSS styling. Faker generates realistic data (names, addresses, amounts). RDFa attributes add semantic markup (DoCO-FD + Schema.org).

Jinja2 Templates

HTML templates with variable substitution for dynamic content generation

Technology: Jinja2

Data Generation

Realistic synthetic data for names, companies, addresses, financials, medical codes

Technology: Faker

Semantic Markup

RDFa annotations linking visual elements to ontological types

Technology: RDFa · DoCO-FD · Schema.org

Output: HTML document with data-label attributes on every annotatable element

2 Browser Engine (Coordinate Extraction)

Playwright loads HTML in headless Chromium and executes JavaScript to extract bounding box coordinates. This is the critical step that makes pixel-perfect ground truth possible.

🌐

Load HTML

Render document in headless browser with all CSS applied

📐

Query Element Positions

JavaScript walks DOM, queries getBoundingClientRect() for each data-label element

📊

Store Coordinates

Capture [x1, y1, x2, y2] in pixels relative to page, map to entity types

Output: Coordinate mapping: element_id → [x1, y1, x2, y2, entity_type, text_content]

3 PDF Generation (Pagination)

Paged.js handles CSS pagination (@page rules, page breaks). WeasyPrint renders paginated HTML to PDF. This produces print-ready documents with realistic page layouts.

Paged.js

Client-side pagination library that polyfills CSS Paged Media specification

Technology: JavaScript

WeasyPrint

Converts HTML/CSS to PDF with proper font rendering and layout

Technology: Python

Output: Multi-page PDF document (preserved as intermediate artifact)

4 Rasterization (Image Generation)

pdf2image converts PDF pages to high-resolution images using poppler-utils. Default 300 DPI produces publication-quality document images suitable for ML training.

Resolution: 300 DPI (default) produces 2480×3508px for A4 pages
Format: JPEG or PNG
Quality: Suitable for OCR and document understanding models

Stage 5: Ground Truth Alignment

The key innovation: coordinates extracted from Stage 2 (HTML) are scaled to match Stage 4 (Image) dimensions.

Since we control both the coordinate extraction and the rendering pipeline, we can guarantee that bounding boxes align perfectly with the final image. The scaling factor is calculated from HTML viewport dimensions to final image pixel dimensions.

5 Final Output Package

File Format Description
{id}.jpg JPEG/PNG Rasterized document image (300 DPI)
{id}.json JSON Bounding box annotations with entity types
{id}.ttl Turtle RDF graph with DoCO-FD + Schema.org semantics
{id}.jsonld JSON-LD Linked Data format for web publishing
{id}.pdf (optional) PDF Intermediate PDF artifact

Bounding Box Ground Truth

Annotation Structure

Each document element with a data-label attribute becomes an annotation:

{
  "document_id": "INV_000001",
  "template_type": "invoice",
  "width": 2480,
  "height": 3508,
  "annotations": [
    {
      "label": "invoice_number",
      "text_content": "INV-2024-001",
      "box_2d": [245, 180, 290, 420],
      "entity_type": "INVOICE_NUMBER",
      "generic_type": "IDENTIFIER"
    }
  ]
}

box_2d format: [x1, y1, x2, y2] in pixels, 0-indexed from top-left

Key Advantages

🎯 Pixel-Perfect GT

Browser-reported element positions guarantee accurate bounding boxes, unlike heuristics or manual annotation

📄 Realistic Documents

Actual browser rendering with real fonts, spacing, and layout—not synthetic approximations

🏷️ Rich Semantics

Every annotation has entity type, generic type, and RDF semantic markup

🔄 Scalable

Generate thousands of unique documents with automatic ground truth—no manual labeling

📐 Multi-Page Support

Pagination produces realistic multi-page documents with per-page annotations

🌐 Standards-Based

Uses web standards (HTML, CSS, RDFa) with established tooling

Comparison with Alternatives

Approach GT Accuracy Doc Realism Scalability
SDD Pipeline ✓ Perfect ✓ High ✓ Unlimited
Manual Annotation ~95% (human error) ✓ Real documents Expensive, slow
Template + PIL/Pillow ✓ Perfect ✗ Unrealistic fonts/layout ✓ Fast
GAN Generation ✗ No GT ~Realistic ✓ Fast

Bottom Line: This pipeline uniquely combines the realism of web rendering with the precision of programmatic coordinate extraction, producing datasets suitable for training production-grade document understanding models.