87 document types categorized by layout variability and variation strategy
Not all documents should be varied equally. Government forms require strict adherence to standards, while business documents benefit from creative diversity. The SDD taxonomy classifies each template by its appropriate variation strategy.
Key Insight: Layout variability correlates with real-world document diversity. Strict standardized forms (IRS, CMS) have low variability in reality, while marketing materials and certificates vary widely.
| Characteristic | Strict / Standardized | Moderate | Creative / Complex |
|---|---|---|---|
| Layout Variability | Low - Fixed positions | Medium - Some flexibility | High - Highly flexible |
| Real-world Parallel | Government forms, IRS | Business documents | Marketing, certificates |
| Variation Strategy | Minimal - fonts only | Structural elements | Full layout redesign |
| Image Morphisms | Skew, rotation, scan/fax effects | Optional morphisms | Minimal - focus on layout |
| Use Case | OCR accuracy testing | Form understanding | Layout robustness |
Strategy: Maintain strict adherence to official formats. Variation limited to minor font changes and subtle styling. Image morphisms (skew, rotation, scan/fax simulation) test robustness while preserving layout.
| Template | Category | Variation Strategy | Why Strict? |
|---|---|---|---|
| CMS 1500 | Medical | Strict layout compliance required for insurance claims processing | Federal standard |
| CMS 485 | Medical | Strict layout compliance for home health certification | Medicare requirement |
| Form I-9 | Government | USCIS mandated format - no variation allowed | Legal compliance |
| IRS Form 1040 | Tax | Strict adherence to IRS format with red ink simulation | Tax authority standard |
| Closing Disclosure | Financial | Strict adherence to government mandate (CFPB format) | TRID compliance |
| AIA G702/G703 | Construction | Strict adherence to AIA standard blocks and tables | Industry standard |
| Passport | Identity | ICAO 9303 standard compliance required | International standard |
| Tax Transcript | Tax | IRS official format with minimal variation | Government document |
| SDS / MSDS | Safety | OSHA/GHS standardized 16-section format | Safety compliance |
| DEA Form 222 | Medical | Strict layout for controlled substance tracking | DEA requirement |
Strategy: Maximum layout variation including positioning, fonts, colors, and structural reorganization. Same semantic content presented in dramatically different visual layouts.
| Template | Category | Variation Strategy | Variation Focus |
|---|---|---|---|
| Bank Statement | Financial | Critically high variability. Multiple layouts, transaction table styles, branding placements. | Layout, tables, branding |
| Birth Certificate | Vital Records | Vary by US state and county designs. High value for robust identity verification. | Regional designs, seals |
| Business Card | Business | Maximum variability required. Wild layouts, fonts, and orientations. | Layout, typography |
| Generic Certificate | Academic | Vary decorative borders, fonts, and signature layouts. | Visual design |
| Blueprint | Technical | Vary title block locations and data field arrangements. | Technical layout |
| Commercial Invoice | International | Vary layout significantly; key fields (incoterms, HS codes) move frequently. | Field positioning |
| Credit Card Statement | Financial | High brand-specific variance. Vary summary box locations. | Branding, layout |
| Diploma | Academic | Vary font styles (old english vs modern), seals, and orientations. | Visual style |
| Academic Transcript | Academic | Vary table structures, grading scales, and header placements largely. | Table layouts |
| After Visit Summary | Medical | Vary section ordering (vitals, meds, instructions) and EHR generated styling. | Section ordering |
Configurable Dataset Size: n = 100,000 documents
The matrix below shows recommended distribution for balanced dataset construction. Adjust n based on your training requirements.
| Variation Tier | Templates | % of Dataset | Docs per Template | Layout Variations | Image Morphisms | Total Docs |
|---|---|---|---|---|---|---|
| Strict / Standardized | 21 | 30% | ~1,429 | 1-2 | 5-10 (skew, rotation, scan/fax) | 30,000 |
| Moderate | 43 | 45% | ~1,047 | 3-5 | 2-3 (optional) | 45,000 |
| Creative / Complex | 23 | 25% | ~1,087 | 5-10 | 0-1 (minimal) | 25,000 |
| Total | 87 | 100% | — | — | — | 100,000 |
| Tier | Base Docs | × Layout Variations | × Image Morphisms | = Unique Presentations | Total Annotations |
|---|---|---|---|---|---|
| Strict | 30,000 | × 1.5 | × 7 | 315,000 | ~315,000 |
| Moderate | 45,000 | × 4 | × 2 | 360,000 | ~360,000 |
| Creative | 25,000 | × 7 | × 1 | 175,000 | ~175,000 |
| Total | 100,000 | — | — | 850,000 | ~850,000 |
Note: Each document generates pixel-perfect bounding box annotations for all text entities. Image morphisms (skew ±5°, rotation ±3°, scan noise, fax compression) preserve annotation accuracy through coordinate transformation.
Low variability forms are ideal for:
High variability templates test:
Combined dataset enables:
Implementation Strategy: Variation is controlled at the template level through:
Complete list of all 87 templates with their variability classification:
Government • Standards • Compliance
Business • Professional • Industry
Creative • Marketing • Diverse