PDF → Page Images → Preprocessing (denoising, deskewing) → Layout Analysis
Image → Text Block Detection → Block Classification → Block Hierarchy
Block → OCR → Language Detection → Format Analysis → Metadata
Block Structure + Text + Metadata → HTML Template → Final HTML
{
"document": {
"type": "invoice|form|letter|other",
"language": "pl|en|de",
"layout": "4-block|6-block|custom",
"confidence": 0.95
},
"blocks": [
{
"id": "A",
"type": "header|content|table|footer",
"position": {"x": 0, "y": 0, "width": 300, "height": 200},
"content": "recognized text",
"language": "en",
"confidence": 0.92,
"formatting": {
"bold": [0, 10],
"tables": [],
"lists": []
}
}
]
}