_ _ _____ __ __ _
__ _| | | |_ _| \/ | |
\ \ / / |_| | | | | |\/| | |
\ V /| _ | | | | | | | |___
\_/ |_| |_| |_| |_| |_|_____|
Visual HTML Generator - Convert PDFs to structured HTML with OCR
A modular system for converting PDF documents to structured HTML with advanced OCR and layout analysis capabilities.
poppler-utils
)# Install system dependencies
sudo apt-get update
sudo apt-get install -y \
tesseract-ocr \
tesseract-ocr-pol \
tesseract-ocr-eng \
tesseract-ocr-deu \
poppler-utils \
python3-pip \
python3-venv
graph TD
A[PDF Input] --> B[PDF Processor]
B --> C[Layout Analyzer]
C --> D[OCR Engine]
D --> E[HTML Generator]
E --> F[Structured HTML Output]
G[Configuration] --> B
G --> C
G --> D
G --> E
H[Plugins] -->|Extend| B
H -->|Customize| C
H -->|Enhance| D
H -->|Theme| E
+----------------+ +-----------------+ +---------------+
| | | | | |
| PDF Input |---->| PDF Processor |---->| Page Images |
| | | | | |
+----------------+ +-----------------+ +-------.-------+
|
v
+----------------+ +-----------------+ +-------+-------+
| | | | | |
| HTML Output |<----| HTML Generator |<----| OCR Results |
| | | | | |
+----------------+ +-----------------+ +-------.-------+
^
|
+----------------+ +-----------------+ +-------+-------+
| | | | | |
| Configuration |---->| Layout Analyzer |---->| Page Layout |
| | | | | |
+----------------+ +-----------------+ +---------------+
# 1. Clone the repository
git clone https://github.com/fin-officer/vhtml.git
cd vhtml
# 2. Install Python dependencies
poetry install
# 3. Install system dependencies (if not already installed)
make install-deps
# 4. Verify installation
make validate
# Build the Docker image
docker build -t vhtml .
# Run the container
docker run -v $(pwd)/invoices:/app/invoices -v $(pwd)/output:/app/output vhtml \
python -m vhtml.main /app/invoices/sample.pdf -o /app/output
To verify that all dependencies are correctly installed:
# Run validation script
make validate
# Or directly
python scripts/validate_installation.py
# Expected output:
# ✓ Python version: 3.8+
# ✓ Tesseract found: v5.0.0
# ✓ Poppler utils installed
# ✓ All Python dependencies satisfied
# ✓ Test document processed successfully
# Process a single PDF file
poetry run python -m vhtml.main /path/to/document.pdf -o output_directory
# Process a directory of PDF files (batch mode)
poetry run python -m vhtml.main /path/to/pdf_directory -b -o output_directory
# Process and open in browser
poetry run python -m vhtml.main /path/to/document.pdf -v
# Specify output format (html/mhtml)
poetry run python -m vhtml.main document.pdf --format mhtml
# Use specific OCR language
poetry run python -m vhtml.main document.pdf --lang pol+eng
from vhtml import DocumentAnalyzer
# Initialize with custom settings
analyzer = DocumentAnalyzer(
languages=['pol', 'eng'], # OCR languages
output_format='html', # 'html' or 'mhtml'
debug_mode=False # Enable debug output
)
# Process a single document
result = analyzer.process("document.pdf", "output_dir")
print(f"Generated: {result.output_path}")
print(f"Metadata: {result.metadata}")
# Batch processing
results = analyzer.process_batch("input_dir", "output_dir")
for result in results:
print(f"Processed: {result.input_path} -> {result.output_path}")
from vhtml import PDFProcessor, OCREngine
# Load and preprocess PDF
processor = PDFProcessor()
pages = processor.process("document.pdf")
# Perform OCR
ocr = OCREngine(languages=['eng'])
for page_num, page_image in enumerate(pages):
text = ocr.extract_text(page_image)
print(f"Page {page_num + 1}:\n{text}\n{'='*50}")
graph LR
A[Clone Repository] --> B[Install Dependencies]
B --> C[Run Tests]
C --> D[Make Changes]
D --> E[Run Linters]
E --> F[Update Tests]
F --> G[Commit Changes]
G --> H[Create Pull Request]
# Run tests
make test
# Format code
make format
# Run linters
make lint
# Generate documentation
make docs
# Build package
make build
git checkout -b feature/amazing-feature
)git commit -m 'Add some amazing feature'
)git push origin feature/amazing-feature
)This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Generate a standalone HTML file with all images, JS, and JSON embedded:
poetry run python examples/pdf2html.py
Generate a fully self-contained MHTML file for browser archiving:
poetry run python examples/pdf2mhtml.py
examples/html.py
and examples/mhtml.py
for usage patterns and batch processing.vhtml/
├── vhtml/
│ ├── core/
│ │ ├── pdf_processor.py
│ │ ├── layout_analyzer.py
│ │ ├── ocr_engine.py
│ │ └── html_generator.py
│ └── main.py
├── scripts/
│ ├── validate_installation.py
│ └── test_integration.py
├── docs/
│ ├── ARCHITECTURE.md
│ ├── IMPLEMENTATION.md
│ └── PROJECT_STRUCTURE.md
├── Makefile
├── pyproject.toml
└── README.md
# Setup development environment
make setup
# Run tests
make test
# Format code
make format
# Lint code
make lint
# Build package
make build
For more detailed information, see the documentation files:
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.