π₯ Installation Guide | π Examples | π§ Configuration | π» CLI | π API |
π Enterprise-grade document processing with advanced OCR for invoices, receipts, and financial documents
InvOCR is a powerful document processing system that automates the extraction and conversion of financial documents. It supports multiple input formats (PDF, images) and output formats (JSON, XML, HTML, PDF) with multi-language OCR capabilities.
| Type | Description | Key Features | |ββ|ββββ-|βββββ| | Invoices | Commercial invoices | Line items, totals, tax details | | Receipts | Retail receipts | Merchant info, items, totals | | Bills | Utility bills | Account info, payment details | | Bank Statements | Account statements | Transactions, balances | | Custom | Any document | Configurable templates |
invutil - zawiera najbardziej generyczne funkcje, ktΓ³re majΔ najmniej zaleΕΌnoΕci git@github.com:fin-officer/invutil.git
valider - mechanizmy walidacji majΔ jasno okreΕlone interfejsy git@github.com:fin-officer/valider.git
dextra - wymaga wczeΕniejszego wyodrΔbnienia Utils i OCR git@github.com:fin-officer/dextra.git
dotect - zaleΕΌy od niektΓ³rych komponentΓ³w Utils git@github.com:fin-officer/dotect.git
# Convert PDF to JSON
poetry run invocr convert invoice.pdf invoice.json
poetry run invocr convert ./2024.11/attachments/invoice-25417.pdf ./2024.11/attachments/invoice-25417.json
# Process image with specific languages
poetry run invocr img2json receipt.jpg --languages en,pl,de
# Start the API server (use --port 8001 if port 8000 is already in use)
poetry run invocr serve --port 8001
# Run batch processing
poetry run invocr batch ./2024.11/attachments/ ./2024.11/attachments/ --format json
# Convert a single PDF to JSON with specialized extraction
poetry run invocr pdf2json path/to/input.pdf --output path/to/output.json
# Process all PDFs in a directory
poetry run invocr batch ./2024.09/attachments/ ./2024.09/attachments/ --format json
poetry run invocr batch ./2024.10/attachments/ ./2024.10/attachments/ --format json
poetry run invocr batch ./2024.11/attachments/ ./2024.11/attachments/ --format json
# Process with complete workflow (OCR, detection, extraction, validation)
poetry run invocr workflow ./2024.11/attachments/ --output-dir ./2024.11/attachments/
# Available options:
# --input-dir: Directory containing PDF files (default: 2024.09/attachments)
# --output-dir: Directory to save JSON files (default: 2024.09/json)
# --log-level: Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
# View extracted text from a PDF for debugging
poetry run python debug_pdf.py path/to/document.pdf
# Full PDF to HTML conversion pipeline (one step)
invocr pipeline --input invoice.pdf --output ./output/invoice.html --start-format pdf --end-format html
# Step-by-step PDF to HTML conversion
invocr pdf2img --input invoice.pdf --output ./temp/invoice.png
invocr img2json --input ./temp/invoice.png --output ./temp/invoice.json
invocr json2xml --input ./temp/invoice.json --output ./temp/invoice.xml
invocr pipeline --input ./temp/invoice.xml --output ./output/invoice.html --start-format xml --end-format html
For batch processing, the following directory structure is recommended:
./
βββ 2024.09/
β βββ attachments/ # Put your PDF files here
β βββ json/ # JSON output will be saved here
βββ 2024.10/
β βββ attachments/
β βββ json/
βββ ...
import requests
import time
# 1. Upload a PDF file
upload_response = requests.post(
"http://localhost:8001/api/v1/upload",
files={"file": open("invoice.pdf", "rb")}
)
file_id = upload_response.json()["file_id"]
# 2. Start the PDF to HTML conversion pipeline
convert_response = requests.post(
"http://localhost:8001/api/v1/convert/pipeline",
json={
"file_id": file_id,
"start_format": "pdf",
"end_format": "html",
"options": {
"languages": ["en", "pl"],
"output_type": "file"
}
}
)
task_id = convert_response.json()["task_id"]
# 3. Check conversion status
while True:
status_response = requests.get(f"http://localhost:8001/api/v1/tasks/{task_id}")
status = status_response.json()["status"]
if status == "completed":
result_file_id = status_response.json()["result"]["file_id"]
break
elif status == "failed":
print("Conversion failed:", status_response.json()["error"])
break
time.sleep(1) # Wait before checking again
# 4. Download the converted HTML file
with open("output.html", "wb") as f:
download_response = requests.get(f"http://localhost:8001/api/v1/files/{result_file_id}")
f.write(download_response.content)
print("Conversion complete! HTML file saved as output.html")
# 1. Upload a PDF file
curl -X POST "http://localhost:8001/api/v1/upload" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@invoice.pdf"
# 2. Start the conversion pipeline (replace YOUR_FILE_ID)
curl -X POST "http://localhost:8001/api/v1/convert/pipeline" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"file_id": "YOUR_FILE_ID",
"start_format": "pdf",
"end_format": "html",
"options": {
"languages": ["en", "pl"],
"output_type": "file"
}
}'
# 3. Check task status (replace YOUR_TASK_ID)
curl -X GET "http://localhost:8001/api/v1/tasks/YOUR_TASK_ID" \
-H "accept: application/json"
# 4. Download the result (replace YOUR_RESULT_FILE_ID)
curl -X GET "http://localhost:8001/api/v1/files/YOUR_RESULT_FILE_ID" \
-H "accept: application/json" \
-o output.html
invocr/
βββ π invocr/ # Main package
β βββ π core/ # Core processing modules
β β βββ ocr.py # OCR engine (Tesseract + EasyOCR)
β β βββ converter.py # Universal format converter
β β βββ extractor.py # Data extraction logic
β β βββ validator.py # Data validation
β β
β βββ π formats/ # Format-specific handlers
β β βββ pdf.py # PDF operations
β β βββ image.py # Image processing
β β βββ json_handler.py # JSON operations
β β βββ xml_handler.py # EU XML format
β β βββ html_handler.py # HTML generation
β β
β βββ π api/ # REST API
β β βββ main.py # FastAPI application
β β βββ routes.py # API endpoints
β β βββ models.py # Pydantic models
β β
β βββ π cli/ # Command line interface
β β βββ commands.py # CLI commands
β β
β βββ π utils/ # Utilities
β βββ config.py # Configuration
β βββ logger.py # Logging setup
β βββ helpers.py # Helper functions
β
βββ π tests/ # Test suite
βββ π scripts/ # Installation scripts
βββ π docs/ # Documentation
βββ π³ Dockerfile # Docker configuration
βββ π³ docker-compose.yml # Docker Compose
βββ π pyproject.toml # Poetry configuration
βββ π README.md # This file
git clone repo && cd invocr
./scripts/install.sh
poetry run invocr serve
docker-compose up
docker-compose -f docker-compose.prod.yml up
kubectl apply -f kubernetes/
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Web Client β β Mobile App β β CLI Client β
βββββββββββ¬ββββββββ βββββββββββ¬ββββββββ βββββββββββ¬ββββββββ
β β β
ββββββββββββββββββββββββΌβββββββββββββββββββββββ
β
βββββββββββββββΌββββββββββββββββ
β Nginx Proxy β
β (Load Balancer + SSL) β
βββββββββββββββ¬ββββββββββββββββ
β
βββββββββββββββΌββββββββββββββββ
β InvOCR API Server β
β (FastAPI + Uvicorn) β
βββββββββββββββ¬ββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββ
β β β
βββββββββΌββββββββ βββββββββββββΌβββββββββββ ββββββββββΌβββββββββ
β OCR Engine β β Format Converters β β Validators β
β (Tesseract + β β (PDF/IMG/JSON/XML/ β β (Data Quality β
β EasyOCR) β β HTML) β β + Metrics) β
βββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ
β β β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββ
β β β
βββββββββΌββββββββ βββββββββββββΌβββββββββββ ββββββββββΌβββββββββ
β PostgreSQL β β Redis Cache β β File Storage β
β (Metadata + β β (Jobs + Sessions) β β (Temp + Output) β
β Analytics) β β β β β
βββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ
InvOCR to teraz w peΕni funkcjonalny, enterprise-grade system do przetwarzania faktur z:
π― 33 artefakty - wszystkie komponenty systemu
π― 50+ plikΓ³w - kompletna struktura projektu
π― Wszystkie konwersje - PDFβIMGβJSONβXMLβHTMLβPDF
π― OCR wielojΔzyczny - 6 jΔzykΓ³w z auto-detekcjΔ
π― 3 interfejsy - CLI, REST API, Docker
π― EU XML compliance - UBL 2.1 standard
π― Production deployment - K8s, Docker, CI/CD
π― Enterprise security - Monitoring, alerts, compliance
π― Developer tools - VS Code, testing, debugging
π― Documentation - Complete README, API docs, examples
# Clone repository
git clone https://github.com/fin-officer/invocr.git
cd invocr
# Build and start services
docker-compose up -d --build
# Access the API at http://localhost:8000
# View API docs at http://localhost:8000/docs
sudo apt update
sudo apt install -y tesseract-ocr tesseract-ocr-pol tesseract-ocr-deu \
tesseract-ocr-fra tesseract-ocr-spa tesseract-ocr-ita \
poppler-utils libpango-1.0-0 libharfbuzz0b python3-dev build-essential
curl -sSL https://install.python-poetry.org | python3 -
# Run all tests
poetry run pytest
# Run tests with coverage
poetry run pytest --cov=invocr --cov-report=html
# Run linters
poetry run flake8 invocr/
poetry run mypy invocr/
# Format code
poetry run black invocr/ tests/
poetry run isort invocr/ tests/
# Build package
poetry build
# Publish to PyPI (requires credentials)
poetry publish
For detailed documentation, see:
We welcome contributions! Please see our Contributing Guidelines for details.
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
For support, please open an issue in the issue tracker.
poetry install
cp .env.example .env
### Option 3: Docker
```bash
# Using Docker Compose (easiest)
docker-compose up
# Or build manually
docker build -t invocr .
docker run -p 8000:8000 invocr
# Convert PDF to JSON
poetry run python pdf2json.py --input invoice.pdf --output invoice.json
# Process image with specific languages
poetry run python process_pdfs.py --input receipt.jpg --output receipt.json --languages en,pl,de
# Convert invoice PDF to JSON with output directory
poetry run python pdf_to_json.py --input ./invoices/invoice.pdf --output-dir ./output/
# PDF to images
poetry run python pdf_to_json.py --extract-images --input document.pdf --output-dir ./images/
# Image to JSON (OCR)
poetry run python process_pdfs.py --input scan.png --output data.json --doc-type invoice
# Debug invoice extraction
poetry run invocr debug invoice.pdf
# View OCR text from a document
poetry run invocr ocr-text invoice.pdf
# Batch processing
poetry run invocr batch ./input_files/ ./output/ --format json
# View OCR text extracted from PDF
poetry run invocr ocr-text document.pdf
# Test invoice extraction
poetry run invocr validate --input-file path/to/invoice.json
# Debug receipt extraction
poetry run invocr debug --doc-type receipt path/to/receipt.pdf
InvOCR features a modular extraction system that provides better accuracy, maintainability, and extensibility:
formats/pdf/extractors/base_extractor.py
PDFInvoiceExtractor
: General PDF invoice processorAdobeInvoiceExtractor
: Specialized for Adobe JSON invoices with OCR verificationpatterns.py
: Centralized regex patterns for all data elementsdate_utils.py
: Date parsing and extraction utilitiesnumeric_utils.py
: Number and currency utilitiesitem_utils.py
: Line item extraction utilitiestotals_utils.py
: Invoice totals extraction utilitiesThe system implements a decision tree approach for document classification:
# Example: Extract data from a PDF invoice
from invocr.formats.pdf.extractors.pdf_invoice_extractor import PDFInvoiceExtractor
# Create an extractor
extractor = PDFInvoiceExtractor()
# Extract data from text
invoice_data = extractor.extract(text)
# Access extracted data
print(f"Invoice Number: {invoice_data['invoice_number']}")
print(f"Issue Date: {invoice_data['issue_date']}")
print(f"Total Amount: {invoice_data['total_amount']} {invoice_data['currency']}")
# Start server
invocr serve
# Convert file
curl -X POST "http://localhost:8000/convert" \
-F "file=@invoice.pdf" \
-F "target_format=json" \
-F "languages=en,pl"
# Check job status
curl "http://localhost:8000/status/{job_id}"
# Download result
curl "http://localhost:8000/download/{job_id}" -o result.json
from invocr import create_converter
# Create converter instance
converter = create_converter(languages=['en', 'pl', 'de'])
# Convert PDF to JSON
result = converter.pdf_to_json('invoice.pdf')
print(result)
# Convert image to JSON with OCR
data = converter.image_to_json('scan.png', document_type='invoice')
# Convert JSON to EU XML
xml_content = converter.json_to_xml(data, format='eu_invoice')
# Full conversion pipeline
result = converter.convert('input.pdf', 'output.json', 'auto', 'json')
When running the API server, visit:
POST /convert
- Convert single filePOST /convert/pdf2img
- PDF to imagesPOST /convert/img2json
- Image OCR to JSONPOST /batch/convert
- Batch processingGET /status/{job_id}
- Job statusGET /download/{job_id}
- Download resultGET /health
- Health checkGET /info
- System informationKey configuration options in .env
:
# OCR Settings
DEFAULT_OCR_ENGINE=auto # tesseract, easyocr, auto
DEFAULT_LANGUAGES=en,pl,de,fr,es # Supported languages
OCR_CONFIDENCE_THRESHOLD=0.3 # Minimum confidence
# Processing
MAX_FILE_SIZE=52428800 # 50MB limit
PARALLEL_WORKERS=4 # Concurrent processing
MAX_PAGES_PER_PDF=10 # Page limit
# Storage
UPLOAD_DIR=./uploads
OUTPUT_DIR=./output
TEMP_DIR=./temp
Code | Language | Tesseract | EasyOCR |
---|---|---|---|
en |
English | β | β |
pl |
Polish | β | β |
de |
German | β | β |
fr |
French | β | β |
es |
Spanish | β | β |
it |
Italian | β | β |
# Run all tests
poetry run pytest
# Run with coverage
poetry run pytest --cov=invocr
# Run specific test file
poetry run pytest tests/test_ocr.py
# Run API tests
poetry run pytest tests/test_api.py
# docker-compose.prod.yml
version: '3.8'
services:
invocr:
image: invocr:latest
ports:
- "80:8000"
environment:
- ENVIRONMENT=production
- WORKERS=4
volumes:
- ./data:/app/data
# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: invocr
spec:
replicas: 3
selector:
matchLabels:
app: invocr
template:
metadata:
labels:
app: invocr
spec:
containers:
- name: invocr
image: invocr:latest
ports:
- containerPort: 8000
git checkout -b feature/amazing-feature
)poetry run pytest
)git commit -m 'Add amazing feature'
)git push origin feature/amazing-feature
)# Install development dependencies
poetry install --with dev
# Install pre-commit hooks
poetry run pre-commit install
# Run linting
poetry run black invocr/
poetry run isort invocr/
poetry run flake8 invocr/
# Run type checking
poetry run mypy invocr/
Operation | Time | Memory |
---|---|---|
PDF β JSON (1 page) | ~2-3s | ~50MB |
Image OCR β JSON | ~1-2s | ~30MB |
JSON β XML | ~0.1s | ~10MB |
JSON β HTML | ~0.2s | ~15MB |
HTML β PDF | ~1-2s | ~40MB |
--parallel
for batch processingIMAGE_ENHANCEMENT=false
for faster OCRtesseract
engine for better performanceMAX_PAGES_PER_PDF
for large documentsOCR not working:
# Check Tesseract installation
tesseract --version
# Install missing languages
sudo apt install tesseract-ocr-pol
WeasyPrint errors:
# Install system dependencies
sudo apt install libpango-1.0-0 libharfbuzz0b
Import errors:
# Reinstall dependencies
poetry install --force
Permission errors:
# Fix file permissions
chmod -R 755 uploads/ output/
This project is licensed under the Apache License - see the LICENSE file for details.
Made with β€οΈ for the open source community
β Star this repository if you find it useful!