invocr

InvOCR Documentation

πŸ“₯ Installation Guide πŸ“‹ Examples πŸ”§ Configuration πŸ’» CLI πŸ”Œ API

InvOCR - Intelligent Invoice Processing

πŸ” Enterprise-grade document processing with advanced OCR for invoices, receipts, and financial documents

Python 3.9+ FastAPI Docker License Code style: black

InvOCR is a powerful document processing system that automates the extraction and conversion of financial documents. It supports multiple input formats (PDF, images) and output formats (JSON, XML, HTML, PDF) with multi-language OCR capabilities.

πŸš€ Key Features

πŸ“„ Document Processing Pipeline

πŸ” Advanced OCR Capabilities

πŸ› οΈ Technical Highlights

πŸ“‹ Supported Document Types

| Type | Description | Key Features | |β€”β€”|β€”β€”β€”β€”-|————–| | Invoices | Commercial invoices | Line items, totals, tax details | | Receipts | Retail receipts | Merchant info, items, totals | | Bills | Utility bills | Account info, payment details | | Bank Statements | Account statements | Transactions, balances | | Custom | Any document | Configurable templates |

invutil - zawiera najbardziej generyczne funkcje, ktΓ³re majΔ… najmniej zaleΕΌnoΕ›ci git@github.com:fin-officer/invutil.git

valider - mechanizmy walidacji majΔ… jasno okreΕ›lone interfejsy git@github.com:fin-officer/valider.git

dextra - wymaga wczeΕ›niejszego wyodrΔ™bnienia Utils i OCR git@github.com:fin-officer/dextra.git

dotect - zaleΕΌy od niektΓ³rych komponentΓ³w Utils git@github.com:fin-officer/dotect.git

πŸ“š Documentation

πŸ› οΈ Basic Usage

Using the CLI

# Convert PDF to JSON
poetry run invocr convert invoice.pdf invoice.json


poetry run invocr convert ./2024.11/attachments/invoice-25417.pdf ./2024.11/attachments/invoice-25417.json

# Process image with specific languages
poetry run invocr img2json receipt.jpg --languages en,pl,de

# Start the API server (use --port 8001 if port 8000 is already in use)
poetry run invocr serve --port 8001

# Run batch processing
poetry run invocr batch ./2024.11/attachments/ ./2024.11/attachments/ --format json

Additional CLI Commands

1. Process PDF to JSON with Specialized Extraction

# Convert a single PDF to JSON with specialized extraction
poetry run invocr pdf2json path/to/input.pdf --output path/to/output.json

2. Batch Process Multiple PDFs

# Process all PDFs in a directory
poetry run invocr batch ./2024.09/attachments/ ./2024.09/attachments/ --format json
poetry run invocr batch ./2024.10/attachments/ ./2024.10/attachments/ --format json
poetry run invocr batch ./2024.11/attachments/ ./2024.11/attachments/ --format json

# Process with complete workflow (OCR, detection, extraction, validation)
poetry run invocr workflow ./2024.11/attachments/ --output-dir ./2024.11/attachments/

# Available options:
# --input-dir: Directory containing PDF files (default: 2024.09/attachments)
# --output-dir: Directory to save JSON files (default: 2024.09/json)
# --log-level: Set logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)

3. Debug PDF Extraction

# View extracted text from a PDF for debugging
poetry run python debug_pdf.py path/to/document.pdf

Advanced Usage

# Full PDF to HTML conversion pipeline (one step)
invocr pipeline --input invoice.pdf --output ./output/invoice.html --start-format pdf --end-format html

# Step-by-step PDF to HTML conversion
invocr pdf2img --input invoice.pdf --output ./temp/invoice.png
invocr img2json --input ./temp/invoice.png --output ./temp/invoice.json
invocr json2xml --input ./temp/invoice.json --output ./temp/invoice.xml
invocr pipeline --input ./temp/invoice.xml --output ./output/invoice.html --start-format xml --end-format html

Directory Structure

For batch processing, the following directory structure is recommended:

./
β”œβ”€β”€ 2024.09/
β”‚   β”œβ”€β”€ attachments/    # Put your PDF files here
β”‚   └── json/          # JSON output will be saved here
β”œβ”€β”€ 2024.10/
β”‚   β”œβ”€β”€ attachments/
β”‚   └── json/
└── ...

Using the API

import requests
import time

# 1. Upload a PDF file
upload_response = requests.post(
    "http://localhost:8001/api/v1/upload",
    files={"file": open("invoice.pdf", "rb")}
)
file_id = upload_response.json()["file_id"]

# 2. Start the PDF to HTML conversion pipeline
convert_response = requests.post(
    "http://localhost:8001/api/v1/convert/pipeline",
    json={
        "file_id": file_id,
        "start_format": "pdf",
        "end_format": "html",
        "options": {
            "languages": ["en", "pl"],
            "output_type": "file"
        }
    }
)
task_id = convert_response.json()["task_id"]

# 3. Check conversion status
while True:
    status_response = requests.get(f"http://localhost:8001/api/v1/tasks/{task_id}")
    status = status_response.json()["status"]
    if status == "completed":
        result_file_id = status_response.json()["result"]["file_id"]
        break
    elif status == "failed":
        print("Conversion failed:", status_response.json()["error"])
        break
    time.sleep(1)  # Wait before checking again

# 4. Download the converted HTML file
with open("output.html", "wb") as f:
    download_response = requests.get(f"http://localhost:8001/api/v1/files/{result_file_id}")
    f.write(download_response.content)

print("Conversion complete! HTML file saved as output.html")

Using cURL

# 1. Upload a PDF file
curl -X POST "http://localhost:8001/api/v1/upload" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@invoice.pdf"

# 2. Start the conversion pipeline (replace YOUR_FILE_ID)
curl -X POST "http://localhost:8001/api/v1/convert/pipeline" \
  -H "accept: application/json" \
  -H "Content-Type: application/json" \
  -d '{
        "file_id": "YOUR_FILE_ID",
        "start_format": "pdf",
        "end_format": "html",
        "options": {
          "languages": ["en", "pl"],
          "output_type": "file"
        }
      }'

# 3. Check task status (replace YOUR_TASK_ID)
curl -X GET "http://localhost:8001/api/v1/tasks/YOUR_TASK_ID" \
  -H "accept: application/json"

# 4. Download the result (replace YOUR_RESULT_FILE_ID)
curl -X GET "http://localhost:8001/api/v1/files/YOUR_RESULT_FILE_ID" \
  -H "accept: application/json" \
  -o output.html

πŸ—οΈ Project Structure

invocr/
β”œβ”€β”€ πŸ“ invocr/                 # Main package
β”‚   β”œβ”€β”€ πŸ“ core/               # Core processing modules
β”‚   β”‚   β”œβ”€β”€ ocr.py            # OCR engine (Tesseract + EasyOCR)
β”‚   β”‚   β”œβ”€β”€ converter.py      # Universal format converter
β”‚   β”‚   β”œβ”€β”€ extractor.py      # Data extraction logic
β”‚   β”‚   └── validator.py      # Data validation
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ formats/            # Format-specific handlers
β”‚   β”‚   β”œβ”€β”€ pdf.py           # PDF operations
β”‚   β”‚   β”œβ”€β”€ image.py         # Image processing
β”‚   β”‚   β”œβ”€β”€ json_handler.py  # JSON operations
β”‚   β”‚   β”œβ”€β”€ xml_handler.py   # EU XML format
β”‚   β”‚   └── html_handler.py  # HTML generation
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ api/               # REST API
β”‚   β”‚   β”œβ”€β”€ main.py          # FastAPI application
β”‚   β”‚   β”œβ”€β”€ routes.py        # API endpoints
β”‚   β”‚   └── models.py        # Pydantic models
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ cli/               # Command line interface
β”‚   β”‚   └── commands.py      # CLI commands
β”‚   β”‚
β”‚   └── πŸ“ utils/             # Utilities
β”‚       β”œβ”€β”€ config.py        # Configuration
β”‚       β”œβ”€β”€ logger.py        # Logging setup
β”‚       └── helpers.py       # Helper functions
β”‚
β”œβ”€β”€ πŸ“ tests/                 # Test suite
β”œβ”€β”€ πŸ“ scripts/               # Installation scripts
β”œβ”€β”€ πŸ“ docs/                  # Documentation
β”œβ”€β”€ 🐳 Dockerfile             # Docker configuration
β”œβ”€β”€ 🐳 docker-compose.yml     # Docker Compose
β”œβ”€β”€ πŸ“‹ pyproject.toml         # Poetry configuration
└── πŸ“– README.md              # This file

πŸ† KOMPLETNY SYSTEM InvOCR - PODSUMOWANIE FINALNE

πŸ”„ Konwersje formatΓ³w (100% kompletne):

🌍 WielojΔ™zycznoΕ›Δ‡:

πŸ“‹ Typy dokumentΓ³w:

πŸ”§ Interfejsy (3 kompletne):


πŸš€ DEPLOYMENT OPTIONS:

1. Local Development:

git clone repo && cd invocr
./scripts/install.sh
poetry run invocr serve

2. Docker (Single Container):

docker-compose up

3. Production (Docker Swarm):

docker-compose -f docker-compose.prod.yml up

4. Kubernetes (Enterprise):

kubectl apply -f kubernetes/

5. Cloud (Auto-scaling):


πŸ—οΈ ARCHITEKTURA FINALNA:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Web Client    β”‚    β”‚   Mobile App    β”‚    β”‚   CLI Client    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                      β”‚                      β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚       Nginx Proxy           β”‚
                    β”‚   (Load Balancer + SSL)     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚     InvOCR API Server       β”‚
                    β”‚    (FastAPI + Uvicorn)      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                        β”‚                        β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  OCR Engine   β”‚    β”‚   Format Converters  β”‚    β”‚   Validators    β”‚
β”‚ (Tesseract +  β”‚    β”‚ (PDF/IMG/JSON/XML/   β”‚    β”‚  (Data Quality  β”‚
β”‚   EasyOCR)    β”‚    β”‚      HTML)           β”‚    β”‚   + Metrics)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                        β”‚                        β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                        β”‚                        β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   PostgreSQL  β”‚    β”‚      Redis Cache     β”‚    β”‚   File Storage  β”‚
β”‚  (Metadata +  β”‚    β”‚   (Jobs + Sessions)  β”‚    β”‚ (Temp + Output) β”‚
β”‚   Analytics)  β”‚    β”‚                      β”‚    β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ˆ FEATURES ZAAWANSOWANE:

πŸ” Monitoring & Observability:

πŸ”’ Security:

⚑ Performance:

πŸ§ͺ Quality Assurance:


🎯 GOTOWY DO UŻYCIA W PRODUKCJI:

βœ… Enterprise Features:

βœ… Developer Experience:

βœ… Operations:


InvOCR to teraz w peΕ‚ni funkcjonalny, enterprise-grade system do przetwarzania faktur z:

🎯 33 artefakty - wszystkie komponenty systemu
🎯 50+ plików - kompletna struktura projektu
🎯 Wszystkie konwersje - PDF↔IMG↔JSON↔XML↔HTML↔PDF
🎯 OCR wielojΔ™zyczny - 6 jΔ™zykΓ³w z auto-detekcjΔ…
🎯 3 interfejsy - CLI, REST API, Docker
🎯 EU XML compliance - UBL 2.1 standard
🎯 Production deployment - K8s, Docker, CI/CD
🎯 Enterprise security - Monitoring, alerts, compliance
🎯 Developer tools - VS Code, testing, debugging
🎯 Documentation - Complete README, API docs, examples

πŸš€ Quick Start

Prerequisites

Installation

# Clone repository
git clone https://github.com/fin-officer/invocr.git
cd invocr

# Build and start services
docker-compose up -d --build

# Access the API at http://localhost:8000
# View API docs at http://localhost:8000/docs

Option 2: Local Installation

  1. Install system dependencies (Ubuntu/Debian):
    sudo apt update
    sudo apt install -y tesseract-ocr tesseract-ocr-pol tesseract-ocr-deu \
     tesseract-ocr-fra tesseract-ocr-spa tesseract-ocr-ita \
     poppler-utils libpango-1.0-0 libharfbuzz0b python3-dev build-essential
    
  2. Install Python dependencies: ```bash

    Install Poetry if not installed

    curl -sSL https://install.python-poetry.org | python3 -

πŸš€ Development

Running Tests

# Run all tests
poetry run pytest

# Run tests with coverage
poetry run pytest --cov=invocr --cov-report=html

Code Quality

# Run linters
poetry run flake8 invocr/
poetry run mypy invocr/

# Format code
poetry run black invocr/ tests/
poetry run isort invocr/ tests/

Building the Package

# Build package
poetry build

# Publish to PyPI (requires credentials)
poetry publish

πŸ“š Documentation

For detailed documentation, see:

🀝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

πŸ“„ License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

πŸ“ž Support

For support, please open an issue in the issue tracker.

πŸ“Š Project Status

GitHub last commit GitHub issues GitHub pull requests


Made with ❀️ by the Tom Sapletta

poetry install

Setup environment

cp .env.example .env


### Option 3: Docker

```bash
# Using Docker Compose (easiest)
docker-compose up

# Or build manually
docker build -t invocr .
docker run -p 8000:8000 invocr

πŸ“š Usage Examples

CLI Commands

# Convert PDF to JSON
poetry run python pdf2json.py --input invoice.pdf --output invoice.json

# Process image with specific languages
poetry run python process_pdfs.py --input receipt.jpg --output receipt.json --languages en,pl,de

# Convert invoice PDF to JSON with output directory
poetry run python pdf_to_json.py --input ./invoices/invoice.pdf --output-dir ./output/

# PDF to images
poetry run python pdf_to_json.py --extract-images --input document.pdf --output-dir ./images/

# Image to JSON (OCR)
poetry run python process_pdfs.py --input scan.png --output data.json --doc-type invoice

# Debug invoice extraction
poetry run invocr debug invoice.pdf

# View OCR text from a document
poetry run invocr ocr-text invoice.pdf

# Batch processing
poetry run invocr batch ./input_files/ ./output/ --format json

# View OCR text extracted from PDF
poetry run invocr ocr-text document.pdf

# Test invoice extraction
poetry run invocr validate --input-file path/to/invoice.json

# Debug receipt extraction
poetry run invocr debug --doc-type receipt path/to/receipt.pdf

🧩 Modular Extraction System

InvOCR features a modular extraction system that provides better accuracy, maintainability, and extensibility:

Key Components

Utility Modules

Multi-Level Detection

The system implements a decision tree approach for document classification:

  1. Document type detection (invoice, receipt, Adobe JSON)
  2. Language detection (en, pl, de, etc.)
  3. Format-specific extractor selection
  4. OCR verification for higher confidence

Using the Extraction System

# Example: Extract data from a PDF invoice
from invocr.formats.pdf.extractors.pdf_invoice_extractor import PDFInvoiceExtractor

# Create an extractor
extractor = PDFInvoiceExtractor()

# Extract data from text
invoice_data = extractor.extract(text)

# Access extracted data
print(f"Invoice Number: {invoice_data['invoice_number']}")
print(f"Issue Date: {invoice_data['issue_date']}")
print(f"Total Amount: {invoice_data['total_amount']} {invoice_data['currency']}")

REST API

# Start server
invocr serve

# Convert file
curl -X POST "http://localhost:8000/convert" \
  -F "file=@invoice.pdf" \
  -F "target_format=json" \
  -F "languages=en,pl"

# Check job status
curl "http://localhost:8000/status/{job_id}"

# Download result
curl "http://localhost:8000/download/{job_id}" -o result.json

Python API

from invocr import create_converter

# Create converter instance
converter = create_converter(languages=['en', 'pl', 'de'])

# Convert PDF to JSON
result = converter.pdf_to_json('invoice.pdf')
print(result)

# Convert image to JSON with OCR
data = converter.image_to_json('scan.png', document_type='invoice')

# Convert JSON to EU XML
xml_content = converter.json_to_xml(data, format='eu_invoice')

# Full conversion pipeline
result = converter.convert('input.pdf', 'output.json', 'auto', 'json')

🌐 API Documentation

When running the API server, visit:

Key Endpoints

πŸ”§ Configuration

Environment Variables

Key configuration options in .env:

# OCR Settings
DEFAULT_OCR_ENGINE=auto          # tesseract, easyocr, auto
DEFAULT_LANGUAGES=en,pl,de,fr,es # Supported languages
OCR_CONFIDENCE_THRESHOLD=0.3     # Minimum confidence

# Processing
MAX_FILE_SIZE=52428800          # 50MB limit
PARALLEL_WORKERS=4              # Concurrent processing
MAX_PAGES_PER_PDF=10           # Page limit

# Storage
UPLOAD_DIR=./uploads
OUTPUT_DIR=./output
TEMP_DIR=./temp

Supported Languages

Code Language Tesseract EasyOCR
en English βœ… βœ…
pl Polish βœ… βœ…
de German βœ… βœ…
fr French βœ… βœ…
es Spanish βœ… βœ…
it Italian βœ… βœ…

πŸ“Š Supported Formats

Input Formats

Output Formats

πŸ§ͺ Testing

# Run all tests
poetry run pytest

# Run with coverage
poetry run pytest --cov=invocr

# Run specific test file
poetry run pytest tests/test_ocr.py

# Run API tests
poetry run pytest tests/test_api.py

πŸš€ Deployment

Production with Docker

# docker-compose.prod.yml
version: '3.8'
services:
  invocr:
    image: invocr:latest
    ports:
      - "80:8000"
    environment:
      - ENVIRONMENT=production
      - WORKERS=4
    volumes:
      - ./data:/app/data

Kubernetes

# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: invocr
spec:
  replicas: 3
  selector:
    matchLabels:
      app: invocr
  template:
    metadata:
      labels:
        app: invocr
    spec:
      containers:
      - name: invocr
        image: invocr:latest
        ports:
        - containerPort: 8000

🀝 Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Make changes
  4. Add tests
  5. Run tests (poetry run pytest)
  6. Commit changes (git commit -m 'Add amazing feature')
  7. Push to branch (git push origin feature/amazing-feature)
  8. Open Pull Request

Development Setup

# Install development dependencies
poetry install --with dev

# Install pre-commit hooks
poetry run pre-commit install

# Run linting
poetry run black invocr/
poetry run isort invocr/
poetry run flake8 invocr/

# Run type checking
poetry run mypy invocr/

πŸ“ˆ Performance

Benchmarks

Operation Time Memory
PDF β†’ JSON (1 page) ~2-3s ~50MB
Image OCR β†’ JSON ~1-2s ~30MB
JSON β†’ XML ~0.1s ~10MB
JSON β†’ HTML ~0.2s ~15MB
HTML β†’ PDF ~1-2s ~40MB

Optimization Tips

πŸ”’ Security

πŸ“‹ Requirements

System Requirements

Dependencies

πŸ› Troubleshooting

Common Issues

OCR not working:

# Check Tesseract installation
tesseract --version

# Install missing languages
sudo apt install tesseract-ocr-pol

WeasyPrint errors:

# Install system dependencies
sudo apt install libpango-1.0-0 libharfbuzz0b

Import errors:

# Reinstall dependencies
poetry install --force

Permission errors:

# Fix file permissions
chmod -R 755 uploads/ output/

πŸ“ž Support

πŸ“„ License

This project is licensed under the Apache License - see the LICENSE file for details.

πŸ™ Acknowledgments


Made with ❀️ for the open source community

⭐ Star this repository if you find it useful!