InvOCR Examples

This document provides practical examples of how to use InvOCR for various document processing tasks.

Quick Start
Basic Usage
PDF Processing
OCR Processing
Batch Processing
API Usage
Advanced Examples
Validation
Troubleshooting

Quick Start

Installation

# Clone the repository
git clone https://github.com/fin-officer/invocr.git
cd invocr

# Install dependencies
poetry install

# Install Tesseract OCR (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-pol tesseract-ocr-eng

Basic Usage

from invocr.core import PDFProcessor

# Initialize the processor
processor = PDFProcessor()

# Process a single PDF
success, error = processor.process_pdf("invoice.pdf", "output")
if success:
    print("PDF processed successfully!")
else:
    print(f"Error: {error}")

PDF Processing

Extract Text from PDF

from invocr.core.pdf_processor import PDFProcessor

processor = PDFProcessor()
text, error = processor.extract_text("document.pdf")
if text:
    print(text[:500])  # Print first 500 characters

Validate PDF

from invocr.utils.validation import is_valid_pdf, is_valid_pdf_simple

# Simple check (fast)
if is_valid_pdf_simple("document.pdf"):
    print("File appears to be a valid PDF")

# Detailed validation
is_valid, error = is_valid_pdf("document.pdf", min_size=1024)  # 1KB minimum
if not is_valid:
    print(f"Invalid PDF: {error}")

OCR Processing

Basic OCR

from invocr.core.ocr import create_ocr_engine

# Initialize OCR engine with supported languages
ocr = create_ocr_engine(["en", "pl", "de"])

# Extract text from image
result = ocr.extract_text("receipt.jpg")
print(result["text"])

Batch Processing

Process Directory of PDFs

from invocr.core import PDFProcessor

processor = PDFProcessor()
results = processor.process_directory("invoices/", "output/")

print(f"Processed: {results['succeeded']} files")
print(f"Failed: {results['failed']} files")

API Usage

Start the API Server

uvicorn invocr.api.main:app --reload --host 0.0.0.0 --port 8000

Upload and Process File

curl -X 'POST' \
  'http://localhost:8000/api/v1/process' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@invoice.pdf;type=application/pdf'

Advanced Examples

Custom PDF Processing Pipeline

from pathlib import Path
from invocr.core import PDFProcessor
from invocr.core.ocr import create_ocr_engine
from invocr.core.extractor import create_extractor

def custom_pipeline(pdf_path, output_dir):
    # Initialize components
    pdf_processor = PDFProcessor()
    ocr_engine = create_ocr_engine(["en", "pl"])
    extractor = create_extractor()
    
    # Process PDF
    text, error = pdf_processor.extract_text(pdf_path)
    if error:
        return False, f"Text extraction failed: {error}"
    
    # Extract structured data
    data = extractor.extract_invoice_data(text)
    
    # Save results
    output_path = Path(output_dir) / f"{Path(pdf_path).stem}.json"
    with open(output_path, 'w') as f:
        json.dump(data, f, indent=2)
    
    return True, str(output_path)

Validation

See detailed validation examples in validation_examples.md

Troubleshooting

Common Issues

Missing Dependencies

# Install system dependencies
sudo apt-get install tesseract-ocr tesseract-ocr-pol

PDF Text Extraction Fails
- Ensure the PDF is not scanned (try selecting text in your PDF viewer)
- For scanned PDFs, use the OCR functionality
Performance Issues
- Process large documents in batches
- Use is_valid_pdf_simple() for quick validation before full processing

For more help, please open an issue.

This site is open source. Improve this page.

invocr