OCR PDF with Python: Extract Text from Scanned Documents via API
DeveloperJune 4, 202611 min read

OCR PDF with Python: Extract Text from Scanned Documents via API

Extract text from scanned PDFs using Python OCR API and pytesseract. Compare accuracy, handle multi-language documents, and build production OCR pipelines with error handling.

AllPDFMagic Team

OCR PDF with Python: Extracting Text from Scanned Documents via API

Optical Character Recognition (OCR) converts scanned PDF images into machine-readable text. Building OCR into a Python application requires either a local OCR library (Tesseract) or a cloud OCR API. This guide covers both approaches with working code, then compares accuracy and tradeoffs.

When You Need OCR

  • Scanned contracts and agreements — paper documents digitised by scanner
  • Bank statements received by mail — scanned and emailed as PDFs
  • Historical document digitisation — archiving physical records
  • Invoice processing — suppliers who send photographed or faxed invoices
  • Form processing — handwritten or printed forms scanned after completion

The AllPDFMagic OCR API handles multi-page PDFs, multiple languages, tables, and mixed layouts without any local setup.

Basic OCR request:

import requests

API_KEY = "your_api_key"

def ocr_pdf(pdf_path: str, language: str = "eng") -> dict:
    with open(pdf_path, "rb") as f:
        response = requests.post(
            "https://api.allpdfmagic.com/v1/ai/ocr",
            headers={"X-API-Key": API_KEY},
            files={"file": f},
            data={"language": language}
        )
    response.raise_for_status()
    return response.json()

result = ocr_pdf("scanned_contract.pdf")
print(result["text"])          # Full document text
print(result["page_count"])    # Number of pages processed
print(result["confidence"])    # Average confidence score (0–1)

# Access page-by-page results
for page in result["pages"]:
    print(f"Page {page['page_number']}: {page['text'][:200]}")

Response structure:

{
  "text": "Full document text...",
  "page_count": 5,
  "confidence": 0.94,
  "language_detected": "en",
  "pages": [
    {
      "page_number": 1,
      "text": "Page 1 text...",
      "confidence": 0.96,
      "words": [{"text": "Invoice", "confidence": 0.99, "bbox": [10, 20, 80, 35]}]
    }
  ]
}

Option 2: pytesseract (Local, Free)

For fully offline OCR without API costs:

import pytesseract
from pdf2image import convert_from_path
from PIL import Image

def ocr_pdf_local(pdf_path: str, language: str = "eng") -> str:
    # Convert PDF pages to images
    images = convert_from_path(pdf_path, dpi=300)

    full_text = ""
    for i, image in enumerate(images):
        text = pytesseract.image_to_string(image, lang=language)
        full_text += f"--- Page {i+1} ---\n{text}\n"

    return full_text

# Install: pip install pytesseract pdf2image
# Also install: tesseract-ocr (system package), poppler (for pdf2image)
text = ocr_pdf_local("scanned.pdf")

Limitations of pytesseract:

  • Requires system-level Tesseract installation (complex on Windows)
  • Lower accuracy on complex layouts, tables, and non-standard fonts
  • No built-in table extraction
  • Slower per page than cloud APIs

Accuracy Comparison

Document TypeAllPDFMagic APIpytesseract (300 DPI)
Clean typed text97–99%95–98%
Printed forms95–97%90–95%
Handwriting (clear)85–90%70–80%
Tables92–95%75–85%
Multi-column layouts90–94%70–80%
Low-quality scans80–88%60–75%

Improving OCR Accuracy

Before running OCR, pre-process the PDF to improve results:

For poor-quality scans:

from PIL import Image, ImageFilter, ImageEnhance
import numpy as np

def preprocess_for_ocr(image: Image.Image) -> Image.Image:
    # Convert to grayscale
    image = image.convert("L")
    # Increase contrast
    enhancer = ImageEnhance.Contrast(image)
    image = enhancer.enhance(2.0)
    # Sharpen
    image = image.filter(ImageFilter.SHARPEN)
    # Threshold to black and white
    image = image.point(lambda x: 0 if x < 128 else 255, "1")
    return image

Multi-Language OCR

# AllPDFMagic API supports 50+ languages
result = ocr_pdf("hindi_document.pdf", language="hin")

# pytesseract with multiple languages
text = pytesseract.image_to_string(image, lang="eng+hin+mar")

Supported languages in AllPDFMagic API include: English, Hindi, Tamil, Telugu, Marathi, Gujarati, Bengali, Arabic, French, German, Spanish, Portuguese, Chinese (Simplified/Traditional), Japanese, Korean.

Building a Production OCR Pipeline

import os
from pathlib import Path
import logging

logging.basicConfig(level=logging.INFO)

def process_scanned_pdfs(input_dir: str, output_dir: str):
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)

    pdf_files = list(input_path.glob("*.pdf"))
    logging.info(f"Processing {len(pdf_files)} PDFs")

    results = []
    for pdf_file in pdf_files:
        try:
            result = ocr_pdf(str(pdf_file))

            # Save extracted text
            text_file = output_path / (pdf_file.stem + ".txt")
            text_file.write_text(result["text"])

            results.append({
                "file": pdf_file.name,
                "pages": result["page_count"],
                "confidence": result["confidence"],
                "status": "success"
            })
            logging.info(f"Processed {pdf_file.name}: {result['confidence']:.0%} confidence")

        except Exception as e:
            results.append({"file": pdf_file.name, "status": "error", "error": str(e)})
            logging.error(f"Failed {pdf_file.name}: {e}")

    return results

results = process_scanned_pdfs("scanned_invoices/", "extracted_text/")

Frequently Asked Questions

How do I know if my PDF needs OCR? Try extracting text with pdfplumber or PyMuPDF. If you get empty strings or garbage characters, the PDF is image-based and needs OCR. See the detection function in our Extract Text from PDF with Python guide.

What DPI should I use when converting PDF to images for Tesseract? 300 DPI is the standard recommendation for OCR. Higher DPI (400–600) can improve accuracy on small text but significantly increases processing time. 150 DPI is the minimum for acceptable results.

Can I OCR a PDF and make it searchable (without changing its appearance)? Yes — this is called creating a "searchable PDF" or PDF/A. AllPDFMagic OCR returns a searchable PDF option where the original scan appearance is preserved and an invisible text layer is added on top. This is the standard approach for archiving scanned documents.

Get OCR API access →

Related guides:

Frequently Asked Questions

Use the API when: you need higher accuracy (especially for tables and multi-column layouts), you want no local setup, you're processing in a cloud environment, or you need multi-language support. Use pytesseract locally when: processing is offline, data privacy prevents cloud uploads, or you're doing high-volume processing with custom preprocessing.

300 DPI is the standard recommendation. Higher DPI (400-600) can improve accuracy on small text but significantly increases processing time and memory usage. 150 DPI is the minimum for acceptable results.

Use AllPDFMagic's searchable PDF option — the original scan appearance is preserved and an invisible text layer is added on top. This is the standard approach for archiving scanned documents (PDF/A compliant).

Yes. AllPDFMagic OCR API supports Hindi, Tamil, Telugu, Marathi, Gujarati, Bengali, and other Indian languages. pytesseract also supports these with the correct language pack installed.

Tags:ocr pdf pythonpython ocr apiscanned pdf text extractionpytesseract pdfpdf ocr apiextract text scanned pdf pythonocr api integration

Try Our PDF Tools

Put what you've learned into practice with our free tools.

Explore Tools