DeveloperJune 4, 202611 min read

OCR PDF with Python: Extract Text from Scanned Documents via API

Extract text from scanned PDFs using Python OCR API and pytesseract. Compare accuracy, handle multi-language documents, and build production OCR pipelines with error handling.

AllPDFMagic Team

OCR PDF with Python: Extracting Text from Scanned Documents via API

Optical Character Recognition (OCR) converts scanned PDF images into machine-readable text. Building OCR into a Python application requires either a local OCR library (Tesseract) or a cloud OCR API. This guide covers both approaches with working code, then compares accuracy and tradeoffs.

When You Need OCR

Scanned contracts and agreements — paper documents digitised by scanner
Bank statements received by mail — scanned and emailed as PDFs
Historical document digitisation — archiving physical records
Invoice processing — suppliers who send photographed or faxed invoices
Form processing — handwritten or printed forms scanned after completion

Option 1: AllPDFMagic OCR API (Recommended)

The AllPDFMagic OCR API handles multi-page PDFs, multiple languages, tables, and mixed layouts without any local setup.

Basic OCR request:

import requests

API_KEY = "your_api_key"

def ocr_pdf(pdf_path: str, language: str = "eng") -> dict:
    with open(pdf_path, "rb") as f:
        response = requests.post(
            "https://api.allpdfmagic.com/v1/ai/ocr",
            headers={"X-API-Key": API_KEY},
            files={"file": f},
            data={"language": language}
        )
    response.raise_for_status()
    return response.json()

result = ocr_pdf("scanned_contract.pdf")
print(result["text"])          # Full document text
print(result["page_count"])    # Number of pages processed
print(result["confidence"])    # Average confidence score (0–1)

# Access page-by-page results
for page in result["pages"]:
    print(f"Page {page['page_number']}: {page['text'][:200]}")

Response structure:

{
  "text": "Full document text...",
  "page_count": 5,
  "confidence": 0.94,
  "language_detected": "en",
  "pages": [
    {
      "page_number": 1,
      "text": "Page 1 text...",
      "confidence": 0.96,
      "words": [{"text": "Invoice", "confidence": 0.99, "bbox": [10, 20, 80, 35]}]
    }
  ]
}

Option 2: pytesseract (Local, Free)

For fully offline OCR without API costs:

import pytesseract
from pdf2image import convert_from_path
from PIL import Image

def ocr_pdf_local(pdf_path: str, language: str = "eng") -> str:
    # Convert PDF pages to images
    images = convert_from_path(pdf_path, dpi=300)

    full_text = ""
    for i, image in enumerate(images):
        text = pytesseract.image_to_string(image, lang=language)
        full_text += f"--- Page {i+1} ---\n{text}\n"

    return full_text

# Install: pip install pytesseract pdf2image
# Also install: tesseract-ocr (system package), poppler (for pdf2image)
text = ocr_pdf_local("scanned.pdf")

Limitations of pytesseract:

Requires system-level Tesseract installation (complex on Windows)
Lower accuracy on complex layouts, tables, and non-standard fonts
No built-in table extraction
Slower per page than cloud APIs

Accuracy Comparison

Document Type	AllPDFMagic API	pytesseract (300 DPI)
Clean typed text	97–99%	95–98%
Printed forms	95–97%	90–95%
Handwriting (clear)	85–90%	70–80%
Tables	92–95%	75–85%
Multi-column layouts	90–94%	70–80%
Low-quality scans	80–88%	60–75%

Improving OCR Accuracy

Before running OCR, pre-process the PDF to improve results:

For poor-quality scans:

from PIL import Image, ImageFilter, ImageEnhance
import numpy as np

def preprocess_for_ocr(image: Image.Image) -> Image.Image:
    # Convert to grayscale
    image = image.convert("L")
    # Increase contrast
    enhancer = ImageEnhance.Contrast(image)
    image = enhancer.enhance(2.0)
    # Sharpen
    image = image.filter(ImageFilter.SHARPEN)
    # Threshold to black and white
    image = image.point(lambda x: 0 if x < 128 else 255, "1")
    return image

Multi-Language OCR

# AllPDFMagic API supports 50+ languages
result = ocr_pdf("hindi_document.pdf", language="hin")

# pytesseract with multiple languages
text = pytesseract.image_to_string(image, lang="eng+hin+mar")

Supported languages in AllPDFMagic API include: English, Hindi, Tamil, Telugu, Marathi, Gujarati, Bengali, Arabic, French, German, Spanish, Portuguese, Chinese (Simplified/Traditional), Japanese, Korean.

Building a Production OCR Pipeline

import os
from pathlib import Path
import logging

logging.basicConfig(level=logging.INFO)

def process_scanned_pdfs(input_dir: str, output_dir: str):
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)

    pdf_files = list(input_path.glob("*.pdf"))
    logging.info(f"Processing {len(pdf_files)} PDFs")

    results = []
    for pdf_file in pdf_files:
        try:
            result = ocr_pdf(str(pdf_file))

            # Save extracted text
            text_file = output_path / (pdf_file.stem + ".txt")
            text_file.write_text(result["text"])

            results.append({
                "file": pdf_file.name,
                "pages": result["page_count"],
                "confidence": result["confidence"],
                "status": "success"
            })
            logging.info(f"Processed {pdf_file.name}: {result['confidence']:.0%} confidence")

        except Exception as e:
            results.append({"file": pdf_file.name, "status": "error", "error": str(e)})
            logging.error(f"Failed {pdf_file.name}: {e}")

    return results

results = process_scanned_pdfs("scanned_invoices/", "extracted_text/")

Frequently Asked Questions

How do I know if my PDF needs OCR? Try extracting text with pdfplumber or PyMuPDF. If you get empty strings or garbage characters, the PDF is image-based and needs OCR. See the detection function in our Extract Text from PDF with Python guide.

What DPI should I use when converting PDF to images for Tesseract? 300 DPI is the standard recommendation for OCR. Higher DPI (400–600) can improve accuracy on small text but significantly increases processing time. 150 DPI is the minimum for acceptable results.

Can I OCR a PDF and make it searchable (without changing its appearance)? Yes — this is called creating a "searchable PDF" or PDF/A. AllPDFMagic OCR returns a searchable PDF option where the original scan appearance is preserved and an invisible text layer is added on top. This is the standard approach for archiving scanned documents.

Get OCR API access →

Related guides:

Best PDF APIs for Developers 2026 — full API comparison
Extract Text from PDF with Python — native PDF text extraction
PDF Compression API Guide — compress PDFs via API

Frequently Asked Questions

Use the API when: you need higher accuracy (especially for tables and multi-column layouts), you want no local setup, you're processing in a cloud environment, or you need multi-language support. Use pytesseract locally when: processing is offline, data privacy prevents cloud uploads, or you're doing high-volume processing with custom preprocessing.

300 DPI is the standard recommendation. Higher DPI (400-600) can improve accuracy on small text but significantly increases processing time and memory usage. 150 DPI is the minimum for acceptable results.

Use AllPDFMagic's searchable PDF option — the original scan appearance is preserved and an invisible text layer is added on top. This is the standard approach for archiving scanned documents (PDF/A compliant).

Yes. AllPDFMagic OCR API supports Hindi, Tamil, Telugu, Marathi, Gujarati, Bengali, and other Indian languages. pytesseract also supports these with the correct language pack installed.

Tags:ocr pdf pythonpython ocr apiscanned pdf text extractionpytesseract pdfpdf ocr apiextract text scanned pdf pythonocr api integration

Try Our PDF Tools

Put what you've learned into practice with our free tools.

Explore Tools

OCR PDF with Python: Extract Text from Scanned Documents via API

OCR PDF with Python: Extracting Text from Scanned Documents via API

When You Need OCR

Option 1: AllPDFMagic OCR API (Recommended)

Option 2: pytesseract (Local, Free)

Accuracy Comparison

Improving OCR Accuracy

Multi-Language OCR

Building a Production OCR Pipeline

Frequently Asked Questions

Frequently Asked Questions

Try Our PDF Tools

Compress

Convert from PDF

AI PDF

Organize

View & Edit

Convert to PDF

Sign

More

Scan

Latest from Our Blog