OCR PDF with Python: Extract Text from Scanned Documents via API
Extract text from scanned PDFs using Python OCR API and pytesseract. Compare accuracy, handle multi-language documents, and build production OCR pipelines with error handling.
OCR PDF with Python: Extracting Text from Scanned Documents via API
Optical Character Recognition (OCR) converts scanned PDF images into machine-readable text. Building OCR into a Python application requires either a local OCR library (Tesseract) or a cloud OCR API. This guide covers both approaches with working code, then compares accuracy and tradeoffs.
When You Need OCR
- Scanned contracts and agreements — paper documents digitised by scanner
- Bank statements received by mail — scanned and emailed as PDFs
- Historical document digitisation — archiving physical records
- Invoice processing — suppliers who send photographed or faxed invoices
- Form processing — handwritten or printed forms scanned after completion
Option 1: AllPDFMagic OCR API (Recommended)
The AllPDFMagic OCR API handles multi-page PDFs, multiple languages, tables, and mixed layouts without any local setup.
Basic OCR request:
import requests
API_KEY = "your_api_key"
def ocr_pdf(pdf_path: str, language: str = "eng") -> dict:
with open(pdf_path, "rb") as f:
response = requests.post(
"https://api.allpdfmagic.com/v1/ai/ocr",
headers={"X-API-Key": API_KEY},
files={"file": f},
data={"language": language}
)
response.raise_for_status()
return response.json()
result = ocr_pdf("scanned_contract.pdf")
print(result["text"]) # Full document text
print(result["page_count"]) # Number of pages processed
print(result["confidence"]) # Average confidence score (0–1)
# Access page-by-page results
for page in result["pages"]:
print(f"Page {page['page_number']}: {page['text'][:200]}")
Response structure:
{
"text": "Full document text...",
"page_count": 5,
"confidence": 0.94,
"language_detected": "en",
"pages": [
{
"page_number": 1,
"text": "Page 1 text...",
"confidence": 0.96,
"words": [{"text": "Invoice", "confidence": 0.99, "bbox": [10, 20, 80, 35]}]
}
]
}
Option 2: pytesseract (Local, Free)
For fully offline OCR without API costs:
import pytesseract
from pdf2image import convert_from_path
from PIL import Image
def ocr_pdf_local(pdf_path: str, language: str = "eng") -> str:
# Convert PDF pages to images
images = convert_from_path(pdf_path, dpi=300)
full_text = ""
for i, image in enumerate(images):
text = pytesseract.image_to_string(image, lang=language)
full_text += f"--- Page {i+1} ---\n{text}\n"
return full_text
# Install: pip install pytesseract pdf2image
# Also install: tesseract-ocr (system package), poppler (for pdf2image)
text = ocr_pdf_local("scanned.pdf")
Limitations of pytesseract:
- Requires system-level Tesseract installation (complex on Windows)
- Lower accuracy on complex layouts, tables, and non-standard fonts
- No built-in table extraction
- Slower per page than cloud APIs
Accuracy Comparison
| Document Type | AllPDFMagic API | pytesseract (300 DPI) |
|---|---|---|
| Clean typed text | 97–99% | 95–98% |
| Printed forms | 95–97% | 90–95% |
| Handwriting (clear) | 85–90% | 70–80% |
| Tables | 92–95% | 75–85% |
| Multi-column layouts | 90–94% | 70–80% |
| Low-quality scans | 80–88% | 60–75% |
Improving OCR Accuracy
Before running OCR, pre-process the PDF to improve results:
For poor-quality scans:
from PIL import Image, ImageFilter, ImageEnhance
import numpy as np
def preprocess_for_ocr(image: Image.Image) -> Image.Image:
# Convert to grayscale
image = image.convert("L")
# Increase contrast
enhancer = ImageEnhance.Contrast(image)
image = enhancer.enhance(2.0)
# Sharpen
image = image.filter(ImageFilter.SHARPEN)
# Threshold to black and white
image = image.point(lambda x: 0 if x < 128 else 255, "1")
return image
Multi-Language OCR
# AllPDFMagic API supports 50+ languages
result = ocr_pdf("hindi_document.pdf", language="hin")
# pytesseract with multiple languages
text = pytesseract.image_to_string(image, lang="eng+hin+mar")
Supported languages in AllPDFMagic API include: English, Hindi, Tamil, Telugu, Marathi, Gujarati, Bengali, Arabic, French, German, Spanish, Portuguese, Chinese (Simplified/Traditional), Japanese, Korean.
Building a Production OCR Pipeline
import os
from pathlib import Path
import logging
logging.basicConfig(level=logging.INFO)
def process_scanned_pdfs(input_dir: str, output_dir: str):
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
pdf_files = list(input_path.glob("*.pdf"))
logging.info(f"Processing {len(pdf_files)} PDFs")
results = []
for pdf_file in pdf_files:
try:
result = ocr_pdf(str(pdf_file))
# Save extracted text
text_file = output_path / (pdf_file.stem + ".txt")
text_file.write_text(result["text"])
results.append({
"file": pdf_file.name,
"pages": result["page_count"],
"confidence": result["confidence"],
"status": "success"
})
logging.info(f"Processed {pdf_file.name}: {result['confidence']:.0%} confidence")
except Exception as e:
results.append({"file": pdf_file.name, "status": "error", "error": str(e)})
logging.error(f"Failed {pdf_file.name}: {e}")
return results
results = process_scanned_pdfs("scanned_invoices/", "extracted_text/")
Frequently Asked Questions
How do I know if my PDF needs OCR? Try extracting text with pdfplumber or PyMuPDF. If you get empty strings or garbage characters, the PDF is image-based and needs OCR. See the detection function in our Extract Text from PDF with Python guide.
What DPI should I use when converting PDF to images for Tesseract? 300 DPI is the standard recommendation for OCR. Higher DPI (400–600) can improve accuracy on small text but significantly increases processing time. 150 DPI is the minimum for acceptable results.
Can I OCR a PDF and make it searchable (without changing its appearance)? Yes — this is called creating a "searchable PDF" or PDF/A. AllPDFMagic OCR returns a searchable PDF option where the original scan appearance is preserved and an invisible text layer is added on top. This is the standard approach for archiving scanned documents.
Related guides:
- Best PDF APIs for Developers 2026 — full API comparison
- Extract Text from PDF with Python — native PDF text extraction
- PDF Compression API Guide — compress PDFs via API
Frequently Asked Questions
Use the API when: you need higher accuracy (especially for tables and multi-column layouts), you want no local setup, you're processing in a cloud environment, or you need multi-language support. Use pytesseract locally when: processing is offline, data privacy prevents cloud uploads, or you're doing high-volume processing with custom preprocessing.
300 DPI is the standard recommendation. Higher DPI (400-600) can improve accuracy on small text but significantly increases processing time and memory usage. 150 DPI is the minimum for acceptable results.
Use AllPDFMagic's searchable PDF option — the original scan appearance is preserved and an invisible text layer is added on top. This is the standard approach for archiving scanned documents (PDF/A compliant).
Yes. AllPDFMagic OCR API supports Hindi, Tamil, Telugu, Marathi, Gujarati, Bengali, and other Indian languages. pytesseract also supports these with the correct language pack installed.