DeveloperJune 4, 202610 min read

How to Extract Text from PDF with Python in 2026

Complete Python guide to PDF text extraction: pdfplumber, PyMuPDF, pypdf, and OCR API for scanned PDFs. Working code examples for tables, metadata, and batch processing.

AllPDFMagic Team

How to Extract Text from PDF with Python in 2026

Extracting text from PDFs in Python is one of the most common document processing tasks in data science, automation, and backend engineering. The approach you use depends on the PDF type: native text PDFs are straightforward; scanned PDFs require OCR; and complex layouts with columns, tables, or mixed content require specialised libraries.

This guide covers every scenario with working code examples.

Library Comparison

Library	Best For	Install	Handles Scanned PDFs
pdfplumber	Tables, structured extraction	`pip install pdfplumber`	No
PyMuPDF (fitz)	Fast text extraction, metadata	`pip install pymupdf`	No
pypdf	Simple text extraction, page ops	`pip install pypdf`	No
pdfminer.six	Low-level text with positions	`pip install pdfminer.six`	No
AllPDFMagic API	All PDFs including scanned	API call	Yes (OCR built-in)
AWS Textract	Enterprise OCR	boto3	Yes

Method 1: pdfplumber (Recommended for Tables and Structured Data)

import pdfplumber

# Extract all text
with pdfplumber.open("document.pdf") as pdf:
    full_text = ""
    for page in pdf.pages:
        full_text += page.extract_text() + "\n"
print(full_text[:500])  # Preview first 500 chars

# Extract tables from a specific page
with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    for table in tables:
        for row in table:
            print(row)

pdfplumber is the best choice for extracting structured data like tables, financial figures, and forms from native text PDFs.

Method 2: PyMuPDF (Fastest for Large Files)

import fitz  # PyMuPDF

doc = fitz.open("document.pdf")
for page_num, page in enumerate(doc):
    text = page.get_text()
    print(f"Page {page_num + 1}:\n{text}\n{'='*50}")
doc.close()

PyMuPDF is 3–5× faster than pdfplumber for pure text extraction and handles large documents (500+ pages) efficiently.

Method 3: pypdf (Lightweight, Simple)

from pypdf import PdfReader

reader = PdfReader("document.pdf")
for page in reader.pages:
    print(page.extract_text())

pypdf is the lightest dependency if you only need basic text extraction and don't need table support.

Handling Scanned PDFs: OCR via API

None of the above libraries work for scanned PDFs — they return empty strings or garbled output because there's no text layer to read.

For scanned PDFs, use AllPDFMagic OCR API:

import requests

API_KEY = "your_api_key"

def ocr_pdf(pdf_path: str) -> str:
    with open(pdf_path, "rb") as f:
        response = requests.post(
            "https://api.allpdfmagic.com/v1/ai/ocr",
            headers={"X-API-Key": API_KEY},
            files={"file": f}
        )
    return response.json()["text"]

# Extract text from a scanned PDF
text = ocr_pdf("scanned_contract.pdf")
print(text[:1000])

The OCR API handles multi-language text, tables, and mixed layouts. It returns structured JSON with page-by-page text and confidence scores.

Detecting PDF Type (Text vs Scanned)

Before choosing your extraction method, detect whether the PDF has a text layer:

import fitz

def has_text_layer(pdf_path: str, threshold: int = 10) -> bool:
    doc = fitz.open(pdf_path)
    total_text = ""
    for page in doc:
        total_text += page.get_text()
    doc.close()
    # If very little text extracted, likely a scanned PDF
    return len(total_text.strip()) > threshold

if has_text_layer("document.pdf"):
    # Use pdfplumber or PyMuPDF
    pass
else:
    # Use OCR API
    pass

Extracting Text from Specific Pages

import pdfplumber

def extract_pages(pdf_path: str, page_numbers: list[int]) -> str:
    with pdfplumber.open(pdf_path) as pdf:
        text = ""
        for n in page_numbers:
            text += pdf.pages[n].extract_text() + "\n"
    return text

# Extract pages 2, 5, and 10 (0-indexed)
content = extract_pages("report.pdf", [1, 4, 9])

Extracting Metadata

import fitz

doc = fitz.open("document.pdf")
metadata = doc.metadata
print(f"Author: {metadata['author']}")
print(f"Title: {metadata['title']}")
print(f"Creation date: {metadata['creationDate']}")
print(f"Pages: {doc.page_count}")
doc.close()

Frequently Asked Questions

Why does page.extract_text() return an empty string? The PDF is likely scanned (image-based) with no text layer. Use OCR via AllPDFMagic API to extract text from scanned documents.

How do I handle PDFs with multiple columns? pdfplumber's extract_text() reads left-to-right, top-to-bottom by default, which merges columns incorrectly. Use page.extract_words() and sort/group by x-position to handle multi-column layouts properly.

Which library handles large PDFs (1000+ pages) without running out of memory? PyMuPDF (fitz) has the most efficient memory management for large PDFs. Process pages in chunks and del page objects after extraction to free memory.

Get OCR API access →

Related guides:

PDF Compression API Guide — compress PDFs via API
OCR PDF with Python API — OCR integration in depth
Best PDF APIs for Developers 2026 — full API comparison

Frequently Asked Questions

The PDF is likely scanned (image-based) with no text layer. Use AllPDFMagic OCR API to extract text from scanned documents. Verify by checking if any text is extracted — near-zero output on a non-blank PDF indicates a scanned document.

PyMuPDF (fitz) is the fastest for pure text extraction, handling 500+ page PDFs efficiently. pdfplumber is more accurate for table extraction but slower for large documents.

pdfplumber's extract_text() reads left-to-right by default, merging columns. Use page.extract_words() and sort by x-position to identify column boundaries, then group words by column before joining into text.

For owner-restricted PDFs (editing/copying restricted), pypdf and pdfplumber can extract text if you have the open password. For user-password-protected PDFs, provide the password: PdfReader("doc.pdf", password="secret").

Tags:extract text from pdf pythonpython pdf text extractionpdfplumber tutorialpymupdf extract textpdf parsing pythonpython read pdfpdf ocr python

Try Our PDF Tools

Put what you've learned into practice with our free tools.

Explore Tools

How to Extract Text from PDF with Python in 2026

How to Extract Text from PDF with Python in 2026

Library Comparison

Method 1: pdfplumber (Recommended for Tables and Structured Data)

Method 2: PyMuPDF (Fastest for Large Files)

Method 3: pypdf (Lightweight, Simple)

Handling Scanned PDFs: OCR via API

Detecting PDF Type (Text vs Scanned)

Extracting Text from Specific Pages

Extracting Metadata

Frequently Asked Questions

Frequently Asked Questions

Try Our PDF Tools

Compress

Convert from PDF

AI PDF

Organize

View & Edit

Convert to PDF

Sign

More

Scan

Latest from Our Blog