How to Use a PDF API in Python: Compress, Convert, and Extract Data in Minutes
Step-by-step guide to using the AllPDFMagic REST API from Python. Compress PDFs, extract text, convert Word/Excel files, and pull structured invoice data — all in under 50 lines of code.
How to Use a PDF API in Python: Compress, Convert, and Extract Data in Minutes
Working with PDFs in Python used to mean wrestling with PyPDF2, reportlab, or pdfminer — brittle libraries that break on anything but the simplest PDFs. A modern PDF API removes that pain entirely: you send an HTTP request with a file, you get a processed file back. No dependencies to manage, no edge cases to handle, no C libraries to compile.
This guide walks through using the AllPDFMagic REST API from Python to compress PDFs, extract text, convert Office files, and pull structured data from invoices — all in under 50 lines of code.
Prerequisites
You need Python 3.8+ and the requests library:
pip install requests
Get a free API key at allpdfmagic.com/dashboard/api — 500 calls/month, no credit card required.
Basic setup
import requests
API_KEY = "apm_live_your_key_here"
BASE_URL = "https://www.allpdfmagic.com/api/v1"
session = requests.Session()
session.headers["Authorization"] = f"Bearer {API_KEY}"
Using a Session means the auth header is included on every request without repeating yourself.
Compress a PDF
def compress_pdf(input_path: str, output_path: str, level: str = "medium") -> int:
"""Compress a PDF. Level: low | medium | high. Returns original size - compressed size."""
with open(input_path, "rb") as f:
response = session.post(
f"{BASE_URL}/pdf/compress",
files={"file": (input_path, f, "application/pdf")},
data={"level": level},
)
response.raise_for_status()
with open(output_path, "wb") as out:
out.write(response.content)
original = os.path.getsize(input_path)
compressed = os.path.getsize(output_path)
return original - compressed
saved_bytes = compress_pdf("report.pdf", "report_compressed.pdf", level="high")
print(f"Saved {saved_bytes / 1024:.1f} KB")
Extract all text from a PDF
For PDFs with a text layer (not scanned), the /convert/pdf-to-txt endpoint is instant and free:
def extract_text(pdf_path: str) -> str:
with open(pdf_path, "rb") as f:
response = session.post(
f"{BASE_URL}/convert/pdf-to-txt",
files={"file": (pdf_path, f, "application/pdf")},
)
response.raise_for_status()
return response.text
text = extract_text("contract.pdf")
print(text[:500]) # First 500 chars
For scanned PDFs without a text layer, use OCR instead:
def ocr_pdf(pdf_path: str) -> str:
with open(pdf_path, "rb") as f:
response = session.post(
f"{BASE_URL}/ai/ocr",
files={"file": (pdf_path, f, "application/pdf")},
)
response.raise_for_status()
data = response.json()
return data.get("text", "")
Convert Word/Excel/PPT to PDF
import os
from pathlib import Path
def to_pdf(input_path: str, output_path: str) -> None:
"""Convert Word, Excel, or PPT file to PDF."""
ext = Path(input_path).suffix.lower()
endpoint_map = {
".docx": "word-to-pdf", ".doc": "word-to-pdf",
".xlsx": "excel-to-pdf", ".xls": "excel-to-pdf",
".pptx": "ppt-to-pdf", ".ppt": "ppt-to-pdf",
}
endpoint = endpoint_map.get(ext)
if not endpoint:
raise ValueError(f"Unsupported extension: {ext}")
mime_map = {
".docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
".xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
".pptx": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
}
mime = mime_map.get(ext, "application/octet-stream")
with open(input_path, "rb") as f:
response = session.post(
f"{BASE_URL}/convert/{endpoint}",
files={"file": (input_path, f, mime)},
)
response.raise_for_status()
with open(output_path, "wb") as out:
out.write(response.content)
to_pdf("quarterly_report.docx", "quarterly_report.pdf")
Extract structured invoice data with AI
This is where PDF APIs earn their keep. Instead of writing regex patterns for every invoice format, the AI endpoint handles any layout:
import json
def extract_invoice(pdf_path: str) -> dict:
"""Extract structured data from an invoice PDF using AI."""
with open(pdf_path, "rb") as f:
response = session.post(
f"{BASE_URL}/ai/extract-invoice",
files={"file": (pdf_path, f, "application/pdf")},
)
response.raise_for_status()
return response.json()
invoice = extract_invoice("invoice_vendor_abc.pdf")
print(json.dumps(invoice, indent=2))
# Output:
# {
# "vendor": "ABC Supplies Pvt Ltd",
# "invoice_number": "INV-2024-0891",
# "date": "2024-11-15",
# "total": 48500.00,
# "currency": "INR",
# "line_items": [
# {"description": "Office chairs x10", "quantity": 10, "unit_price": 4500, "amount": 45000},
# {"description": "GST 7%", "quantity": 1, "unit_price": 3500, "amount": 3500}
# ]
# }
Processing a batch of invoices
import glob
from concurrent.futures import ThreadPoolExecutor, as_completed
def process_invoice_batch(folder: str) -> list[dict]:
"""Process all invoice PDFs in a folder concurrently."""
paths = glob.glob(f"{folder}/*.pdf")
results = []
errors = []
with ThreadPoolExecutor(max_workers=5) as executor:
futures = {executor.submit(extract_invoice, p): p for p in paths}
for future in as_completed(futures):
path = futures[future]
try:
data = future.result()
data["_source_file"] = path
results.append(data)
except Exception as e:
errors.append({"file": path, "error": str(e)})
if errors:
print(f"Failed: {len(errors)} files")
return results
invoices = process_invoice_batch("./vendor_invoices")
total = sum(inv.get("total", 0) for inv in invoices)
print(f"Processed {len(invoices)} invoices. Total: ₹{total:,.2f}")
Check your quota
Always useful to check before a large batch:
def check_quota() -> dict:
response = session.get(f"{BASE_URL}/usage/check")
response.raise_for_status()
return response.json()
quota = check_quota()
print(f"{quota['remaining']} API calls remaining this month")
Error handling
import requests
def safe_compress(input_path: str, output_path: str) -> bool:
try:
with open(input_path, "rb") as f:
response = session.post(
f"{BASE_URL}/pdf/compress",
files={"file": (input_path, f, "application/pdf")},
data={"level": "medium"},
timeout=60,
)
if response.status_code == 429:
print("Monthly quota reached. Upgrade at allpdfmagic.com/pricing")
return False
if response.status_code == 422:
error = response.json().get("error", "Validation error")
print(f"Invalid request: {error}")
return False
response.raise_for_status()
with open(output_path, "wb") as out:
out.write(response.content)
return True
except requests.Timeout:
print(f"Request timed out for {input_path}")
return False
except requests.HTTPError as e:
print(f"HTTP error: {e}")
return False
Key status codes
| Status | Meaning |
|---|---|
| 200 | Success — response body is the processed file or JSON |
| 401 | Invalid or missing API key |
| 422 | Validation error — check the error field in response body |
| 429 | Monthly quota exceeded — upgrade or wait for reset |
| 500 | Backend error — retry after a few seconds |
What's next
- API Reference — all 50+ endpoints with parameters
- SDK Guide — the official Python SDK that wraps all of this
- Dashboard — get your API key and check usage
Frequently Asked Questions
Just the requests library (pip install requests). The API is a standard HTTP REST API — no special PDF libraries needed. Everything runs server-side.
Yes. Use /ai/ocr for scanned PDFs without a text layer. For PDFs with embedded text, /convert/pdf-to-txt is instant and free.
The Starter tier gives 500 API calls per month at no cost. Each file operation (compress, convert, extract) counts as one call. No credit card required.
The AI extraction is highly accurate for standard invoice formats. It handles vendor names, amounts, line items, GST numbers, and dates. Edge cases like handwritten invoices or non-standard layouts may need manual review.