PDF Reader

i am work on a solution that require me to read govt issue pdfs. the pdfs have the standard govt issue quirks, un structured, items mix up, layered, un standard bullet poiints. etc. some data has to be extract from them. standard py script reader does agood job but fails in unstructured lines. deepseek ocr needs py3.12 it seems, im py3.10. venv could be option, yes. moondreams or something else, glm ocr, what i do?

2 Likes

Hmmm.., been down this exact rabbit hole — messy government PDFs are a different breed. the good news: you don’t need DeepSeek OCR, you don’t need Py 3.12, and you definitely don’t need a venv nightmare. here’s the full breakdown of what actually works on Python 3.10, tested and benchmarked, dumb-proof from zero.


the short version: no single tool handles everything, but the right combo gets you 90%+ accuracy. start cheap and fast, escalate only when pages break.


đź§  Why Government PDFs Are Broken (the 30-second version)

regular PDFs store text in neat layers — government PDFs don’t give a damn. here’s what makes them special:

  • no consistent structure — headers aren’t tagged as headers, bullets aren’t real bullets, they’re just dashes someone typed
  • tables with invisible borders — the data LOOKS like a table but there’s zero grid lines underneath. just text floating in space
  • mixed layouts on the same page — half the page is two-column, the other half is one-column, and there’s a stamp in the corner
  • scanned copies of copies — someone printed a digital PDF, stamped it, scanned it back in at 150 DPI, and now you’re supposed to read it

think of it like this: a normal PDF is a well-organized filing cabinet. a government PDF is someone dumped the filing cabinet on the floor and photographed it.

standard PyPDF2 or pdfplumber can read the text that’s already digital — but the moment the layout goes sideways (literally or figuratively), they just vomit garbled text.

🏆 The Tier List — What to Install First (Py 3.10, all work)
Tool What It Does Tables Scanned PDFs GPU Needed Install
pymupdf4llm Fastest PDF→Markdown. AI layout analysis. Handles multi-column, strips headers/footers Good (3 detection modes) Auto-OCR via Tesseract :cross_mark: CPU only pip install pymupdf4llm pymupdf-layout
PP-StructureV3 #1 on benchmarks. Detects seals/stamps, fixes rotated scans, chart→table conversion Best (0.159 edit distance) 100+ languages :white_check_mark: 8GB VRAM (has CPU mode) pip install paddleocr paddlepaddle-gpu
Docling (IBM) Vision-transformer table extraction. 97.9% accuracy on complex tables with merged cells Best for merged cells Tesseract/EasyOCR/RapidOCR :cross_mark: CPU works pip install docling
MinerU Heavy-duty. Best reading order for multi-column. Strips headers/footers. 84-language OCR Great (HTML output) PaddleOCR built-in :white_check_mark: 8-25GB VRAM (CPU fallback exists) Docker recommended
Marker Surya OCR + deep layout. Great for European + Devanagari languages Good (HTML for complex) 90+ languages :white_check_mark: helps, not required pip install marker-pdf
Moondream 2 Tiny vision model. Fallback for pages where everything else fails. 2.5GB VRAM Improving, not great yet VLM-based (sees the page as an image) :white_check_mark: 2.5GB pip install moondream or via Ollama
Camelot Specifically beats everything on government tender tables. Dead simple API Best for govt tables with lines :cross_mark: digital only :cross_mark: CPU only pip install camelot-py[cv]
olmOCR 2 Allen AI’s fine-tuned Qwen2.5-VL-7B. Cheapest at scale ($190/million pages). Beats Marker + MinerU Good Document anchoring reduces hallucinations :white_check_mark: 7B model Docker

start here → pymupdf4llm handles 80% of government PDFs. escalate to PP-StructureV3 or Docling for the tables that break. use Moondream/olmOCR as a last resort for truly cursed pages.

⚡ Step 1 — pymupdf4llm (your daily driver, 2 lines of code)

think of pymupdf4llm as a really smart photocopier — it reads the PDF natively (not as a picture), figures out the reading order, and spits out clean Markdown. and it just got an AI layout model that was literally trained on government tenders and legal documents.

install it (works on Py 3.10, no GPU, takes 10 seconds):

# this installs pymupdf4llm + the AI layout module
pip install pymupdf4llm pymupdf-layout

basic extraction — the two lines that handle most documents:

# activates the AI layout model (trained on 580K+ pages including govt docs)
import pymupdf.layout
import pymupdf4llm

# converts your PDF to clean Markdown with tables preserved
md = pymupdf4llm.to_markdown("government_doc.pdf", header=False, footer=False)

# save it
import pathlib
pathlib.Path("output.md").write_bytes(md.encode())

the header=False, footer=False part tells it to strip out repeating page headers and footers (that “Page 3 of 47 | Ministry of Whatever” garbage on every page).

when tables don’t extract right, the secret is the table_strategy parameter. think of it as telling the tool HOW to find tables:

Strategy When to Use
"lines_strict" (default) Tables with visible borders/gridlines
"lines" Tables where colored cell backgrounds act as separators
"text" Borderless tables — detects columns from text position alone. This is the one you need for most govt PDFs
None Skip table detection entirely (speed mode)
# for those annoying borderless government tables
md = pymupdf4llm.to_markdown("govt_form.pdf", table_strategy="text")

the nuclear option — when you know exactly where the table is but the tool can’t find it, you draw the grid lines yourself:

import pymupdf

# open the PDF and grab page 1
doc = pymupdf.open("nightmare_form.pdf")
page = doc[0]

# define the table area (left, top, right, bottom in points)
# and manually add column dividers
tabs = page.find_tables(
    strategy="text",
    clip=pymupdf.Rect(50, 100, 550, 400),
    add_lines=[
        ((50, 100), (550, 100)),   # top border
        ((50, 400), (550, 400)),   # bottom border
        ((50, 100), (50, 400)),    # left border
        ((200, 100), (200, 400)),  # column divider
        ((350, 100), (350, 400)),  # another column divider
    ],
    min_words_vertical=2
)

# export each table to a pandas DataFrame
for tab in tabs:
    df = tab.to_pandas()
    print(df)

one gotcha: when you import pymupdf.layout (the AI model), it takes over table detection automatically. the table_strategy parameter only works WITHOUT the layout import. so pick one approach: AI layout mode OR manual table_strategy. not both.

📊 Step 2 — When Tables Break: Docling or PP-StructureV3

pymupdf4llm uses heuristics to find tables — think of it as “looking for lines and spacing.” when government PDFs have merged cells, invisible borders, or data scattered across weird positions, heuristics fail.

Docling’s TableFormer is fundamentally different — it’s a vision model that literally LOOKS at the table as a picture and figures out the structure. like showing a photo of a table to someone and asking “what are the rows and columns?” instead of trying to measure pixel gaps.

the result: 97.9% accuracy on complex tables with merged cells, versus pymupdf4llm straight-up not having merged cell support at all.

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
from docling.datamodel.base_models import InputFormat

# set to ACCURATE mode for maximum table precision (slower but worth it)
pipeline_options = PdfPipelineOptions(do_table_structure=True)
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

# converts the PDF and preserves complex table structures
result = converter.convert("complex_government_tables.pdf")

# get the Markdown output
md = result.document.export_to_markdown()
print(md)

OR — PP-StructureV3 if you want the absolute benchmark king. this thing is #1 on OmniDocBench (the biggest document parsing benchmark, CVPR 2025). it detects 20 different layout elements including seals, stamps, charts, and multi-column text. it even auto-fixes rotated scans and bent/photographed documents:

from paddleocr import PPStructureV3

# the pipeline handles orientation, warping, tables, charts — everything
pipeline = PPStructureV3(
    use_doc_orientation_classify=True,  # auto-fixes rotated pages
    use_doc_unwarping=True,             # fixes bent/photographed docs
    use_chart_recognition=True,         # converts charts to data tables
    device="gpu"                        # use "cpu" if no GPU
)

# process the PDF and save as Markdown
output = pipeline.predict("./government_report.pdf")
for res in output:
    res.save_to_markdown(save_path="output")

PP-StructureV3 on Py 3.10: confirmed working with PaddleOCR 3.3.0 + PaddlePaddle 3.2.0. needs ~8GB VRAM on GPU, or runs on CPU (just slower).

quick cheat sheet:

Problem Use This
Tables with visible borders pymupdf4llm table_strategy="lines_strict"
Borderless tables pymupdf4llm table_strategy="text"
Merged cells / complex structure Docling (TableFormer ACCURATE)
Government tenders specifically Camelot (camelot.read_pdf("file.pdf", flavor="lattice"))
Everything at once (seals, stamps, charts, rotation) PP-StructureV3
🔧 Step 3 — Scanned PDFs (the preprocessing step nobody talks about)

if your government PDFs are scanned images (not digital text), you need to preprocess BEFORE any extraction tool touches them. this single step can push OCR accuracy from 60% to 95%.

think of it like cleaning a dirty window before trying to read what’s on the other side.

the fastest way — OCRmyPDF handles everything in one command:

# install it
pip install ocrmypdf

# the magic one-liner: deskews, cleans, fixes rotation, upscales to 300 DPI
ocrmypdf --deskew --clean --clean-final --rotate-pages --remove-background \
    --oversample 300 -l eng --jobs 4 input_scan.pdf output_searchable.pdf

what each flag does:

  • --deskew — straightens pages that were scanned crooked
  • --clean — removes scanner noise and artifacts
  • --rotate-pages — auto-detects and fixes upside-down pages
  • --remove-background — strips colored/dirty backgrounds
  • --oversample 300 — upscales to 300 DPI (minimum for good OCR)
  • --jobs 4 — uses 4 CPU cores (adjust to your machine)

after this, feed the cleaned PDF to pymupdf4llm or PP-StructureV3 and watch the accuracy jump.

for really degraded documents (old photocopies, faded text, uneven lighting), add Python preprocessing with OpenCV:

import cv2
import numpy as np

# read the page image
img = cv2.imread("scanned_page.png", cv2.IMREAD_GRAYSCALE)

# step 1: adaptive thresholding (handles uneven scanner lighting)
binary = cv2.adaptiveThreshold(
    img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 25, 10
)

# step 2: denoise (removes scanner specks)
clean = cv2.fastNlMeansDenoising(binary, h=10)

# step 3: enhance contrast for faded text
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
enhanced = clahe.apply(img)

cv2.imwrite("preprocessed_page.png", clean)
🤖 Step 4 — Vision Model Fallback (for cursed pages)

some pages are beyond saving with text extraction — weird overlapping layers, hand-stamped content over printed text, forms where the filled-in data overlaps the template. for these, you need a model that LOOKS at the page like a human does.

Moondream 2 is the lightest option. 1.9B parameters, runs on 2.5GB VRAM (or CPU via Ollama), and now supports structured JSON output. think of it as showing a photo to a tiny AI and asking “what does this say?”

via Ollama (easiest path):

# install Ollama if you haven't
curl -fsSL https://ollama.com/install.sh | sh

# pull the Moondream model (~1.7GB download)
ollama pull moondream
import ollama

# convert your PDF page to an image first
# (use pdf2image: pip install pdf2image)
from pdf2image import convert_from_path
images = convert_from_path("cursed_document.pdf", dpi=300)

# save page 1 as image
images[0].save("page1.png", "PNG")

# ask Moondream to read it
response = ollama.chat(
    model="moondream",
    messages=[{
        "role": "user",
        "content": "Transcribe all text from this document in natural reading order. Preserve table structures.",
        "images": ["page1.png"]
    }]
)
print(response["message"]["content"])

for structured form data, ask it to return JSON:

response = ollama.chat(
    model="moondream",
    messages=[{
        "role": "user",
        "content": "Extract all form fields and their filled values from this document. Return as JSON with field names as keys.",
        "images": ["page1.png"]
    }]
)

for production/large batches, olmOCR 2 (Allen AI) is better — it’s specifically trained on documents, reduces hallucinations with a technique called “document anchoring,” and costs only $190 per MILLION pages to run. fully open-source:

# requires GPU with 16GB+ VRAM
docker run --gpus all -v $(pwd):/workspace alleninstituteforai/olmocr:latest-with-model \
  -c "olmocr /workspace/output --markdown --pdfs /workspace/sample.pdf"

skip GOT-OCR 2.0 for this use case — it’s great at plain text but straight-up fails on forms and tables (rated “Bad” on 4 of 7 form tests by F22 Labs).

đź§© The Hybrid Pipeline (the real answer for production)

in practice, you don’t pick one tool — you build a pipeline that starts fast/cheap and escalates only when needed. think of it like a hospital triage: most patients just need a bandaid, some need an x-ray, very few need surgery.

PDF comes in
    ↓
pymupdf4llm extracts text (0.12 seconds/page)
    ↓
Quality check: did we get actual readable text?
  - Character count > 50? âś…
  - Less than 10% garbled characters? âś…
  - Words have normal length (3-15 chars)? âś…
    ↓
YES → done, use the output
    ↓
NO → page is probably scanned or broken
    ↓
Preprocess with OCRmyPDF → re-extract
    ↓
Still broken? → send page image to Moondream/olmOCR
    ↓
Tables look wrong? → re-extract tables with Docling or Camelot

there’s a GitHub repo that implements exactly this pattern: nihithta/pdf-extraction-pipeline — it auto-detects page types (digital text vs scanned vs image-heavy) and routes each page to the right extraction method.

💡 About DeepSeek OCR, Moondream, and GLM OCR — Your Specific Questions

DeepSeek OCR + Py 3.12 issue: yeah, some newer VLM tools need 3.12+. but you don’t need DeepSeek for this. pymupdf4llm + Marker + PP-StructureV3 all work perfectly on Py 3.10. if you specifically WANT DeepSeek later, a venv with Py 3.12 is the clean path — don’t pollute your main environment.

Moondream: solid choice for a vision-model fallback, not as a primary extractor. it’s tiny (1.9B params), runs anywhere, and the OCR quality improved significantly in the Jan-Apr 2025 updates. use it for the 5-10% of pages where text extraction completely fails. prompt it with “Transcribe the text in natural reading order” for best results.

GLM OCR: if you mean CogVLM/GLM-4V — these are big vision-language models. GLM-OCR (0.9B) actually just hit 94.62 on OmniDocBench which is insane for its size. but for your use case it’s overkill. the problem isn’t OCR accuracy on clean pages — it’s handling the structural chaos. that’s where pymupdf4llm’s layout model and Docling’s TableFormer solve the actual pain point.

💰 The Part Nobody Mentions — Making Money With This

once you can reliably extract structured data from messy PDFs, you’re sitting on a skill that companies will pay serious money for:

freelancing/consulting — government contractors, law firms, and accounting firms all drown in PDF hell. extracting data from regulatory filings, tender documents, and compliance reports is a $50-150/hour gig on Upwork. most firms still do this manually with interns.

build a SaaS tool — a simple web app that takes a government PDF and returns structured JSON/CSV. charge per document or per page. the market for this is massive in India (GST invoices, ITR forms, property documents) and basically every country with a bureaucracy.

data entry automation — any business that processes government forms (visa applications, permits, tax documents, insurance claims) needs this. you’re replacing manual data entry. one Python script can do what 5 data entry operators do.

RAG pipeline for government docs — feed extracted data into a vector database, build a chatbot that answers questions about government regulations/forms. legal tech companies pay $200K+ salaries for this exact skill.

the extraction pipeline IS the product — most people get stuck at “how do I read this PDF” and give up.

🔗 Quick Reference — All the Links
Tool Link Best For Cost
pymupdf4llm GitHub Daily driver, fast extraction Free
pymupdf-layout PyPI AI layout model addon Free
PP-StructureV3 GitHub Benchmark king, tables/seals Free
Docling GitHub Complex table extraction Free
MinerU GitHub Heavy-duty, best reading order Free
Marker GitHub Multilingual OCR Free
Moondream 2 GitHub Lightweight vision fallback Free
olmOCR 2 GitHub Cheapest at scale Free / $190 per 1M pages
Camelot GitHub Government tender tables Free
OCRmyPDF GitHub Scan preprocessing Free
pdf-extraction-pipeline GitHub Hybrid pipeline template Free
OpenBharatOCR GitHub Indian govt docs (Aadhaar, PAN, etc.) Free
OCRFlux-3B GitHub Cross-page table merging Free
Nanonets-OCR2 HuggingFace Checkboxes, watermarks, forms Free
Mistral OCR 3 API Best cloud API option $2/1K pages

the actual recommendation for your situation: install pymupdf4llm + pymupdf-layout right now. two pip commands, two lines of Python, Py 3.10, no GPU. that’ll handle most of your government PDFs immediately. when specific pages break (and they will), come back to this reply and escalate to the right tool for that specific failure mode. don’t try to set up the entire pipeline before you even know which PDFs give you trouble — start simple, fix the exceptions as they come.

5 Likes

If you have the opportunity to provide me, for example, with some outdated government pdf file that is outdated in terms of information importance, I am ready to take up the software for this task and post the results here on the forum

1 Like

I seriously didn’t anyone to reply after so many days. This is by far the bestest advice. let me try what is suggested. thanks you zillions :slight_smile:

1 Like

I understand that you do not have a problem with reading PDF documents but with extracting information from PDF files.
A cumbersome solution has been proposed. I don’t think you have the software development experience to use the Python packages from github.com!This can be done very easily on a PC with Windows OS and Microsoft Word, transforming the PDF file into an editable TXT, DOCX format.

But first, let’s do a little theory about the PDF format!

The PDF format (Portable Document Format), developed by Adobe in 1993, is an international standard (ISO 32000) designed to present documents independently of the software, hardware or operating system used.

Here are the main types and classifications of PDF documents:

1.Classification according to content and interaction:
a.Standard (Native) PDFs: Created directly from document editing applications such as Microsoft Word or Adobe InDesign. These allow selecting, copying and searching for text.
b.Image-only PDFs: Scanned documents that function as images. They do not allow editing or searching for text unless OCR (Optical Character Recognition) technology is applied to them.
c.Searchable PDFs: Scanned documents processed to convert images into selectable and searchable text.
d.AcroForms PDFs: Files that contain fillable fields (text, check boxes, radio buttons) and data submission buttons.

2.Classification according to ISO standards (Purpose of use)
a.PDF/A (Archive): Designed for long-term archiving of electronic documents, ensuring that the document will look identical in the future.
b.PDF/X (Print): The standard for the printing industry, ensuring that the file includes all the fonts, images, and color settings necessary for correct printing.
c.PDF/E (Engineering): Used for technical documents, engineering drawings and maps.
d.PDF/VT (Variable and Transactional): Used for printing variable data, such as invoices or personalized bank statements.
e.PDF/UA (Universal Accessibility): Standard for accessibility, ensuring that documents can be accessed by people with disabilities (using screen readers).

3.Classification by destination
a.PDF for Screen (Optimized): Compressed, low-resolution files, ideal for quick sharing via email or the web.
b.PDF for Print (High Quality): High-resolution files (usually 300 DPI or higher) used for physical documents.
c.PDF allows the inclusion of text, fonts, images and 2D (even 3D) vector graphics in a single file, making it the most popular format for official documents, contracts and invoices.

PDF CONVERSION

All PDF files can be edited with Word, except:
-1.b - OCR must be applied first.
-2.c, 2.d, 2.e - Edit with Adobe Acrobat or third party software.

How to edit PDF with Word:
-Right Click on the PDF file
-Open With → Word
-Save as DOCX
Word-ul converteste si PDF-urile tabelare!
Moreover, after editing the file can be saved back to PDF. This is the easiest way to edit a PDF file, for free, without paying for Adobe Acrobat.
Having the converted PDF it is very easy to design an application in Python, or to extract from the converted file what information you want!

For situations where OCR must be used or another situation, there are “portable” third-party applications that solve this problem:
PDF-XChange Editor Plus,
PDF Arranger,
PDF Extra Ultimate,
Master PDF Editor - It also has an OCR mode,
SepPDF,
Atlantis Word Processor,
EasyScreenOCR,
Icecream PDF Editor Pro,
Infix PDF Editor Pro,
OfficeSuite Premium,
PDFMatePro,
PDF Shaper Professional,
ABBYY FineReader - It also has an OCR mode, at https://portableapps.com/node/58553
AlterPDF Pro,
ByteScout PDF Multitool,
CleverPDF,
PDF Extra Premium,
PDF Redactor Pro,
Renee PDF Aide - It also has an OCR mode

Portable programs can be found at
https://www.portablefreeware.com/
https://portableapps.com/
Cracked programs, with serials and keys at https://rutracker.org/

2 Likes

the first thing i did with the pdf when i got was open it with word and copy the text i wanted. the screw up is the pdf itself is made up of images (scanned text, that too variable different resolutions) and copy-able text. i can extract about 70% of pure text but the rest is un use. the online glm-ocr in this works far better. it extract much more. i am win 11, 32gb ram and rtx 3060 12 gb vram. the offline glm-ocr thru ollama is also work, extracts better than online version, but slow as %$#@.
i am very thankfull to community here for help out. it did help in making me understand how to go about extract. the govt pdf are notorius they are mixed language, some time tables are potrait, some times wide. suddenlt mid table it changes. the garbage is huge. i had 36 pdf, can safely work with 16 only because of mixed lang not extract (english and indic where even glm-ocr fails) my choices are seriously limited.
thnaks to all people here :slight_smile: is there a sol for Indic text extract from PDFs?

I’ve recently developed ocr, but I’m not sure if it’s suitable for official PDF files, but if I had an example document so I could identify what problems are occurring and create a program that fixes those gaps, for now, here’s my first program. 🔍 Translate Anything On Your Screen — Games, PDFs, Locked Sites, 109 Languages

AI_Vids what do you mean by “index text in PDF”?
Send me a document here and explain to me on that document what your problem is!.
THE ONLY APP THAT EDITS AND EXTRACTS COMPLETELY PDF DOCUMENTS IS ADOBE ACROBAT!!
BUT THERE ARE ALSO APPS THAT COME CLOSE TO 95% OF ADOBE ACROBAT:
ABBYY FineReader,
Master PDF Editor,
Renee PDF Aide.

I am cant send a pdf, and not allowed to dm to you :frowning:

sir i mean Indian language INDIC text, Like Hindi, Tamil, Malayalam, etc. If i can manage to send you a dm, i will send you a sample PDF.

YOU can send my documents through [https://www.transfernow.net/en]

google drive link: https://drive.google.com/drive/folders/1mmxJwUqMDXGwY-2C2Tjuggmx88-QGFrd?usp=drive_link

I have converted the submitted files to DOCX format.
You can get them from here:

The archive is valid for 6 days.

1 Like

Thanks. this is really appreciate. thanks again. :folded_hands: