Top 4 Python pdf-parsing Projects

PyPDF2

1 31 9,363 9.6 Python

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Project mention: Using Docling’s OCR features with RapidOCR | dev.to | 2025-04-03
InfluxDB

www.influxdata.com featured

InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
pdfplumber

2 31 8,209 8.0 Python

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Project mention: PDF Extraction: Retrieving Text and Tables together using Python🐍 | dev.to | 2024-09-22

Extracting both text and tables can be challenging when working with PDF files due to their complex structure. However, the “pdfplumber” library offers a powerful solution. This article explores an effective method for combining text and table extraction from PDFs using pdfplumber. Special thanks to Karl Genockey a.k.a. cmdlineuser and other contributors for their brilliant approach discussed here.
py-pdf-parser

3 2 412 0.0 Python

A Python tool to help extracting information from structured PDFs.
pdf-to-markdown

4 2 92 7.6 Python

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

Project mention: Parse Markdown content from PDF documents | news.ycombinator.com | 2024-12-03

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python pdf-parsing discussion

Python pdf-parsing related posts

PDF Extraction: Retrieving Text and Tables together using Python🐍

1 project | dev.to | 22 Sep 2024
Running OCR against PDFs and images directly in the browser

7 projects | news.ycombinator.com | 30 Mar 2024
Parsing dates with PDFminer

1 project | /r/learnpython | 4 Jul 2023
How to Extract Data from Tables in a Public Record PDF

2 projects | /r/Journalism | 26 Jun 2023
Code to extract text from pdf to excel

2 projects | /r/Python | 2 Jun 2023
I need to parse unstructured tables from a pdf into a json, what can I do

1 project | /r/computervision | 21 May 2023
Advanced PDF to Excel with documents and example code

2 projects | /r/learnpython | 1 May 2023
A note from our sponsor - Sevalla
sevalla.com | 1 Sep 2025

Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more! Learn more →

Index

What are some of the best open-source pdf-parsing projects in Python? This list will help you:

#	Project	Stars
1	PyPDF2	9,363
2	pdfplumber	8,209
3	py-pdf-parser	412
4	pdf-to-markdown	92

Python pdf-parsing

Top 4 Python pdf-parsing Projects

PyPDF2

InfluxDB

pdfplumber

py-pdf-parser

pdf-to-markdown

Python pdf-parsing discussion

Python pdf-parsing related posts

PDF Extraction: Retrieving Text and Tables together using Python🐍

Running OCR against PDFs and images directly in the browser

Parsing dates with PDFminer

How to Extract Data from Tables in a Public Record PDF

Code to extract text from pdf to excel

I need to parse unstructured tables from a pdf into a json, what can I do

Advanced PDF to Excel with documents and example code

Index

Did you know that Python is the 2nd most popular programming language based on number of references?

Did you know that Python is
the 2nd most popular programming language
based on number of references?