Top 23 Python PDF Projects

paperless-ngx

212 17,416 9.9 Python

A community-supported supercharged version of paperless: scan, index and archive all your physical documents

Project mention: I accidentally built a meme search engine | news.ycombinator.com | 2024-04-13

I steered a friend towards Paperless (and away from an LLM solution) as a way of searching/accessing GBs of architectural PDFs recently - so far, it’s apparently working well for them.
https://github.com/paperless-ngx/paperless-ngx

OCRmyPDF

77 12,291 9.5 Python

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Project mention: TextSnatcher: Copy text from images, for the Linux Desktop | news.ycombinator.com | 2024-03-14

Try https://github.com/ocrmypdf/OCRmyPDF - it uses Tesseract behind the scenes and it absolutely brilliant.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
h2ogpt

29 10,801 10.0 Python

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/

Project mention: Multi AI Agent Systems Using OpenAI's New GPT-4o Model | news.ycombinator.com | 2024-05-17

PyPDF2

30 7,585 9.5 Python

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Project mention: Yara scanning PDF files | /r/computerforensics | 2023-06-01

WeasyPrint

43 6,698 9.5 Python

The awesome document factory

Project mention: Launch HN: Onedoc (YC W24) – A better way to create PDFs | news.ycombinator.com | 2024-03-11

Is there a reason you didn't consider something like Weasyprint?
https://weasyprint.org
I've gone through a number of systems to convert CV's, business cards, and other docs and it hasn't let me down yet.

pdfplumber

29 5,706 8.2 Python

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Project mention: Running OCR against PDFs and images directly in the browser | news.ycombinator.com | 2024-03-30

pdfminer.six

14 5,514 6.8 Python

Community maintained fork of pdfminer - we fathom PDF

Project mention: Code to extract text from pdf to excel | /r/Python | 2023-06-02

I love to use PDFMiner and PDFQuery for this https://github.com/pdfminer/pdfminer.six https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-and-pdfquery-d033721c3b28

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
PDFMiner

6 5,179 0.0 Python

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
PyMuPDF

5 4,249 9.8 Python

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Project mention: FLaNK Stack for 04 December 2023 | dev.to | 2023-12-04

camelot

1 3,553 0.0 Python

Camelot: PDF Table Extraction for Humans (by atlanhq)

Project mention: How do you parse tables in PDF with langchain? Especially, the context which is few lines above and below the table. | /r/LangChain | 2023-06-23

borb

66 3,314 5.3 Python

borb is a library for reading, creating and manipulating PDF files in python.
pdfarranger

93 3,077 8.9 Python

Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

Project mention: Pdftool.org: modify pdfs offline in the browser | news.ycombinator.com | 2023-08-13

On Linux I like to use:
https://github.com/pdfarranger/pdfarranger and https://gitlab.com/scarpetta/pdfmixtool for such tasks.

malicious-pdf

13 2,710 2.5 Python

💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh
Camelot

10 2,712 6.6 Python

A Python library to extract tabular data from PDFs

Project mention: Show HN: How do you OCR on a Mac using the CLI or just Python for free | news.ycombinator.com | 2024-01-02

I had good repeated success extracting tables from PDFs using Camelot (Python, https://github.com/camelot-dev/camelot)

Papermerge

33 2,363 5.9 Python

Open Source Document Management System for Digital Archives (Scanned Documents)
xhtml2pdf

2 2,188 7.5 Python

A library for converting HTML into PDFs using ReportLab
pdftabextract

0 2,152 0.0 Python

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
tabula-py

4 2,073 7.3 Python

Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
pikepdf

4 2,038 9.5 Python

A Python library for reading and writing PDF, powered by QPDF
pdf2image

1 1,480 5.1 Python

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
arxiv.py

2 995 5.4 Python

Python wrapper for the arXiv API

Project mention: Cannot export query results from the advanced search | /r/arxiv | 2023-10-18

the arxiv api is largely broken. Have a look at this: https://github.com/lukasschwab/arxiv.py/issues/129.

fpdf2

23 966 9.2 Python

Simple PDF generation for Python
pdfium-lib

2 885 7.0 Python

PDFium - Project to compile PDFium library to multiple platforms.

Project mention: MuPDF WASM Viewer Demo | news.ycombinator.com | 2024-04-20

I am letting people know of permissive alternatives (https://github.com/paulocoutinhox/pdfium-lib) and their usage in web components (PDFium.wasm + PDF.js).
My comment also serves as a promise to open-source my components under the same permissive license.
I don't want people exposed to unnecessary stress.

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python PDF related posts

Web Extraction with Vision-LLMs Done the Right Way: Structured Data From Any URL with GPT-4o

1 project | dev.to | 22 May 2024
Stirling PDF: Self-hosted, web-based PDF manipulation tool

4 projects | news.ycombinator.com | 2 May 2024
Show HN: I just open sourced my document/website extractor for Vision-LLMs

2 projects | news.ycombinator.com | 2 Apr 2024
Running OCR against PDFs and images directly in the browser

7 projects | news.ycombinator.com | 30 Mar 2024
🔍Underrated Open Source Projects You Should Know About 🧠

9 projects | dev.to | 20 Mar 2024
Launch HN: Onedoc (YC W24) – A better way to create PDFs

11 projects | news.ycombinator.com | 11 Mar 2024
LlamaCloud and LlamaParse

9 projects | news.ycombinator.com | 20 Feb 2024
A note from our sponsor - InfluxDB
www.influxdata.com | 30 May 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source PDF projects in Python? This list will help you:

	Project	Stars
1	paperless-ngx	17,416
2	OCRmyPDF	12,291
3	h2ogpt	10,801
4	PyPDF2	7,585
5	WeasyPrint	6,698
6	pdfplumber	5,706
7	pdfminer.six	5,514
8	PDFMiner	5,179
9	PyMuPDF	4,249
10	camelot	3,553
11	borb	3,314
12	pdfarranger	3,077
13	malicious-pdf	2,710
14	Camelot	2,712
15	Papermerge	2,363
16	xhtml2pdf	2,188
17	pdftabextract	2,152
18	tabula-py	2,073
19	pikepdf	2,038
20	pdf2image	1,480
21	arxiv.py	995
22	fpdf2	966
23	pdfium-lib	885

Python PDF

Top 23 Python PDF Projects

Python PDF related posts

Web Extraction with Vision-LLMs Done the Right Way: Structured Data From Any URL with GPT-4o

Stirling PDF: Self-hosted, web-based PDF manipulation tool

Show HN: I just open sourced my document/website extractor for Vision-LLMs

Running OCR against PDFs and images directly in the browser

🔍Underrated Open Source Projects You Should Know About 🧠

Launch HN: Onedoc (YC W24) – A better way to create PDFs

LlamaCloud and LlamaParse

Index