Top 23 Python PDF Projects

paperless-ngx

212 16,576 9.9 Python

A community-supported supercharged version of paperless: scan, index and archive all your physical documents

Project mention: I accidentally built a meme search engine | news.ycombinator.com | 2024-04-13

I steered a friend towards Paperless (and away from an LLM solution) as a way of searching/accessing GBs of architectural PDFs recently - so far, it’s apparently working well for them.
https://github.com/paperless-ngx/paperless-ngx
OCRmyPDF

77 11,866 9.6 Python

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Project mention: TextSnatcher: Copy text from images, for the Linux Desktop | news.ycombinator.com | 2024-03-14

Try https://github.com/ocrmypdf/OCRmyPDF - it uses Tesseract behind the scenes and it absolutely brilliant.
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
PyPDF2

30 7,311 9.5 Python

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Project mention: Yara scanning PDF files | /r/computerforensics | 2023-06-01
WeasyPrint

43 6,610 9.4 Python

The awesome document factory

Project mention: Launch HN: Onedoc (YC W24) – A better way to create PDFs | news.ycombinator.com | 2024-03-11

Is there a reason you didn't consider something like Weasyprint?
https://weasyprint.org
I've gone through a number of systems to convert CV's, business cards, and other docs and it hasn't let me down yet.
pdfplumber

29 5,468 8.4 Python

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

Project mention: Running OCR against PDFs and images directly in the browser | news.ycombinator.com | 2024-03-30
pdfminer.six

14 5,404 7.1 Python

Community maintained fork of pdfminer - we fathom PDF

Project mention: Code to extract text from pdf to excel | /r/Python | 2023-06-02

I love to use PDFMiner and PDFQuery for this https://github.com/pdfminer/pdfminer.six https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-and-pdfquery-d033721c3b28
PDFMiner

6 5,179 0.0 Python

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
WorkOS

workos.com
sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
PyMuPDF

5 3,969 9.8 Python

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Project mention: FLaNK Stack for 04 December 2023 | dev.to | 2023-12-04
camelot

1 3,553 0.0 Python

Camelot: PDF Table Extraction for Humans (by atlanhq)

Project mention: How do you parse tables in PDF with langchain? Especially, the context which is few lines above and below the table. | /r/LangChain | 2023-06-23
borb

66 3,277 5.2 Python

borb is a library for reading, creating and manipulating PDF files in python.

Project mention: Caffè Italia * 30/04/23 | /r/italy | 2023-04-30
pdfarranger

93 2,970 9.0 Python

Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

Project mention: Pdftool.org: modify pdfs offline in the browser | news.ycombinator.com | 2023-08-13

On Linux I like to use:
https://github.com/pdfarranger/pdfarranger and https://gitlab.com/scarpetta/pdfmixtool for such tasks.
Camelot

10 2,626 6.9 Python

A Python library to extract tabular data from PDFs

Project mention: Show HN: How do you OCR on a Mac using the CLI or just Python for free | news.ycombinator.com | 2024-01-02

I had good repeated success extracting tables from PDFs using Camelot (Python, https://github.com/camelot-dev/camelot)
malicious-pdf

13 2,585 0.0 Python

💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh

Project mention: Securing PDF Generators Against SSRF Vulnerabilities | /r/netsec | 2023-05-30

Wrote a tool two years ago that does some of the PDF-tests. But more could be added: https://github.com/jonaslejon/malicious-pdf
Papermerge

33 2,317 6.1 Python

Open Source Document Management System for Digital Archives (Scanned Documents)
xhtml2pdf

2 2,166 7.5 Python

A library for converting HTML into PDFs using ReportLab
pdftabextract

0 2,152 0.0 Python

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
tabula-py

4 2,047 7.2 Python

Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
pikepdf

4 2,014 9.5 Python

A Python library for reading and writing PDF, powered by QPDF
pdf2image

1 1,448 5.9 Python

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
arxiv.py

2 965 6.7 Python

Python wrapper for the arXiv API

Project mention: Cannot export query results from the advanced search | /r/arxiv | 2023-10-18

the arxiv api is largely broken. Have a look at this: https://github.com/lukasschwab/arxiv.py/issues/129.
fpdf2

23 930 9.3 Python

Simple PDF generation for Python

Project mention: Pylint strict base configuration | dev.to | 2023-05-03

I recently used this approach on fpdf2: PR #780 Hardening Pylint config
pdftotext

1 807 0.0 Python

Simple PDF text extraction

Project mention: Is it possible (and how) to extract highlighted text from a PDF (via API)? | /r/CSEducation | 2023-12-08

I’ve used https://github.com/jalan/pdftotext for automated pdf text parsing before, but I wasn’t doing anything with highlighting and I’m not sure what the formatting for that looks like or if it would be simple or even present in the output of this library. You could give it a test run on one of your pdfs and see though. Best of luck; pdf text extraction is a nightmare, especially if they don’t all come from the exact same source and generation system, since they’re not really text documents.
pypandoc

5 796 6.6 Python

Thin wrapper for "pandoc" (MIT)
Project mention: Web Scraping in Python – The Complete Guide | news.ycombinator.com | 2024-02-20

I recently used [0] Playwright for Python and [1] pypandoc to build a scraper that fetches a webpage and turns the content into sane markdown so that it can be passed into an AI coding chat [2].
They are both very gentle dependencies to add to a project. Both packages contain built in or scriptable methods to install their underlying platform-specific binary dependencies. This means you don't need to ask end users to use some complex, platform-specific package manager to install playwright and pandoc.
Playwright let's you scrape pages that rely on js. Pandoc is great at turning HTML into sensible markdown. Below is an excerpt of the openai pricing docs [3] that have been scraped to markdown [4] in this manner.
[0] https://playwright.dev/python/docs/intro
[1] https://github.com/JessicaTegner/pypandoc
[2] https://github.com/paul-gauthier/aider
[3] https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turb...
[4] https://gist.githubusercontent.com/paul-gauthier/95a1434a28d...
```
  ## GPT-4 and GPT-4 Turbo
```
SaaSHub

www.saashub.com
sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-13.

Python PDF related posts

Show HN: I just open sourced my document/website extractor for Vision-LLMs
2 projects | news.ycombinator.com | 2 Apr 2024
Running OCR against PDFs and images directly in the browser
7 projects | news.ycombinator.com | 30 Mar 2024
🔍Underrated Open Source Projects You Should Know About 🧠
9 projects | dev.to | 20 Mar 2024
Launch HN: Onedoc (YC W24) – A better way to create PDFs
11 projects | news.ycombinator.com | 11 Mar 2024
LlamaCloud and LlamaParse
9 projects | news.ycombinator.com | 20 Feb 2024
Show HN: A new open-source library to design PDF using React
2 projects | news.ycombinator.com | 17 Feb 2024
1.5M PDFs in 25 Minutes
2 projects | news.ycombinator.com | 15 Feb 2024
A note from our sponsor - SaaSHub
www.saashub.com | 17 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source PDF projects in Python? This list will help you:

	Project	Stars
1	paperless-ngx	16,576
2	OCRmyPDF	11,866
3	PyPDF2	7,311
4	WeasyPrint	6,610
5	pdfplumber	5,468
6	pdfminer.six	5,404
7	PDFMiner	5,179
8	PyMuPDF	3,969
9	camelot	3,553
10	borb	3,277
11	pdfarranger	2,970
12	Camelot	2,626
13	malicious-pdf	2,585
14	Papermerge	2,317
15	xhtml2pdf	2,166
16	pdftabextract	2,152
17	tabula-py	2,047
18	pikepdf	2,014
19	pdf2image	1,448
20	arxiv.py	965
21	fpdf2	930
22	pdftotext	807
23	pypandoc	796