Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 23 Python PDF Projects
-
paperless-ngx
A community-supported supercharged version of paperless: scan, index and archive all your physical documents
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
h2ogpt
Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/
-
PyPDF2
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
-
pdfplumber
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
-
pdfarranger
Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.
-
malicious-pdf
💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh
-
pdftabextract
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
I steered a friend towards Paperless (and away from an LLM solution) as a way of searching/accessing GBs of architectural PDFs recently - so far, it’s apparently working well for them.
https://github.com/paperless-ngx/paperless-ngx
Project mention: TextSnatcher: Copy text from images, for the Linux Desktop | news.ycombinator.com | 2024-03-14Try https://github.com/ocrmypdf/OCRmyPDF - it uses Tesseract behind the scenes and it absolutely brilliant.
Project mention: Multi AI Agent Systems Using OpenAI's New GPT-4o Model | news.ycombinator.com | 2024-05-17
Project mention: Launch HN: Onedoc (YC W24) – A better way to create PDFs | news.ycombinator.com | 2024-03-11Is there a reason you didn't consider something like Weasyprint?
https://weasyprint.org
I've gone through a number of systems to convert CV's, business cards, and other docs and it hasn't let me down yet.
Project mention: Running OCR against PDFs and images directly in the browser | news.ycombinator.com | 2024-03-30
I love to use PDFMiner and PDFQuery for this https://github.com/pdfminer/pdfminer.six https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-and-pdfquery-d033721c3b28
Project mention: How do you parse tables in PDF with langchain? Especially, the context which is few lines above and below the table. | /r/LangChain | 2023-06-23
Project mention: Pdftool.org: modify pdfs offline in the browser | news.ycombinator.com | 2023-08-13On Linux I like to use:
https://github.com/pdfarranger/pdfarranger and https://gitlab.com/scarpetta/pdfmixtool for such tasks.
Project mention: Show HN: How do you OCR on a Mac using the CLI or just Python for free | news.ycombinator.com | 2024-01-02I had good repeated success extracting tables from PDFs using Camelot (Python, https://github.com/camelot-dev/camelot)
the arxiv api is largely broken. Have a look at this: https://github.com/lukasschwab/arxiv.py/issues/129.
I am letting people know of permissive alternatives (https://github.com/paulocoutinhox/pdfium-lib) and their usage in web components (PDFium.wasm + PDF.js).
My comment also serves as a promise to open-source my components under the same permissive license.
I don't want people exposed to unnecessary stress.
Python PDF related posts
-
Web Extraction with Vision-LLMs Done the Right Way: Structured Data From Any URL with GPT-4o
-
Stirling PDF: Self-hosted, web-based PDF manipulation tool
-
Show HN: I just open sourced my document/website extractor for Vision-LLMs
-
Running OCR against PDFs and images directly in the browser
-
🔍Underrated Open Source Projects You Should Know About 🧠
-
Launch HN: Onedoc (YC W24) – A better way to create PDFs
-
LlamaCloud and LlamaParse
-
A note from our sponsor - InfluxDB
www.influxdata.com | 30 May 2024
Index
What are some of the best open-source PDF projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | paperless-ngx | 17,416 |
2 | OCRmyPDF | 12,291 |
3 | h2ogpt | 10,801 |
4 | PyPDF2 | 7,585 |
5 | WeasyPrint | 6,698 |
6 | pdfplumber | 5,706 |
7 | pdfminer.six | 5,514 |
8 | PDFMiner | 5,179 |
9 | PyMuPDF | 4,249 |
10 | camelot | 3,553 |
11 | borb | 3,314 |
12 | pdfarranger | 3,077 |
13 | malicious-pdf | 2,710 |
14 | Camelot | 2,712 |
15 | Papermerge | 2,363 |
16 | xhtml2pdf | 2,188 |
17 | pdftabextract | 2,152 |
18 | tabula-py | 2,073 |
19 | pikepdf | 2,038 |
20 | pdf2image | 1,480 |
21 | arxiv.py | 995 |
22 | fpdf2 | 966 |
23 | pdfium-lib | 885 |
Sponsored