SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python PDF Projects
-
paperless-ngx
A community-supported supercharged version of paperless: scan, index and archive all your physical documents
I steered a friend towards Paperless (and away from an LLM solution) as a way of searching/accessing GBs of architectural PDFs recently - so far, it’s apparently working well for them.
-
Project mention: TextSnatcher: Copy text from images, for the Linux Desktop | news.ycombinator.com | 2024-03-14
Try https://github.com/ocrmypdf/OCRmyPDF - it uses Tesseract behind the scenes and it absolutely brilliant.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
PyPDF2
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
-
Project mention: Launch HN: Onedoc (YC W24) – A better way to create PDFs | news.ycombinator.com | 2024-03-11
Is there a reason you didn't consider something like Weasyprint?
I've gone through a number of systems to convert CV's, business cards, and other docs and it hasn't let me down yet.
-
pdfplumber
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Project mention: Running OCR against PDFs and images directly in the browser | news.ycombinator.com | 2024-03-30 -
I love to use PDFMiner and PDFQuery for this https://github.com/pdfminer/pdfminer.six https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-and-pdfquery-d033721c3b28
-
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
-
Project mention: How do you parse tables in PDF with langchain? Especially, the context which is few lines above and below the table. | /r/LangChain | 2023-06-23
-
-
pdfarranger
Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.
Project mention: Pdftool.org: modify pdfs offline in the browser | news.ycombinator.com | 2023-08-13On Linux I like to use:
https://github.com/pdfarranger/pdfarranger and https://gitlab.com/scarpetta/pdfmixtool for such tasks.
-
Project mention: Show HN: How do you OCR on a Mac using the CLI or just Python for free | news.ycombinator.com | 2024-01-02
I had good repeated success extracting tables from PDFs using Camelot (Python, https://github.com/camelot-dev/camelot)
-
malicious-pdf
💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh
Wrote a tool two years ago that does some of the PDF-tests. But more could be added: https://github.com/jonaslejon/malicious-pdf
-
-
-
pdftabextract
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
-
-
-
-
the arxiv api is largely broken. Have a look at this: https://github.com/lukasschwab/arxiv.py/issues/129.
-
I recently used this approach on fpdf2: PR #780 Hardening Pylint config
-
Project mention: Is it possible (and how) to extract highlighted text from a PDF (via API)? | /r/CSEducation | 2023-12-08
I’ve used https://github.com/jalan/pdftotext for automated pdf text parsing before, but I wasn’t doing anything with highlighting and I’m not sure what the formatting for that looks like or if it would be simple or even present in the output of this library. You could give it a test run on one of your pdfs and see though. Best of luck; pdf text extraction is a nightmare, especially if they don’t all come from the exact same source and generation system, since they’re not really text documents.
-
I recently used [0] Playwright for Python and [1] pypandoc to build a scraper that fetches a webpage and turns the content into sane markdown so that it can be passed into an AI coding chat [2].
They are both very gentle dependencies to add to a project. Both packages contain built in or scriptable methods to install their underlying platform-specific binary dependencies. This means you don't need to ask end users to use some complex, platform-specific package manager to install playwright and pandoc.
Playwright let's you scrape pages that rely on js. Pandoc is great at turning HTML into sensible markdown. Below is an excerpt of the openai pricing docs [3] that have been scraped to markdown [4] in this manner.
[0] https://playwright.dev/python/docs/intro
[1] https://github.com/JessicaTegner/pypandoc
[2] https://github.com/paul-gauthier/aider
[3] https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turb...
[4] https://gist.githubusercontent.com/paul-gauthier/95a1434a28d...
## GPT-4 and GPT-4 Turbo
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python PDF related posts
- Show HN: I just open sourced my document/website extractor for Vision-LLMs
- Running OCR against PDFs and images directly in the browser
- 🔍Underrated Open Source Projects You Should Know About 🧠
- Launch HN: Onedoc (YC W24) – A better way to create PDFs
- LlamaCloud and LlamaParse
- Show HN: A new open-source library to design PDF using React
- 1.5M PDFs in 25 Minutes
-
A note from our sponsor - SaaSHub
www.saashub.com | 17 Apr 2024
Index
What are some of the best open-source PDF projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | paperless-ngx | 16,576 |
2 | OCRmyPDF | 11,866 |
3 | PyPDF2 | 7,311 |
4 | WeasyPrint | 6,610 |
5 | pdfplumber | 5,468 |
6 | pdfminer.six | 5,404 |
7 | PDFMiner | 5,179 |
8 | PyMuPDF | 3,969 |
9 | camelot | 3,553 |
10 | borb | 3,277 |
11 | pdfarranger | 2,970 |
12 | Camelot | 2,626 |
13 | malicious-pdf | 2,585 |
14 | Papermerge | 2,317 |
15 | xhtml2pdf | 2,166 |
16 | pdftabextract | 2,152 |
17 | tabula-py | 2,047 |
18 | pikepdf | 2,014 |
19 | pdf2image | 1,448 |
20 | arxiv.py | 965 |
21 | fpdf2 | 930 |
22 | pdftotext | 807 |
23 | pypandoc | 796 |