Python PDF

Open-source Python projects categorized as PDF

Top 23 Python PDF Projects

  • paperless-ngx

    A community-supported supercharged version of paperless: scan, index and archive all your physical documents

  • Project mention: I accidentally built a meme search engine | news.ycombinator.com | 2024-04-13

    I steered a friend towards Paperless (and away from an LLM solution) as a way of searching/accessing GBs of architectural PDFs recently - so far, it’s apparently working well for them.

    https://github.com/paperless-ngx/paperless-ngx

  • OCRmyPDF

    OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

  • Project mention: TextSnatcher: Copy text from images, for the Linux Desktop | news.ycombinator.com | 2024-03-14

    Try https://github.com/ocrmypdf/OCRmyPDF - it uses Tesseract behind the scenes and it absolutely brilliant.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • PyPDF2

    A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

  • Project mention: Yara scanning PDF files | /r/computerforensics | 2023-06-01
  • WeasyPrint

    The awesome document factory

  • Project mention: Launch HN: Onedoc (YC W24) – A better way to create PDFs | news.ycombinator.com | 2024-03-11

    Is there a reason you didn't consider something like Weasyprint?

    https://weasyprint.org

    I've gone through a number of systems to convert CV's, business cards, and other docs and it hasn't let me down yet.

  • pdfplumber

    Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

  • Project mention: Running OCR against PDFs and images directly in the browser | news.ycombinator.com | 2024-03-30
  • pdfminer.six

    Community maintained fork of pdfminer - we fathom PDF

  • Project mention: Code to extract text from pdf to excel | /r/Python | 2023-06-02

    I love to use PDFMiner and PDFQuery for this https://github.com/pdfminer/pdfminer.six https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-and-pdfquery-d033721c3b28

  • PDFMiner

    Python PDF Parser (Not actively maintained). Check out pdfminer.six.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • PyMuPDF

    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

  • Project mention: FLaNK Stack for 04 December 2023 | dev.to | 2023-12-04
  • camelot

    Camelot: PDF Table Extraction for Humans (by atlanhq)

  • Project mention: How do you parse tables in PDF with langchain? Especially, the context which is few lines above and below the table. | /r/LangChain | 2023-06-23
  • borb

    borb is a library for reading, creating and manipulating PDF files in python.

  • Project mention: Caffè Italia * 30/04/23 | /r/italy | 2023-04-30
  • pdfarranger

    Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

  • Project mention: Pdftool.org: modify pdfs offline in the browser | news.ycombinator.com | 2023-08-13

    On Linux I like to use:

    https://github.com/pdfarranger/pdfarranger and https://gitlab.com/scarpetta/pdfmixtool for such tasks.

  • Camelot

    A Python library to extract tabular data from PDFs

  • Project mention: Show HN: How do you OCR on a Mac using the CLI or just Python for free | news.ycombinator.com | 2024-01-02

    I had good repeated success extracting tables from PDFs using Camelot (Python, https://github.com/camelot-dev/camelot)

  • malicious-pdf

    💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh

  • Project mention: Securing PDF Generators Against SSRF Vulnerabilities | /r/netsec | 2023-05-30

    Wrote a tool two years ago that does some of the PDF-tests. But more could be added: https://github.com/jonaslejon/malicious-pdf

  • Papermerge

    Open Source Document Management System for Digital Archives (Scanned Documents)

  • xhtml2pdf

    A library for converting HTML into PDFs using ReportLab

  • pdftabextract

    A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

  • tabula-py

    Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

  • pikepdf

    A Python library for reading and writing PDF, powered by QPDF

  • pdf2image

    A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

  • arxiv.py

    Python wrapper for the arXiv API

  • Project mention: Cannot export query results from the advanced search | /r/arxiv | 2023-10-18

    the arxiv api is largely broken. Have a look at this: https://github.com/lukasschwab/arxiv.py/issues/129.

  • fpdf2

    Simple PDF generation for Python

  • Project mention: Pylint strict base configuration | dev.to | 2023-05-03

    I recently used this approach on fpdf2: PR #780 Hardening Pylint config

  • pdfium-lib

    PDFium - Project to compile PDFium library to multiple platforms.

  • Project mention: MuPDF WASM Viewer Demo | news.ycombinator.com | 2024-04-20

    I am letting people know of permissive alternatives (https://github.com/paulocoutinhox/pdfium-lib) and their usage in web components (PDFium.wasm + PDF.js).

    My comment also serves as a promise to open-source my components under the same permissive license.

    I don't want people exposed to unnecessary stress.

  • pdftotext

    Simple PDF text extraction

  • Project mention: Is it possible (and how) to extract highlighted text from a PDF (via API)? | /r/CSEducation | 2023-12-08

    I’ve used https://github.com/jalan/pdftotext for automated pdf text parsing before, but I wasn’t doing anything with highlighting and I’m not sure what the formatting for that looks like or if it would be simple or even present in the output of this library. You could give it a test run on one of your pdfs and see though. Best of luck; pdf text extraction is a nightmare, especially if they don’t all come from the exact same source and generation system, since they’re not really text documents.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python PDF related posts

Index

What are some of the best open-source PDF projects in Python? This list will help you:

Project Stars
1 paperless-ngx 16,576
2 OCRmyPDF 11,936
3 PyPDF2 7,359
4 WeasyPrint 6,635
5 pdfplumber 5,527
6 pdfminer.six 5,430
7 PDFMiner 5,179
8 PyMuPDF 4,002
9 camelot 3,553
10 borb 3,283
11 pdfarranger 2,990
12 Camelot 2,631
13 malicious-pdf 2,585
14 Papermerge 2,325
15 xhtml2pdf 2,169
16 pdftabextract 2,152
17 tabula-py 2,049
18 pikepdf 2,019
19 pdf2image 1,448
20 arxiv.py 971
21 fpdf2 930
22 pdfium-lib 870
23 pdftotext 814

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com