Python PDF

Open-source Python projects categorized as PDF

Top 23 Python PDF Projects

  • paperless-ngx

    A community-supported supercharged version of paperless: scan, index and archive all your physical documents

  • Project mention: I accidentally built a meme search engine | news.ycombinator.com | 2024-04-13

    I steered a friend towards Paperless (and away from an LLM solution) as a way of searching/accessing GBs of architectural PDFs recently - so far, it’s apparently working well for them.

    https://github.com/paperless-ngx/paperless-ngx

  • OCRmyPDF

    OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

  • Project mention: TextSnatcher: Copy text from images, for the Linux Desktop | news.ycombinator.com | 2024-03-14

    Try https://github.com/ocrmypdf/OCRmyPDF - it uses Tesseract behind the scenes and it absolutely brilliant.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • h2ogpt

    Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/

  • Project mention: Multi AI Agent Systems Using OpenAI's New GPT-4o Model | news.ycombinator.com | 2024-05-17
  • PyPDF2

    A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

  • Project mention: Yara scanning PDF files | /r/computerforensics | 2023-06-01
  • WeasyPrint

    The awesome document factory

  • Project mention: Launch HN: Onedoc (YC W24) – A better way to create PDFs | news.ycombinator.com | 2024-03-11

    Is there a reason you didn't consider something like Weasyprint?

    https://weasyprint.org

    I've gone through a number of systems to convert CV's, business cards, and other docs and it hasn't let me down yet.

  • pdfplumber

    Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

  • Project mention: Running OCR against PDFs and images directly in the browser | news.ycombinator.com | 2024-03-30
  • pdfminer.six

    Community maintained fork of pdfminer - we fathom PDF

  • Project mention: Code to extract text from pdf to excel | /r/Python | 2023-06-02

    I love to use PDFMiner and PDFQuery for this https://github.com/pdfminer/pdfminer.six https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-and-pdfquery-d033721c3b28

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • PDFMiner

    Python PDF Parser (Not actively maintained). Check out pdfminer.six.

  • PyMuPDF

    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

  • Project mention: FLaNK Stack for 04 December 2023 | dev.to | 2023-12-04
  • camelot

    Camelot: PDF Table Extraction for Humans (by atlanhq)

  • Project mention: How do you parse tables in PDF with langchain? Especially, the context which is few lines above and below the table. | /r/LangChain | 2023-06-23
  • borb

    borb is a library for reading, creating and manipulating PDF files in python.

  • pdfarranger

    Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

  • Project mention: Pdftool.org: modify pdfs offline in the browser | news.ycombinator.com | 2023-08-13

    On Linux I like to use:

    https://github.com/pdfarranger/pdfarranger and https://gitlab.com/scarpetta/pdfmixtool for such tasks.

  • malicious-pdf

    💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh

  • Camelot

    A Python library to extract tabular data from PDFs

  • Project mention: Show HN: How do you OCR on a Mac using the CLI or just Python for free | news.ycombinator.com | 2024-01-02

    I had good repeated success extracting tables from PDFs using Camelot (Python, https://github.com/camelot-dev/camelot)

  • Papermerge

    Open Source Document Management System for Digital Archives (Scanned Documents)

  • xhtml2pdf

    A library for converting HTML into PDFs using ReportLab

  • pdftabextract

    A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

  • tabula-py

    Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

  • pikepdf

    A Python library for reading and writing PDF, powered by QPDF

  • pdf2image

    A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

  • arxiv.py

    Python wrapper for the arXiv API

  • Project mention: Cannot export query results from the advanced search | /r/arxiv | 2023-10-18

    the arxiv api is largely broken. Have a look at this: https://github.com/lukasschwab/arxiv.py/issues/129.

  • fpdf2

    Simple PDF generation for Python

  • pdfium-lib

    PDFium - Project to compile PDFium library to multiple platforms.

  • Project mention: MuPDF WASM Viewer Demo | news.ycombinator.com | 2024-04-20

    I am letting people know of permissive alternatives (https://github.com/paulocoutinhox/pdfium-lib) and their usage in web components (PDFium.wasm + PDF.js).

    My comment also serves as a promise to open-source my components under the same permissive license.

    I don't want people exposed to unnecessary stress.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python PDF related posts

  • Web Extraction with Vision-LLMs Done the Right Way: Structured Data From Any URL with GPT-4o

    1 project | dev.to | 22 May 2024
  • Stirling PDF: Self-hosted, web-based PDF manipulation tool

    4 projects | news.ycombinator.com | 2 May 2024
  • Show HN: I just open sourced my document/website extractor for Vision-LLMs

    2 projects | news.ycombinator.com | 2 Apr 2024
  • Running OCR against PDFs and images directly in the browser

    7 projects | news.ycombinator.com | 30 Mar 2024
  • 🔍Underrated Open Source Projects You Should Know About 🧠

    9 projects | dev.to | 20 Mar 2024
  • Launch HN: Onedoc (YC W24) – A better way to create PDFs

    11 projects | news.ycombinator.com | 11 Mar 2024
  • LlamaCloud and LlamaParse

    9 projects | news.ycombinator.com | 20 Feb 2024
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 30 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source PDF projects in Python? This list will help you:

Project Stars
1 paperless-ngx 17,416
2 OCRmyPDF 12,291
3 h2ogpt 10,801
4 PyPDF2 7,585
5 WeasyPrint 6,698
6 pdfplumber 5,706
7 pdfminer.six 5,514
8 PDFMiner 5,179
9 PyMuPDF 4,249
10 camelot 3,553
11 borb 3,314
12 pdfarranger 3,077
13 malicious-pdf 2,710
14 Camelot 2,712
15 Papermerge 2,363
16 xhtml2pdf 2,188
17 pdftabextract 2,152
18 tabula-py 2,073
19 pikepdf 2,038
20 pdf2image 1,480
21 arxiv.py 995
22 fpdf2 966
23 pdfium-lib 885

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com