Python PDF

Open-source Python projects categorized as PDF | Edit details

Top 23 Python PDF Projects

  • OCRmyPDF

    OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

    Project mention: Recovering redacted information from pixelated videos | news.ycombinator.com | 2022-01-25

    Not off the shelf but here are some tools. I have no experience with them.

    Wolf binarization - I think it makes the text more clear before OCR.

    https://github.com/chriswolfvision/local_adaptive_binarizati...

    This thing OCRs the pdf using Tesseract OCR

    https://github.com/ocrmypdf/OCRmyPDF/

    Two other pdf tools

    https://github.com/qpdf/qpdf

    https://github.com/pikepdf/pikepdf

  • WeasyPrint

    The awesome document factory

    Project mention: QuestPDF 2021.10 - a new version of the open-source, MIT-licensed, C# library for generating PDF documents with fluent API, now with extended text capabilities. Please help me make it popular :) | reddit.com/r/csharp | 2021-10-06

    I’d recommend Weasyprint (.net core wrapper) instead of wkhtmltopdf. It supports CSS Paged Media which is pretty much required for everything but the simplest of HTML2PDF conversions.

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • PDFMiner

    Python PDF Parser (Not actively maintained). Check out pdfminer.six.

    Project mention: Desktop File Search, How does it work! python or C# library libraries? | reddit.com/r/learnpython | 2021-09-12
  • PyPDF2

    A utility to read and write PDFs with Python

    Project mention: How do I get python to read .pdf's in a directory? | reddit.com/r/learnpython | 2022-01-18

    I've had some success reading/writing PDFs using this library. Not sure if it will help you, but it's worth a shot.

  • pdfminer.six

    Community maintained fork of pdfminer - we fathom PDF

    Project mention: Completely crazy tables when transforming table from PDF file to CSV | reddit.com/r/Python | 2022-01-27

    Having said that, I've had pretty decent luck with PDFMiner.six (github link) for various extractions. Sometimes PDFs are decently structured HTML under the hood as well, so you might look into dumping the PDF to HTML, then parsing it with an HTML library like LXML or use Python's HTML.parser as part of the standard Python library (assuming the structured HTML is actually functional; can check by dumping the PDF to an .html document and opening it in a web browser.)

  • borb

    borb is a library for reading, creating and manipulating PDF files in python.

    Project mention: borb, the pure Python PDF library | reddit.com/r/Python | 2022-01-14

    Get borb from source on GitHub, or download using PyPi.

  • pdfplumber

    Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

    Project mention: How to search a PDF file for text matching keywords | reddit.com/r/learnpython | 2022-01-14
  • OPS

    OPS - Build and Run Open Source Unikernels. Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.

  • pdftabextract

    A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

  • tabula-py

    Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

    Project mention: Completely crazy tables when transforming table from PDF file to CSV | reddit.com/r/Python | 2022-01-27
  • Papermerge

    Open Source Document Management System for Digital Archives (Scanned Documents)

    Project mention: Paperless-ng on virtual Linux on Windows? | reddit.com/r/linux | 2021-12-02

    There also are https://docspell.org/ and https://www.papermerge.com/ if it has to be a server or multi user.

  • Camelot

    A Python library to extract tabular data from PDFs

    Project mention: Need help with indexing pdf tables in python | reddit.com/r/programminghelp | 2021-12-12
  • malicious-pdf

    Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh

    Project mention: Malicious PDF Generator: Utilizado para pruebas de penetración y/o #redteaming 🐱‍🏍 | reddit.com/r/u_esgeeks | 2021-09-04
  • pdfarranger

    Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

    Project mention: How to copy/paste or export columned text with proper formatting? | reddit.com/r/pdf | 2022-01-17
  • pdf2image

    A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

    Project mention: Is pdf2image Secure? | reddit.com/r/learnpython | 2021-04-05

    There's no official code audit or something like that for pdf2image. However, the GitHub repository for py2pdf has over 700 stars, incidcating a significant community activity. Chances are good somebody would notice if the author did something fishy.

  • pikepdf

    A Python library for reading and writing PDF, powered by qpdf

    Project mention: Recovering redacted information from pixelated videos | news.ycombinator.com | 2022-01-25

    Not off the shelf but here are some tools. I have no experience with them.

    Wolf binarization - I think it makes the text more clear before OCR.

    https://github.com/chriswolfvision/local_adaptive_binarizati...

    This thing OCRs the pdf using Tesseract OCR

    https://github.com/ocrmypdf/OCRmyPDF/

    Two other pdf tools

    https://github.com/qpdf/qpdf

    https://github.com/pikepdf/pikepdf

  • CairoSVG

    Convert your vector images

  • arxiv.py

    Python wrapper for the arXiv API

    Project mention: [D] Best Literature Review Practices | reddit.com/r/MachineLearning | 2021-04-15

    For your reference, here's an arXiv API crawler: https://github.com/lukasschwab/arxiv.py

  • rst2pdf

    Use a text editor. Make a PDF.

    Project mention: Rst2pdf/rst2pdf: Use a text editor. Make a PDF | news.ycombinator.com | 2021-11-20
  • latex-online

    Online latex compiler. You give it a link, it gives you PDF

    Project mention: Compile latex from private Github Repository. | reddit.com/r/github | 2021-06-11

    Latex online use a script that is available to download and use offline, see the doc here : https://github.com/aslushnikov/latex-online#installation

  • fpdf2

    Simple PDF generation for Python

    Project mention: New Release for fpdf2: vector drawing API & SVG embedding | reddit.com/r/pythonnews | 2022-01-23
  • Mayan EDMS

    Free Open Source Document Management System (mirror, no pull request or issues)

    Project mention: Hosted SAAS Basic ElasticSearch or SOLR Options? | reddit.com/r/devops | 2022-01-19

    Are you sure you don't need a document management system like Mayan EDMS?

  • stapler

    A small utility making use of the pypdf library to provide a (somewhat) lighter alternative to pdftk

    Project mention: Papermerge (almost) 2.0 is out! | reddit.com/r/selfhosted | 2021-02-25

    In UI you can cut pages from one document and pasted those pages into another document. Afterwards you can sort/reorder pages. Up until version 2.0 Papermerge was using pdftk for "cut" and "paste" operations. Because of pdftk licensing (plus its dependency on java) - it was replaced by stapler which is pure python equivalent of pdftk. Stapler is BSD licensed.

  • Zydra

    Project mention: Password protected PDF | reddit.com/r/hacking | 2021-10-06
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-01-27.

Python PDF related posts

Index

What are some of the best open-source PDF projects in Python? This list will help you:

Project Stars
1 OCRmyPDF 5,835
2 WeasyPrint 4,748
3 PDFMiner 4,726
4 PyPDF2 3,994
5 pdfminer.six 3,345
6 borb 2,492
7 pdfplumber 2,418
8 pdftabextract 1,968
9 tabula-py 1,545
10 Papermerge 1,461
11 Camelot 1,395
12 malicious-pdf 1,193
13 pdfarranger 1,180
14 pdf2image 915
15 pikepdf 894
16 CairoSVG 538
17 arxiv.py 498
18 rst2pdf 454
19 latex-online 406
20 fpdf2 332
21 Mayan EDMS 309
22 stapler 254
23 Zydra 251
Find remote jobs at our new job board 99remotejobs.com. There are 29 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
Static code analysis for 29 languages.
Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.
www.sonarqube.org