Python PDF

Open-source Python projects categorized as PDF

Top 23 Python PDF Projects

  1. MinerU

    A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

    Project mention: Gemini beats everyone on new OCR benchmark | news.ycombinator.com | 2025-02-14

    The system they tested are mostly used as a part of a larger system. A more fair comparison would be to use something like MinerU [1] and proper benchmark like the OHR Bench and Reductos table bench. This paper is really bad...

    [1]: https://github.com/opendatalab/MinerU

  2. Nutrient

    Nutrient – The #1 PDF SDK Library, trusted by 10K+ developers. Other PDF SDKs promise a lot - then break. Laggy scrolling, poor mobile UX, tons of bugs, and lack of support cost you endless frustrations. Nutrient’s SDK handles billion-page workloads - so you don’t have to debug PDFs. Used by ~1 billion end users in more than 150 different countries.

    Nutrient logo
  3. paperless-ngx

    A community-supported supercharged version of paperless: scan, index and archive all your physical documents

    Project mention: Paperless-ngx: scan, index and archive all your physical documents | news.ycombinator.com | 2024-09-30
  4. docling

    Get your documents ready for gen AI

    Project mention: Show HN: Kreuzberg – Modern async Python library for document text extraction | news.ycombinator.com | 2025-02-15

    Very cool, we've been using https://github.com/DS4SD/docling in our project, but will give this a try :)

  5. OCRmyPDF

    OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

    Project mention: Z-Library Helps Students to Overcome Academic Poverty, Study Finds | news.ycombinator.com | 2024-11-21

    and the good old sci-hub.

    For paper management Zotero + https://github.com/ethanwillis/zotero-scihub plugin makes browsing google scholar very efficient.

    Also Calibre fulltext search with OCR-ed PDFs:

    https://github.com/ocrmypdf/OCRmyPDF

    makes learning a concept/finding test exercises even easier.

    Soon a local LLM to "RAG retrieval on my library" might be the next step.

  6. h2ogpt

    Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/

    Project mention: Major Technologies Worth Learning in 2025 for Data Professionals | dev.to | 2024-12-07

    Artificial Intelligence (AI) is becoming a ubiquitous, and dare I say, indispensable part of data workflows. Tools like ChatGPT have made it easier to review data and write reports. But diving even deeper, tools like DataRobot, H2O.ai, and Google’s AutoML are also simplifying machine learning pipelines and automating repetitive tasks, enabling professionals to focus on high-value activities like model optimization and data storytelling. Mastering these tools will not only boost productivity but also ensure you remain competitive in an AI-first world.

  7. PyPDF2

    A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

  8. WeasyPrint

    The awesome document factory

    Project mention: Converting Plotly charts into images in parallel | dev.to | 2025-01-02

    To create our PDF reports, we use a combination of Weasyprint, Jinja, and Plotly charts. To render a report as a PDF, we first have to render all graphs as images.

  9. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  10. pdfplumber

    Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

    Project mention: PDF Extraction: Retrieving Text and Tables together using Python🐍 | dev.to | 2024-09-22

    Extracting both text and tables can be challenging when working with PDF files due to their complex structure. However, the “pdfplumber” library offers a powerful solution. This article explores an effective method for combining text and table extraction from PDFs using pdfplumber. Special thanks to Karl Genockey a.k.a. cmdlineuser and other contributors for their brilliant approach discussed here.

  11. PyMuPDF

    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

    Project mention: Ask HN: What are you using to parse PDFs for RAG? | news.ycombinator.com | 2024-07-30
  12. pdfminer.six

    Community maintained fork of pdfminer - we fathom PDF

  13. MegaParse

    File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.

    Project mention: AI and All Data Weekly for 09 Dec 2024 | dev.to | 2024-12-09

    ❄️ Apache Polaris + Iceberg Quickstart ⚡️ How to extract tables from pdfs 🚀 Microsoft 1bit LLM BitNet 🐿️ Verifying Kafka Transactions Entry 2 🐿️ FLUSS: Streaming Storage 🐿️ Fluss -> Flow for Flink Real Time Analytics 🌐 TableFlow - iceberg / kafka ❄️ Snowflake Cortex AI + Slack 🐿️❄️ Door dash flink, kafka, snowflake 🧠 Prompt Stack -- all in one 🔌 SpaCY Layout for PDF 📱 Responsible AI Pathways 📼 Megaparse documents python 🔌 Time Series LLM ❄️ Generate Synthetic Data in Snowflake 🐿️ LLMs and GenAI - When to use them 🐿️ Flink Observability with Prometheus 📡 New SQL GUI 🍫 TDD for GenAI 🕵️ 🎁 Open Source Agent Framework for Production 💻 Cedit command line editor 🏭 ServiceNow AgentLab 🎤 Snowflake Lessons Learned in Replication 🎄 Privastead 🔌 Backup Icloud with nodejs on linux 🔌 Backup Google with nodejs on linux 🎄 HuggingFace macos chat source code 🎁 Ollama working with structured output 🎁 dspy ai how to 🔌 Piazza updater 🔌 Building a financial report with langgraph ColPali Notebook with QWEN 2 VL

  14. pdfarranger

    Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.

  15. camelot

    Camelot: PDF Table Extraction for Humans (by atlanhq)

  16. borb

    borb is a library for reading, creating and manipulating PDF files in python.

  17. Camelot

    A Python library to extract tabular data from PDFs

  18. malicious-pdf

    💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh

  19. Papermerge

    Open Source Document Management System for Digital Archives (Scanned Documents)

  20. xhtml2pdf

    A library for converting HTML into PDFs using ReportLab

  21. pikepdf

    A Python library for reading and writing PDF, powered by QPDF

  22. tabula-py

    Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame

  23. pdftabextract

    A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.

    Project mention: Liberate tabular data from scanned documents | news.ycombinator.com | 2024-12-18
  24. text-extract-api

    Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown

    Project mention: PDF Extract API Using Ollama with Anonymization and PII Removal | news.ycombinator.com | 2025-01-07
  25. pdf2image

    A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

  26. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python PDF discussion

Log in or Post with

Python PDF related posts

  • Show HN: Kreuzberg – Modern async Python library for document text extraction

    6 projects | news.ycombinator.com | 15 Feb 2025
  • Announcing Kreuzberg v2.0: A Lightweight, Modern Python Text Extraction library

    1 project | dev.to | 15 Feb 2025
  • Show HN: HTML visualization of a PDF file's internal structure

    5 projects | news.ycombinator.com | 10 Feb 2025
  • ROX “Building an AI-Powered Document Retrieval System with Docling and Granite 3.1”

    3 projects | dev.to | 20 Jan 2025
  • Docling: Get your documents ready for gen AI

    1 project | news.ycombinator.com | 16 Jan 2025
  • PDF Extract API Using Ollama with Anonymization and PII Removal

    1 project | news.ycombinator.com | 7 Jan 2025
  • Converting Plotly charts into images in parallel

    2 projects | dev.to | 2 Jan 2025
  • A note from our sponsor - CodeRabbit
    coderabbit.ai | 16 Feb 2025
    Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →

Index

What are some of the best open-source PDF projects in Python? This list will help you:

# Project Stars
1 MinerU 25,747
2 paperless-ngx 25,008
3 docling 20,231
4 OCRmyPDF 18,023
5 h2ogpt 11,658
6 PyPDF2 8,712
7 WeasyPrint 7,452
8 pdfplumber 7,223
9 PyMuPDF 6,413
10 pdfminer.six 6,206
11 MegaParse 5,275
12 pdfarranger 3,829
13 camelot 3,679
14 borb 3,445
15 Camelot 3,151
16 malicious-pdf 2,932
17 Papermerge 2,597
18 xhtml2pdf 2,279
19 pikepdf 2,261
20 tabula-py 2,228
21 pdftabextract 2,224
22 text-extract-api 2,150
23 pdf2image 1,701

Sponsored
Nutrient – The #1 PDF SDK Library, trusted by 10K+ developers
Other PDF SDKs promise a lot - then break. Laggy scrolling, poor mobile UX, tons of bugs, and lack of support cost you endless frustrations. Nutrient’s SDK handles billion-page workloads - so you don’t have to debug PDFs. Used by ~1 billion end users in more than 150 different countries.
www.nutrient.io

Did you know that Python is
the 2nd most popular programming language
based on number of references?