Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR. Learn more →
Top 23 Python PDF Projects
-
MinerU
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
The system they tested are mostly used as a part of a larger system. A more fair comparison would be to use something like MinerU [1] and proper benchmark like the OHR Bench and Reductos table bench. This paper is really bad...
[1]: https://github.com/opendatalab/MinerU
-
Nutrient
Nutrient – The #1 PDF SDK Library, trusted by 10K+ developers. Other PDF SDKs promise a lot - then break. Laggy scrolling, poor mobile UX, tons of bugs, and lack of support cost you endless frustrations. Nutrient’s SDK handles billion-page workloads - so you don’t have to debug PDFs. Used by ~1 billion end users in more than 150 different countries.
-
paperless-ngx
A community-supported supercharged version of paperless: scan, index and archive all your physical documents
Project mention: Paperless-ngx: scan, index and archive all your physical documents | news.ycombinator.com | 2024-09-30 -
Project mention: Show HN: Kreuzberg – Modern async Python library for document text extraction | news.ycombinator.com | 2025-02-15
Very cool, we've been using https://github.com/DS4SD/docling in our project, but will give this a try :)
-
Project mention: Z-Library Helps Students to Overcome Academic Poverty, Study Finds | news.ycombinator.com | 2024-11-21
and the good old sci-hub.
For paper management Zotero + https://github.com/ethanwillis/zotero-scihub plugin makes browsing google scholar very efficient.
Also Calibre fulltext search with OCR-ed PDFs:
https://github.com/ocrmypdf/OCRmyPDF
makes learning a concept/finding test exercises even easier.
Soon a local LLM to "RAG retrieval on my library" might be the next step.
-
h2ogpt
Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
Project mention: Major Technologies Worth Learning in 2025 for Data Professionals | dev.to | 2024-12-07Artificial Intelligence (AI) is becoming a ubiquitous, and dare I say, indispensable part of data workflows. Tools like ChatGPT have made it easier to review data and write reports. But diving even deeper, tools like DataRobot, H2O.ai, and Google’s AutoML are also simplifying machine learning pipelines and automating repetitive tasks, enabling professionals to focus on high-value activities like model optimization and data storytelling. Mastering these tools will not only boost productivity but also ensure you remain competitive in an AI-first world.
-
PyPDF2
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
-
To create our PDF reports, we use a combination of Weasyprint, Jinja, and Plotly charts. To render a report as a PDF, we first have to render all graphs as images.
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
pdfplumber
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Project mention: PDF Extraction: Retrieving Text and Tables together using Python🐍 | dev.to | 2024-09-22Extracting both text and tables can be challenging when working with PDF files due to their complex structure. However, the “pdfplumber” library offers a powerful solution. This article explores an effective method for combining text and table extraction from PDFs using pdfplumber. Special thanks to Karl Genockey a.k.a. cmdlineuser and other contributors for their brilliant approach discussed here.
-
PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
Project mention: Ask HN: What are you using to parse PDFs for RAG? | news.ycombinator.com | 2024-07-30 -
-
MegaParse
File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
❄️ Apache Polaris + Iceberg Quickstart ⚡️ How to extract tables from pdfs 🚀 Microsoft 1bit LLM BitNet 🐿️ Verifying Kafka Transactions Entry 2 🐿️ FLUSS: Streaming Storage 🐿️ Fluss -> Flow for Flink Real Time Analytics 🌐 TableFlow - iceberg / kafka ❄️ Snowflake Cortex AI + Slack 🐿️❄️ Door dash flink, kafka, snowflake 🧠 Prompt Stack -- all in one 🔌 SpaCY Layout for PDF 📱 Responsible AI Pathways 📼 Megaparse documents python 🔌 Time Series LLM ❄️ Generate Synthetic Data in Snowflake 🐿️ LLMs and GenAI - When to use them 🐿️ Flink Observability with Prometheus 📡 New SQL GUI 🍫 TDD for GenAI 🕵️ 🎁 Open Source Agent Framework for Production 💻 Cedit command line editor 🏭 ServiceNow AgentLab 🎤 Snowflake Lessons Learned in Replication 🎄 Privastead 🔌 Backup Icloud with nodejs on linux 🔌 Backup Google with nodejs on linux 🎄 HuggingFace macos chat source code 🎁 Ollama working with structured output 🎁 dspy ai how to 🔌 Piazza updater 🔌 Building a financial report with langgraph ColPali Notebook with QWEN 2 VL
-
pdfarranger
Small python-gtk application, which helps the user to merge or split PDF documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface.
-
-
-
-
malicious-pdf
💀 Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator or Interact.sh
-
-
-
-
-
pdftabextract
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
-
text-extract-api
Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown
Project mention: PDF Extract API Using Ollama with Anonymization and PII Removal | news.ycombinator.com | 2025-01-07 -
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Python PDF discussion
Python PDF related posts
-
Show HN: Kreuzberg – Modern async Python library for document text extraction
-
Announcing Kreuzberg v2.0: A Lightweight, Modern Python Text Extraction library
-
Show HN: HTML visualization of a PDF file's internal structure
-
ROX “Building an AI-Powered Document Retrieval System with Docling and Granite 3.1”
-
Docling: Get your documents ready for gen AI
-
PDF Extract API Using Ollama with Anonymization and PII Removal
-
Converting Plotly charts into images in parallel
-
A note from our sponsor - CodeRabbit
coderabbit.ai | 16 Feb 2025
Index
What are some of the best open-source PDF projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | MinerU | 25,747 |
2 | paperless-ngx | 25,008 |
3 | docling | 20,231 |
4 | OCRmyPDF | 18,023 |
5 | h2ogpt | 11,658 |
6 | PyPDF2 | 8,712 |
7 | WeasyPrint | 7,452 |
8 | pdfplumber | 7,223 |
9 | PyMuPDF | 6,413 |
10 | pdfminer.six | 6,206 |
11 | MegaParse | 5,275 |
12 | pdfarranger | 3,829 |
13 | camelot | 3,679 |
14 | borb | 3,445 |
15 | Camelot | 3,151 |
16 | malicious-pdf | 2,932 |
17 | Papermerge | 2,597 |
18 | xhtml2pdf | 2,279 |
19 | pikepdf | 2,261 |
20 | tabula-py | 2,228 |
21 | pdftabextract | 2,224 |
22 | text-extract-api | 2,150 |
23 | pdf2image | 1,701 |