Top 5 Python pdf-parser Projects

MinerU

1 5 42,765 9.9 Python

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。

Project mention: Gemini beats everyone on new OCR benchmark | news.ycombinator.com | 2025-02-14

The system they tested are mostly used as a part of a larger system. A more fair comparison would be to use something like MinerU [1] and proper benchmark like the OHR Bench and Reductos table bench. This paper is really bad...
[1]: https://github.com/opendatalab/MinerU
Sevalla

sevalla.com featured

Deploy and host your apps and databases, now with $50 credit! Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more!
PyPDF2

2 31 9,363 9.6 Python

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Project mention: Using Docling’s OCR features with RapidOCR | dev.to | 2025-04-03
docstrange

3 3 458 8.9 Python

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

Project mention: Show HN: A Python library for document parsing with a new 7B parameter model | news.ycombinator.com | 2025-08-31
Lexoid

4 3 78 9.2 Python

Multimodal document parser for high quality data understanding and extraction

Project mention: Keeping multimodal parsing free for all | news.ycombinator.com | 2025-04-02

One step at a time, we are keeping multimodal document understanding, parsing, and information extraction free under a permissive license.
If you like the idea, or if something is missing or something we can do better, give us a nudge.
https://github.com/oidlabs-com/Lexoid
Initial benchmarks:
poste-italiane-parser

5 1 50 4.8 Python

A Python tool to parse PDF statements from Poste Italiane (Postepay, BancoPosta) and extract data as structured JSON.

Project mention: (Python) Poste Italiane document parser | news.ycombinator.com | 2025-07-21
InfluxDB

www.influxdata.com featured

InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python pdf-parser discussion

Python pdf-parser related posts

(Python) Poste Italiane document parser

3 projects | news.ycombinator.com | 21 Jul 2025
Keeping multimodal parsing free for all

1 project | news.ycombinator.com | 2 Apr 2025
Show HN: Lexoid – A Library for LLM-Based and Non-LLM-Based Document Parsing

2 projects | news.ycombinator.com | 11 Jan 2025
Yara scanning PDF files

1 project | /r/computerforensics | 1 Jun 2023
I need help install PyPDF2 library on my computer

1 project | /r/learnpython | 12 May 2023
A note from our sponsor - Sevalla
sevalla.com | 1 Sep 2025

Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more! Learn more →

Index

What are some of the best open-source pdf-parser projects in Python? This list will help you:

#	Project	Stars
1	MinerU	42,765
2	PyPDF2	9,363
3	docstrange	458
4	Lexoid	78
5	poste-italiane-parser	50

Python pdf-parser

Top 5 Python pdf-parser Projects

MinerU

Sevalla

PyPDF2

docstrange

Lexoid

poste-italiane-parser

InfluxDB

Python pdf-parser discussion

Python pdf-parser related posts

(Python) Poste Italiane document parser

Keeping multimodal parsing free for all

Show HN: Lexoid – A Library for LLM-Based and Non-LLM-Based Document Parsing

Yara scanning PDF files

I need help install PyPDF2 library on my computer

Index

Did you know that Python is the 2nd most popular programming language based on number of references?

Did you know that Python is
the 2nd most popular programming language
based on number of references?