Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more! Learn more →
Top 5 Python pdf-parser Projects
-
MinerU
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
The system they tested are mostly used as a part of a larger system. A more fair comparison would be to use something like MinerU [1] and proper benchmark like the OHR Bench and Reductos table bench. This paper is really bad...
[1]: https://github.com/opendatalab/MinerU
-
Sevalla
Deploy and host your apps and databases, now with $50 credit! Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more!
-
PyPDF2
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
-
docstrange
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.
Project mention: Show HN: A Python library for document parsing with a new 7B parameter model | news.ycombinator.com | 2025-08-31 -
One step at a time, we are keeping multimodal document understanding, parsing, and information extraction free under a permissive license.
If you like the idea, or if something is missing or something we can do better, give us a nudge.
https://github.com/oidlabs-com/Lexoid
Initial benchmarks:
-
poste-italiane-parser
A Python tool to parse PDF statements from Poste Italiane (Postepay, BancoPosta) and extract data as structured JSON.
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
Python pdf-parser discussion
Python pdf-parser related posts
-
(Python) Poste Italiane document parser
-
Keeping multimodal parsing free for all
-
Show HN: Lexoid – A Library for LLM-Based and Non-LLM-Based Document Parsing
-
Yara scanning PDF files
-
I need help install PyPDF2 library on my computer
-
A note from our sponsor - Sevalla
sevalla.com | 1 Sep 2025
Index
What are some of the best open-source pdf-parser projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | MinerU | 42,765 |
2 | PyPDF2 | 9,363 |
3 | docstrange | 458 |
4 | Lexoid | 78 |
5 | poste-italiane-parser | 50 |