Python pdf-parser

Open-source Python projects categorized as pdf-parser

Top 5 Python pdf-parser Projects

  1. MinerU

    A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

    Project mention: Gemini beats everyone on new OCR benchmark | news.ycombinator.com | 2025-02-14

    The system they tested are mostly used as a part of a larger system. A more fair comparison would be to use something like MinerU [1] and proper benchmark like the OHR Bench and Reductos table bench. This paper is really bad...

    [1]: https://github.com/opendatalab/MinerU

  2. Sevalla

    Deploy and host your apps and databases, now with $50 credit! Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more!

    Sevalla logo
  3. PyPDF2

    A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

    Project mention: Using Docling’s OCR features with RapidOCR | dev.to | 2025-04-03
  4. docstrange

    Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

    Project mention: Show HN: A Python library for document parsing with a new 7B parameter model | news.ycombinator.com | 2025-08-31
  5. Lexoid

    Multimodal document parser for high quality data understanding and extraction

    Project mention: Keeping multimodal parsing free for all | news.ycombinator.com | 2025-04-02

    One step at a time, we are keeping multimodal document understanding, parsing, and information extraction free under a permissive license.

    If you like the idea, or if something is missing or something we can do better, give us a nudge.

    https://github.com/oidlabs-com/Lexoid

    Initial benchmarks:

  6. poste-italiane-parser

    A Python tool to parse PDF statements from Poste Italiane (Postepay, BancoPosta) and extract data as structured JSON.

    Project mention: (Python) Poste Italiane document parser | news.ycombinator.com | 2025-07-21
  7. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python pdf-parser discussion

Log in or Post with

Python pdf-parser related posts

Index

What are some of the best open-source pdf-parser projects in Python? This list will help you:

# Project Stars
1 MinerU 42,765
2 PyPDF2 9,363
3 docstrange 458
4 Lexoid 78
5 poste-italiane-parser 50

Sponsored
Deploy and host your apps and databases, now with $50 credit!
Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more!
sevalla.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?