Python pdf-parsing

Open-source Python projects categorized as pdf-parsing

Top 4 Python pdf-parsing Projects

pdf-parsing
  1. PyPDF2

    A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

    Project mention: Using Docling’s OCR features with RapidOCR | dev.to | 2025-04-03
  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. pdfplumber

    Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

    Project mention: PDF Extraction: Retrieving Text and Tables together using Python🐍 | dev.to | 2024-09-22

    Extracting both text and tables can be challenging when working with PDF files due to their complex structure. However, the “pdfplumber” library offers a powerful solution. This article explores an effective method for combining text and table extraction from PDFs using pdfplumber. Special thanks to Karl Genockey a.k.a. cmdlineuser and other contributors for their brilliant approach discussed here.

  4. py-pdf-parser

    A Python tool to help extracting information from structured PDFs.

  5. pdf-to-markdown

    Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

    Project mention: Parse Markdown content from PDF documents | news.ycombinator.com | 2024-12-03
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python pdf-parsing discussion

Log in or Post with

Python pdf-parsing related posts

  • PDF Extraction: Retrieving Text and Tables together using Python🐍

    1 project | dev.to | 22 Sep 2024
  • Running OCR against PDFs and images directly in the browser

    7 projects | news.ycombinator.com | 30 Mar 2024
  • Parsing dates with PDFminer

    1 project | /r/learnpython | 4 Jul 2023
  • How to Extract Data from Tables in a Public Record PDF

    2 projects | /r/Journalism | 26 Jun 2023
  • Code to extract text from pdf to excel

    2 projects | /r/Python | 2 Jun 2023
  • I need to parse unstructured tables from a pdf into a json, what can I do

    1 project | /r/computervision | 21 May 2023
  • Advanced PDF to Excel with documents and example code

    2 projects | /r/learnpython | 1 May 2023
  • A note from our sponsor - Sevalla
    sevalla.com | 1 Sep 2025
    Sevalla is the PaaS you have been looking for! Advanced deployment pipelines, usage-based pricing, preview apps, templates, human support by developers, and much more! Learn more →

Index

What are some of the best open-source pdf-parsing projects in Python? This list will help you:

# Project Stars
1 PyPDF2 9,363
2 pdfplumber 8,209
3 py-pdf-parser 412
4 pdf-to-markdown 92

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?