How to Extract Data from Tables in a Public Record PDF

This page summarizes the projects mentioned and recommended in the original post on /r/Journalism

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • pdfplumber

    Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

  • I recently published a story that was based on some data analysis I did of a report I obtained from the Department of Behavioral Health and Developmental Services in VA. I wanted to share a quick walkthrough of how I extracted the data from tables in a PDF using a Python module called PDFplumber. I also uploaded a video to Youtube if you prefer that.

  • tdo

  • That's it! I wrote some additional code to pull the values from the table and clean the data before creating the final visualization.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts