Table Extraction and Processing from PDFs - A Tutorial

This page summarizes the projects mentioned and recommended in the original post on dev.to

InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com
featured
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.
Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
getstream.io
featured
  1. llmwhisperer-table-extraction

    Extract tables from PDFs using LLMWhisperer and extract structured information from those tables using Langchain

    The source code for the project can be found here on Github. To successfully run the extraction script, you’ll need 2 API keys. One for LLMWhisperer and the other for OpenAI APIs. Please be sure to read the Github project’s README to fully understand OS and other dependency requirements. You can sign up for LLMWhisperer, get your API key, and process up to 100 pages per day free of charge.

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. pydantic

    Data validation using Python type hints

    Pydantic: Use Pydantic to declare your data model. This output parser allows users to specify an arbitrary Pydantic Model and query LLMs for outputs that conform to that schema.

  4. unstract

    No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

    With API deployments you can expose an API to which you send a PDF or an image and get back structured data in JSON format. Or with an ETL deployment, you can just put files into a Google Drive, Amazon S3 bucket or choose from a variety of sources and the platform will run extractions and store the extracted data into a database or a warehouse like Snowflake automatically. Unstract is an Open Source software and is available at https://github.com/Zipstack/unstract.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • A Practical Guide on Structuring LLM Outputs with Pydantic

    2 projects | dev.to | 12 Jun 2025
  • Advanced Pydantic: Generic Models, Custom Types, and Performance Tricks

    1 project | dev.to | 5 May 2025
  • Checkbox Extraction from PDFs - A Tutorial

    3 projects | dev.to | 16 Jul 2024
  • JSON extra uses orjson instead of ujson

    4 projects | news.ycombinator.com | 5 Jun 2024
  • Advanced RAG with guided generation

    2 projects | dev.to | 18 Apr 2024

Did you know that Python is
the 2nd most popular programming language
based on number of references?