table-extraction

Open-source projects categorized as table-extraction
Language: + Python + R + C#

Top 8 table-extraction Open-Source Projects

  • pdfplumber

    Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

  • Project mention: Running OCR against PDFs and images directly in the browser | news.ycombinator.com | 2024-03-30
  • PyMuPDF

    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

  • Project mention: FLaNK Stack for 04 December 2023 | dev.to | 2023-12-04
  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • table-transformer

    Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.

  • Project mention: Data extraction from pdf | /r/LocalLLaMA | 2023-12-11

    Saw this last time but never played with it https://github.com/microsoft/table-transformer

  • img2table

    img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing

  • ExtractTable-py

    Python library to extract tabular data from images and scanned PDFs

  • docxtractr

    :scissors: Extract Tables from Microsoft Word Documents with R

  • tabula-sharp

    Extract tables from PDF files (port of tabula-java)

  • Project mention: What is the best library for processing table data contained within a PDF? | /r/dotnet | 2023-06-23

    This looks promising: https://github.com/BobLd/tabula-sharp

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • parsee-pdf-reader

    Parsee's PDF reader, specialized on the extraction of tables with numeric values and the accurate extraction and preservation of text-paragraphs. Full support for scans and images.

  • Project mention: Parsee.ai – a framework to easily extract complex structured data with LLMs | news.ycombinator.com | 2024-03-31

    Yes, another LLM framework. This one is specialized on extracting structured data from various document types (mainly PDFs, images and HTML files).

    Comes with a new (separate) PDF extraction library that is focused on the extraction of numeric tables (tables with numbers, so especially for the financial domain): https://github.com/parsee-ai/parsee-pdf-reader

    Helps to easily set up a dataset to evaluate the performance of various LLMs on data extraction tasks, e.g. extracting revenue figures from financial reports: https://github.com/parsee-ai/parsee-datasets/tree/main/datas...

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

table-extraction related posts

  • Data extraction from pdf

    1 project | /r/LocalLLaMA | 11 Dec 2023
  • Parsing dates with PDFminer

    1 project | /r/learnpython | 4 Jul 2023
  • [P] OCR + Table Extraction Advice

    1 project | /r/MachineLearning | 28 Jun 2023
  • How to Extract Data from Tables in a Public Record PDF

    2 projects | /r/Journalism | 26 Jun 2023
  • How do you parse tables in PDF with langchain? Especially, the context which is few lines above and below the table.

    4 projects | /r/LangChain | 23 Jun 2023
  • [D] Unimpressive improvement in training speed after upgrading from GTX 980 Ti to RTX 4090

    2 projects | /r/MachineLearning | 7 Jun 2023
  • Code to extract text from pdf to excel

    2 projects | /r/Python | 2 Jun 2023
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 21 May 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source table-extraction projects? This list will help you:

Project Stars
1 pdfplumber 5,668
2 PyMuPDF 4,205
3 table-transformer 1,869
4 img2table 387
5 ExtractTable-py 241
6 docxtractr 170
7 tabula-sharp 139
8 parsee-pdf-reader 23

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com