Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 8 table-extraction Open-Source Projects
-
pdfplumber
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
-
PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
table-transformer
Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
-
img2table
img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
parsee-pdf-reader
Parsee's PDF reader, specialized on the extraction of tables with numeric values and the accurate extraction and preservation of text-paragraphs. Full support for scans and images.
Project mention: Running OCR against PDFs and images directly in the browser | news.ycombinator.com | 2024-03-30
Saw this last time but never played with it https://github.com/microsoft/table-transformer
Project mention: What is the best library for processing table data contained within a PDF? | /r/dotnet | 2023-06-23This looks promising: https://github.com/BobLd/tabula-sharp
Project mention: Parsee.ai – a framework to easily extract complex structured data with LLMs | news.ycombinator.com | 2024-03-31Yes, another LLM framework. This one is specialized on extracting structured data from various document types (mainly PDFs, images and HTML files).
Comes with a new (separate) PDF extraction library that is focused on the extraction of numeric tables (tables with numbers, so especially for the financial domain): https://github.com/parsee-ai/parsee-pdf-reader
Helps to easily set up a dataset to evaluate the performance of various LLMs on data extraction tasks, e.g. extracting revenue figures from financial reports: https://github.com/parsee-ai/parsee-datasets/tree/main/datas...
table-extraction related posts
-
Data extraction from pdf
-
Parsing dates with PDFminer
-
[P] OCR + Table Extraction Advice
-
How to Extract Data from Tables in a Public Record PDF
-
How do you parse tables in PDF with langchain? Especially, the context which is few lines above and below the table.
-
[D] Unimpressive improvement in training speed after upgrading from GTX 980 Ti to RTX 4090
-
Code to extract text from pdf to excel
-
A note from our sponsor - InfluxDB
www.influxdata.com | 21 May 2024
Index
What are some of the best open-source table-extraction projects? This list will help you:
Project | Stars | |
---|---|---|
1 | pdfplumber | 5,668 |
2 | PyMuPDF | 4,205 |
3 | table-transformer | 1,869 |
4 | img2table | 387 |
5 | ExtractTable-py | 241 |
6 | docxtractr | 170 |
7 | tabula-sharp | 139 |
8 | parsee-pdf-reader | 23 |
Sponsored