[D] Getting super-level table extraction

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

CascadeTabNet

1 1,397 0.0 Python

This repository contains the code and implementation details of the CascadeTabNet paper "CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents"

Recently, I've been researching extracting tables from image documents. First I tried with pdfs, however, the data extraction libraries like camelot are inconsistent. I found a deep learning model called CascadeTabNet. The detection results are okay but cell recognition is poor. I even found Multi-Type-TD-TSR for table extraction. It uses image processing techniques to find the grids. It performs well on structured and bordered tables. However, it messes up if the cell is not properly aligned. Even if extraction is successful, aggregation of multi-line cells, i.e post-processing, is not very obvious.

Multi-Type-TD-TSR

4 236 0.0 Jupyter Notebook

Extracting Tables from Document Images using a Multi-stage Pipeline for Table Detection and Table Structure Recognition:

Recently, I've been researching extracting tables from image documents. First I tried with pdfs, however, the data extraction libraries like camelot are inconsistent. I found a deep learning model called CascadeTabNet. The detection results are okay but cell recognition is poor. I even found Multi-Type-TD-TSR for table extraction. It uses image processing techniques to find the grids. It performs well on structured and bordered tables. However, it messes up if the cell is not properly aligned. Even if extraction is successful, aggregation of multi-line cells, i.e post-processing, is not very obvious.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
donut

19 5,343 3.6 Python

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

this starts with an image transformer but you could tokenize PDF syntax instead: https://github.com/clovaai/donut

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Data extraction from pdf

1 project | /r/LocalLLaMA | 11 Dec 2023
[P] OCR + Table Extraction Advice

1 project | /r/MachineLearning | 28 Jun 2023
[D] Unimpressive improvement in training speed after upgrading from GTX 980 Ti to RTX 4090

2 projects | /r/MachineLearning | 7 Jun 2023
Microsoft TableTransformer

1 project | /r/hypeurls | 26 Apr 2023
DeepSeek-V2 integrated, RAGFlow v0.5.0 is released

1 project | news.ycombinator.com | 7 May 2024

[D] Getting super-level table extraction

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
table-structure-recognition table-detection table-detection-using-deep-learning Image processing table-recognition
Post date: 23 Aug 2022

CascadeTabNet

Multi-Type-TD-TSR

InfluxDB

donut

Related posts

Data extraction from pdf

[P] OCR + Table Extraction Advice

[D] Unimpressive improvement in training speed after upgrading from GTX 980 Ti to RTX 4090

Microsoft TableTransformer

DeepSeek-V2 integrated, RAGFlow v0.5.0 is released

[D] Getting super-level table extraction

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning table-structure-recognition table-detection table-detection-using-deep-learning Image processing table-recognition Post date: 23 Aug 2022

CascadeTabNet

Multi-Type-TD-TSR

InfluxDB

donut

Related posts

Data extraction from pdf

[P] OCR + Table Extraction Advice

[D] Unimpressive improvement in training speed after upgrading from GTX 980 Ti to RTX 4090

Microsoft TableTransformer

DeepSeek-V2 integrated, RAGFlow v0.5.0 is released

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning
table-structure-recognition table-detection table-detection-using-deep-learning Image processing table-recognition
Post date: 23 Aug 2022