-
CascadeTabNet
This repository contains the code and implementation details of the CascadeTabNet paper "CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents"
-
Multi-Type-TD-TSR
Extracting Tables from Document Images using a Multi-stage Pipeline for Table Detection and Table Structure Recognition:
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
donut
Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
Recently, I've been researching extracting tables from image documents. First I tried with pdfs, however, the data extraction libraries like camelot are inconsistent. I found a deep learning model called CascadeTabNet. The detection results are okay but cell recognition is poor. I even found Multi-Type-TD-TSR for table extraction. It uses image processing techniques to find the grids. It performs well on structured and bordered tables. However, it messes up if the cell is not properly aligned. Even if extraction is successful, aggregation of multi-line cells, i.e post-processing, is not very obvious.
Recently, I've been researching extracting tables from image documents. First I tried with pdfs, however, the data extraction libraries like camelot are inconsistent. I found a deep learning model called CascadeTabNet. The detection results are okay but cell recognition is poor. I even found Multi-Type-TD-TSR for table extraction. It uses image processing techniques to find the grids. It performs well on structured and bordered tables. However, it messes up if the cell is not properly aligned. Even if extraction is successful, aggregation of multi-line cells, i.e post-processing, is not very obvious.
this starts with an image transformer but you could tokenize PDF syntax instead: https://github.com/clovaai/donut