[D] Getting super-level table extraction

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • CascadeTabNet

    This repository contains the code and implementation details of the CascadeTabNet paper "CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents"

  • Recently, I've been researching extracting tables from image documents. First I tried with pdfs, however, the data extraction libraries like camelot are inconsistent. I found a deep learning model called CascadeTabNet. The detection results are okay but cell recognition is poor. I even found Multi-Type-TD-TSR for table extraction. It uses image processing techniques to find the grids. It performs well on structured and bordered tables. However, it messes up if the cell is not properly aligned. Even if extraction is successful, aggregation of multi-line cells, i.e post-processing, is not very obvious.

  • Multi-Type-TD-TSR

    Extracting Tables from Document Images using a Multi-stage Pipeline for Table Detection and Table Structure Recognition:

  • Recently, I've been researching extracting tables from image documents. First I tried with pdfs, however, the data extraction libraries like camelot are inconsistent. I found a deep learning model called CascadeTabNet. The detection results are okay but cell recognition is poor. I even found Multi-Type-TD-TSR for table extraction. It uses image processing techniques to find the grids. It performs well on structured and bordered tables. However, it messes up if the cell is not properly aligned. Even if extraction is successful, aggregation of multi-line cells, i.e post-processing, is not very obvious.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • donut

    Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

  • this starts with an image transformer but you could tokenize PDF syntax instead: https://github.com/clovaai/donut

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Data extraction from pdf

    1 project | /r/LocalLLaMA | 11 Dec 2023
  • [P] OCR + Table Extraction Advice

    1 project | /r/MachineLearning | 28 Jun 2023
  • [D] Unimpressive improvement in training speed after upgrading from GTX 980 Ti to RTX 4090

    2 projects | /r/MachineLearning | 7 Jun 2023
  • Microsoft TableTransformer

    1 project | /r/hypeurls | 26 Apr 2023
  • DeepSeek-V2 integrated, RAGFlow v0.5.0 is released

    1 project | news.ycombinator.com | 7 May 2024