Top 23 tabular-data Open-Source Projects

react-virtualized

40 25,936 1.6 JavaScript

React components for efficiently rendering large lists and tabular data

Project mention: The Secret Weapon of Top Developers: 7 React JS Libraries You Can't Afford to Ignore | dev.to | 2024-02-21

You may increase the rendering efficiency of tabular and huge list data by using the React Virtualized module. React apps perform better overall when the quantity of requests and DOM elements is limited. React Virtualized is comparable to many other tools; however, what sets it apart from the competition is the sheer volume of features and excellent upkeep.

miller

63 8,553 9.1 Go

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

Project mention: Qsv: Efficient CSV CLI Toolkit | news.ycombinator.com | 2023-12-22

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
vaex

7 8,173 6.0 Python

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
visidata

36 7,409 9.8 Python

A terminal spreadsheet multitool for discovering and arranging data

Project mention: Fx – Terminal JSON Viewer | news.ycombinator.com | 2023-09-19

[4] "Is it possible to "flatten" structured data (like JSON?)": https://github.com/saulpw/visidata/discussions/1605

autogluon

8 7,091 9.6 Python

AutoGluon: Fast and Accurate ML in 3 Lines of Code

Project mention: pip install remyxai - easiest way to create custom vision models | /r/computervision | 2023-04-25

This seems not very convincing. There are other popular frameworks that provide AutoML with existing datasets (eg https://github.com/autogluon/autogluon)

FLAML

9 3,671 8.3 Jupyter Notebook

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.

Project mention: AutoGen: Enabling Next-Gen GPT-X Applications | news.ycombinator.com | 2023-08-22

I really like the simplicity of this framework, and they hit on a lot of common problems found in other agent-based frameworks. Most intrigued by the RAG improvements.
Seems like Microsoft was frustrated with the pace of movement in this space and the shitty results of agents (which admittedly kept my interest turned away from agents for the last few months). I'm interested again because it makes practical sense, and from looking at the example notebooks, seems fairly easy to integrate into existing applications.
Maybe this is the 'low code' approach that might actually work, and bridge together engineering and non-engineering resources.
This example was what caught my eye: https://github.com/microsoft/FLAML/blob/main/notebook/autoge...

tad

3 3,013 7.6 TypeScript

A desktop application for viewing and analyzing tabular data

Project mention: Show HN: Open-source, browser-local data exploration using DuckDB-WASM and PRQL | news.ycombinator.com | 2024-03-15

Very impressive project and vision! Love the demo!
I am also ex-GS and worked on what I am fairly sure is the table display tool you're describing. I tried to carry the essential aspects of that work (multi-level pivots, with drill-down to the leaf level, and all interactive events and analytics supported by db queries) to Tad (https://www.tadviewer.com/, https://github.com/antonycourtney/tad), another open source project powered by DuckDb.
An embeddable version of Tad, powered by DuckDb WASM, is used as the results viewer in the MotherDuck Web UI (https://app.motherduck.com/).
If you're interested in embedding Tad in Pretzel, or leveraging pieces of it in your work, or collaborating on other aspects of DuckDb WASM powered UIs, please get in touch!

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
tabnet

8 2,476 4.8 Python

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf
Alpaca-CoT

1 2,463 9.1 Jupyter Notebook

We unified the interfaces of instruction-tuning data (e.g., CoT data), multiple LLMs and parameter-efficient methods (e.g., lora, p-tuning) together for easy use. We welcome open-source enthusiasts to initiate any meaningful PR on this repo and integrate as many LLM related technologies as possible. 我们打造了方便研究人员上手和使用大模型等微调平台，我们欢迎开源爱好者发起任何有意义的pr！
Auto-PyTorch

4 2,274 0.0 Python

Automatic architecture search and hyperparameter optimization for PyTorch

Project mention: [Project] AMLTK: A framework for building your own AutoML (AutoSklearn authors) | /r/MachineLearning | 2023-12-09

We took some of the lessons learned while building AutoSklearn and AutoPytorch, the good, the bad and the ugly and made a library that to enable the next generation of open-source AutoML tools, to allow them to be research-able but also efficient and scalable. We have some future plans and on-going work with this and we'd like to gather any feedback the community might have!

sketch

20 2,194 4.4 Python

AI code-writing assistant that understands data content

Project mention: Ask HN: What have you built with LLMs? | news.ycombinator.com | 2024-02-05

We've made a lot of data tooling things based on LLMs, and are in the process of rebranding and launching our main product.
1. sketch (in notebook, ai for pandas) https://github.com/approximatelabs/sketch
2. datadm (open source, "chat with data", with support for the open source LLMs (https://github.com/approximatelabs/datadm)
3. Our main product: julyp. https://julyp.com/ (currently under very active rebrand and cleanup) -- but a "chat with data" style app, with a lot of specialized features. I'm also streaming me using it (and sometimes building it) every weekday on twitch to solve misc data problems (https://www.twitch.tv/bluecoconut)
For your next question, about the stack and deploy:

alibi-detect

9 2,082 7.6 Python

Algorithms for outlier, adversarial and drift detection

Project mention: Exploring Open-Source Alternatives to Landing AI for Robust MLOps | dev.to | 2023-12-13

Numerous tools exist for detecting anomalies in time series data, but Alibi Detect stood out to me, particularly for its capabilities and its compatibility with both TensorFlow and PyTorch backends.

tidy-viewer

28 2,020 4.3 Rust

📺(tv) Tidy Viewer is a cross-platform CLI csv pretty printer that uses column styling to maximize viewer enjoyment.

Project mention: Csvlens: Command line CSV file viewer. Like less but made for CSV | news.ycombinator.com | 2024-01-06

DataFrames.jl

9 1,690 7.0 Julia

In-memory tabular data in Julia
tsv-utils

10 1,396 0.0 D

eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

Project mention: Frawk: An efficient Awk-like programming language. (2021) | news.ycombinator.com | 2024-04-21

If you need just csv/tsv parsing, you can also take a look at https://github.com/eBay/tsv-utils

DataProfiler

61 1,357 6.3 Python

What's in your data? Extract schema, statistics and entities from datasets

Project mention: LongRoPE: Extending LLM Context Window Beyond 2M Tokens | news.ycombinator.com | 2024-02-22

It's been possible to skip tokenization for a long time, my team and I did it here - https://github.com/capitalone/DataProfiler
For what it's worth, we actually were working with LSTMs with nearly a billion params back in 2016-2017 area. Transformers made it far more effective to train and execute, but ultimately LSTMs are able to achieve similar results, though slow & require more training data.

pytorch-widedeep

7 1,234 8.5 Python

A flexible package for multimodal-deep-learning to combine tabular data with text and images using Wide and Deep models in Pytorch
ktrain

2 1,210 8.4 Jupyter Notebook

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply
CTGAN

2 1,136 7.9 Python

Conditional GAN for generating synthetic tabular data.

Project mention: Ctgan: Generating synthetic data in Python using GANs | news.ycombinator.com | 2024-02-05

Transformers4Rec

4 1,025 5.3 Python

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
virtua

2 933 9.8 TypeScript

A zero-config, fast and small (~3kB) virtual list (and grid) component for React, Vue and Solid.

Project mention: Show HN: Virtua – zero-config virtualization components for React | news.ycombinator.com | 2023-07-17

rows

1 860 5.1 Python

A common, beautiful interface to tabular data, no matter the format
tab-transformer-pytorch

1 698 4.5 Python

Implementation of TabTransformer, attention network for tabular data, in Pytorch
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

tabular-data related posts

Frawk: An efficient Awk-like programming language. (2021)
4 projects | news.ycombinator.com | 21 Apr 2024
Ctgan: Generating synthetic data in Python using GANs
1 project | news.ycombinator.com | 5 Feb 2024
[Project] AMLTK: A framework for building your own AutoML (AutoSklearn authors)
2 projects | /r/MachineLearning | 9 Dec 2023
What is the best library for processing table data contained within a PDF?
2 projects | /r/dotnet | 23 Jun 2023
Ask HN: What's a good library/command line tool to extract tables from PDFs?
2 projects | news.ycombinator.com | 10 Jun 2023
Building a database to search Excel files
1 project | /r/Database | 15 Apr 2023
Julia's latency: Past, present and future
1 project | news.ycombinator.com | 1 Apr 2023
A note from our sponsor - SaaSHub
www.saashub.com | 24 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source tabular-data projects? This list will help you:

	Project	Stars
1	react-virtualized	25,936
2	miller	8,553
3	vaex	8,173
4	visidata	7,409
5	autogluon	7,091
6	FLAML	3,671
7	tad	3,013
8	tabnet	2,476
9	Alpaca-CoT	2,463
10	Auto-PyTorch	2,274
11	sketch	2,194
12	alibi-detect	2,082
13	tidy-viewer	2,020
14	DataFrames.jl	1,690
15	tsv-utils	1,396
16	DataProfiler	1,357
17	pytorch-widedeep	1,234
18	ktrain	1,210
19	CTGAN	1,136
20	Transformers4Rec	1,025
21	virtua	933
22	rows	860
23	tab-transformer-pytorch	698