Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →
Top 18 data-wrangling Open-Source Projects
-
OpenRefine
OpenRefine is a free, open source power tool for working with messy data and improving it
-
dasel
Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
cracking-the-data-science-interview
A Collection of Cheatsheets, Books, Questions, and Portfolio For DS/ML Interview Prep
-
zui
Zui is a powerful desktop application for exploring and working with data. The official front-end to the Zed lake.
-
Optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
prose
Microsoft Program Synthesis using Examples SDK is a framework of technologies for the automatic generation of programs from input-output examples. This repo includes samples and sample data for the Microsoft Program Synthesis using Example SDK. (by microsoft)
-
desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
-
prosto
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
-
R-Fundamentals
D-Lab's 4 part, 8 hour introduction to R Fundamentals. Learn how to create variables and functions, manipulate data frames, make visualizations, use control flow structures, and more, using R in RStudio.
-
mongorefine
Experimental headless data wrangling / refining tool over MongoDB, inspired by OpenRefine
-
image-dataset-prepper
A desktop file explorer with keyboard shortcuts to quickly prep images when collecting datasets for training image classification models
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07"OpenRefine is a powerful free, open source tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data." https://openrefine.org/
Project mention: Can someone recommend some website for data science interview preparation | /r/datascience | 2023-06-02
Thanks for the detailed feedback @snidane!
As maintainer of qsv, here's my reply:
- Given qsv's rapid release cycle (173 releases over three years), the auto-update check is essential at the moment. Once we reach 1.0, I'll turn it off. For now, given your feedback, I've only made it check 10% of the time.
- Pivot is in the backlog and I'll be sure to add unpivot when I implement it. (https://github.com/jqnatividad/qsv/issues/799)
- I'll add a dedicated summing command with the group by (-by) and window by (-over) capability (https://github.com/jqnatividad/qsv/issues/1514). Do note that `stats` has basic sum as @ezequiel-garzon pointed out.
- With the `enum` command, qsv can achieve what you proposed with `laminate`. E.g. qsv enum --new-column newcol --constant newconstant mydata.csv --output laminated-data.csv
- With the cat rowskey command, qsv can already concatenate files with mismatched headers.
- other file formats. qsv supports parquet, csv, tsv, excel, ods, datapackage, sqlite and more (see https://github.com/jqnatividad/qsv/tree/master#file-formats). Fixed-format though is not supported yet and quite interesting, and have added it to the backlog (https://github.com/jqnatividad/qsv/issues/1515)
- as to "enable embedding outputs of commands", qsv is composable by design, so you can use standard stdin/stdout redirection/piping techniques to have it work with other CLI tools like jq, awk, etc.
Finally, just released v0.120.0 that already incorporates the less aggressive self-update check. https://github.com/jqnatividad/qsv/releases/tag/0.120.0
Project mention: Are there any Python libraries for Data Cleansing ? | /r/dataengineering | 2023-12-08
data-wrangling related posts
- Joining CSV Data Without SQL: An IP Geolocation Use Case
- Qsv: Ultra-fast CSV data-wrangling toolkit
- Qsv: CSVs sliced, diced and analyzed (fork of xsv)
- Resources to Practice Technical Skills
- How manipulate this CSV in Python?
- Which spreadsheet program do you guys use? (even if it's not emacs related)
- R-Fundamentals: NEW Data - star count:112.0
-
A note from our sponsor - InfluxDB
www.influxdata.com | 26 Apr 2024
Index
What are some of the best open-source data-wrangling projects? This list will help you:
Project | Stars | |
---|---|---|
1 | OpenRefine | 10,465 |
2 | dasel | 4,864 |
3 | cracking-the-data-science-interview | 3,174 |
4 | Data-science-best-resources | 2,755 |
5 | qsv | 2,228 |
6 | zui | 1,733 |
7 | Optimus | 1,446 |
8 | skrub | 1,009 |
9 | prose | 612 |
10 | desbordante-core | 321 |
11 | sqawk | 308 |
12 | qsacnpj | 302 |
13 | prosto | 89 |
14 | pipda | 35 |
15 | R-Fundamentals | 25 |
16 | mongorefine | 2 |
17 | image-dataset-prepper | 1 |
18 | 8-week-sqlchallenge | 0 |
Sponsored