|about 2 months ago||3 days ago|
|Mozilla Public License 2.0||BSD 3-clause "New" or "Revised" License|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Will Rust-based data frame library Polars dethrone Pandas? We evaluate on 1M+ Stack Overflow questions
9 projects | reddit.com/r/rust | 25 May 2022
We didn't conduct our own benchmarks for this post, but in this comparison from ~1 year ago, Polars emerged as the fastest https://h2oai.github.io/db-benchmark/
Amazon Redshift Re-Invented
1 project | news.ycombinator.com | 24 May 2022
DuckDB's is in constant development but doesn't yet have native cross-version export/import feature (since its developers claim DuckDB hasn't reached maturity to stabilise its on-disk formats just yet).
I also keep an eye on https://h2oai.github.io/db-benchmark/ Pola.rs and DataFusion sound the most exciting.
It also remains to be seen how DataBrick's delta.io develops (might come in handy for much much larger datasets).
C++ for data analysis
2 projects | reddit.com/r/cpp | 18 May 2022
Fast Lane to Learning R
4 projects | news.ycombinator.com | 14 May 2022
I strongly recommend data.table R. Tidyverse is an improvement on base R, no question. Data.table has less intuitive syntax and can be harder to learn, but is lightning fast and memory efficient. If you're working with more than 1M rows, you should be using data.table.
Here are some benchmarks: https://h2oai.github.io/db-benchmark/
Friendlier SQL with DuckDB
8 projects | news.ycombinator.com | 12 May 2022
Hi, good to hear that you guys care about testing. One thing apart from the Github issues that led me to believe it might not be super stable yet was the benchmark results on https://h2oai.github.io/db-benchmark/ which make it look like it couldn't handle the 50GB case due to a out of memory error. I see that the benchmark and the used versions are about a year old so maybe things changed a lot since then. Can you chime in regarding the current story of running bigger DBs like 1TB on a machine with just 32GB or so RAM? Especially regardung data mutations and DDL queries. Thanks!
I used a new dataframe library (polars) to wrangle 300M prices and discover some of the most expensive hospitals in America. Code/notebook in article
2 projects | reddit.com/r/Python | 9 May 2022
Per these benchmarks it appears Polars is an order of magnitude more performant and it's lazy and Rust is just kinda sexy.
Benchmarking for loops vs apply and others
2 projects | reddit.com/r/rstats | 1 May 2022
This is a much more comprehensive set of benchmarks: https://h2oai.github.io/db-benchmark/
Why is R viewed badly over here? Also, as a college student, should I prioritize Python instead?
1 project | reddit.com/r/datascience | 29 Apr 2022
Its not like pandas is faster than tidyverse either on all the bechmarks, and data.table is faster than both. https://h2oai.github.io/db-benchmark/
Resources for data cleaning
2 projects | reddit.com/r/rstats | 13 Apr 2022
Language isn't really important here; what's important is tooling, and R definitely has the tooling. I would look at this benchmark reference for database-like operations, and you'll see that data.table (a very fast and memory-efficient R package) consistently ranks as one of the fastest tools out there that can also support a wide range of memory loads.
The fastest tool for querying large JSON files is written in Python (benchmark)
16 projects | news.ycombinator.com | 12 Apr 2022
Forecasting three months ahead.
2 projects | reddit.com/r/datascience | 7 Apr 2022
I Need Your Help: Convincing Reasons for Python over C# for ML Pipeline?
1 project | reddit.com/r/Python | 31 Jan 2022
Time series -> https://github.com/alan-turing-institute/sktime have a look and have fun :)
Good python time series libraries?
1 project | reddit.com/r/algotrading | 13 Dec 2021
Scikit-Learn Version 1.0
11 projects | news.ycombinator.com | 14 Sep 2021
Sktime: Machine Learning for Time Series
1 project | news.ycombinator.com | 4 Sep 2021
It provides specialized time series algorithms and scikit-learn compatible tools to build, tune and validate time series models for multiple learning problems.
sktime is built by an active open-source community, working together during regular meetings, workshops and sprints. For new contributors, we provide mentoring sessions and tutorials.
If you are interested in contributing or just a chat about the project, feel free to submit a PR or just reach out to us. We welcome all kinds of contributions: code, API design, testing, documentation, outreach, mentoring and more.
Darts: Non-Facebook alternative for timeseries forecasting
8 projects | news.ycombinator.com | 12 Aug 2021
What are some alternatives?
darts - A python library for easy manipulation and forecasting of time series.
tslearn - A machine learning toolkit dedicated to time-series data
Prophet - Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
scikit-hts - Hierarchical Time Series Forecasting with a familiar API
arrow-datafusion - Apache Arrow DataFusion SQL Query Engine
Kats - Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.
polars - Fast multi-threaded DataFrame library in Rust | Python | Node.js
sysidentpy - A Python Package For System Identification Using NARMAX Models
greykite - A flexible, intuitive and fast forecasting library
databend - A modern Elasticity and Performance cloud data warehouse, activate your object storage for real-time analytics.
DataFrame - C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
InceptionTime - InceptionTime: Finding AlexNet for Time Series Classification