|about 2 months ago||6 days ago|
|Mozilla Public License 2.0||BSD 3-clause "New" or "Revised" License|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Friendlier SQL with DuckDB
8 projects | news.ycombinator.com | 12 May 2022
Hi, good to hear that you guys care about testing. One thing apart from the Github issues that led me to believe it might not be super stable yet was the benchmark results on https://h2oai.github.io/db-benchmark/ which make it look like it couldn't handle the 50GB case due to a out of memory error. I see that the benchmark and the used versions are about a year old so maybe things changed a lot since then. Can you chime in regarding the current story of running bigger DBs like 1TB on a machine with just 32GB or so RAM? Especially regardung data mutations and DDL queries. Thanks!
I used a new dataframe library (polars) to wrangle 300M prices and discover some of the most expensive hospitals in America. Code/notebook in article
2 projects | reddit.com/r/Python | 9 May 2022
Per these benchmarks it appears Polars is an order of magnitude more performant and it's lazy and Rust is just kinda sexy.
Benchmarking for loops vs apply and others
2 projects | reddit.com/r/rstats | 1 May 2022
This is a much more comprehensive set of benchmarks: https://h2oai.github.io/db-benchmark/
Why is R viewed badly over here? Also, as a college student, should I prioritize Python instead?
1 project | reddit.com/r/datascience | 29 Apr 2022
Its not like pandas is faster than tidyverse either on all the bechmarks, and data.table is faster than both. https://h2oai.github.io/db-benchmark/
Resources for data cleaning
2 projects | reddit.com/r/rstats | 13 Apr 2022
Language isn't really important here; what's important is tooling, and R definitely has the tooling. I would look at this benchmark reference for database-like operations, and you'll see that data.table (a very fast and memory-efficient R package) consistently ranks as one of the fastest tools out there that can also support a wide range of memory loads.
The fastest tool for querying large JSON files is written in Python (benchmark)
16 projects | news.ycombinator.com | 12 Apr 2022
Polars 0.20.0 release
2 projects | reddit.com/r/rust | 14 Mar 2022
How Easy It Is to Re-use Old Pandas Code in Spark 3.2
1 project | reddit.com/r/programming | 4 Feb 2022
It seems to me that the Spark model is much more sensible in terms of performance. In Spark, individual tasks are finally compiled into optimized Java code. As I understand Dusk works, a separate Python process is run for each data subset. So because of this architecture, Dusk is unlikely to ever get Spark performance. By the way, both systems build and optimize the operation graph. This is confirmed by benchmarks: https://h2oai.github.io/db-benchmark/
Count frequency of breed by animal ID (details in comments)
1 project | reddit.com/r/rstats | 3 Feb 2022
You can try to compare dplyr and data.table in this benchmark of different data sizes and operations: https://h2oai.github.io/db-benchmark/
Massive R analysis of Data Science Language and Job Trends 2022
2 projects | reddit.com/r/rstats | 29 Jan 2022
I use both R and Python, and agree with 1, 3 and 4, but for most data wrangling, data.table in R beats most of the competition, including pandas in Python: https://h2oai.github.io/db-benchmark/
DataFrame: NEW Data - star count:1452.0
1 project | reddit.com/r/algoprojects | 15 May 2022
Tooling makes all the difference in the world
1 project | reddit.com/r/options | 14 Apr 2022
This (https://github.com/hosseinmoein/DataFrame) is a tool for data analysis for large data sets and environments where things need to be done fast. It includes many trading indicators + ML data analysis + statistical analysis + more
3 projects | reddit.com/r/cpp | 30 Mar 2022
Is there a test/method in Time Series analysis on how to check if a series is holding a trend or not?
1 project | reddit.com/r/algotrading | 9 Mar 2022
DataFrame: NEW Data - star count:1386.0
1 project | reddit.com/r/algoprojects | 1 Mar 20221 project | reddit.com/r/algoprojects | 28 Feb 20221 project | reddit.com/r/algoprojects | 27 Feb 2022
DataFrame: NEW Data - star count:1374.0
1 project | reddit.com/r/algoprojects | 26 Feb 20221 project | reddit.com/r/algoprojects | 25 Feb 20221 project | reddit.com/r/algoprojects | 24 Feb 2022
What are some alternatives?
arrow-datafusion - Apache Arrow DataFusion and Ballista query engines
polars - Fast multi-threaded DataFrame library in Rust | Python | Node.js
databend - A modern Elasticity and Performance cloud data warehouse, activate your object storage for real-time analytics.
datatable - A Python package for manipulating 2-dimensional tabular data structures
sktime - A unified framework for machine learning with time series
disk.frame - Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data
DataFramesMeta.jl - Metaprogramming tools for DataFrames
faiss - A library for efficient similarity search and clustering of dense vectors.
zhetapi - A C++ ML and numerical analysis API, with an accompanying scripting language.
julia - The Julia Programming Language