|3 months ago||4 months ago|
|Mozilla Public License 2.0||GNU General Public License v3.0 or later|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
The computers are fast, but you don't know it
7 projects | news.ycombinator.com | 16 Jun 2022
4 projects | reddit.com/r/Python | 15 Jun 2022
You can also compare benchmarks: https://h2oai.github.io/db-benchmark
[Project] BFLOAT16 on ALL hardware (>= 2009), up to 2000x faster ML algos, 50% less RAM usage for all old/new hardware - Hyperlearn Reborn.
2 projects | reddit.com/r/MachineLearning | 2 Jun 2022
Yes, please do! Because polars/tidypolars is the best performer, even better than R's data.table pretty often.
Best tool for larger than memory data
2 projects | reddit.com/r/dataengineering | 27 May 2022
Will Rust-based data frame library Polars dethrone Pandas? We evaluate on 1M+ Stack Overflow questions
10 projects | reddit.com/r/rust | 25 May 2022
We didn't conduct our own benchmarks for this post, but in this comparison from ~1 year ago, Polars emerged as the fastest https://h2oai.github.io/db-benchmark/
Amazon Redshift Re-Invented
1 project | news.ycombinator.com | 24 May 2022
DuckDB's is in constant development but doesn't yet have native cross-version export/import feature (since its developers claim DuckDB hasn't reached maturity to stabilise its on-disk formats just yet).
I also keep an eye on https://h2oai.github.io/db-benchmark/ Pola.rs and DataFusion sound the most exciting.
It also remains to be seen how DataBrick's delta.io develops (might come in handy for much much larger datasets).
C++ for data analysis
2 projects | reddit.com/r/cpp | 18 May 2022
Fast Lane to Learning R
4 projects | news.ycombinator.com | 14 May 2022
I strongly recommend data.table R. Tidyverse is an improvement on base R, no question. Data.table has less intuitive syntax and can be harder to learn, but is lightning fast and memory efficient. If you're working with more than 1M rows, you should be using data.table.
Here are some benchmarks: https://h2oai.github.io/db-benchmark/
Friendlier SQL with DuckDB
8 projects | news.ycombinator.com | 12 May 2022
Hi, good to hear that you guys care about testing. One thing apart from the Github issues that led me to believe it might not be super stable yet was the benchmark results on https://h2oai.github.io/db-benchmark/ which make it look like it couldn't handle the 50GB case due to a out of memory error. I see that the benchmark and the used versions are about a year old so maybe things changed a lot since then. Can you chime in regarding the current story of running bigger DBs like 1TB on a machine with just 32GB or so RAM? Especially regardung data mutations and DDL queries. Thanks!
I used a new dataframe library (polars) to wrangle 300M prices and discover some of the most expensive hospitals in America. Code/notebook in article
2 projects | reddit.com/r/Python | 9 May 2022
Per these benchmarks it appears Polars is an order of magnitude more performant and it's lazy and Rust is just kinda sexy.
Do you code from memory? Or do you reference things?
1 project | reddit.com/r/rstats | 31 Mar 2022
Say hello to disk.frame.
How can I read in only two columns from a massive 10+ GB tab file?
1 project | reddit.com/r/rstats | 20 Jan 2022
Data cleaning/ analysis 100-200 million rows of data. Is this doable in R, or is there another program I should try instead?
2 projects | reddit.com/r/rstats | 12 Oct 2021
It depends on your hardware, but it should not be a problem. You might look into disk frame (https://diskframe.com) or similar packages.
is it possible to have my enviroment objects and work with them on my local drive instead of RAM?
1 project | reddit.com/r/Rlanguage | 3 Jul 2021
If that doesn't work, the disk.frame package might help. It is new-ish and not common, but does seem to work with data on disk rather than in memory
We Test PCIe 4.0 Storage: The AnandTech 2021 SSD Benchmark Suite
1 project | news.ycombinator.com | 2 Feb 2021
> The speeds were just stunning to say the least at 15GB/s.
That is amazing. That is around DDR4-1866 speeds, and not far from DDR4-2666 (~21 GB/s). At those speeds I would happily work with dataframes sitting on the disk rather than in memory [1, 2]. Did you benchmark RAID 0 with less than four disks?
What are some alternatives?
arrow-datafusion - Apache Arrow DataFusion SQL Query Engine
polars - Fast multi-threaded DataFrame library in Rust | Python | Node.js
databend - A modern Elasticity and Performance cloud data warehouse, activate your object storage for real-time analytics. Cloud at https://app.databend.com/
sktime - A unified framework for machine learning with time series
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
DataFramesMeta.jl - Metaprogramming tools for DataFrames
arrow2 - Unofficial transmute-free Rust library to work with the Arrow format
csvs-to-sqlite - Convert CSV files into a SQLite database
julia - The Julia Programming Language
DataFrame - C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
Preql - An interpreted relational query language that compiles to SQL.
PackageCompiler.jl - Compile your Julia Package