|9 days ago||30 days ago|
|GNU General Public License v3.0 only||Mozilla Public License 2.0|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Notes on the SQLite DuckDB Paper
4 projects | news.ycombinator.com | 1 Sep 2022
Friendlier SQL with DuckDB
8 projects | news.ycombinator.com | 12 May 2022
Interesting thought! I have not tried this yet so I only have a guess as an answer. Could you export the data as SQL statements and then run those statements on DuckDB? That may be easier to set up, but may take longer to run...
DuckDB also has the ability to read Postgres data directly, and there is a Postgres FDW that can read from DuckDB!
Benchmarking Pandas, CuDF, Modin, Apache Arrow and Spark on a Billion Taxi Rides dataset
2 projects | reddit.com/r/Python | 21 Sep 2022
And more benchmarks: https://h2oai.github.io/db-benchmark/. If you are looking for performant dataframes, ideomatic polars typically tops the benchmarks.
Hiring an R coder to improve efficiency of code?
3 projects | reddit.com/r/rstats | 14 Sep 2022
base-R is not particularly fast. Use data.table and it's fast assignment/grouping/aggregation
Does anyone else feel in a tricky spot about their use of R?
3 projects | reddit.com/r/rstats | 7 Sep 2022
Performance efficiency and capacity (e.g. RAM and speed), from the stats coder perspective, is not dependent on the language, but it's dependent on the packages. As /u/Farther_father mentioned, tidytable is identical to dplyr from coding perspective, but the efficiency and capacity are far better. This means that what you said about R's design or S4, Python, Julia, etc. is a fundamental misunderstanding of what is going on in the back-end, especially because Julia is known to be performant, when in fact it is the worst of the three (pandas runs out of memory while polars/tidypolars does not, dplyr runs out of memory while data.table/tidytable does not, etc. -- same language, different packages, different performance).3 projects | reddit.com/r/rstats | 7 Sep 2022
yeah, `data.table` is a huge help for me as well. Most of what I do (my largest data sets are ~ 3 -5Gb) is even much faster than when I tried it with pandas/python - which agrees well with https://h2oai.github.io/db-benchmark/
Don't waste your time on Julia
2 projects | reddit.com/r/rstats | 14 Aug 2022
Certainly not compared to Python's polars and R's data.table https://h2oai.github.io/db-benchmark (where DuckDB, cuDF, ClickHouse, and DataFrames.jl are Julia) -- and the two are even tidy, with tidypolars and tidytable. So this necessitates Julia to catch up on two layers to Python and R.
Polars vs Pandas - A look at performance
2 projects | reddit.com/r/rust | 14 Jul 2022
I analyzed 1835 hospital price lists so you didn't have to
4 projects | news.ycombinator.com | 6 Jul 2022
Python for Data Analysis, 3rd Edition – The Open Access Version Online
3 projects | news.ycombinator.com | 2 Jul 2022
I think that neither "outlier toolchain" nor "some percentage increase" are fair. This benchmark  show significant speedup while lowering the memory needs. You still need to reach dask/spark for really big data where you need a cluster of beefy computers for your tasks.
If you use an r5d.24xlarge-like instance, you can skip spark/dask for most workflows as 768 GB is plenty enough. On top of that, polars will efficiently use the 96 available cores when you are computing your join, groupby, etc.
Also polars is getting more and more popular
The computers are fast, but you don't know it
7 projects | news.ycombinator.com | 16 Jun 2022
4 projects | reddit.com/r/Python | 15 Jun 2022
You can also compare benchmarks: https://h2oai.github.io/db-benchmark
What are some alternatives?
arrow-datafusion - Apache Arrow DataFusion SQL Query Engine
polars - Fast multi-threaded DataFrame library in Rust | Python | Node.js
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
databend - A modern Elasticity and Performance cloud data warehouse, activate your object storage for real-time analytics. Databend Serverless at https://app.databend.com/
DataFramesMeta.jl - Metaprogramming tools for DataFrames
sktime - A unified framework for machine learning with time series
disk.frame - Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data
arrow2 - Transmute-free Rust library to work with the Arrow format
DataFrame - C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
julia - The Julia Programming Language
csvs-to-sqlite - Convert CSV files into a SQLite database
datatable - A Python package for manipulating 2-dimensional tabular data structures