db-benchmark
scientific-visualization-book
Our great sponsors
db-benchmark | scientific-visualization-book | |
---|---|---|
91 | 17 | |
320 | 9,995 | |
2.2% | - | |
0.0 | 3.6 | |
9 months ago | 2 months ago | |
R | Python | |
Mozilla Public License 2.0 | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
db-benchmark
-
Polars
Real-world performance is complicated since data science covers a lot of use cases.
If you're just reading a small CSV to do analysis on it, then there will be no human-perceptible difference between Polars and Pandas. If you're reading a larger CSV with 100k rows, there still won't be much of a perceptible difference.
Per this (old) benchmark, there are differences once you get into 500MB+ territory: https://h2oai.github.io/db-benchmark/
-
DuckDB performance improvements with the latest release
I do think it was important for duckdb to put out a new version of the results as the earlier version of that benchmark [1] went dormant with a very old version of duckdb with very bad performance, especially against polars.
-
Show HN: SimSIMD vs. SciPy: How AVX-512 and SVE make SIMD cleaner and ML faster
https://news.ycombinator.com/item?id=33270638 :
> Apache Ballista and Polars do Apache Arrow and SIMD.
> The Polars homepage links to the "Database-like ops benchmark" of {Polars, data.table, DataFrames.jl, ClickHouse, cuDF, spark, (py)datatable, dplyr, pandas, dask, Arrow, DuckDB, Modin,} but not yet PostgresML? https://h2oai.github.io/db-benchmark/ *
LLM -> Vector database: https://en.wikipedia.org/wiki/Vector_database
/? inurl:awesome site:github.com "vector database"
-
Pandas vs. Julia – cheat sheet and comparison
I agree with your conclusion but want to add that switching from Julia may not make sense either.
According to these benchmarks: https://h2oai.github.io/db-benchmark/, DF.jl is the fastest library for some things, data.table for others, polars for others. Which is fastest depends on the query and whether it takes advantage of the features/properties of each.
For what it's worth, data.table is my favourite to use and I believe it has the nicest ergonomics of the three I spoke about.
-
Any faster Python alternatives?
Same. Numba does wonders for me in most scenarios. Yesterday I've discovered pola-rs and looks like I will add it to the stack. It's API is similar to pandas. Have a look at the benchmarks of cuDF, spark, dask, pandas compared to it: Benchmarks
-
Pandas v2.0 Released
If interested in benchmarks comparing different dataframe implementations, here is one:
-
Python "programmers" when I show them how much faster their naive code runs when translated to C++ (this is a joke, I love python)
Bad examples. Both numpy and pandas are notoriously un-optimized packages, losing handily to pretty much all their competitors (R, Julia, kdb+, vaex, polars). See https://h2oai.github.io/db-benchmark/ for a partial comparison.
-
Best alternative to Pandas 2023?
And what's your rating scale? Objectively, pandas loses in performance against everything relevant. It has a wonky syntax that requires using lambda all over the place or to retype your df name at least twice for many operations.
-
Tutorial on Intro to Rust Programming
There has been an upward trend in opensource tools written in Rust with interfaces to python eg: pydantic (moved to Rust in the recent release), polars which is very fast as indicated in the H2Oai benchmarks.
- How do I work with GIGANTIC csv files (20-100 gigabytes)?
scientific-visualization-book
- Book or web book recommendation request: a data visualization cookbook using Python for scientists.
-
What's New in Matplotlib 3.6.0
I had the same problem until I found this tutorial:
https://github.com/rougier/matplotlib-tutorial
If you wan something deeper the same person has written a book:
-
Ask HN: What is the best book on data visualization in 2021?
For python this open access book is excellent: https://github.com/rougier/scientific-visualization-book
-
Open access book on scientific visualization using Python and matplotlib
This book is distributed under a non commercial license (CC-BY-NC-SA 4.0) [0].
While I understand that language is evolving and that only under a "strict definition" of open access does it mean removing barriers to copying and reuse [1], it's seems pretty duplicitous to say it's "open" while putting it under a non-commercial license.
[0] https://github.com/rougier/scientific-visualization-book/blo...
[1] https://en.wikipedia.org/w/index.php?title=Open_access&oldid...
-
Scikit-Learn Version 1.0
Speaking of what's possible in matplotlib, I am very much looking forward to reading this book: https://github.com/rougier/scientific-visualization-book
What are some alternatives?
polars - Dataframes powered by a multithreaded, vectorized query engine, written in Rust
arrow-datafusion - Apache Arrow DataFusion SQL Query Engine
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
databend - 𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
DataFramesMeta.jl - Metaprogramming tools for DataFrames
sktime - A unified framework for machine learning with time series
arrow2 - Transmute-free Rust library to work with the Arrow format
disk.frame - Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data
datatable - A Python package for manipulating 2-dimensional tabular data structures
DataFrame - C++ DataFrame for statistical, Financial, and ML analysis -- in modern C++ using native types and contiguous memory storage
vaex - Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
PackageCompiler.jl - Compile your Julia Package