Big Data Is Dead

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

blog

10 1,926 6.7 JavaScript

Some notes on things I find interesting and important. (by frankmcsherry)

This reminds me of a great blog post by Frank McSherry (Materialize, timely dataflow, etc) talking about how using the right tools on a laptop could beat out a bunch of these JVM distributed querying tools because... data locality basically.
https://github.com/frankmcsherry/blog/blob/master/posts/2015...

memray

27 12,545 9.0 Python

Memray is a memory profiler for Python

This is an excellent summary, but it omits part of the problem (perhaps because the author has an obvious, and often quite good solution, namely DuckDB).
The implicit problem is that even if the dataset fits in memory, the software processing that data often uses more RAM than the machine has. It's _really easy_ to use way too much memory with e.g. Pandas. And there's three ways to approach this:
* As mentioned in the article, throw more money at the problem with cloud VMs. This gets expensive at scale, and can be a pain, and (unless you pursue the next two solutions) is in some sense a workaround.
* Better data processing tools: Use a smart enough tool that it can use efficient query planning and streaming algorithms to limit data usage. There's DuckDB, obviously, and Polars; here's a writeup I did showing how Polars uses much less memory than Pandas for the same query: https://pythonspeed.com/articles/polars-memory-pandas/
* Better visibility/observability: Make it easier to actually see where memory usage is coming from, so that the problems can be fixed. It's often very difficult to get good visibility here, partially because the tooling for performance and memory is often biased towards web apps, that have different requirements than data processing. In particular, the bottleneck is _peak_ memory, which requires a particular kind of memory profiling.
In the Python world, relevant memory profilers are pretty new. The most popular open source one at this point is Memray (https://bloomberg.github.io/memray/), but I also maintain Fil (https://pythonspeed.com/fil/). Both can give you visibility into sources of memory usage that was previous painfully difficult to get. On the commercial side, I'm working on https://sciagraph.com, which does memory and also performance profiling for Python data processing applications, and is designed to support running in development but also in production.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
ClickHouse

208 34,153 10.0 C++

ClickHouse® is a free analytics DBMS for big data

One great reason to use DuckDB was when ClickHouse took up too much memory on Parquet files.
https://github.com/ClickHouse/ClickHouse/issues/45741#issuec... helps with that though.
Also, clickhouse-local exists https://clickhouse.com/blog/extracting-converting-querying-l... as a thing.
But, yes, when I think of DuckDB...I think embedded use cases...i'm also not a power user.
I also think of this very much as a 'horses for courses' or 'different strokes, different folks' sort of scenario. There is, naturally, overlap because 'analytical data.' But also, there is naturally overlap with R and this giant scary mess of data-munging PERL code I maintain for a side project.
The DuckDB team, the MotherDuck team, the ClickHouse team...we all want your experience interacting with data to be amazing. In some scenarios, ClickHouse is better. In some scenarios, DuckDB. I'm biased (as I work for ClickHouse in DevRel), but I <3 ClickHouse.
Try both. Pick the one that is best for you. Then...you know...tell the other(s) why so that we all can get better at what we do.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project