Ask HN: Does (or why does) anyone use MapReduce anymore?

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

beam

30 7,508 10.0 Java

Apache Beam is a unified programming model for Batch and Streaming data processing.

The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).
As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.

s4

5 29 3.2 Go

super simple storage service + data local compute + shuffle (by nathants)

the idea of map reduce remains a good one.
there are a number of interesting innovations in streaming systems that followed, mostly around reducing latency, reducing batch size, and alternate failure/retry strategies.
even hadoop could be hard to debug when hitting a performance ceiling for challenging workloads. the streaming systems took this even further, spark being notorious for fiddle with knobs and pray the next job doesn’t fail after a few hours, again.
i played around with the thinnest possible map reduce stack a while back[1][2]. i wanted to understand the performance ceiling for different workloads without all the impenetrable layers of data bureaucracy. turns out modern network and cpu are really fast when you stop adding random software layers like lasagna.
i think the future of data, for serious workloads, is gonna be bespoke. the primitives are just too good now.
1. https://github.com/nathants/s4
2. https://github.com/nathants/bsv

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project