Ask HN: Does (or why does) anyone use MapReduce anymore?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • beam

    Apache Beam is a unified programming model for Batch and Streaming data processing.

  • The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).

    As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.

  • s4

    super simple storage service + data local compute + shuffle (by nathants)

  • the idea of map reduce remains a good one.

    there are a number of interesting innovations in streaming systems that followed, mostly around reducing latency, reducing batch size, and alternate failure/retry strategies.

    even hadoop could be hard to debug when hitting a performance ceiling for challenging workloads. the streaming systems took this even further, spark being notorious for fiddle with knobs and pray the next job doesn’t fail after a few hours, again.

    i played around with the thinnest possible map reduce stack a while back[1][2]. i wanted to understand the performance ceiling for different workloads without all the impenetrable layers of data bureaucracy. turns out modern network and cpu are really fast when you stop adding random software layers like lasagna.

    i think the future of data, for serious workloads, is gonna be bespoke. the primitives are just too good now.

    1. https://github.com/nathants/s4

    2. https://github.com/nathants/bsv

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts