dumbo VS Apache Spark

Compare dumbo vs Apache Spark and see what are their differences.

dumbo

Python module that allows one to easily write and run Hadoop programs. (by klbostee)

Apache Spark

Apache Spark - A unified analytics engine for large-scale data processing (by apache)
Our great sponsors
  • SonarLint - Clean code begins in your IDE with SonarLint
  • Scout APM - Truly a developer’s best friend
  • talent.io - Download talent.io’s Tech Salary Report
dumbo Apache Spark
0 66
1,047 33,939
- 1.3%
0.0 10.0
over 4 years ago 5 days ago
Python Scala
Apache License 2.0 Apache License 2.0
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

dumbo

Posts with mentions or reviews of dumbo. We have used some of these posts to build our list of alternatives and similar projects.

We haven't tracked posts mentioning dumbo yet.
Tracking mentions began in Dec 2020.

Apache Spark

Posts with mentions or reviews of Apache Spark. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-09-26.
  • A peek into Location Data Science at Ola
    6 projects | dev.to | 26 Sep 2022
    This requires the use of distributed computation tools such as Spark and Hadoop, Flink and Kafka are used. But for occasional experimentation, Pandas, Geopandas and Dask are some of the commonly used tools.
  • System Design: Uber
    4 projects | dev.to | 21 Sep 2022
    Recording analytics and metrics is one of our extended requirements. We can capture the data from different services and run analytics on the data using Apache Spark which is an open-source unified analytics engine for large-scale data processing. Additionally, we can store critical metadata in the views table to increase data points within our data.
  • System Design: Twitter
    5 projects | dev.to | 21 Sep 2022
    Recording analytics and metrics is one of our extended requirements. As we will be using Apache Kafka to publish all sorts of events, we can process these events and run analytics on the data using Apache Spark which is an open-source unified analytics engine for large-scale data processing.
  • How the world caught up with Apache Cassandra
    4 projects | dev.to | 15 Sep 2022
    Cassandra survived its adolescent years by retaining its position as the database that scales more reliably than anything else, with a continual pursuit of operational simplicity at scale. It demonstrated its value even further by integrating with a broader data infrastructure stack of open source components, including the analytics engine Apache Spark, stream-processing platform Apache Kafka, and others.
  • Why we don’t use Spark
    2 projects | dev.to | 7 Sep 2022
    Most people working in big data know Spark (if you don't, check out their website) as the standard tool to Extract, Transform & Load (ETL) their heaps of data. Spark, the successor of Hadoop & MapReduce, works a lot like Pandas, a data science package where you run operators over collections of data. These operators then return new data collections, which allows the chaining of operators in a functional way while keeping scalability in mind.
  • Tracking Aircraft in Real-Time With Open Source
    17 projects | dev.to | 1 Sep 2022
    Apache Spark
  • Best Open source no-code ELT tool for startup
    5 projects | reddit.com/r/dataengineering | 29 Aug 2022
    For my ETL/data warehouse/analytics needs, I've been very happy with Apache Airflow combined with Apache Spark.
  • Spark vs Flink vs ksqlDB for stream processing
    3 projects | dev.to | 17 Aug 2022
    Apache SparkⓇ is a multi-language framework designed for executing data engineering, data science, and machine learning computation on single-node machines or clusters.
  • System Design: The complete course
    31 projects | dev.to | 16 Aug 2022
    Recording analytics and metrics is one of our extended requirements. We can capture the data from different services and run analytics on the data using Apache Spark which is an open-source unified analytics engine for large-scale data processing. Additionally, we can store critical metadata in the views table to increase data points within our data.
  • Efficiently processing large amounts of data that needs to be grouped by multiple parameters
    2 projects | reddit.com/r/javahelp | 20 Jul 2022
    Naturally it's difficult to suggest tools, but something like https://spark.apache.org or https://jet-start.sh ?

What are some alternatives?

When comparing dumbo and Apache Spark you can also consider the following projects:

Trino - Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Pytorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration

Scalding - A Scala API for Cascading

mrjob - Run MapReduce jobs on Hadoop or Amazon Web Services

luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

Weka

Smile - Statistical Machine Intelligence & Learning Engine

Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache Calcite - Apache Calcite

Scio - A Scala API for Apache Beam and Google Cloud Dataflow.

Apache Flink - Apache Flink