dumbo
streamparse
Our great sponsors
dumbo | streamparse | |
---|---|---|
- | 1 | |
1,034 | 1,490 | |
- | 0.1% | |
0.0 | 0.0 | |
over 6 years ago | 8 months ago | |
Python | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
dumbo
We haven't tracked posts mentioning dumbo yet.
Tracking mentions began in Dec 2020.
streamparse
-
Apache Heron: A realtime, distributed, fault-tolerant stream processing engine
Wonder why this is getting posted today in particular?
The quick summary here is that this was a clean-house rewrite of Apache Storm done by an internal team at Twitter. As an open source project history refresher, Apache Storm was originally built by a startup called Backtype, and the project was led by Nathan Marz, the technical founder of Backtype. Then, Backtype was acquired by Twitter, and Storm became a major component for large-scale stream processing (of tweets, tweet analytics, and other things) at Twitter.
I wrote a summary of the "interesting bits" of Apache Storm here:
However, at a certain point, Nathan Marz left Twitter, and a different group of engineers tried to rethink Storm inside Twitter. There was also a lot of work going on around Apache Mesos at the time. Heron is kind of a merger of their "rethinking" of Storm while also making it possible to manage Storm-like Heron clusters using Mesos.
But, I don't think Heron really took off. Meanwhile, Storm got very, very stable in the 1.x series, and then had a clean-house rewrite from Clojure to Java in the 2.x series. The last stable/major Storm release was in 2020.
Storm provides a stream processing programming API, a multi-lang wire protocol, and a cluster management approach. But certain cluster computing problems can probably be better solved at the infrastructure layer today. That said, it's still a very powerful system; on my team, we process 75K+ events per second across hundreds of vCPU cores and thousands of Python processes by combining Storm and Kafka with our open source project, streamparse.
https://github.com/Parsely/streamparse
(Also, I'd be remiss if I didn't mention -- if you're interested in stream processing and distributed computing, we are hiring Python Data Engineers to work on a stack involving Storm, Spark, Kafka, Cassandra, etc.) -- https://www.parse.ly/careers/python_data_engineer
What are some alternatives?
mrjob - Run MapReduce jobs on Hadoop or Amazon Web Services
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
dpark - Python clone of Spark, a MapReduce alike framework in Python
data-science-ipython-notebooks - Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Botflix - 🎥 Stream your favorite movie from the terminal!
thanos - Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.