streamparse
dpark
Our great sponsors
streamparse | dpark | |
---|---|---|
1 | - | |
1,490 | 2,691 | |
0.1% | -0.1% | |
0.0 | 0.0 | |
8 months ago | over 3 years ago | |
Python | Python | |
Apache License 2.0 | BSD 3-clause "New" or "Revised" License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
streamparse
-
Apache Heron: A realtime, distributed, fault-tolerant stream processing engine
Wonder why this is getting posted today in particular?
The quick summary here is that this was a clean-house rewrite of Apache Storm done by an internal team at Twitter. As an open source project history refresher, Apache Storm was originally built by a startup called Backtype, and the project was led by Nathan Marz, the technical founder of Backtype. Then, Backtype was acquired by Twitter, and Storm became a major component for large-scale stream processing (of tweets, tweet analytics, and other things) at Twitter.
I wrote a summary of the "interesting bits" of Apache Storm here:
https://blog.parse.ly/storm/
However, at a certain point, Nathan Marz left Twitter, and a different group of engineers tried to rethink Storm inside Twitter. There was also a lot of work going on around Apache Mesos at the time. Heron is kind of a merger of their "rethinking" of Storm while also making it possible to manage Storm-like Heron clusters using Mesos.
But, I don't think Heron really took off. Meanwhile, Storm got very, very stable in the 1.x series, and then had a clean-house rewrite from Clojure to Java in the 2.x series. The last stable/major Storm release was in 2020.
Storm provides a stream processing programming API, a multi-lang wire protocol, and a cluster management approach. But certain cluster computing problems can probably be better solved at the infrastructure layer today. That said, it's still a very powerful system; on my team, we process 75K+ events per second across hundreds of vCPU cores and thousands of Python processes by combining Storm and Kafka with our open source project, streamparse.
https://github.com/Parsely/streamparse
(Also, I'd be remiss if I didn't mention -- if you're interested in stream processing and distributed computing, we are hiring Python Data Engineers to work on a stack involving Storm, Spark, Kafka, Cassandra, etc.) -- https://www.parse.ly/careers/python_data_engineer
dpark
We haven't tracked posts mentioning dpark yet.
Tracking mentions began in Dec 2020.
What are some alternatives?
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
mrjob - Run MapReduce jobs on Hadoop or Amazon Web Services
dumbo - Python module that allows one to easily write and run Hadoop programs.
Botflix - 🎥 Stream your favorite movie from the terminal!
tdigest - t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
thanos - Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
data-science-ipython-notebooks - Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.