beam

Apache Beam is a unified programming model for Batch and Streaming (by apache)

Beam Alternatives

Similar projects and alternatives to beam

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better beam alternative or higher similarity.

Suggest an alternative to beam

Reviews and mentions

Posts with mentions or reviews of beam. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2021-12-10.
  • Jinja2 not formatting my text correctly. Any advice?
    11 projects | reddit.com/r/learnpython | 10 Dec 2021
    ListItem(name='Apache Beam', website='https://beam.apache.org/', category='Batch Processing', short_description='Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream processing'),
  • The Data Engineer Roadmap 🗺
    11 projects | dev.to | 19 Oct 2021
    Apache Beam
  • Frameworks of the Future?
    9 projects | dev.to | 21 Jul 2021
    I asked a similar question in a different community, and the closest they came up with was the niche Apache Beam and the obligatory vague hand-waving about no-code systems. So, maybe DEV seeming to skew younger and more deliberately technical might get a better view of things? Is anybody using a "Framework of the Future" that we should know about?
  • Best library for CSV to XML or JSON.
    2 projects | reddit.com/r/javahelp | 1 Jul 2021
    Apache Beam may be what you're looking for. It will work with both Python and Java. It's used by GCP in the Cloud Dataflow service as a sort of streaming ETL tool. It occupies a similar niche to Spark, but is a little easier to use IMO.
  • How to guarantee exactly once with Beam(on Flink) for side effects
    1 project | dev.to | 16 May 2021
    Now that we understand how exactly-once state consistency works, you might think what about side effects, such as sending out an email, or write to database. That is a valid concern, because Flink's recovery mechanism are not sufficient to provide end to end exactly once guarantees even though the application state is exactly once consistent, for example, if message x and y from above contains info and action to send out email, during failure recovery, these messages will be processed more than once which results in duplicated emails being sent. Therefore other techniques must used, such as idempotent writes, transaction writes etc. To avoid the duplicated email issue, we could save the key of the message that have been processed to a key-value data storage, and use that to check the key, however, since steam processing means unbounded message, it is tricky to manage this key-value checker with large throughput or unbounded time window by yourself. Here I will explain one approach to guarantee end to end exactly once such as only sending out email once with Apache Beam.
  • Apache Beam in Production
    1 project | reddit.com/r/dataengineering | 22 Apr 2021
    No, Beam comes with many runners and can run in many environments, including locally. The Python SDK is the easiest to get running and here is the quickstart guide. It runs this simple wordcount script.
  • Ecosystem: Haskell vs JVM (Eta, Frege)
    4 projects | reddit.com/r/haskell | 30 Mar 2021
    Dataflow is Google's implementation of a runner for Apache Beam jobs in Google cloud. Right now, python and java are pretty much the only two options supported for writing Beam jobs that run on Dataflow.
  • How likely a company uses steaming analytical/etl/ml or event based stream processing rather than using batch?
    1 project | reddit.com/r/dataengineering | 21 Mar 2021
  • Tricky Dataflow ep.2 : Import documents from MongoDB views
    1 project | dev.to | 17 Feb 2021
    As mentioned, MongoDbIO has a method to handle aggregations : withQueryFn. However, this method actually has a little bug in the current version (2.27) when the pipeline has multiple steps:
  • GCP Outpaces Azure, AWS in the 2021 Cloud Report
    4 projects | news.ycombinator.com | 17 Jan 2021
    Disclosure: I work on Google Cloud (and with the Dataflow folks on occasion).

    Sorry, if you're getting mixed messages. Dataflow is here to stay. Google, Spotify, Twitter, and many other large customers heavily depend on it. Twitter moved their entire ad revenue pipeline to it [1] last year.

    A quick perusal though of https://github.com/apache/beam/commits/master shows decent Googler activity. Can you highlight where you were looking for "no visible contributions"? (Maybe we do a bad job of being visible?).

    [1] https://cloud.google.com/blog/products/data-analytics/modern...

  • Apache Beam for Search: Getting Started by Hacking Time
    1 project | news.ycombinator.com | 8 Jan 2021
    I'm not even sure the core Beam engineers understand it all! Look at how Kafka offset acks are handled now:

    https://github.com/apache/beam/blob/master/sdks/java/io/kafk...

  • Getting started with Apache Beam for distributed data processing
    5 projects | dev.to | 5 Dec 2020
    Let's start from the pipeline creation. The pipeline object is created with options specifying, for example, the configuration for the execution engine. The pipeline is used as a context manager with beam.Pipeline(options=options) as p so that it can be executed at __exit__.
    5 projects | dev.to | 5 Dec 2020
    The bit-shift operator is overridden by defining __rrshift__ for PTransform to allow naming it. In the "Split" transform, each line is split into words. This collection of collections is flattened to a collection with beam.FlatMap. The "PairWithOne" transform maps every word to a tuple (x, 1). The first item is the key and the second item is the value. The key-value pairs are then fed to the "GroupAndSum" transform, where all values are summed up by key. This is parallelized word count!
    5 projects | dev.to | 5 Dec 2020
    The "Format" transform maps every (word, count) pair with the format_result function. The output is written to output_file with the WriteToText transform.
    5 projects | dev.to | 5 Dec 2020
    The pipeline definition is located in main.py. The file is almost identical to the wordcount_minimal.py from Beam examples. We'll go through the pipeline code soon, but if you're in hurry, here are the details how to execute the pipeline first locally with the local execution engine (DirectRunner) without Google Cloud account:

Stats

Basic beam repo stats
15
5,238
10.0
1 day ago

apache/beam is an open source project licensed under GNU General Public License v3.0 or later which is an OSI approved license.

Less time debugging, more time building
Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
scoutapm.com
Find remote jobs at our new job board 99remotejobs.com. There are 29 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.