|9 months ago||1 day ago|
|BSD 3-clause "New" or "Revised" License||Apache License 2.0|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Ecosystem: Haskell vs JVM (Eta, Frege)
If you want to interact with Java libraries, but don't need the Haskell program itself to run on the JVM, inline-java is probably your best bet at the moment.
Beginners Guide to Caching Inside an Apache Beam Dataflow Streaming Pipeline Using Python
1 project | dev.to | 9 Mar 2022
will do the job, but due to a bug in versions prior to this commit the tag parameter will be ignored. The cached object is going to be reloaded even if you provide the same identifier, rendering the whole mechanism useless and our transformation will hit our attached resources every time.
GCP to AWS
1 project | reddit.com/r/dataengineering | 29 Jan 2022
Apache Beam a framework in which we can implement batch and streaming data processing pipeline independent of the underlying engine e.g. spark, flink, dataflow etc.
A trick to have arbitrary infix operators in Python
5 projects | news.ycombinator.com | 25 Jan 2022
Apache Beam works this way. Code sample: https://github.com/apache/beam/blob/master/examples/multi-la...
Jinja2 not formatting my text correctly. Any advice?
11 projects | reddit.com/r/learnpython | 10 Dec 2021
ListItem(name='Apache Beam', website='https://beam.apache.org/', category='Batch Processing', short_description='Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream processing'),
The Data Engineer Roadmap 🗺
11 projects | dev.to | 19 Oct 2021
Frameworks of the Future?
9 projects | dev.to | 21 Jul 2021
I asked a similar question in a different community, and the closest they came up with was the niche Apache Beam and the obligatory vague hand-waving about no-code systems. So, maybe DEV seeming to skew younger and more deliberately technical might get a better view of things? Is anybody using a "Framework of the Future" that we should know about?
Best library for CSV to XML or JSON.
2 projects | reddit.com/r/javahelp | 1 Jul 2021
Apache Beam may be what you're looking for. It will work with both Python and Java. It's used by GCP in the Cloud Dataflow service as a sort of streaming ETL tool. It occupies a similar niche to Spark, but is a little easier to use IMO.
How to guarantee exactly once with Beam(on Flink) for side effects
1 project | dev.to | 16 May 2021
Now that we understand how exactly-once state consistency works, you might think what about side effects, such as sending out an email, or write to database. That is a valid concern, because Flink's recovery mechanism are not sufficient to provide end to end exactly once guarantees even though the application state is exactly once consistent, for example, if message x and y from above contains info and action to send out email, during failure recovery, these messages will be processed more than once which results in duplicated emails being sent. Therefore other techniques must used, such as idempotent writes, transaction writes etc. To avoid the duplicated email issue, we could save the key of the message that have been processed to a key-value data storage, and use that to check the key, however, since steam processing means unbounded message, it is tricky to manage this key-value checker with large throughput or unbounded time window by yourself. Here I will explain one approach to guarantee end to end exactly once such as only sending out email once with Apache Beam.
Apache Beam in Production
1 project | reddit.com/r/dataengineering | 22 Apr 2021
No, Beam comes with many runners and can run in many environments, including locally. The Python SDK is the easiest to get running and here is the quickstart guide. It runs this simple wordcount script.
Ecosystem: Haskell vs JVM (Eta, Frege)
Dataflow is Google's implementation of a runner for Apache Beam jobs in Google cloud. Right now, python and java are pretty much the only two options supported for writing Beam jobs that run on Dataflow.
What are some alternatives?
Apache Hadoop - Apache Hadoop
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
Scio - A Scala API for Apache Beam and Google Cloud Dataflow.
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Hive - Apache Hive
shuffle - Shuffle tool used by UHC (Utrecht Haskell Compiler)
Apache HBase - Apache HBase
castle - A tool to manage shared cabal-install sandboxes.
dhall-check - Check all the .dh files in a directory against a given type signature
bumper - Haskell tool to automatically bump package versions transitively.