ELT, Data Pipeline

This page summarizes the projects mentioned and recommended in the original post on dev.to

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • kafkat

    KafkaT-ool

  • Few tips before I close this blog post, always have enough memory in Kafka server as it is very memory intensive and tends to shutdown gracefully every time it hits heap size limit without indicating anything in the logs. For monitoring your cluster use Kafka Manager, it does the job very well and also have KafkaT on server which saves you from running cumbersome builtin Kafka commands.

  • fluent-plugin-kafka

    Kafka input and output plugin for Fluentd

  • Before we moved to using the data from Kafka, we kept it running as our secondary datasource imported in to Big Query to compare the data count. The data count seemed to be better coming from Kafka indicating that we were losing some data via our S3 based pipeline. As we started adding more and more datasets to Kafka we eventually hit a road bump due to memory issue causing server to fall out of the quorum, which was initially hard to identify with errors in the Tron server. The issue 97 was opened over to discuss on Github in Fluentd Kafka Plugin repository. That got drilled down to the plugin's limitation to counter it bound to the use of the gem ruby-kafka, a fix PR294 was contributed to resolve the issue from happening in future. The fix is now part of the fluent-plugin-kafka as well which will benefit many others. Happy to say that the patch works well in production :).

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • secor

    Secor is a service implementing Kafka log persistence

  • Once we had our producer working for Kafka , it was time for a consumer to start pulling data and push it to GCS. With some research over at Github we found Secor from Pinterest to be a viable option for our use. Though it being a great piece of software, it wasn't mapping ideally to our design, for that purpose we had to submit few Pull requests to make the necessary changes to the secor project for our use and the greater good of the open source community. From updating the docs (PR268, PR271, PR277) on how to set it up to adding flexible upload directory structure with hourly support (PR275) and support for partitioned parser with no offset folder (PR279), also added flexible delimited file reader, writer option (PR291) for better control over file structure. Below diagram is our current ELT pipeline running in production.

  • Scio

    A Scala API for Apache Beam and Google Cloud Dataflow.

  • To counter the above mentioned problem, we decided to move our data to a Pub/Sub based stream model, where we would continue to push data as it arrives. As fluentd is the primary tool being used in all our servers to gather data, rather than replacing it we leveraged its plugin architecture to use a plugin to stream data into a sink of our choosing. Initially our inclination was towards Google PubSub and Google Dataflow as our Data Scientists/Engineers use Big Query extensively and keeping the data in the same Cloud made sense. The inspiration of using these tools came from Spotify’s Event Delivery – The Road to the Cloud. We did the setup on one of our staging server with Google PubSub and Dataflow. Both didn't really work out for us as PubSub model requires a Subscriber to be available for the Topic a Publisher streams messages to, otherwise the messages are not stored. On top of it there was no way to see which messages are arriving. During this the weirdest thing that we encountered was that the Topic would be orphaned losing the subscribers when working with Dataflow. PubSub we might have managed to live with, the wall in our path was Dataflow. We started off with using SCIO from Spotify to work with Dataflow, there is a considerate lack of documentation over it and found the community to be very reserved on Github, something quite evident in the world of Scala for which they came up with a Code of Conduct for its user base to follow. Something that was required from Dataflow for us was to support batch write option to GCS, after trying our hand at Dataflow to no success to achieve that, Google's staff at StackOverflow were quite responsive and their response confirmed that it was something not available with Dataflow and streaming data to BigQuery, Datastore or Bigtable as a datastore was an option to use. The reason we didn't do that was to avoid high streaming cost to these services to store data, as majority of our jobs from the data team are based on batched hourly data. The initial proposal to the updated pipeline is shown below.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Are there any openly available data engineering projects using Scala and Spark which follow industry conventions like proper folder/package structures and object oriented division of classes/concerns? Most examples I’ve seen have everything in one file without proper separation of concerns.

    1 project | /r/dataengineering | 24 Jan 2023
  • Running multi-broker Kafka using docker

    2 projects | /r/apachekafka | 27 Sep 2022
  • For the DE's that choose Java over Python in new projects, why?

    1 project | /r/dataengineering | 2 Jun 2022
  • what popular companies uses Scala?

    3 projects | /r/scala | 14 May 2022
  • Scala or Python

    1 project | /r/dataengineering | 19 Apr 2022