Go data-engineering

Open-source Go projects categorized as data-engineering

Top 17 Go data-engineering Projects

  • argo

    Workflow Engine for Kubernetes

  • Project mention: StackStorm – IFTTT for Ops | news.ycombinator.com | 2023-11-05

    Like Argo Workflows?

    https://github.com/argoproj/argo-workflows

  • Benthos

    Fancy stream processing made operationally mundane

  • Project mention: Ask HN: Who is hiring? (December 2023) | news.ycombinator.com | 2023-12-01
  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • cloudquery

    The open source high performance ELT framework powered by Apache Arrow

  • Project mention: We might want to regularly keep track of how important each server is | news.ycombinator.com | 2024-02-06

    Check out CloudQuery - https://github.com/cloudquery/cloudquery for an easy cloud asset inventory.

  • lakeFS

    lakeFS - Data version control for your data lake | Git for data

  • Project mention: A Step-by-Step Guide to Implementing Data Version Control | dev.to | 2023-09-04

    # Download the LakeFS binary wget https://github.com/treeverse/lakeFS/releases/latest/download/lakefs # Make the binary executable chmod +x lakefs # Initialize LakeFS with S3 as the storage backend ./lakefs init --backend s3 --s3-gateway-endpoint --s3-region --s3-force-path-style --s3-access-key --s3-secret-key

  • memphis

    Memphis.dev is a highly scalable and effortless data streaming platform

  • Project mention: Memphis | /r/devopspro | 2023-05-11
  • incubator-devlake

    Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.

  • bacalhau

    Compute over Data framework for public, transparent, and optionally verifiable computation

  • Project mention: Deno Cron | news.ycombinator.com | 2023-11-29

    This is really interesting - we’ve tried really hard to solve some of these with Bacalhau[1] - a much simpler distributed compute platform. Would love your feedback!

    [1] https://github.com/bacalhau-project/bacalhau

    Disclosure: I confounded Bacalhau

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • conduit

    Conduit streams data between data stores. Kafka Connect replacement. No JVM required. (by ConduitIO)

  • Project mention: Pulling CDC data from Postgres | /r/dataengineering | 2023-04-30

    I'd like to mention Conduit + its Postgres connector. The Pg connector comes built-in, so all that is needed is a single Conduit binary to get started. It relies on WAL, but the connector creates the replication slot itself (if needed).

  • substation

    Substation is a security analytics and data pipeline toolkit for the cloud (AWS) and more.

  • Project mention: Simple Change Data Capture (CDC) using AWS DynamoDB | /r/dataengineering | 2023-05-20

    Hi everyone, I thought the community might be interested in a recent update to Substation that makes setting up change data capture (CDC) on AWS DynamoDB easy. Features include:

  • Dataplane

    Dataplane is a data platform that makes it easy to construct a data mesh with automated data pipelines and workflows.

  • dud

    A lightweight CLI tool for versioning data alongside source code and building data pipelines.

  • Project mention: Ask HN: How do your ML teams version datasets and models? | news.ycombinator.com | 2023-09-28

    I've used DVC in the past and generally liked its approach. That said, I wholeheartedly agree that it's clunky. It does a lot of things implicitly, which can make it hard to reason about. It was also extremely slow for medium-sized dataset (low 10s of GBs).

    In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django.

    [0]: https://github.com/kevin-hanselman/dud

  • beneath

    Beneath is a serverless real-time data platform ⚡️

  • rtdl

    rtdl makes it easy to build and maintain a real-time data lake (by realtimedatalake)

  • pippin

    Go library to create and manage data pipelines on your machine

  • Project mention: Go concurrency simplified. Part 4: Post office as a data pipeline | dev.to | 2023-12-21

    take a look at the concurrent code written by other devs out there: for example, feel free to check the internals of my library Pippin, but I bet there are many better projects out there to learn from - Google/Bing/DuckDuckGo/Kagi and ChatGPT can help to find the right one

  • amplify

    Bacalhau Amplify: automatic enrichment, enhancement, and explanation of your data (by bacalhau-project)

  • Project mention: Jupyter Lab Extension to run your GPU-heavy stuff (for free for now) on somebody's else server without blocking yours | /r/datascience | 2023-09-22

    When using Jupyter Lab and running GPU-heavy notebooks are you annoyed that your computer is not usable for anything else? I made an extension which allows you to run complex AI inference, training,... remotely on decentralized servers [see bacalhau.org]. This allows you to work on multiple GPU-heavy notebooks in parallel. For now Bacalhau is free, so this is a really cool way to run GPU stuff.

  • Shift

    Shift is a high performance better alternative to Airbyte, Singer, Meltano (by piyushsingariya)

  • Project mention: Alternative to Airbyte, Singer and Meltano | /r/dataengineering | 2023-08-11

    As side hobby I started working on this personal project https://github.com/piyushsingariya/Kaku

  • csv2opensearch

    Import CSV files into OpenSearch or Elasticsearch

  • Project mention: Create a search engine with PostgreSQL: Postgres vs Elasticsearch | dev.to | 2023-07-31

    I was curious to know at roughly what amount of data Postgres slows down compared to Elasticsearch. On the movies dataset (34K rows) that we used in part 1, all queries were reasonably fast (<300 ms). So for the testing here, I chose a larger data set: a recipes dataset from Kaggle, containing 2.3M recipes. The commands to load the CSV file in PostgreSQL can be found in this gist. For Elasticsearch, I've loaded the same CSV file using this tool.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Go data-engineering related posts

Index

What are some of the best open-source data-engineering projects in Go? This list will help you:

Project Stars
1 argo 14,282
2 Benthos 7,559
3 cloudquery 5,581
4 lakeFS 4,058
5 memphis 3,145
6 incubator-devlake 2,424
7 bacalhau 602
8 conduit 342
9 substation 275
10 Dataplane 183
11 dud 166
12 beneath 78
13 rtdl 43
14 pippin 14
15 amplify 10
16 Shift 8
17 csv2opensearch 6

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com