Top 17 Go data-engineering Projects

argo

43 14,282 9.8 Go

Workflow Engine for Kubernetes

Project mention: StackStorm – IFTTT for Ops | news.ycombinator.com | 2023-11-05

Like Argo Workflows?
https://github.com/argoproj/argo-workflows

Benthos

76 7,559 9.6 Go

Fancy stream processing made operationally mundane

Project mention: Ask HN: Who is hiring? (December 2023) | news.ycombinator.com | 2023-12-01

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
cloudquery

102 5,581 10.0 Go

The open source high performance ELT framework powered by Apache Arrow

Project mention: We might want to regularly keep track of how important each server is | news.ycombinator.com | 2024-02-06

Check out CloudQuery - https://github.com/cloudquery/cloudquery for an easy cloud asset inventory.

lakeFS

48 4,058 9.8 Go

lakeFS - Data version control for your data lake | Git for data

Project mention: A Step-by-Step Guide to Implementing Data Version Control | dev.to | 2023-09-04

# Download the LakeFS binary wget https://github.com/treeverse/lakeFS/releases/latest/download/lakefs # Make the binary executable chmod +x lakefs # Initialize LakeFS with S3 as the storage backend ./lakefs init --backend s3 --s3-gateway-endpoint --s3-region --s3-force-path-style --s3-access-key --s3-secret-key

memphis

52 3,145 9.9 Go

Memphis.dev is a highly scalable and effortless data streaming platform

Project mention: Memphis | /r/devopspro | 2023-05-11

incubator-devlake

10 2,424 9.9 Go

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
bacalhau

12 602 9.8 Go

Compute over Data framework for public, transparent, and optionally verifiable computation

Project mention: Deno Cron | news.ycombinator.com | 2023-11-29

This is really interesting - we’ve tried really hard to solve some of these with Bacalhau[1] - a much simpler distributed compute platform. Would love your feedback!
[1] https://github.com/bacalhau-project/bacalhau
Disclosure: I confounded Bacalhau

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
conduit

7 342 9.4 Go

Conduit streams data between data stores. Kafka Connect replacement. No JVM required. (by ConduitIO)

Project mention: Pulling CDC data from Postgres | /r/dataengineering | 2023-04-30

I'd like to mention Conduit + its Postgres connector. The Pg connector comes built-in, so all that is needed is a single Conduit binary to get started. It relies on WAL, but the connector creates the replication slot itself (if needed).

substation

10 275 7.3 Go

Substation is a security analytics and data pipeline toolkit for the cloud (AWS) and more.

Project mention: Simple Change Data Capture (CDC) using AWS DynamoDB | /r/dataengineering | 2023-05-20

Hi everyone, I thought the community might be interested in a recent update to Substation that makes setting up change data capture (CDC) on AWS DynamoDB easy. Features include:

Dataplane

1 183 8.3 Go

Dataplane is a data platform that makes it easy to construct a data mesh with automated data pipelines and workflows.
dud

14 166 6.3 Go

A lightweight CLI tool for versioning data alongside source code and building data pipelines.

Project mention: Ask HN: How do your ML teams version datasets and models? | news.ycombinator.com | 2023-09-28

I've used DVC in the past and generally liked its approach. That said, I wholeheartedly agree that it's clunky. It does a lot of things implicitly, which can make it hard to reason about. It was also extremely slow for medium-sized dataset (low 10s of GBs).
In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django.
[0]: https://github.com/kevin-hanselman/dud

beneath

2 78 0.0 Go

Beneath is a serverless real-time data platform ⚡️
rtdl

2 43 0.0 Go

rtdl makes it easy to build and maintain a real-time data lake (by realtimedatalake)
pippin

3 14 6.7 Go

Go library to create and manage data pipelines on your machine

Project mention: Go concurrency simplified. Part 4: Post office as a data pipeline | dev.to | 2023-12-21

take a look at the concurrent code written by other devs out there: for example, feel free to check the internals of my library Pippin, but I bet there are many better projects out there to learn from - Google/Bing/DuckDuckGo/Kagi and ChatGPT can help to find the right one

amplify

3 10 7.5 Go

Bacalhau Amplify: automatic enrichment, enhancement, and explanation of your data (by bacalhau-project)

Project mention: Jupyter Lab Extension to run your GPU-heavy stuff (for free for now) on somebody's else server without blocking yours | /r/datascience | 2023-09-22

When using Jupyter Lab and running GPU-heavy notebooks are you annoyed that your computer is not usable for anything else? I made an extension which allows you to run complex AI inference, training,... remotely on decentralized servers [see bacalhau.org]. This allows you to work on multiple GPU-heavy notebooks in parallel. For now Bacalhau is free, so this is a really cool way to run GPU stuff.

Shift

1 8 8.1 Go

Shift is a high performance better alternative to Airbyte, Singer, Meltano (by piyushsingariya)

Project mention: Alternative to Airbyte, Singer and Meltano | /r/dataengineering | 2023-08-11

As side hobby I started working on this personal project https://github.com/piyushsingariya/Kaku

csv2opensearch

1 6 4.4 Go

Import CSV files into OpenSearch or Elasticsearch

Project mention: Create a search engine with PostgreSQL: Postgres vs Elasticsearch | dev.to | 2023-07-31

I was curious to know at roughly what amount of data Postgres slows down compared to Elasticsearch. On the movies dataset (34K rows) that we used in part 1, all queries were reasonably fast (<300 ms). So for the testing here, I chose a larger data set: a recipes dataset from Kaggle, containing 2.3M recipes. The commands to load the CSV file in PostgreSQL can be found in this gist. For Elasticsearch, I've loaded the same CSV file using this tool.

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Go data-engineering related posts

Go concurrency simplified. Part 1: Channels and goroutines
2 projects | dev.to | 8 Dec 2023
Migrate mongodb Datawarehouse to snowflake
1 project | /r/snowflake | 4 Dec 2023
Jupyter Lab Extension to run your GPU-heavy stuff (for free for now) on somebody's else server without blocking yours
2 projects | /r/datascience | 22 Sep 2023
Preventing replication slot overflow on Postgres DB (AWS RDS)
1 project | news.ycombinator.com | 11 Sep 2023
Preventing WAL Growth on Postgres DB Running on AWS RDS
1 project | news.ycombinator.com | 10 Sep 2023
A Step-by-Step Guide to Implementing Data Version Control
1 project | dev.to | 4 Sep 2023
Launch HN: Artie (YC S23) – Real time data replication to data warehouses
4 projects | news.ycombinator.com | 24 Jul 2023
A note from our sponsor - WorkOS
workos.com | 25 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source data-engineering projects in Go? This list will help you:

	Project	Stars
1	argo	14,282
2	Benthos	7,559
3	cloudquery	5,581
4	lakeFS	4,058
5	memphis	3,145
6	incubator-devlake	2,424
7	bacalhau	602
8	conduit	342
9	substation	275
10	Dataplane	183
11	dud	166
12	beneath	78
13	rtdl	43
14	pippin	14
15	amplify	10
16	Shift	8
17	csv2opensearch	6