The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 17 Go data-engineering Projects
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
incubator-devlake
Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
conduit
Conduit streams data between data stores. Kafka Connect replacement. No JVM required. (by ConduitIO)
-
substation
Substation is a security analytics and data pipeline toolkit for the cloud (AWS) and more.
-
Dataplane
Dataplane is a data platform that makes it easy to construct a data mesh with automated data pipelines and workflows.
-
amplify
Bacalhau Amplify: automatic enrichment, enhancement, and explanation of your data (by bacalhau-project)
-
Shift
Shift is a high performance better alternative to Airbyte, Singer, Meltano (by piyushsingariya)
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Like Argo Workflows?
https://github.com/argoproj/argo-workflows
Project mention: We might want to regularly keep track of how important each server is | news.ycombinator.com | 2024-02-06Check out CloudQuery - https://github.com/cloudquery/cloudquery for an easy cloud asset inventory.
# Download the LakeFS binary wget https://github.com/treeverse/lakeFS/releases/latest/download/lakefs # Make the binary executable chmod +x lakefs # Initialize LakeFS with S3 as the storage backend ./lakefs init --backend s3 --s3-gateway-endpoint --s3-region --s3-force-path-style --s3-access-key --s3-secret-key
This is really interesting - we’ve tried really hard to solve some of these with Bacalhau[1] - a much simpler distributed compute platform. Would love your feedback!
[1] https://github.com/bacalhau-project/bacalhau
Disclosure: I confounded Bacalhau
I'd like to mention Conduit + its Postgres connector. The Pg connector comes built-in, so all that is needed is a single Conduit binary to get started. It relies on WAL, but the connector creates the replication slot itself (if needed).
Project mention: Simple Change Data Capture (CDC) using AWS DynamoDB | /r/dataengineering | 2023-05-20Hi everyone, I thought the community might be interested in a recent update to Substation that makes setting up change data capture (CDC) on AWS DynamoDB easy. Features include:
Project mention: Ask HN: How do your ML teams version datasets and models? | news.ycombinator.com | 2023-09-28I've used DVC in the past and generally liked its approach. That said, I wholeheartedly agree that it's clunky. It does a lot of things implicitly, which can make it hard to reason about. It was also extremely slow for medium-sized dataset (low 10s of GBs).
In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django.
[0]: https://github.com/kevin-hanselman/dud
Project mention: Go concurrency simplified. Part 4: Post office as a data pipeline | dev.to | 2023-12-21take a look at the concurrent code written by other devs out there: for example, feel free to check the internals of my library Pippin, but I bet there are many better projects out there to learn from - Google/Bing/DuckDuckGo/Kagi and ChatGPT can help to find the right one
Project mention: Jupyter Lab Extension to run your GPU-heavy stuff (for free for now) on somebody's else server without blocking yours | /r/datascience | 2023-09-22When using Jupyter Lab and running GPU-heavy notebooks are you annoyed that your computer is not usable for anything else? I made an extension which allows you to run complex AI inference, training,... remotely on decentralized servers [see bacalhau.org]. This allows you to work on multiple GPU-heavy notebooks in parallel. For now Bacalhau is free, so this is a really cool way to run GPU stuff.
As side hobby I started working on this personal project https://github.com/piyushsingariya/Kaku
Project mention: Create a search engine with PostgreSQL: Postgres vs Elasticsearch | dev.to | 2023-07-31I was curious to know at roughly what amount of data Postgres slows down compared to Elasticsearch. On the movies dataset (34K rows) that we used in part 1, all queries were reasonably fast (<300 ms). So for the testing here, I chose a larger data set: a recipes dataset from Kaggle, containing 2.3M recipes. The commands to load the CSV file in PostgreSQL can be found in this gist. For Elasticsearch, I've loaded the same CSV file using this tool.
Go data-engineering related posts
- Go concurrency simplified. Part 1: Channels and goroutines
- Migrate mongodb Datawarehouse to snowflake
- Jupyter Lab Extension to run your GPU-heavy stuff (for free for now) on somebody's else server without blocking yours
- Preventing replication slot overflow on Postgres DB (AWS RDS)
- Preventing WAL Growth on Postgres DB Running on AWS RDS
- A Step-by-Step Guide to Implementing Data Version Control
- Launch HN: Artie (YC S23) – Real time data replication to data warehouses
-
A note from our sponsor - WorkOS
workos.com | 25 Apr 2024
Index
What are some of the best open-source data-engineering projects in Go? This list will help you:
Project | Stars | |
---|---|---|
1 | argo | 14,282 |
2 | Benthos | 7,559 |
3 | cloudquery | 5,581 |
4 | lakeFS | 4,058 |
5 | memphis | 3,145 |
6 | incubator-devlake | 2,424 |
7 | bacalhau | 602 |
8 | conduit | 342 |
9 | substation | 275 |
10 | Dataplane | 183 |
11 | dud | 166 |
12 | beneath | 78 |
13 | rtdl | 43 |
14 | pippin | 14 |
15 | amplify | 10 |
16 | Shift | 8 |
17 | csv2opensearch | 6 |
Sponsored