Go data-engineering

Open-source Go projects categorized as data-engineering

Top 18 Go data-engineering Projects

data-engineering
  • argo

    Workflow Engine for Kubernetes

    Project mention: StackStorm – IFTTT for Ops | news.ycombinator.com | 2023-11-05

    Like Argo Workflows?

    https://github.com/argoproj/argo-workflows

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • connect

    Fancy stream processing made operationally mundane (by redpanda-data)

    Project mention: Bento, the open source fork of the project formerly known as Benthos | news.ycombinator.com | 2024-05-31

    It feels pretty uncharitable for Redpanda to enforce their terms when they haven't done anything of value with it yet. They made a bold claim that you'll have to pay them to use these features, but you certainly don't as they're still available under MIT licensing.

    One does not simply buy Open Source Software.

    Until Redpanda actually makes any code changes, the ~three now-proprietary plugins are still available as Open Source Software: just browse to the commit before they slapped their license at the top.

    These are all MIT and bit-for-bit identical to the now-proprietary plugins:

    - Splunk HEC: https://github.com/redpanda-data/connect/blob/e653dc3f8a6eee...

    - Snowflake: https://github.com/redpanda-data/connect/blob/e653dc3f8a6eee...

    - Kafka topic logger: https://github.com/redpanda-data/connect/blob/e653dc3f8a6eee...

  • cloudquery

    The open source high performance ELT framework powered by Apache Arrow

    Project mention: We might want to regularly keep track of how important each server is | news.ycombinator.com | 2024-02-06

    Check out CloudQuery - https://github.com/cloudquery/cloudquery for an easy cloud asset inventory.

  • lakeFS

    lakeFS - Data version control for your data lake | Git for data

    Project mention: A Step-by-Step Guide to Implementing Data Version Control | dev.to | 2023-09-04

    # Download the LakeFS binary wget https://github.com/treeverse/lakeFS/releases/latest/download/lakefs # Make the binary executable chmod +x lakefs # Initialize LakeFS with S3 as the storage backend ./lakefs init --backend s3 --s3-gateway-endpoint --s3-region --s3-force-path-style --s3-access-key --s3-secret-key

  • memphis

    Memphis.dev is a highly scalable and effortless data streaming platform

  • incubator-devlake

    Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.

    Project mention: Engineering Metrics Are Overrated | dev.to | 2024-07-03

    Well that's the hard yet can be easy part, you could build your own integrations and all, crunch this data and present them, leverage open source, however, I would guess this is not your core business. Therefore, I'd argue purchasing a SaaS platform off-the-shelf is the easy way. You'll just need to evaluate the market, ensure it fits your needs and your toolchain etc. There is some open source out there (Apache DevLake) in this space, I'd encourage you to take a look and see if it suits your needs, wants desires.

  • bacalhau

    Compute over Data framework for public, transparent, and optionally verifiable computation

    Project mention: Deno Cron | news.ycombinator.com | 2023-11-29

    This is really interesting - we’ve tried really hard to solve some of these with Bacalhau[1] - a much simpler distributed compute platform. Would love your feedback!

    [1] https://github.com/bacalhau-project/bacalhau

    Disclosure: I confounded Bacalhau

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • conduit

    Conduit streams data between data stores. Kafka Connect replacement. No JVM required. (by ConduitIO)

  • substation

    Substation is a toolkit for routing, normalizing, and enriching security event and audit logs.

  • Dataplane

    Dataplane is a data platform that makes it easy to construct a data mesh with automated data pipelines and workflows.

  • dud

    A lightweight CLI tool for versioning data alongside source code and building data pipelines.

    Project mention: Ask HN: How do your ML teams version datasets and models? | news.ycombinator.com | 2023-09-28

    I've used DVC in the past and generally liked its approach. That said, I wholeheartedly agree that it's clunky. It does a lot of things implicitly, which can make it hard to reason about. It was also extremely slow for medium-sized dataset (low 10s of GBs).

    In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django.

    [0]: https://github.com/kevin-hanselman/dud

  • beneath

    Beneath is a serverless real-time data platform ⚡️

  • rtdl

    rtdl makes it easy to build and maintain a real-time data lake (by realtimedatalake)

  • Gear5

    high performance better alternative to Airbyte, Singer, Meltano

    Project mention: Alternative to Airbyte, Singer and Meltano | /r/dataengineering | 2023-08-11

    As side hobby I started working on this personal project https://github.com/piyushsingariya/Kaku

  • pippin

    Go library to create and manage data pipelines on your machine

    Project mention: Go concurrency simplified. Part 4: Post office as a data pipeline | dev.to | 2023-12-21

    take a look at the concurrent code written by other devs out there: for example, feel free to check the internals of my library Pippin, but I bet there are many better projects out there to learn from - Google/Bing/DuckDuckGo/Kagi and ChatGPT can help to find the right one

  • amplify

    Bacalhau Amplify: automatic enrichment, enhancement, and explanation of your data (by bacalhau-project)

    Project mention: Jupyter Lab Extension to run your GPU-heavy stuff (for free for now) on somebody's else server without blocking yours | /r/datascience | 2023-09-22

    When using Jupyter Lab and running GPU-heavy notebooks are you annoyed that your computer is not usable for anything else? I made an extension which allows you to run complex AI inference, training,... remotely on decentralized servers [see bacalhau.org]. This allows you to work on multiple GPU-heavy notebooks in parallel. For now Bacalhau is free, so this is a really cool way to run GPU stuff.

  • csv2opensearch

    Import CSV files into OpenSearch or Elasticsearch

    Project mention: Create a search engine with PostgreSQL: Postgres vs Elasticsearch | dev.to | 2023-07-31

    I was curious to know at roughly what amount of data Postgres slows down compared to Elasticsearch. On the movies dataset (34K rows) that we used in part 1, all queries were reasonably fast (<300 ms). So for the testing here, I chose a larger data set: a recipes dataset from Kaggle, containing 2.3M recipes. The commands to load the CSV file in PostgreSQL can be found in this gist. For Elasticsearch, I've loaded the same CSV file using this tool.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Go data-engineering discussion

Log in or Post with

Go data-engineering related posts

  • Engineering Metrics Are Overrated

    1 project | dev.to | 3 Jul 2024
  • Go concurrency simplified. Part 1: Channels and goroutines

    2 projects | dev.to | 8 Dec 2023
  • Migrate mongodb Datawarehouse to snowflake

    1 project | /r/snowflake | 4 Dec 2023
  • Jupyter Lab Extension to run your GPU-heavy stuff (for free for now) on somebody's else server without blocking yours

    2 projects | /r/datascience | 22 Sep 2023
  • Preventing replication slot overflow on Postgres DB (AWS RDS)

    1 project | news.ycombinator.com | 11 Sep 2023
  • Preventing WAL Growth on Postgres DB Running on AWS RDS

    1 project | news.ycombinator.com | 10 Sep 2023
  • A Step-by-Step Guide to Implementing Data Version Control

    1 project | dev.to | 4 Sep 2023
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 17 Jul 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source data-engineering projects in Go? This list will help you:

Project Stars
1 argo 14,575
2 connect 7,999
3 cloudquery 5,709
4 lakeFS 4,208
5 memphis 3,192
6 incubator-devlake 2,514
7 bacalhau 643
8 conduit 365
9 substation 298
10 Dataplane 202
11 dud 169
12 beneath 81
13 rtdl 44
14 Gear5 16
15 pippin 14
16 blink 14
17 amplify 11
18 csv2opensearch 7

Sponsored
Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com

Did you konow that Go is
the 4th most popular programming language
based on number of metions?