How-to-Guide: Contributing to Open Source

This page summarizes the projects mentioned and recommended in the original post on reddit.com/r/dataengineering

Our great sponsors
  • Scout APM - Truly a developer’s best friend
  • Zigi - Workflow assistant built for devs & their teams
  • InfluxDB - Build time-series-based applications quickly and at scale.
  • Sonar - Write Clean Java Code. Always.
  • awesome-for-beginners

    A list of awesome beginners-friendly projects.

    Here is a list of open source projects that are said to be awesome for beginners.

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Apache Spark

  • Scout APM

    Truly a developer’s best friend. Scout APM is great for developers who want to find and fix performance issues in their applications. With Scout, we'll take care of the bugs so you can focus on building great things 🚀.

  • missing-semester

    The Missing Semester of Your CS Education 📚

    If you’re still new to development in general and not that comfortable with development tools (using an IDE, the terminal, etc.) check out this link: the missing semester in your CS education. It covers the more practical sides of coding that aren’t taught in university courses. Learn this along the way.

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Apache Airflow

  • dbt-core

    dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

    dbt Core

  • Apache Parquet

    Apache Parquet

    Apache Parquet

  • Apache Avro

    Apache Avro is a data serialization system.

    Apache Avro

  • Zigi

    Workflow assistant built for devs & their teams. Automate the mundane part of your day, with live actionable messages for your GitHub & Jira tasks.

  • sqlfluff

    A modular SQL linter and auto-formatter with support for multiple dialects and templated code.

    SQLFluff

  • Apache Arrow

    Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

    Apache Arrow

  • Apache Cassandra

    Mirror of Apache Cassandra

    Apache Cassandra

  • Apache Hadoop

    Apache Hadoop

    Apache Hadoop

  • Apache Kafka

    Mirror of Apache Kafka

    Apache Kafka

  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

    Delta Lake

  • pinot

    Apache Pinot - A realtime distributed OLAP datastore

    Apache Pinot

  • nifi

    Apache NiFi

    Apache NiFi

  • hudi

    Upserts, Deletes And Incremental Processing on Big Data.

    Apache Hudi

  • versatile-data-kit

    Build, run and manage your data pipelines with Python or SQL on any cloud

  • Trino

    Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

    Although Trino (formerly Presto) is in the awesome for beginners list, it’s also a really good DE project as it is a distributed query engine that connects to most of the projects listed above. So depending on where you work in this project you can gain a depth of knowledge on the query engine or breadth across all the connectors …or go hybrid .

  • ploomber

    The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

    As our project grows, we've seen first-hand how difficult it is for others to contribute to open-source: from setting up the development environment, understanding the codebase, drafting a PR, etc. We've learned a lot from helping others successfully contribute to our project so we share our thoughts here in a blog post, don't hesitate to reach out if you need help! Happy to help you contribute to any of our projects or any other!

  • InfluxDB

    Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Data Platform where developers build real-time applications for analytics, IoT and cloud-native services in less time with less code.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts