How-to-Guide: Contributing to Open Source

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering

Our great sponsors
  • InfluxDB - Collect and Analyze Billions of Data Points in Real Time
  • Onboard AI - Learn any GitHub repo in 59 seconds
  • SaaSHub - Software Alternatives and Reviews
  • awesome-for-beginners

    A list of awesome beginners-friendly projects.

    Here is a list of open source projects that are said to be awesome for beginners.

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Apache Spark

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • missing-semester

    The Missing Semester of Your CS Education 📚

    If you’re still new to development in general and not that comfortable with development tools (using an IDE, the terminal, etc.) check out this link: the missing semester in your CS education. It covers the more practical sides of coding that aren’t taught in university courses. Learn this along the way.

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Apache Airflow

  • dbt-core

    dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

    dbt Core

  • Apache Parquet

    Apache Parquet

    Apache Parquet

  • Apache Avro

    Apache Avro is a data serialization system.

    Apache Avro

  • Onboard AI

    Learn any GitHub repo in 59 seconds. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at

  • sqlfluff

    A modular SQL linter and auto-formatter with support for multiple dialects and templated code.


  • Apache Arrow

    Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

    Apache Arrow

  • Apache Cassandra

    Mirror of Apache Cassandra

    Apache Cassandra

  • Apache Hadoop

    Apache Hadoop

    Apache Hadoop

  • Apache Kafka

    Mirror of Apache Kafka

    Apache Kafka

  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

    Delta Lake

  • pinot

    Apache Pinot - A realtime distributed OLAP datastore

    Apache Pinot

  • nifi

    Apache NiFi

    Apache NiFi

  • hudi

    Upserts, Deletes And Incremental Processing on Big Data.

    Apache Hudi

  • versatile-data-kit

    One framework to develop, deploy and operate data workflows with Python and SQL.

  • Trino

    Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (

    Although Trino (formerly Presto) is in the awesome for beginners list, it’s also a really good DE project as it is a distributed query engine that connects to most of the projects listed above. So depending on where you work in this project you can gain a depth of knowledge on the query engine or breadth across all the connectors …or go hybrid .

  • ploomber

    The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

    As our project grows, we've seen first-hand how difficult it is for others to contribute to open-source: from setting up the development environment, understanding the codebase, drafting a PR, etc. We've learned a lot from helping others successfully contribute to our project so we share our thoughts here in a blog post, don't hesitate to reach out if you need help! Happy to help you contribute to any of our projects or any other!

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts