How-to-Guide: Contributing to Open Source

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • awesome-for-beginners

    A list of awesome beginners-friendly projects.

  • Here is a list of open source projects that are said to be awesome for beginners.

  • Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

  • Apache Spark

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • missing-semester

    The Missing Semester of Your CS Education 📚

  • If you’re still new to development in general and not that comfortable with development tools (using an IDE, the terminal, etc.) check out this link: the missing semester in your CS education. It covers the more practical sides of coding that aren’t taught in university courses. Learn this along the way.

  • Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

  • Apache Airflow

  • dbt-core

    dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

  • dbt Core

  • Apache Parquet

    Apache Parquet

  • Apache Parquet

  • Apache Avro

    Apache Avro is a data serialization system.

  • Apache Avro

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • sqlfluff

    A modular SQL linter and auto-formatter with support for multiple dialects and templated code.

  • SQLFluff

  • Apache Arrow

    Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

  • Apache Arrow

  • Apache Cassandra

    Mirror of Apache Cassandra

  • Apache Cassandra

  • Apache Hadoop

    Apache Hadoop

  • Apache Hadoop

  • Apache Kafka

    Mirror of Apache Kafka

  • Apache Kafka

  • delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

  • Delta Lake

  • pinot

    Apache Pinot - A realtime distributed OLAP datastore

  • Apache Pinot

  • nifi

    Apache NiFi

  • Apache NiFi

  • hudi

    Upserts, Deletes And Incremental Processing on Big Data.

  • Apache Hudi

  • versatile-data-kit

    One framework to develop, deploy and operate data workflows with Python and SQL.

  • Trino

    Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

  • Although Trino (formerly Presto) is in the awesome for beginners list, it’s also a really good DE project as it is a distributed query engine that connects to most of the projects listed above. So depending on where you work in this project you can gain a depth of knowledge on the query engine or breadth across all the connectors …or go hybrid .

  • ploomber

    The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

  • As our project grows, we've seen first-hand how difficult it is for others to contribute to open-source: from setting up the development environment, understanding the codebase, drafting a PR, etc. We've learned a lot from helping others successfully contribute to our project so we share our thoughts here in a blog post, don't hesitate to reach out if you need help! Happy to help you contribute to any of our projects or any other!

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts