How-to-Guide: Contributing to Open Source

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering

CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai
featured
Nutrient - The #1 PDF SDK Library
Bad PDFs = bad UX. Slow load times, broken annotations, clunky UX frustrates users. Nutrient’s PDF SDKs gives seamless document experiences, fast rendering, annotations, real-time collaboration, 100+ features. Used by 10K+ devs, serving ~half a billion users worldwide. Explore the SDK for free.
nutrient.io
featured
  1. awesome-for-beginners

    A list of awesome beginners-friendly projects.

    Here is a list of open source projects that are said to be awesome for beginners.

  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. Apache Spark

    Apache Spark - A unified analytics engine for large-scale data processing

    Apache Spark

  4. missing-semester

    The Missing Semester of Your CS Education 📚

    If you’re still new to development in general and not that comfortable with development tools (using an IDE, the terminal, etc.) check out this link: the missing semester in your CS education. It covers the more practical sides of coding that aren’t taught in university courses. Learn this along the way.

  5. Airflow

    Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

    Apache Airflow

  6. dbt-core

    dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

    dbt Core

  7. Apache Parquet

    Apache Parquet Java

    Apache Parquet

  8. Apache Avro

    Apache Avro is a data serialization system.

    Apache Avro

  9. Nutrient

    Nutrient - The #1 PDF SDK Library. Bad PDFs = bad UX. Slow load times, broken annotations, clunky UX frustrates users. Nutrient’s PDF SDKs gives seamless document experiences, fast rendering, annotations, real-time collaboration, 100+ features. Used by 10K+ devs, serving ~half a billion users worldwide. Explore the SDK for free.

    Nutrient logo
  10. sqlfluff

    A modular SQL linter and auto-formatter with support for multiple dialects and templated code.

    SQLFluff

  11. Apache Arrow

    Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

    Apache Arrow

  12. Apache Cassandra

    Apache Cassandra®

    Apache Cassandra

  13. Apache Hadoop

    Apache Hadoop

    Apache Hadoop

  14. Apache Kafka

    Mirror of Apache Kafka

    Apache Kafka

  15. delta

    An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

    Delta Lake

  16. pinot

    Apache Pinot - A realtime distributed OLAP datastore

    Apache Pinot

  17. nifi

    Apache NiFi

    Apache NiFi

  18. hudi

    Upserts, Deletes And Incremental Processing on Big Data.

    Apache Hudi

  19. versatile-data-kit

    One framework to develop, deploy and operate data workflows with Python and SQL.

  20. Trino

    Official repository of Trino, the distributed SQL query engine for big data, former

    Although Trino (formerly Presto) is in the awesome for beginners list, it’s also a really good DE project as it is a distributed query engine that connects to most of the projects listed above. So depending on where you work in this project you can gain a depth of knowledge on the query engine or breadth across all the connectors …or go hybrid .

  21. ploomber

    The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

    As our project grows, we've seen first-hand how difficult it is for others to contribute to open-source: from setting up the development environment, understanding the codebase, drafting a PR, etc. We've learned a lot from helping others successfully contribute to our project so we share our thoughts here in a blog post, don't hesitate to reach out if you need help! Happy to help you contribute to any of our projects or any other!

  22. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Ask HN: Who is hiring? (July 2021)

    33 projects | news.ycombinator.com | 1 Jul 2021
  • Automating Enhanced Due Diligence in Regulated Applications

    9 projects | dev.to | 13 Feb 2025
  • Stream Processing Systems in 2025: RisingWave, Flink, Spark Streaming, and What's Ahead

    7 projects | dev.to | 27 Jan 2025
  • Ask HN: What Open Source Projects Need Help?

    46 projects | news.ycombinator.com | 16 Nov 2024
  • Top 10 GitHub Repositories for Python and Java Developers

    21 projects | dev.to | 3 May 2024

Did you know that Java is
the 8th most popular programming language
based on number of references?