How-to-Guide: Contributing to Open Source

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

awesome-for-beginners

106 63,687 0.0

A list of awesome beginners-friendly projects.

Here is a list of open source projects that are said to be awesome for beginners.

Apache Spark

101 38,378 10.0 Scala

Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
missing-semester

375 4,694 6.8 CSS

The Missing Semester of Your CS Education 📚

If you’re still new to development in general and not that comfortable with development tools (using an IDE, the terminal, etc.) check out this link: the missing semester in your CS education. It covers the more practical sides of coding that aren’t taught in university courses. Learn this along the way.

Airflow

169 34,485 10.0 Python

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Apache Airflow

dbt-core

86 8,881 9.7 Python

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

dbt Core

Apache Parquet

4 2,407 9.2 Java

Apache Parquet

Apache Parquet

Apache Avro

22 2,764 9.7 Java

Apache Avro is a data serialization system.

Apache Avro

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
sqlfluff

35 7,199 9.6 Python

A modular SQL linter and auto-formatter with support for multiple dialects and templated code.

SQLFluff

Apache Arrow

75 13,523 10.0 C++

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache Arrow

Apache Cassandra

35 8,510 9.9 Java

Mirror of Apache Cassandra

Apache Cassandra

Apache Hadoop

26 14,316 9.9 Java

Apache Hadoop

Apache Hadoop

Apache Kafka

26 27,335 9.9 Java

Mirror of Apache Kafka

Apache Kafka

delta

69 6,897 9.8 Scala

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)

Delta Lake

pinot

15 5,119 9.9 Java

Apache Pinot - A realtime distributed OLAP datastore

Apache Pinot

nifi

35 4,381 9.9 Java

Apache NiFi

Apache NiFi

hudi

20 5,066 9.9 Java

Upserts, Deletes And Incremental Processing on Big Data.

Apache Hudi

versatile-data-kit

52 410 9.7 Python

One framework to develop, deploy and operate data workflows with Python and SQL.
Trino

44 9,552 10.0 Java

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Although Trino (formerly Presto) is in the awesome for beginners list, it’s also a really good DE project as it is a distributed query engine that connects to most of the projects listed above. So depending on where you work in this project you can gain a depth of knowledge on the query engine or breadth across all the connectors …or go hybrid .

ploomber

121 3,374 7.4 Python

The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

As our project grows, we've seen first-hand how difficult it is for others to contribute to open-source: from setting up the development environment, understanding the codebase, drafting a PR, etc. We've learned a lot from helping others successfully contribute to our project so we share our thoughts here in a blog post, don't hesitate to reach out if you need help! Happy to help you contribute to any of our projects or any other!

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Ask HN: Who is hiring? (July 2021)
33 projects | news.ycombinator.com | 1 Jul 2021
Ask HN: Does (or why does) anyone use MapReduce anymore?
2 projects | news.ycombinator.com | 24 Jan 2024
FLaNK Weekly 31 December 2023
25 projects | dev.to | 31 Dec 2023
beam VS quix-streams - a user suggested alternative
2 projects | 7 Dec 2023
How do Streaming Aggregation Pipelines work?
1 project | /r/dataengineering | 6 Dec 2023

How-to-Guide: Contributing to Open Source

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering
Java Big Data Python HacktoberFest Analytics
Post date: 11 Jun 2022

awesome-for-beginners

Apache Spark

WorkOS

missing-semester

Airflow

dbt-core

Apache Parquet

Apache Avro

InfluxDB

sqlfluff

Apache Arrow

Apache Cassandra

Apache Hadoop

Apache Kafka

delta

pinot

nifi

hudi

versatile-data-kit

Trino

ploomber

SaaSHub

Related posts

How-to-Guide: Contributing to Open Source

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering Java Big Data Python HacktoberFest Analytics Post date: 11 Jun 2022

Related posts

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering
Java Big Data Python HacktoberFest Analytics
Post date: 11 Jun 2022