Open source contributions for a Data Engineer?

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

quinn

9 576 9.2 Python

pyspark methods to enhance developer productivity 📣 👯 🎉 (by MrPowers)

I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

chispa

12 508 6.7 Python

PySpark test helper methods with beautiful error messages

I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
spark-daria

4 742 0.0 Scala

Essential Spark extensions and helper methods ✨😲

I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

spark-fast-tests

6 418 0.0 Scala

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

soda-sql

25 50 8.2 Python

Discontinued Data profiling, testing, and monitoring for SQL accessible data.

If you are interested in using/learning Python, SQL and data warehouse skills, take a look at https://github.com/sodadata/soda-sql

ballista

20 2,238 9.3 Rust

Discontinued Distributed compute platform implemented in Rust, and powered by Apache Arrow.

His newer project, Ballista, was also donated to Apache Arrow. I hope to get the Rust skills to collaborate with him on open source work someday too. He's also doing really cool work on spark-rapids FYI.

spark-rapids

3 720 9.8 Scala

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

His newer project, Ballista, was also donated to Apache Arrow. I hope to get the Rust skills to collaborate with him on open source work someday too. He's also doing really cool work on spark-rapids FYI.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
airbyte

139 13,923 10.0 Python

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

Airbyte and Singer/Meltano if you want to learn more about ingestion pipelines. Airbyte and Meltano teams are very welcoming. SQLfluff a shiny SQL linter. Beautiful project with awesome maintainers.

meltano

10 - -

Airbyte and Singer/Meltano if you want to learn more about ingestion pipelines. Airbyte and Meltano teams are very welcoming. SQLfluff a shiny SQL linter. Beautiful project with awesome maintainers.

sqlfluff

35 7,199 9.6 Python

A modular SQL linter and auto-formatter with support for multiple dialects and templated code.

Airbyte and Singer/Meltano if you want to learn more about ingestion pipelines. Airbyte and Meltano teams are very welcoming. SQLfluff a shiny SQL linter. Beautiful project with awesome maintainers.

DataGristle

5 137 0.0 Python

Tough and flexible tools for data analysis, transformation, validation and movement.

DataGristle by u/kenfar who influenced many of us in this sub.

Metabase

67 36,417 10.0 Clojure

The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:

If you want to work more on the visualization side maybe Metabase, Superset and Streamlit.

superset

137 58,737 9.9 TypeScript

Apache Superset is a Data Visualization and Data Exploration Platform

If you want to work more on the visualization side maybe Metabase, Superset and Streamlit.

streamlit

254 31,506 9.8 Python

Streamlit — A faster way to build and share data apps.

If you want to work more on the visualization side maybe Metabase, Superset and Streamlit.

Skytrax-Data-Warehouse

1 131 0.0 Python

Discontinued A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

Always open to accept contributions to my project (Skytrax Data Warehouse). If you are into data stuff support my work at youtube as well (One Developer Pirate), I mostly make data-oriented videos. These days I'm making a SQL course from a data analysis perspective that is expected to release in next week.

Prefect

19 14,586 10.0 Python

The easiest way to build, run, and monitor data pipelines at scale.

Prefect! Specifically the Task Library: https://github.com/PrefectHQ/prefect

dagster

46 10,173 10.0 Python

An orchestration platform for the development, production, and observation of data assets.

It's a near crime that Dagster hasn't been mentioned already.

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Welcome to 14 days of Data Science!
1 project | dev.to | 7 Mar 2024
What codegen is (actually) good for
2 projects | news.ycombinator.com | 28 Sep 2023
Data Science for Beginners - A Curriculum
1 project | /r/programming | 8 Sep 2023
Road map data science/ machine learning
1 project | /r/devsarg | 3 May 2023
Need a small help!
2 projects | /r/pythontips | 3 May 2023

Open source contributions for a Data Engineer?

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering
Python Data Science Data Analysis Data Visualization data-engineering
Post date: 16 Apr 2021

quinn

chispa

WorkOS

spark-daria

spark-fast-tests

soda-sql

ballista

spark-rapids

InfluxDB

airbyte

meltano

sqlfluff

DataGristle

Metabase

superset

streamlit

Skytrax-Data-Warehouse

Prefect

dagster

SaaSHub

Related posts

Open source contributions for a Data Engineer?

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering Python Data Science Data Analysis Data Visualization data-engineering Post date: 16 Apr 2021

Related posts

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering
Python Data Science Data Analysis Data Visualization data-engineering
Post date: 16 Apr 2021