Open source contributions for a Data Engineer?

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • quinn

    pyspark methods to enhance developer productivity 📣 👯 🎉 (by MrPowers)

  • I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

  • chispa

    PySpark test helper methods with beautiful error messages

  • I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • spark-daria

    Essential Spark extensions and helper methods ✨😲

  • I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

  • spark-fast-tests

    Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

  • I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

  • soda-sql

    Discontinued Data profiling, testing, and monitoring for SQL accessible data.

  • If you are interested in using/learning Python, SQL and data warehouse skills, take a look at https://github.com/sodadata/soda-sql

  • ballista

    Discontinued Distributed compute platform implemented in Rust, and powered by Apache Arrow.

  • His newer project, Ballista, was also donated to Apache Arrow. I hope to get the Rust skills to collaborate with him on open source work someday too. He's also doing really cool work on spark-rapids FYI.

  • spark-rapids

    Spark RAPIDS plugin - accelerate Apache Spark with GPUs

  • His newer project, Ballista, was also donated to Apache Arrow. I hope to get the Rust skills to collaborate with him on open source work someday too. He's also doing really cool work on spark-rapids FYI.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

  • Airbyte and Singer/Meltano if you want to learn more about ingestion pipelines. Airbyte and Meltano teams are very welcoming. SQLfluff a shiny SQL linter. Beautiful project with awesome maintainers.

  • meltano

  • Airbyte and Singer/Meltano if you want to learn more about ingestion pipelines. Airbyte and Meltano teams are very welcoming. SQLfluff a shiny SQL linter. Beautiful project with awesome maintainers.

  • sqlfluff

    A modular SQL linter and auto-formatter with support for multiple dialects and templated code.

  • Airbyte and Singer/Meltano if you want to learn more about ingestion pipelines. Airbyte and Meltano teams are very welcoming. SQLfluff a shiny SQL linter. Beautiful project with awesome maintainers.

  • DataGristle

    Tough and flexible tools for data analysis, transformation, validation and movement.

  • DataGristle by u/kenfar who influenced many of us in this sub.

  • Metabase

    The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:

  • If you want to work more on the visualization side maybe Metabase, Superset and Streamlit.

  • superset

    Apache Superset is a Data Visualization and Data Exploration Platform

  • If you want to work more on the visualization side maybe Metabase, Superset and Streamlit.

  • streamlit

    Streamlit — A faster way to build and share data apps.

  • If you want to work more on the visualization side maybe Metabase, Superset and Streamlit.

  • Skytrax-Data-Warehouse

    Discontinued A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

  • Always open to accept contributions to my project (Skytrax Data Warehouse). If you are into data stuff support my work at youtube as well (One Developer Pirate), I mostly make data-oriented videos. These days I'm making a SQL course from a data analysis perspective that is expected to release in next week.

  • Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

  • Prefect! Specifically the Task Library: https://github.com/PrefectHQ/prefect

  • dagster

    An orchestration platform for the development, production, and observation of data assets.

  • It's a near crime that Dagster hasn't been mentioned already.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts