Open source contributions for a Data Engineer?

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering

InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. quinn

    pyspark methods to enhance developer productivity 📣 👯 🎉 (by mrpowers-io)

    I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. chispa

    PySpark test helper methods with beautiful error messages

    I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

  4. spark-daria

    Essential Spark extensions and helper methods ✨😲

    I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

  5. spark-fast-tests

    Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

    I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

  6. soda-sql

    Discontinued Data profiling, testing, and monitoring for SQL accessible data.

    If you are interested in using/learning Python, SQL and data warehouse skills, take a look at https://github.com/sodadata/soda-sql

  7. ballista

    Discontinued Distributed compute platform implemented in Rust, and powered by Apache Arrow.

    His newer project, Ballista, was also donated to Apache Arrow. I hope to get the Rust skills to collaborate with him on open source work someday too. He's also doing really cool work on spark-rapids FYI.

  8. spark-rapids

    Spark RAPIDS plugin - accelerate Apache Spark with GPUs

    His newer project, Ballista, was also donated to Apache Arrow. I hope to get the Rust skills to collaborate with him on open source work someday too. He's also doing really cool work on spark-rapids FYI.

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. airbyte

    The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

    Airbyte and Singer/Meltano if you want to learn more about ingestion pipelines. Airbyte and Meltano teams are very welcoming. SQLfluff a shiny SQL linter. Beautiful project with awesome maintainers.

  11. meltano

    Airbyte and Singer/Meltano if you want to learn more about ingestion pipelines. Airbyte and Meltano teams are very welcoming. SQLfluff a shiny SQL linter. Beautiful project with awesome maintainers.

  12. sqlfluff

    A modular SQL linter and auto-formatter with support for multiple dialects and templated code.

    Airbyte and Singer/Meltano if you want to learn more about ingestion pipelines. Airbyte and Meltano teams are very welcoming. SQLfluff a shiny SQL linter. Beautiful project with awesome maintainers.

  13. DataGristle

    Tough and flexible tools for data analysis, transformation, validation and movement.

    DataGristle by u/kenfar who influenced many of us in this sub.

  14. Metabase

    The easy-to-use open source Business Intelligence and Embedded Analytics tool that lets everyone work with data :bar_chart:

    If you want to work more on the visualization side maybe Metabase, Superset and Streamlit.

  15. superset

    Apache Superset is a Data Visualization and Data Exploration Platform

    If you want to work more on the visualization side maybe Metabase, Superset and Streamlit.

  16. streamlit

    Streamlit — A faster way to build and share data apps.

    If you want to work more on the visualization side maybe Metabase, Superset and Streamlit.

  17. Skytrax-Data-Warehouse

    Discontinued A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

    Always open to accept contributions to my project (Skytrax Data Warehouse). If you are into data stuff support my work at youtube as well (One Developer Pirate), I mostly make data-oriented videos. These days I'm making a SQL course from a data analysis perspective that is expected to release in next week.

  18. Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Prefect! Specifically the Task Library: https://github.com/PrefectHQ/prefect

  19. dagster

    An orchestration platform for the development, production, and observation of data assets.

    It's a near crime that Dagster hasn't been mentioned already.

  20. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • LLM GUI: Custom Python Gradio Interface

    2 projects | dev.to | 30 Oct 2024
  • Useful Python Libraries for AI/ML

    5 projects | dev.to | 10 Aug 2024
  • Show HN: Excel to Python Compiler

    3 projects | news.ycombinator.com | 23 May 2024
  • Welcome to 14 days of Data Science!

    1 project | dev.to | 7 Mar 2024
  • What codegen is (actually) good for

    2 projects | news.ycombinator.com | 28 Sep 2023

Did you know that Python is
the 2nd most popular programming language
based on number of references?