quinn
soda-sql
DISCONTINUED
Our great sponsors
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
quinn
-
Invitation to collaborate on open source PySpark projects
quinn is a library with PySpark helper functions. I need to work through all the open issues / PRs and bump all versions. I should do another release. This library gets around 600,000 monthly downloads.
-
Pyspark now provides a native Pandas API
Pandas syntax is far inferior to regular PySpark in my opinion. Goes to show how much data analysts value a syntax that they're already familiar with. Pandas syntax makes it harder to reason about queries, abstract DataFrame transformations, etc. I've authored some popular PySpark libraries like quinn and chispa and am not excited to add Pandas syntax support, haha.
-
Is Spark - The Defenitive Guide outdated?
They spent a lot of effort improving the catalyst engine under the hood too and making it easier to extend and improve it in the future. Making it easy to add your own native code to Spark itself. Shameless plug of a blog post I wrote on this subject which basically reiterates what Matthew Powers, author of Spark Daria and quinn, wrote here.
-
Ask HN: What are some tools / libraries you built yourself?
I built daria (https://github.com/MrPowers/spark-daria) to make it easier to write Spark and spark-fast-tests (https://github.com/MrPowers/spark-fast-tests) to provide a good testing workflow.
quinn (https://github.com/MrPowers/quinn) and chispa (https://github.com/MrPowers/chispa) are the PySpark equivalents.
Built bebe (https://github.com/MrPowers/bebe) to expose the Spark Catalyst expressions that aren't exposed to the Scala / Python APIs.
Also build spark-sbt.g8 to create a Spark project with a single command: https://github.com/MrPowers/spark-sbt.g8
-
Open source contributions for a Data Engineer?
I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.
soda-sql
-
Data Quality - Great Expectations for Data Engineers
I might be a bit biased, but that was my opinion before even I started contributing to Soda SQL.
You can always give Soda a try, more info on soda.io and https://github.com/sodadata/soda-sql. We've put a lot of focus on making it lightweight and easy to use. Disclaimer: I'm one of the founders :).
-
dbt vs R/Python for transformation
Testing and production monitoring of data is still underrated in many teams. In building and operating software systems this has become the norm. In data, there is still a lot of room for improvement. The mentioned tools are insufficient for a thorough testing and monitoring setup. That is why we created Soda with Soda SQL as our open source tool for testing data in and out of pipeline: https://github.com/sodadata/soda-sql
-
How do you test your pipelines?
You can also use soda-sql to do checks on your warehouses separately. Both Soda SQL and Soda Spark are OSS/Apache licensed.
- How heavily do you use Great Expectations?
-
What are some exciting new tools/libraries in 2021?
soda-sql really cool library to automate data quality checks on SQL tables
-
Data Testing Tools, Pytest vs Great Expectations vs Soda vs Deequ
Certainly! Itβs not requested that much π but please add an issue on GitHub . I would love to add at least experimental support.
-
Open source contributions for a Data Engineer?
If you are interested in using/learning Python, SQL and data warehouse skills, take a look at https://github.com/sodadata/soda-sql
-
Anyone aware of any Data Validation Framework with custom SQL capability
Soda-sql looks promising. It has some out of the box tests and you can also provide custom SQL: https://github.com/sodadata/soda-sql
What are some alternatives?
deequ - Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
pandera - A light-weight, flexible, and expressive statistical data testing library
sqlfluff - A modular SQL linter and auto-formatter with support for multiple dialects and templated code.
trino_data_mesh - Proof of concept on how to gain insights with Trino across different databases from a distributed data mesh
dbt-sessionization - Using DBT for Creating Session Abstractions on RudderStack - an open-source, warehouse-first customer data pipeline and Segment alternative.
re_data - re_data - fix data issues before your users & CEO would discover them π
spark-fast-tests - Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
Prefect - The easiest way to build, run, and monitor data pipelines at scale.
dagster - An orchestration platform for the development, production, and observation of data assets.
spark-daria - Essential Spark extensions and helper methods β¨π²
airflow-notebook - This repository is no longer maintained.
chispa - PySpark test helper methods with beautiful error messages