spark-fast-tests
soda-sql
DISCONTINUED
Our great sponsors
spark-fast-tests | soda-sql | |
---|---|---|
5 | 25 | |
377 | 50 | |
- | - | |
4.1 | 8.2 | |
9 months ago | 3 months ago | |
Scala | Python | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
spark-fast-tests
-
Well designed scala/spark project
https://github.com/MrPowers/spark-fast-tests https://github.com/97arushisharma/Scala_Practice/tree/master/BigData_Analysis_with_Scala_and_Spark/wikipedia
-
Unit & integration testing in Databricks
If the majority of your stuff is not UDF-based there is an OS solution to run assertion tests against full data frames called spark-fast-tests. The idea here is similar in that you have a it notebook that calls your actual notebook against a staged input reads the output and compares it to a prefabed expected output. This does take a bit of setup and trial and error but it’s the closest I’ve been able to get to proper automated regression testing in databricks
-
Show dataengineering: beavis, a library for unit testing Pandas/Dask code
I am the author of spark-fast-tests and chispa, libraries for unit testing Scala Spark / PySpark code.
-
Ask HN: What are some tools / libraries you built yourself?
I built daria (https://github.com/MrPowers/spark-daria) to make it easier to write Spark and spark-fast-tests (https://github.com/MrPowers/spark-fast-tests) to provide a good testing workflow.
quinn (https://github.com/MrPowers/quinn) and chispa (https://github.com/MrPowers/chispa) are the PySpark equivalents.
Built bebe (https://github.com/MrPowers/bebe) to expose the Spark Catalyst expressions that aren't exposed to the Scala / Python APIs.
Also build spark-sbt.g8 to create a Spark project with a single command: https://github.com/MrPowers/spark-sbt.g8
-
Open source contributions for a Data Engineer?
I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.
soda-sql
-
Data Quality - Great Expectations for Data Engineers
I might be a bit biased, but that was my opinion before even I started contributing to Soda SQL.
You can always give Soda a try, more info on soda.io and https://github.com/sodadata/soda-sql. We've put a lot of focus on making it lightweight and easy to use. Disclaimer: I'm one of the founders :).
-
dbt vs R/Python for transformation
Testing and production monitoring of data is still underrated in many teams. In building and operating software systems this has become the norm. In data, there is still a lot of room for improvement. The mentioned tools are insufficient for a thorough testing and monitoring setup. That is why we created Soda with Soda SQL as our open source tool for testing data in and out of pipeline: https://github.com/sodadata/soda-sql
-
How do you test your pipelines?
You can also use soda-sql to do checks on your warehouses separately. Both Soda SQL and Soda Spark are OSS/Apache licensed.
- How heavily do you use Great Expectations?
-
What are some exciting new tools/libraries in 2021?
soda-sql really cool library to automate data quality checks on SQL tables
-
Data Testing Tools, Pytest vs Great Expectations vs Soda vs Deequ
Certainly! It’s not requested that much 😊 but please add an issue on GitHub . I would love to add at least experimental support.
-
Open source contributions for a Data Engineer?
If you are interested in using/learning Python, SQL and data warehouse skills, take a look at https://github.com/sodadata/soda-sql
-
Anyone aware of any Data Validation Framework with custom SQL capability
Soda-sql looks promising. It has some out of the box tests and you can also provide custom SQL: https://github.com/sodadata/soda-sql
What are some alternatives?
deequ - Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Prefect - The easiest way to build, run, and monitor data pipelines at scale.
pandera - A light-weight, flexible, and expressive statistical data testing library
sqlfluff - A modular SQL linter and auto-formatter with support for multiple dialects and templated code.
chispa - PySpark test helper methods with beautiful error messages
trino_data_mesh - Proof of concept on how to gain insights with Trino across different databases from a distributed data mesh
dbt-sessionization - Using DBT for Creating Session Abstractions on RudderStack - an open-source, warehouse-first customer data pipeline and Segment alternative.
re_data - re_data - fix data issues before your users & CEO would discover them 😊
dagster - An orchestration platform for the development, production, and observation of data assets.
airflow-notebook - This repository is no longer maintained.
spark-daria - Essential Spark extensions and helper methods ✨😲