soda-sql vs spark-fast-tests

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

soda-sql		spark-fast-tests
	Project
25	Mentions	6
50	Stars	418
-	Growth	-
8.2	Activity	0.0
over 1 year ago	Latest Commit	12 days ago
Python	Language	Scala
Apache License 2.0	License	MIT License

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

soda-sql

Posts with mentions or reviews of soda-sql. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-03-18.

Data Quality - Great Expectations for Data Engineers
2 projects | /r/dataengineering | 18 Mar 2022

I might be a bit biased, but that was my opinion before even I started contributing to Soda SQL.
dbt vs R/Python for transformation
2 projects | /r/dataengineering | 25 Feb 2022
SodaCL - preview of a new "data reliability as code" language
1 project | /r/dataengineering | 13 Feb 2022

I'm one of the developers of the Open Source soda-sql data quality monitoring library, and over the past year we got some incredible feedback from our users, and based on that we started working on a new DSL for data reliability as code we are calling Soda CL.
How do you test your pipelines?
3 projects | /r/dataengineering | 23 Jan 2022

You can also use soda-sql to do checks on your warehouses separately. Both Soda SQL and Soda Spark are OSS/Apache licensed.
Being constantly shut down by more senior team members when I mention adding some QA in our work
1 project | /r/dataengineering | 10 Jan 2022

As many have said, there might be business side of things to deliver. Somebody above promised delivery with tight deadlines. Trust me, I am not a fan, but this how the world works and it sucks. I would say in your free time, explore tools like greatexpectations.io https://greatexpectations.io/ or https://github.com/sodadata/soda-sql which are modern ways of testing in your learning curve
Soda
1 project | /r/devopspro | 10 Dec 2021
How heavily do you use Great Expectations?
2 projects | /r/dataengineering | 23 Sep 2021
What are some exciting new tools/libraries in 2021?
2 projects | /r/datascience | 20 Jun 2021

soda-sql really cool library to automate data quality checks on SQL tables
How do I incorporate testing after the fact?
1 project | /r/dataengineering | 18 May 2021

Look at SodaSQL. It's more enterprise focused than Great Expectations and you can pipe results to a database for downstream actions and analysis.
Data Testing Tools, Pytest vs Great Expectations vs Soda vs Deequ
2 projects | /r/dataengineering | 17 May 2021

Certainly! It’s not requested that much 😊 but please add an issue on GitHub . I would love to add at least experimental support.

spark-fast-tests

Posts with mentions or reviews of spark-fast-tests. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-04-13.

Lakehouse architecture in Azure Synapse without Databricks?
2 projects | /r/dataengineering | 13 Apr 2023

I was a Databricks user for 5 years and spent 95% of my time developing Spark code in IDEs. See the spark-daria and spark-fast-tests projects as Scala examples. I developed internal libraries with all the business logic. The Databricks notebooks would consist of a few lines of code that would invoke a function in the proprietary Spark codebase. The proprietary Spark codebase would depend on the OSS libraries I developed in parallel.
Well designed scala/spark project
4 projects | /r/scala | 15 Oct 2022

https://github.com/MrPowers/spark-fast-tests https://github.com/97arushisharma/Scala_Practice/tree/master/BigData_Analysis_with_Scala_and_Spark/wikipedia
Unit & integration testing in Databricks
3 projects | /r/dataengineering | 30 Apr 2022

If the majority of your stuff is not UDF-based there is an OS solution to run assertion tests against full data frames called spark-fast-tests. The idea here is similar in that you have a it notebook that calls your actual notebook against a staged input reads the output and compares it to a prefabed expected output. This does take a bit of setup and trial and error but it’s the closest I’ve been able to get to proper automated regression testing in databricks
Show dataengineering: beavis, a library for unit testing Pandas/Dask code
3 projects | /r/dataengineering | 9 Aug 2021

I am the author of spark-fast-tests and chispa, libraries for unit testing Scala Spark / PySpark code.
Ask HN: What are some tools / libraries you built yourself?
264 projects | news.ycombinator.com | 16 May 2021

I built daria (https://github.com/MrPowers/spark-daria) to make it easier to write Spark and spark-fast-tests (https://github.com/MrPowers/spark-fast-tests) to provide a good testing workflow.
quinn (https://github.com/MrPowers/quinn) and chispa (https://github.com/MrPowers/chispa) are the PySpark equivalents.
Built bebe (https://github.com/MrPowers/bebe) to expose the Spark Catalyst expressions that aren't exposed to the Scala / Python APIs.
Also build spark-sbt.g8 to create a Spark project with a single command: https://github.com/MrPowers/spark-sbt.g8
Open source contributions for a Data Engineer?
17 projects | /r/dataengineering | 16 Apr 2021

I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

What are some alternatives?

When comparing soda-sql and spark-fast-tests you can also consider the following projects:

deequ - Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Prefect - The easiest way to build, run, and monitor data pipelines at scale.

pandera - A light-weight, flexible, and expressive statistical data testing library

chispa - PySpark test helper methods with beautiful error messages

sqlfluff - A modular SQL linter and auto-formatter with support for multiple dialects and templated code.

airbyte - The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

dbt-sessionization - Using DBT for Creating Session Abstractions on RudderStack - an open-source, warehouse-first customer data pipeline and Segment alternative.

re_data - re_data - fix data issues before your users & CEO would discover them 😊

spark-daria - Essential Spark extensions and helper methods ✨😲

trino_data_mesh - Proof of concept on how to gain insights with Trino across different databases from a distributed data mesh

dagster - An orchestration platform for the development, production, and observation of data assets.

soda-sql vs deequ spark-fast-tests vs Prefect soda-sql vs pandera spark-fast-tests vs chispa soda-sql vs sqlfluff spark-fast-tests vs airbyte soda-sql vs dbt-sessionization spark-fast-tests vs sqlfluff soda-sql vs re_data spark-fast-tests vs spark-daria soda-sql vs trino_data_mesh spark-fast-tests vs dagster

Compare soda-sql vs spark-fast-tests and see what are their differences.

soda-sql

spark-fast-tests

soda-sql

spark-fast-tests

What are some alternatives?