soda-sql
pandera
Our great sponsors
soda-sql | pandera | |
---|---|---|
25 | 7 | |
50 | 2,994 | |
- | 4.8% | |
8.2 | 8.9 | |
over 1 year ago | 6 days ago | |
Python | Python | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
soda-sql
-
Data Quality - Great Expectations for Data Engineers
I might be a bit biased, but that was my opinion before even I started contributing to Soda SQL.
- dbt vs R/Python for transformation
-
SodaCL - preview of a new "data reliability as code" language
I'm one of the developers of the Open Source soda-sql data quality monitoring library, and over the past year we got some incredible feedback from our users, and based on that we started working on a new DSL for data reliability as code we are calling Soda CL.
-
How do you test your pipelines?
You can also use soda-sql to do checks on your warehouses separately. Both Soda SQL and Soda Spark are OSS/Apache licensed.
-
Being constantly shut down by more senior team members when I mention adding some QA in our work
As many have said, there might be business side of things to deliver. Somebody above promised delivery with tight deadlines. Trust me, I am not a fan, but this how the world works and it sucks. I would say in your free time, explore tools like greatexpectations.io https://greatexpectations.io/ or https://github.com/sodadata/soda-sql which are modern ways of testing in your learning curve
- Soda
- How heavily do you use Great Expectations?
-
What are some exciting new tools/libraries in 2021?
soda-sql really cool library to automate data quality checks on SQL tables
-
How do I incorporate testing after the fact?
Look at SodaSQL. It's more enterprise focused than Great Expectations and you can pipe results to a database for downstream actions and analysis.
-
Data Testing Tools, Pytest vs Great Expectations vs Soda vs Deequ
Certainly! It’s not requested that much 😊 but please add an issue on GitHub . I would love to add at least experimental support.
pandera
-
Unit testing functions that input/output dataframes?
I use Pandera, so I just need to define the expected input/output schemas (i.e. column names, types, and constraints on them), and Pandera automatically generates fake data for the unit tests, and validates the result: https://github.com/unionai-oss/pandera
-
Great Expectations is annoyingly cumbersome
Please DM me! Or we can discuss in this issue which I just created: https://github.com/unionai-oss/pandera/issues/1042
-
Data validation for dashboards
In my opinion for simple data validation tasks the best solution is always Pandera.
-
Show HN: Pandera 0.8.0 – validate pandas, dask, modin, and koalas dataframes
* adds support for mypy static type-linting if you need that extra type safety
Repo: https://github.com/pandera-dev/pandera
-
Pandera 0.8.0: Schema Validation for Pandas, Dask, Modin, and Koalas DataFrames. Oh, and also out-of-the-box Pydantic and Mypy support :)
Repo: https://github.com/pandera-dev/pandera
-
How heavily do you use Great Expectations?
pandera
What are some alternatives?
deequ - Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Schematics - Python Data Structures for Humans™.
sqlfluff - A modular SQL linter and auto-formatter with support for multiple dialects and templated code.
jsonschema - An implementation of the JSON Schema specification for Python
dbt-sessionization - Using DBT for Creating Session Abstractions on RudderStack - an open-source, warehouse-first customer data pipeline and Segment alternative.
pointblank - Data quality assessment and metadata reporting for data frames and database tables
re_data - re_data - fix data issues before your users & CEO would discover them 😊
swifter - A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner
trino_data_mesh - Proof of concept on how to gain insights with Trino across different databases from a distributed data mesh
dbt-expectations - Port(ish) of Great Expectations to dbt test macros
spark-fast-tests - Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
sweetviz - Visualize and compare datasets, target values and associations, with one line of code.