Our great sponsors
-
As mentioned in other comments, dbt tests are one way to go about it, usually hooked up to Airflow or some other scheduler. There’s also an open source package being actively built out for monitoring data quality and validating some of the parameters you described - https://github.com/monosidev/monosi
-
soda-spark
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
Since you already have Spark setup, perhaps it would be easier to build a DataFrames by loading data from different tables and validate it in one go ? You can give soda-spark a try (disclosure: I'm one of the developers), using which you can specify your checks using YAML declaratively and run the validations in spark jobs.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
You can also use soda-sql to do checks on your warehouses separately. Both Soda SQL and Soda Spark are OSS/Apache licensed.