How do you test your pipelines?

This page summarizes the projects mentioned and recommended in the original post on /r/dataengineering

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • monosi

    Open source data observability platform

    As mentioned in other comments, dbt tests are one way to go about it, usually hooked up to Airflow or some other scheduler. There’s also an open source package being actively built out for monitoring data quality and validating some of the parameters you described - https://github.com/monosidev/monosi

  • soda-spark

    Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

    Since you already have Spark setup, perhaps it would be easier to build a DataFrames by loading data from different tables and validate it in one go ? You can give soda-spark a try (disclosure: I'm one of the developers), using which you can specify your checks using YAML declaratively and run the validations in spark jobs.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • soda-sql

    Discontinued Data profiling, testing, and monitoring for SQL accessible data.

    You can also use soda-sql to do checks on your warehouses separately. Both Soda SQL and Soda Spark are OSS/Apache licensed.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts