Our great sponsors
-
I was wondering the other day how to classify tests that use a test double like pg-mem, which isn't a mock but isn't the Dockerized test DB either :
-
hash-db
Experimental distributed pseudomultimodel keyvalue database (it uses python dictionaries) imitating dynamodb querying with join only SQL support, distributed joins and simple Cypher graph support and document storage
From an SQL database implementation perspective, in my toy Python barebones SQL database that barely supports inner joins (https://github.com/samsquire/hash-db) I tested by testing on postgresql and seeing if my query with two joins produces the same results.
I ought to produce unit tests that prove that tuples from each join operation produces the correct dataset.
For a user perspective, I guess you could write some tooling that loads example data into a database and does an incremental join with each part of the join statement added.
-
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
https://github.com/QuantCo/datajudge
We've also written a blog post trying to illustrate a use case:
-
dbt-unit-testing
This dbt package contains macros to support unit testing that can be (re)used across dbt projects.
We use this and take an example-based tests approach for any non-trivial tables: https://github.com/EqualExperts/dbt-unit-testing
-
testcontainers-dotnet
A library to support tests with throwaway instances of Docker containers for all compatible .NET Standard versions.
.NET Shop using SQL Server here, but I think something similar to what we do can apply to any stack. We use TestContainers [1] to spin up a container with SQL Server engine running on it. Then use FluentMigrator [2] to provision tables and test data to run XUnit integration tests against. This has worked remarkably well.
-
In GraphJin an automatic GraphQL to SQL compiler we use the gnomock library it startups a database instance (docker) then create the schema and tests data and finally our code connects to it and runs a series of tests. We run these across Mysql, Postgres and a few other DB's. Gnomock supports a wide range of them. Right now we don't take down the db for every test only between test runs but its fast enough that we could. This whole thing runs of a simple `go test -v .` command and we run it on every commit using a githook. https://github.com/dosco/graphjin/blob/master/tests/dbint_te...
-
-
InfluxDB
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
-
sqlfluff
A modular SQL linter and auto-formatter with support for multiple dialects and templated code.
This linter can really enforce some best practices https://github.com/sqlfluff/sqlfluff
A list of best practices:
-
sqlx
🧰 The Rust SQL Toolkit. An async, pure Rust SQL crate featuring compile-time checked queries without a DSL. Supports PostgreSQL, MySQL, SQLite, and MSSQL. (by launchbadge)
Part of the solution could be using tooling which can compile time check SQL is valid like https://github.com/launchbadge/sqlx
-
I did data engineering for 6 years and am building a company to automate SQL validation for dbt users.
First, by “testing SQL pipelines”, I assume you mean testing changes to SQL code as part of the development workflow? (vs. monitoring pipelines in production for failures / anomalies).
If so:
1 – assertions. dbt comes with a solid built-in testing framework [1] for expressing assertions such as “this column should have values in the list [A,B,C]” as well checking referential integrity, uniqueness, nulls, etc. There are more advanced packages on top of dbt tests [2]. The problem with assertion testing in general though is that for a moderately complex data pipeline, it’s infeasible to achieve test coverage that would cover most possible failure scenarios.
2 – data diff: for every change to SQL, know exactly how the code change affects the output data by comparing the data in dev/staging (built off the dev branch code) with the data in production (built off the main branch). We built an open-source tool for that: https://github.com/datafold/data-diff, and we are adding an integration with dbt soon which will make diffing as part of dbt development workflow one command away [2]
We make money by selling a Cloud solution for teams that integrates data diff into Github/Gitlab CI and automatically diffs every pull request to tell you the how a change to SQL affects the target table you changed, downstream tables and dependent BI tools (video demo: [3])
I’ve also written about why reliable change management is so important for data engineering and what are key best practices to implement [4]
-
fugue
A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark, Dask and Ray without any rewrites.
-
Here's a nifty python module for testing against postgres that a friend wrote. It creates a temporary database on the fly for testing and tears it down at the end. You can then populate the database and run all kinds of integration tests against it.
-
Happy to hear that! When it comes to testing services that depend on PostgreSQL, this is still my preferred solution.
https://github.com/allaboutapps/integresql
disclaimer: author
-
On a related note (though it does touch upon testing) mitchellh open sourced Squire[1] recently (today?). Note though, that it's PostgreSQL specific.
-
Back in the mid-noughties I decided to see if I could write SQL in a test-first manner (i.e. TDD). This resulted in me writing a 100% T-SQL based unit testing framework for SQL Server [1] which we then used for the SQL back-end at an investment bank.
On the back on that professional use I wrote a blog post [2] explaining why you might choose to go down this route as it wasn't the way database was developed way back then (SQL wasn't developed in the same way as the other front-end and back-end code).
A few years later I gave a short 20-minute talk (videoed) to show what writing SQL using TDD looked like for me. It's hard to show all the kinds of tests we wrote in practice at the bank but the talk is intended to show how rapid the feedback loop can be using a standard DB query tool and two code windows - production code and tests.
Be kind, it was a long time ago and I'm sure the state of the art has improved a lot in the intervening years :o).
Chris Oldwood
---
[1] SQL Server Unit: https://github.com/chrisoldwood/SS-Unit
[2] You Write Your SQL Unit Tests in SQL?: https://chrisoldwood.blogspot.com/2011/04/you-write-your-sql...
[3] Test-Driven SQL: https://www.youtube.com/watch?v=5-MWYKLM3r0
-
Spark makes it easy to wrap SQL in functions that are easy to test. I am the author of the popular Scala Spark (spark-fast-tests) and PySpark (chispa) testing libraries. Some additional tips to speed up Spark tests (can speed up tests between 70-90%):
* reuse the same Spark session throughout the test suite
* Set shuffle partitions to 2 (instead of default which is 200)
* Use dependency injection to avoid disk I/O in the test suite
* Use fast DataFrame equality when possible. assertSmallDataFrameEquality is 4x faster than assertLargeDataFrameEquality.
* Use column equality to test column functions. Don't compare DataFrames unless you're testing custom DataFrame transformations. See the spark-style-guide for definitions for these terms: https://github.com/MrPowers/spark-style-guide/blob/main/PYSP...
Spark is an underrated tool for testing SQL. Spark makes it really easy to abstract SQL into unit testable chunks. Configuring your tests properly takes some knowledge, but you can make the tests run relatively quickly.
-
We built a GUI based solution called https://github.com/bytebase/bytebase. You may also take a look.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives