🧱 Databricks CLI eXtensions - aka dbx is a CLI tool for development and advanced Databricks workflows management.
Hey, Databricks person here. Check out DBX for a template on how to do unit and integration tests: https://github.com/databrickslabs/dbx
Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
If the majority of your stuff is not UDF-based there is an OS solution to run assertion tests against full data frames called spark-fast-tests. The idea here is similar in that you have a it notebook that calls your actual notebook against a staged input reads the output and compares it to a prefabed expected output. This does take a bit of setup and trial and error but it’s the closest I’ve been able to get to proper automated regression testing in databricks
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
Demo of using the Nutter for testing of Databricks notebooks in the CI/CD pipeline [Moved to: https://github.com/alexott/databricks-nutter-repos-demo]
You can also use the approach described here https://github.com/alexott/databricks-nutter-projects-demo
how/where do you define your databricks jobs, tasks and workflows?
1 project | reddit.com/r/dataengineering | 1 Nov 2022
My top 5 learnings from driving an OSS project
1 project | dev.to | 3 Apr 2022
Anyone use Pyspark notebook in production ?
1 project | reddit.com/r/dataengineering | 19 Dec 2021
What are the worst data engineering practices you have seen?
1 project | reddit.com/r/dataengineering | 14 Dec 2022
Snowpark equivalent on Databricks?
2 projects | reddit.com/r/dataengineering | 7 Nov 2022