Data diffs: Algorithms for explaining what changed in a dataset (2022)

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

macrobase-diff

1 6 10.0 Python

Minimal implementation of Macrobase Diff

some years ago when I was digging into DIFF and Macrobase (the one from Ballis lab) I made a simple reproduction of DIFF algo https://github.com/PiotrZakrzewski/macrobase-diff

ExplainDaV

1 10 10.0 Python

Of interest might be Explain-Da-V[0] which will be presented at VLDB this year.
[0] https://github.com/shraga89/ExplainDaV

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
spark-extension

1 168 8.3 Scala

A library that provides useful extensions to Apache Spark and PySpark.

We're doing a env migration and I've been using spark diff extension for reconcile data, it's amazing, we've discover bugs in the data logic so quickly,
here is the extension if anyone is interested https://github.com/G-Research/spark-extension/blob/master/DI...

Apache Calcite

28 4,363 9.0 Java

Apache Calcite

> Make diff work on more than just SQLite.
Another way of doing this that I've been wanting to do for a while is to implement the DIFF operator in Apache Calcite[0]. Using Calcite, DIFF could be implemented as rewrite rules to generate the appropriate SQL to be directly executed against the database or the DIFF operator can be implemented outside of the database (which the original paper shows is more efficient).
[0] https://calcite.apache.org/

recidiffist

1 15 10.0 Clojure

Diffs for structured data

At Latacora, we use a giant pile of Clojure (almost everything, with specific measured exceptions, is). As a side effect, we have a lot of data. Not necessarily a lot in the sense of "big S3 bill", but definitely a lot in the sense of "you might not expect this being in a machine-readable format".
Things like: what Lambdas existed in a customer AWS account 6 months ago in us-east-2 that had access to a specific SQS queue (because we learned later that one of the consumers of that queue would actually consume Python pickles if you asked nicely, and hence get you RCE).
As a side effect, we do a lot of data diffing: just mostly on more vanilla Clojure structures rather than data sets in the Datasette/CSV/... sense.
For example: https://github.com/latacora/recidiffist (which we also have wired up to Terraform + S3, so if you write some files to S3, you can get the structured diffs right next to it for free). It's one of those things that's incredibly simple and works ridiculously well. Well, if you do it consistently anyway.
Also https://github.com/lambdaisland/deep-diff2 for when we're more interested in presenting it to humans.

deep-diff2

1 289 6.0 Clojure

Deep diff Clojure data structures and pretty print the result

At Latacora, we use a giant pile of Clojure (almost everything, with specific measured exceptions, is). As a side effect, we have a lot of data. Not necessarily a lot in the sense of "big S3 bill", but definitely a lot in the sense of "you might not expect this being in a machine-readable format".
Things like: what Lambdas existed in a customer AWS account 6 months ago in us-east-2 that had access to a specific SQS queue (because we learned later that one of the consumers of that queue would actually consume Python pickles if you asked nicely, and hence get you RCE).
As a side effect, we do a lot of data diffing: just mostly on more vanilla Clojure structures rather than data sets in the Datasette/CSV/... sense.
For example: https://github.com/latacora/recidiffist (which we also have wired up to Terraform + S3, so if you write some files to S3, you can get the structured diffs right next to it for free). It's one of those things that's incredibly simple and works ridiculously well. Well, if you do it consistently anyway.
Also https://github.com/lambdaisland/deep-diff2 for when we're more interested in presenting it to humans.

lakeFS

48 4,066 9.8 Go

lakeFS - Data version control for your data lake | Git for data

Might want to checkout lakeFS: https://github.com/treeverse/lakeFS
(full disclosure: I'm one of the creators)

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
handy_sql_queries

2 1 4.5

If you are looking for an easy way to compare two tables in SQL, whether every single row and every single column are the same, you can use the following technique:
https://github.com/gregw2hn/handy_sql_queries/blob/main/sql_...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project