Data diffs: Algorithms for explaining what changed in a dataset (2022)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • macrobase-diff

    Minimal implementation of Macrobase Diff

  • some years ago when I was digging into DIFF and Macrobase (the one from Ballis lab) I made a simple reproduction of DIFF algo https://github.com/PiotrZakrzewski/macrobase-diff

  • ExplainDaV

  • Of interest might be Explain-Da-V[0] which will be presented at VLDB this year.

    [0] https://github.com/shraga89/ExplainDaV

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • spark-extension

    A library that provides useful extensions to Apache Spark and PySpark.

  • We're doing a env migration and I've been using spark diff extension for reconcile data, it's amazing, we've discover bugs in the data logic so quickly,

    here is the extension if anyone is interested https://github.com/G-Research/spark-extension/blob/master/DI...

  • Apache Calcite

    Apache Calcite

  • > Make diff work on more than just SQLite.

    Another way of doing this that I've been wanting to do for a while is to implement the DIFF operator in Apache Calcite[0]. Using Calcite, DIFF could be implemented as rewrite rules to generate the appropriate SQL to be directly executed against the database or the DIFF operator can be implemented outside of the database (which the original paper shows is more efficient).

    [0] https://calcite.apache.org/

  • recidiffist

    Diffs for structured data

  • At Latacora, we use a giant pile of Clojure (almost everything, with specific measured exceptions, is). As a side effect, we have a lot of data. Not necessarily a lot in the sense of "big S3 bill", but definitely a lot in the sense of "you might not expect this being in a machine-readable format".

    Things like: what Lambdas existed in a customer AWS account 6 months ago in us-east-2 that had access to a specific SQS queue (because we learned later that one of the consumers of that queue would actually consume Python pickles if you asked nicely, and hence get you RCE).

    As a side effect, we do a lot of data diffing: just mostly on more vanilla Clojure structures rather than data sets in the Datasette/CSV/... sense.

    For example: https://github.com/latacora/recidiffist (which we also have wired up to Terraform + S3, so if you write some files to S3, you can get the structured diffs right next to it for free). It's one of those things that's incredibly simple and works ridiculously well. Well, if you do it consistently anyway.

    Also https://github.com/lambdaisland/deep-diff2 for when we're more interested in presenting it to humans.

  • deep-diff2

    Deep diff Clojure data structures and pretty print the result

  • At Latacora, we use a giant pile of Clojure (almost everything, with specific measured exceptions, is). As a side effect, we have a lot of data. Not necessarily a lot in the sense of "big S3 bill", but definitely a lot in the sense of "you might not expect this being in a machine-readable format".

    Things like: what Lambdas existed in a customer AWS account 6 months ago in us-east-2 that had access to a specific SQS queue (because we learned later that one of the consumers of that queue would actually consume Python pickles if you asked nicely, and hence get you RCE).

    As a side effect, we do a lot of data diffing: just mostly on more vanilla Clojure structures rather than data sets in the Datasette/CSV/... sense.

    For example: https://github.com/latacora/recidiffist (which we also have wired up to Terraform + S3, so if you write some files to S3, you can get the structured diffs right next to it for free). It's one of those things that's incredibly simple and works ridiculously well. Well, if you do it consistently anyway.

    Also https://github.com/lambdaisland/deep-diff2 for when we're more interested in presenting it to humans.

  • lakeFS

    lakeFS - Data version control for your data lake | Git for data

  • Might want to checkout lakeFS: https://github.com/treeverse/lakeFS

    (full disclosure: I'm one of the creators)

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • If you are looking for an easy way to compare two tables in SQL, whether every single row and every single column are the same, you can use the following technique:

    https://github.com/gregw2hn/handy_sql_queries/blob/main/sql_...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts