data-diff

Compare tables within or across databases (by datafold)

Data-diff Alternatives

Similar projects and alternatives to data-diff

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better data-diff alternative or higher similarity.

data-diff reviews and mentions

Posts with mentions or reviews of data-diff. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-07-26.
  • How to Check 2 SQL Tables Are the Same
    8 projects | news.ycombinator.com | 26 Jul 2023
    If the issue happen a lot, there is also: https://github.com/datafold/data-diff

    That is a nice tool to do it cross database as well.

    I think it's based on checksum method.

  • Oops, I wrote yet another SQLAlchemy alternative (looking for contributors!)
    4 projects | /r/pythoncoding | 8 May 2023
    First, let me introduce myself. My name is Erez. You may know some of the Python libraries I wrote in the past: Lark, Preql and Data-diff.
  • Looking for Unit Testing framework in Database Migration Process
    3 projects | /r/dataengineering | 23 Mar 2023
    https://github.com/datafold/data-diff might be worth a look
  • Ask HN: How do you test SQL?
    18 projects | news.ycombinator.com | 31 Jan 2023
    I did data engineering for 6 years and am building a company to automate SQL validation for dbt users.

    First, by “testing SQL pipelines”, I assume you mean testing changes to SQL code as part of the development workflow? (vs. monitoring pipelines in production for failures / anomalies).

    If so:

    1 – assertions. dbt comes with a solid built-in testing framework [1] for expressing assertions such as “this column should have values in the list [A,B,C]” as well checking referential integrity, uniqueness, nulls, etc. There are more advanced packages on top of dbt tests [2]. The problem with assertion testing in general though is that for a moderately complex data pipeline, it’s infeasible to achieve test coverage that would cover most possible failure scenarios.

    2 – data diff: for every change to SQL, know exactly how the code change affects the output data by comparing the data in dev/staging (built off the dev branch code) with the data in production (built off the main branch). We built an open-source tool for that: https://github.com/datafold/data-diff, and we are adding an integration with dbt soon which will make diffing as part of dbt development workflow one command away [2]

    We make money by selling a Cloud solution for teams that integrates data diff into Github/Gitlab CI and automatically diffs every pull request to tell you the how a change to SQL affects the target table you changed, downstream tables and dependent BI tools (video demo: [3])

    I’ve also written about why reliable change management is so important for data engineering and what are key best practices to implement [4]

    [1] https://docs.getdbt.com/docs/build/tests

  • data-diff VS cuallee - a user suggested alternative
    2 projects | 30 Nov 2022
  • Show HN: Open-source infra for building embedded data pipelines
    2 projects | news.ycombinator.com | 1 Sep 2022
    Looks useful! Do you have a way to validate that the data was copied correctly and entirely? If not, you might want to consider integrating data-diff for that - https://github.com/datafold/data-diff
  • Show HN: Data Diff – compare tables of any size across databases
    2 projects | news.ycombinator.com | 22 Jun 2022
    Gleb, Alex, Erez and Simon here – we are building an open-source tool for comparing data within and across databases at any scale. The repo is at https://github.com/datafold/data-diff, and our home page is https://datafold.com/.

    As a company, Datafold builds tools for data engineers to automate the most tedious and error-prone tasks falling through the cracks of the modern data stack, such as data testing and lineage. We launched two years ago with a tool for regression-testing changes to ETL code https://news.ycombinator.com/item?id=24071955. It compares the produced data before and after the code change and shows the impact on values, aggregate metrics, and downstream data applications.

    While working with many customers on improving their data engineering experience, we kept hearing that they needed to diff their data across databases to validate data replication between systems.

    There were 3 main use cases for such replication:

    * To perform analytics on transactional data in an OLAP engine (e.g. PostgreSQL > Snowflake)

    2 projects | news.ycombinator.com | 22 Jun 2022
    There is plans to support pretty much every database. The reason it’s not supported currently is because its md5 hashing is too slow, so we need to do something different for it, e.g. just sum for types that support it. It’s similar for databases we plan to support that don’t support MD5 too, for example ElasticSearch.

    See https://github.com/datafold/data-diff/issues/51

  • So you're using dbt tests—what's next in data quality?
    2 projects | /r/dataengineering | 16 Jun 2022
  • How are you guys validating your data?
    2 projects | /r/dataengineering | 9 Jun 2022
  • A note from our sponsor - InfluxDB
    www.influxdata.com | 15 Apr 2024
    Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Stats

Basic data-diff repo stats
20
2,824
9.6
17 days ago
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com