Dolt Is Git for Data

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • go-mysql-server

    A MySQL-compatible relational database with a storage agnostic query engine. Implemented in pure Go.

  • a very cool project they also maintain is a MySQL server framework for arbitrary backends (in Go): https://github.com/dolthub/go-mysql-server

    You can define a "virtual" table (schema, how to retrieve rows/columns) and then a MySQL client can connect and execute arbitrary queries on your table (which could just be an API or other source)

  • dolt

    Dolt – Git for Data

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • sirix

    SirixDB is an an embeddable, bitemporal, append-only database system and event store, storing immutable lightweight snapshots. It keeps the full history of each resource. Every commit stores a space-efficient snapshot through structural sharing. It is log-structured and never overwrites data. SirixDB uses a novel page-level versioning approach.

  • Basically my research project[1] I'm working on in my spare time is all about versioning and efficiently storing small sized revisions of the data as well as allowing sophisticated time travel queries for audits and analysis.

    Of course all secondary user-defined, typed indexes are also versioned.

    Basically the technical idea is to map a huge tree of index tries (with revisions as indexed leave pages at the top-level and a document index as well as secondary indexes on the second level) to an append-only file. To reduce write amplification and to reduce the size of each snapshot data pages are first compressed and second versioned through a sliding snapshot algorithm. Thus, Sirix does not simply do a copy on write per page. Instead it writes nodes, which have been changed in the current revision plus nodes which fall out of the sliding window (therefore it needs a fast random-read drive).

    [1] https://github.com/sirixdb/sirix

  • lakeFS

    lakeFS - Data version control for your data lake | Git for data

  • Also in the same vein, check out https://lakefs.io/

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts