The "missing" graph datatype already exists. It was invented in the '70s

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • morel

    Standard ML interpreter, with relational extensions, implemented in Java

  • > don't have to have two separate worlds for queries and other logic

    That's definitely the dream. Another point along that spectrum (from the author of Apache Calcite): https://github.com/hydromatic/morel

  • CoinBLAS

    Bitcoin blockchain graph analysis with the GraphBLAS.

  • When you consider that a graph and a matrix are isomorphic, doing vector matrix multiplication takes a vector with a set value, say row 4, and multiplies it by a matrix where row 4 has values present that represent edges to the nodes that are adjacent to it (ie "adjacency" matrix). The result is a vector with the next "step" in a BFS across the graph, do that in a loop and you step across the whole graph.

    A cool result of this is, for example, taking an adjacency matrix and squaring it is the "Friend of a Friend" graph. It takes every node/row and multiplies it by itself, returning a matrix that are adjacent to the adjacencies of each node, ie, the friends (adjacencies of the adjacencies) of friends (adjacencies) of the nodes.

    Deeper traversal are just higher nodes, a matrix cubed are the friends of the friends of the friends.

    A picture is worth a thousand words, see figure 7 of this paper:

    https://arxiv.org/pdf/1606.05790.pdf

    Also check out figure 8, this shows how incidence matrices can work to represent hyper and multi graphs. An pair of incidence matrices reprsent two graphs, one from nodes to edges and the other from edges to nodes, these are n by m and m by n. When you multiply them, you get a square adjacency matrix that "projects" the incidence into an adjacency. This can be used to collapse hypergraphs into simple graphs that use different semirings to combine the multiple edges.

    For some pretty pictures of this kind of stuff, check out CoinBLAS (note I am not a crypto-bro, it was just a very handy extremely large multi-graph that I could easily download in chunks to play with):

    https://github.com/Graphegon/CoinBLAS/

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • snakelog

    A Datalog Framework for Python

  • You can without too much work transpile datalog to SQL. SQL does have such strong support that it is useful https://github.com/philzook58/snakelog or perhaps just do it manually https://www.philipzucker.com/tiny-sqlite-datalog/

  • pygraphistry

    PyGraphistry is a Python library to quickly load, shape, embed, and explore big graphs with the GPU-accelerated Graphistry visual graph analyzer

  • If you enjoy this kind of thinking, we recently released GFQL for dataframe-native graph querying & compute

    Imagine Neo4j Cypher, except no need for a database -- just import it -- and automatically vectorizes for significantly faster CPU+GPU performance. This is fundamentally similar to the kinds of implementations a datalog approach enables. (And indeed one of the alternative interfaces we were considering!)

    We've run it on 100M+ edge graphs on some of the cheapest GPUs you can get, and are getting ready for the next rev with aggregate compute: https://github.com/graphistry/pygraphistry/blob/master/demos...

  • Graphony

  • > Abstractions should be seen as models. They are always wrong, but they are sometimes useful. (And sometimes not.)

    George Box was very specifically talking about statistical models when he coined that aphorism. Matrices are linear algebra and graphs are graph theory, I find it hard to think they are not correct and useful models.

    > A forward traversal >v of node v enters from the left, reads the sequence stored in the node, and exits from the right. A reverse traversal I'm not an expert in this field but I'm guessing you're talking about De Bruijn graphs, which can be very elegantly modeled with incidence matrices, here's an example of one using the GraphBLAS that downloads data from BioPython, loads it into incidence matrices and graphs it. This is just a simple example, SuiteSparse can handle many billions of edges:

    https://github.com/Graphegon/Graphony?tab=readme-ov-file#exa...

    Traversing bidirectionally is quite easy, the upper triangle of a matrix are the directed outgoing edges, and the lower triangle are the incoming. This style of "push/pull" optimization is common in the GraphBLAS.

    > In a good graph representation, you can do this by maintaining a small state that does not grow significantly with the length of the context or the number of underlying paths.

    Again if I understand you correctly, in the GraphBLAS this is accomplished by using accumulators and masks. During traversal data can be accumulated, with a stock operator or one you define, into a vector or matrix, and that object can be used to efficiently mask subsequent computations to avoid unnecessary work or determine when you've reached a termination condition.

    > Matrices don't feel like a good abstraction for graphs like this.

    Mathematically, graphs and matrices are isomorphic. Regardless of algorithm or storage format like edge lists, tuples or CSR, every graph is a matrix, and vice versa. And if you have a matrix, you have linear algebra to operate on it.

    Some people don't like Linear Algebra as a graph abstraction, so I guess for them it is "not good", but on the other hand, it's Linear Algebra and Graph Theory, whose roots date back to the 2nd century BC, forward through great minds like Descartes and Euler, permeating every kind of math, science, physics and engineering discipline humans have ever created. That's a strong argument for its goodness.

    Now it is entirely possible, likely even, that the current SuiteSparse implementation doesn't have exactly the tool needed or maybe not the precise best storage format, but these missing pieces do not invalidate the underlying mathematical foundation that it's based on.

  • gbbs

    GBBS: Graph Based Benchmark Suite

  • Joins are great for OLTP workloads but are horrible for all but the simplest graph algorithms. A datalog-based system makes for a nice story, but it would not survive benchmarking.

    Pick a classic algorithm, say triangle counting, implement it in Datalog, compare against GBBS [1] and come back here to report results.

    [1] https://github.com/ParAlg/gbbs

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts