Ask HN: What is new in Algorithms / Data Structures these days?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • peritext

    A CRDT for asynchronous rich-text collaboration, where authors can work independently and then merge their changes.

  • Yes - The BFT problem only matters when you have Byzantine actors. But I think users deserve and expect the system to be reasonably well behaved and predictable in all situations. Anything publically writable, for example, needs BFT resilience. Or any video game.

    As for the prosemirror problem, I assume you’re talking about weird merges from users putting markdown in a text crdt? You’re totally right - this is a problem. Text CRDTs treat documents as a simple sequence of characters. And that confuses a lot of structured formats. For example, if two users concurrently bold the same word, the system should see that users agree that it should be bolded. But if that “bold” intent is translated into “insert double asterisks here and here”, you end up with 4 asterisks before and after the text, and that confused markdown parsers. The problem is that a text crdt doesn’t understand markdown.

    JSON editing has similar problems. I’ve heard of plenty of people over the years putting json text into a text crdt, only to find that when concurrent edits happen, the json grows parse errors. Eg if two users concurrently insert “a” and “b” into an empty list. The result is [“a””b”] which can’t be parsed.

    The answer to both of these problems is to use CRDTs which understand the shape of your data structure. Eg, use a json OT/crdt system for json data (like sharedb or automerge). Likewise, if the user is editing rich text in prosemirror then you want a rich text crdt like peritext. Rich text CRDTs add the concept of annotations - so if two users bold overlapping regions of text, the crdt understands that the result should be that the entire region is bolded. And that can be translated back to markdown if you want.

    The ink & switch people did a great write up of how this sort of crdt works here: https://www.inkandswitch.com/peritext/

  • egg

    egg is a flexible, high-performance e-graph library (by egraphs-good)

  • In compilers, there's been a recent uptick in industry research and adoption into using equivalency classes and graphs (egraphs) for doing optimization passes in a way that preserves information and solves the phase ordering problem. Egg[0] was one of the big libraries that came out of it, but can only handle term rewriting without guards, and so can't express y/x*x -> y because it's unsound when x=0, or optimization passes that are across control flow points and thus some of the eclasses are only valid in some locations. The Cranelift developers adapted it into a construction they call aegraphs[1] that handles this, and are migrating Cranelift to use these passes for all optimizations, which is (hopefully) faster and achieves more optimization opportunitie; Chris Fallin is presenting about their aegraph work at PLDI this year.

    0: https://egraphs-good.github.io/

    1: https://github.com/cfallin/rfcs/blob/main/accepted/cranelift...

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • yjs

    Shared data types for building collaborative software

  • automerge

    A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.

  • rfcs

    RFC process for Bytecode Alliance projects (by cfallin)

  • In compilers, there's been a recent uptick in industry research and adoption into using equivalency classes and graphs (egraphs) for doing optimization passes in a way that preserves information and solves the phase ordering problem. Egg[0] was one of the big libraries that came out of it, but can only handle term rewriting without guards, and so can't express y/x*x -> y because it's unsound when x=0, or optimization passes that are across control flow points and thus some of the eclasses are only valid in some locations. The Cranelift developers adapted it into a construction they call aegraphs[1] that handles this, and are migrating Cranelift to use these passes for all optimizations, which is (hopefully) faster and achieves more optimization opportunitie; Chris Fallin is presenting about their aegraph work at PLDI this year.

    0: https://egraphs-good.github.io/

    1: https://github.com/cfallin/rfcs/blob/main/accepted/cranelift...

  • clingo

    🤔 A grounder and solver for logic programs.

  • Answer Set Programming is an incredibly powerful tool to declaratively solve combinatorial problems. Clingo is one of the best open source implementations in my opinion: https://github.com/potassco/clingo

  • egglog

    egraphs + datalog!

  • The recent work on relational, datalog-inspired egraphs in PLDI this year ("Unifying Datalog and Equality Saturation") is actually interesting because it can solve cases like the y/x*x -> y identity example, by the power of an interval analysis on x (among other things.) Sort of like adding a postulate but instead it's by adding relations between terms in the graph.

    https://github.com/egraphs-good/egglog

    https://arxiv.org/pdf/2304.04332.pdf

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • flix

    The Flix Programming Language

  • You might be interested in Flix which has first-class Datalog program values:

    https://flix.dev/

    https://doc.flix.dev/fixpoints.html

    (I am one of the developers of Flix)

  • highfleet-ship-opt

    A c/c++ module and python extensions for automatic optimization of Highfleet ship modules. Try it live at https://hfopt.jodavaho.io

  • I used a MILP solver to optimize my ship loadouts in Highfleet. It's rugged-looking, but works great. https://hfopt.jodavaho.io

  • terminusdb

    TerminusDB is a distributed database with a collaboration model

  • How about some succinct data structures and delta encoding for modern databases [1]. Succinct data structures are a family of data structures which are close in size to the information theoretic minimum representation (while still being queryable).

    [1] https://github.com/terminusdb/terminusdb/blob/dev/docs/white...

  • ann-benchmarks

    Benchmarks of approximate nearest neighbor libraries in Python

  • Nobody has mentioned Approximate Nearest Neighbor search, which IMO, are fundamental data structures advancements.

    Basically given a set of million ~1000 valued vectors, and a query ~1000 valued vector, find the closest array in the original set. This is "nearest neighbor" search and there have been increasingly more and more sophisticated approaches:

    http://ann-benchmarks.com

  • libclc

    Cache Line Container - C11

  • This is something I planned (2015) on sharing at some point but then years flew by and here we are .. :}

    It is a cacheline sized 'container' (CLC) of machine-word length records, with one record used to store the record order and remaining bits for metadata. So you can project different kinds of container semantics, such as FIFO or LRU -- any ordered set semantics -- on this meta record. Using arrays of CLCs you can create e.g. a segmented LRU, where the overhead per item is substantially less than a conventional pointer-based datastructure, and, is naturally suited for concurrent operations (for example by assigning a range to distinct worker threads), and ops require a few or couple of lines to be touched. The LRU (or whatever) semantics in aggregate will be probabilistic, as the LRU order is deterministic per unit container only. It is very fast :)

    https://github.com/alphazero/libclc/blob/master/include/libc...

    https://github.com/alphazero/libclc/blob/master/include/libc...

    As for the non-deterministic aspects: Since container semantics e.g. LRU order is only applied at unit level, the overall cache is ~LRU. We can strictly quantify the 'ordering error' by observing the age of items in FIFO mode as they are removed: for a deterministic container the age of the item is equal to the total capacity of the queue, for a segmented (array) composed of FIFO queues, the age will have a effectively gaussian distribution around the capacity (number of units x unit capacity). But since these containers can use as few as 9 bits per entry vs 24 or more bytes for pointer based solutions (which use linked-lists), for the same allocation of memory, the capacity of the array of CLCs will be much greater, so, the distribution tail of 'short-lived' items will actually be longer lived than items in a strict queue for the same exact memory. Additional techniques, such as n-array hashing, and low order 'clock' bits at container level, can tighten this distribution significantly (i.e. ~LRU -> LRU) via better loading factors.

  • ezno

    A JavaScript compiler and TypeScript checker written in Rust with a focus on static analysis and runtime performance

  • > I'm curious if there are any practical reasons we don't see them implemented in more languages.

    I believe it's because they're not exactly easy to implement and the resulting extensive type checking might also affect compiler performance.

    By the way, another great example of refinement types (in JavaScript) is this one: https://kaleidawave.github.io/posts/introducing-ezno/

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts