Ask HN: What are some 'cool' but obscure data structures you know about?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.
Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
getstream.io
featured
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com
featured
  1. minisketch

    Minisketch: an optimized library for BCH-based set reconciliation

    Here is one not on the list so far:

    Set Sketches. They allow you compute the difference between two sets (for example to see if data has been replicated between two nodes) WITHOUT transmitting all the keys in one set to another.

    Say you have two sets of the numbers [1, ..., 1million] all 32 bit integers, and you know one set is missing 2 random numbers. Set sketches allow you to send a "set checksum" that is only 64 BITS which allows the other party to compute those missing numbers. A naive algorithm would require 1MB of data be transferred to calculate the same thing.

    *(in particular pin sketch https://github.com/sipa/minisketch).

  2. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  3. t-digest

    A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means

    On sketches, there is a genre of structure for estimating histogram-like statistics (median, 99th centile, etc) in fixed space, which i really like. Two examples:

    t-digest https://github.com/tdunning/t-digest

    DDSketch https://github.com/DataDog/sketches-java

  4. pyroscope

    Discontinued Continuous Profiling Platform. Debug performance issues down to a single line of code [Moved to: https://github.com/grafana/pyroscope] (by pyroscope-io)

    Tries (or prefix trees).

    We use them a lot at Pyroscope for compressing strings that have common prefixes. They are also used in databases (e.g indexes in Mongo) or file formats (e.g debug symbols in macOS/iOS Mach-O format are compressed using tries).

    We have an article with some animations that go into details about tries in case anyone's interested [0].

    [0] https://github.com/pyroscope-io/pyroscope/blob/main/docs/sto...

  5. TablaM

    The practical relational programing language for data-oriented applications

    I made https://github.com/mamcx/tree-flat as flattened stored tree in pre-order that allows for very fast iterations even for childs/parent queries. Is based on APL, so not that novel.

    I also like a lot the relational model, is not that much represented so I making a language on top of it: https://tablam.org.

  6. tree-flat

    TreeFlat is the simplest way to build & traverse a pre-order Tree in Rust

    I made https://github.com/mamcx/tree-flat as flattened stored tree in pre-order that allows for very fast iterations even for childs/parent queries. Is based on APL, so not that novel.

    I also like a lot the relational model, is not that much represented so I making a language on top of it: https://tablam.org.

  7. ann-benchmarks

    Benchmarks of approximate nearest neighbor libraries in Python

    HNSW, or Hierarchical Navigable Small World is a graph data structure for approximate nearest neighbor search of vectors.

    https://arxiv.org/abs/1603.09320

    The problem space of ANN is one of those really deep holes you can go down. It’s a game of balancing time and space, and it’s got plenty of fascinating algorithms and datastructures.

    Check out http://ann-benchmarks.com/ for a comparison. HNSW is not “the best” but it’s easy to understand and is quite effective.

  8. entt

    Gaming meets modern C++ - a fast and reliable entity component system (ECS) and much more

    Sparse sets.

    They're often used in Entity Component System architectures since you have O(1) add/remove/find, & O(n) iteration. Iterations are also packed, which is a bonus for cache efficiency.

    Can be difficult to implement well, but the concept is simple and a neat example of a useful specialty data structure. Take a look at https://github.com/skypjack/entt

  9. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  10. RVS_Generic_Swift_Toolbox

    A Collection Of Various Swift Tools, Like Extensions and Utilities

    Ole Begemann and Chris Eidhoff wrote Advanced Swift, and, in there, described a really efficient FIFO queue.

    I implemented a variant ofit, in my Generic Swift Toolbox Package[0]. It’s lightning fast.

    [0] https://github.com/RiftValleySoftware/RVS_Generic_Swift_Tool...

  11. Ole Begemann and Chris Eidhoff wrote Advanced Swift, and, in there, described a really efficient FIFO queue.

    I implemented a variant ofit, in my Generic Swift Toolbox Package[0]. It’s lightning fast.

    [0] https://github.com/RiftValleySoftware/RVS_Generic_Swift_Tool...

  12. CPython

    The Python programming language

    Arrays aren't efficient when you want to add/remove to the head. Deque in python exists so there is a data structure with constant time pop/push to the head. And it is in fact implemented as a doubly linked list, with various optimizations: https://github.com/python/cpython/blob/v3.8.1/Modules/_colle....

  13. PSI

    Private Set Intersection Cardinality protocol based on ECDH and Bloom Filters (by OpenMined)

    I came here to say Golomb compressed sets except now I see it's part of the question!

    They are used by default in the OpenMined implementation of Private Set Intersection[1] - a multi-party computation technique.

    [1] https://github.com/OpenMined/PSI/blob/master/private_set_int...

  14. data-algos

    An knowledge graph for data structures and algorithms in markdown format

    I am trying to compile a list of data structures (and algorithms and other stuff) into a kind of knowledge base: https://github.com/Dobatymo/data-algos/blob/master/data-stru.... There are few cool and obscure data structures already. Please feel free to help extend it!

  15. RoaringBitmap

    A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others

  16. fractional_cascading

    Fractional Cascading in Rust

    Fractional Cascading. A simple and very cool way to speed up searching for the same key in multiple lists. Instead of K binary searches taking time Klog(N), you can do it in log(N) time without using asymptomatically more space.

    https://en.m.wikipedia.org/wiki/Fractional_cascading

    I wrote a simple demo in Rust a while back to help myself learn the language.

    https://github.com/mgraczyk/fractional_cascading

    I also think pairing heaps are neat.

  17. clojure

    The Clojure programming language

    Huet's zipper. https://en.wikipedia.org/wiki/Zipper_(data_structure).

    One zipper value represents a regular tree of data, but from the perspective of the "current position". There are traversal operations that take a zipper value and return the same tree from the perspective of an element above/beside/below the current position, and there are operations that return the same structure but with the current element changed, or items deleted or inserted.

    Huet's paper is an easy read if you've been exposed to OCaml or similar languages: https://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced... . His "returned glove" metaphor is what made it click for me.

    Clojure includes an implementation of Huet's zippers https://github.com/clojure/clojure/blob/master/src/clj/cloju... that is <300 lines of code. It's very, very clever and broadly useful for dealing with XML or other nested data structures.

  18. gring

    Golang circular linked list with array backend

    Not super obscure, but I remember that one specific time when circular-linked-list made a lot of sense to use, well I wanted to use it so I used it.

    I had a bunch of API keys with poor expiry documentation and implementation, so to find out if a key expired it had to be used. I put it in a main "keys.pop loop" and all methods below tried to use the key. If HTTP response was (some another obscure HTTP response code like) 505, I simply called `continue;` in the loop to jump to another, without caring at all where I was before.

    https://github.com/atedja/gring

  19. spread

    SPREAD (by odipar)

    I've created an Btree version of SeqHash that also allows splitting SeqHashes in O(log(n)) called SplitHash.

    See: https://github.com/odipar/spread/blob/master/src/main/scala/...

  20. 500lines

    500 Lines or Less (by kragen)

  21. tries-T9-Prediction

    Its artificial intelligence algorithm of T9 mobile

    Even though a trie is pretty standard and expected (to be known) these days it was my first deep dive into more exotic data structures after an interview question about implementing T9 that stumped me many years ago.

    https://github.com/Azeem112/tries-T9-Prediction

  22. asami

    A graph store for Clojure and ClojureScript

  23. ctrie-java

    Java implementation of a concurrent trie

    Concurrent tries with non-blocking snapshots [0]

    Say that you have a dataset that needs to be ordered, easily searchable, but is also updated quite frequently. Fast accesses are a pain if you decide to use traditional read-write locks.

    Ctries are entirely lock-free, thus there is no waiting for your read operations when an update is happening, i.e. you run lookups on snapshots while updates happen.

    They are also a lot of fun to implement, especially if you aren't familiar with lock-free algorithms! I did learn a lot doing it myself [1]

    [0] http://aleksandar-prokopec.com/resources/docs/ctries-snapsho...

    [1] https://github.com/mabeledo/ctrie-java

  24. hegg

    Fast equality saturation in Haskell

    Equality graphs (e-graphs) for theorem proving and equality saturation and other equality-related things.

    They're awesome data structures that efficiently maintain a congruence relation over many expressions

    > At a high level, e-graphs extend union-find to compactly represent equivalence classes of expressions while maintaining a key invariant: the equivalence relation is closed under congruence.

    e.g. If I were to represent "f(x)" and "f(y)" in the e-graph, and then said "x == y" (merged "x" and "y" in the e-graph), then the e-graph, by congruence, would be able to tell me that "f(x) == f(y)"

    e.g. If I were to represent "a(2/2)", in the e-graph, then say "2/2 == 1", and "x1 == x", by congruence the e-graph would know "a*(2/2) == a" !

    The most recent description of e-graphs with an added insight on implementation is https://arxiv.org/pdf/2004.03082.pdf to the best of my knowledge.

    P.S: I'm currently implementing them in Haskell https://github.com/alt-romes/hegg

  25. AspNetCoreDiagnosticScenarios

    This repository has examples of broken patterns in ASP.NET Core applications

    https://github.com/davidfowl/AspNetCoreDiagnosticScenarios/b...

  26. plurid-data-structures-typescript

    Utility Data Structures Implemented in TypeScript

    Somewhat along these lines, I have formed a concept forcedly called "differentially composable string", or "deposed string", or more precise "poor man's git".

    The intended use case is to obtain a compact representation of all the historic text entered into an input field (notes, comments, maybe long-form): all the stages of the text, where a stage is a tuple [add/remove, start_index, text/end_index]. Once you get the stages from the deposed string as JSON, you could transform them however you want then load them into a new deposed string.

    You can read more on GitHub: https://github.com/plurid/plurid-data-structures-typescript#... or play around on my note-taking app implementing deposed strings and more: https://denote.plurid.com

  27. marisa-trie

    MARISA: Matching Algorithm with Recursively Implemented StorAge (by s-yata)

    My favorite one that I've had the chance to use professionally is the Marisa trie[0].

    Context is a data scientist had written a service that essentially was just lookups on a trie. He'd left and the service was starting to have memory problems so I dug in and swapped the implementation. Iirc swapping the trie implementation changed memory usage from 8gb to 100mb and sped everything up as well.

    The actual data structure is equivalent to a trie, but cannot be modified once it's been built (I think it may be the same as a LOUDS trie, I don't remember the specifics)

    [0] https://github.com/s-yata/marisa-trie

  28. stutter

    Implement a Lisp, in C, from scratch, no libs (by mkirchner)

    This. Roughly a year ago I got interested in efficient immutability for my write-from-scratch-in-C Lisp [0] and started to write a HAMT implementation in C [1], along with a (somewhat hacky, you have been warned) benchmarking suite [2].

    The docs are only 70% done (in particular the "putting it all together" part is missing) but it has been a really interesting and enlightening journey so far and can only recommend embarking on this path to everyone.

    [0]: https://github.com/mkirchner/stutter

  29. hamt

    A hash array-mapped trie implementation in C

  30. hamt-bench

    libhamt benchmarking code repo

  31. swift

    the multiparty transport protocol (aka "TCP with swarming" or "BitTorrent at the transport layer") (by gritzko)

  32. sketches-java

    DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees.

    On sketches, there is a genre of structure for estimating histogram-like statistics (median, 99th centile, etc) in fixed space, which i really like. Two examples:

    t-digest https://github.com/tdunning/t-digest

    DDSketch https://github.com/DataDog/sketches-java

  33. apv-median

    C library for streaming median approximation

    An alternative streaming method due to Arandjelović, Pham, Venkatesh based on maximum entropy: https://ieeexplore.ieee.org/document/6971097 and implementation in C: https://github.com/dressipi/apv-median

  34. pybktree

    Python BK-tree data structure to allow fast querying of "close" matches

    The BK-Tree, which allows fast querying of "close" matches, such as Hamming distance (number of bits different). http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK...

    I wrote a Python library implementing them a number of years ago: https://github.com/benhoyt/pybktree

  35. dictomaton

    Finite state dictionaries in Java

    Also related: Levenshtein automata - automata for words that match every word within a given Levenshtein distance. The intersection of a Levenshtein automaton of a word and a DAWG gives you an automaton of all words within the given edit distance.

    I haven't done any Java in years, but I made a Java package in 2013 that supports: DAWGs, Levenshtein automata and perfect hash automata:

    https://github.com/danieldk/dictomaton

  36. Folly

    An open-source C++ library developed and used at Facebook.

    https://github.com/facebook/folly/blob/main/folly/docs/Packe...

    Is pretty rad.

    The parent mentioned Bloom Filters, HyperLogLog-type stuff is on that family tree and also very interesting and very useful.

  37. critbit

    Critbit trees in C

    > Good use-case: routing. Say you have a list of 1 million IPs that are [deny listed].

    Apparently, bloom filters make for lousy IP membership checks, read: https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

    CritBit Trie [0] and possibly Allotment Routing Table (ART) are better suited for IPs.

    [0] https://github.com/agl/critbit

    [1] https://web.archive.org/web/20210720162224/https://www.harig...

  38. ego

    EGraphs in OCaml (by verse-lab)

    Shameless plug, I also maintain an OCaml implementation of egraphs (named ego) at https://github.com/verse-lab/ego

    While the most popular implementation at the moment seems to be egg in Rust, I find that OCaml serves as a much more ergonomic environment for quickly prototyping out uses of egraphs in practice. As a bonus, ego also shares the same logical interface as egg itself, so once you've finalised your designs, you shouldn't have much trouble porting them to egg if you need the performance gains.

  39. pvfmm

    A parallel kernel-independent FMM library for particle and volume potentials

  40. mode

    Python AsyncIO Services

    This is clever, otherwise you have to coordinate your cache setter and your async function call between potentially many concurrent calls. There are ways to do this coordination though, one implementation I've borrowed in practice in python is modes stampede decorator: https://github.com/ask/mode/blob/master/mode/utils/futures.p...

  41. multiversion-concurrency-control

    Discontinued Implementation of multiversion concurrency control, Raft, Left Right concurrency Hashmaps and a multi consumer multi producer Ringbuffer, concurrent and parallel load-balanced loops, parallel actors implementation in Main.java, Actor2.java and a parallel interpreter

    So the self.N is the product number of the nested loops, you can increment this by 1 each time to get each product of the nested loops.

    We can load balance the loops and schedule the N to not starve the outer loops.

    So rather than that sequence, we can generate 000, 111, 222, 010, 230, 013 and so on.

    Load balanced loops means you can progress on multiple items simultaneously, concurrently. I combined concurrent loops with parallel loops with multithreading. I plan it to be a pattern to create incredibly reactive frontends and backends with low latency to processing. Many frontends lag when encrypting, compressing or resizing images, it doesn't need to be that slow. IntelliJ doesn't need to be slow with concurrent loops and multithreading.

    See https://github.com/samsquire/multiversion-concurrency-contro...

    Loops are rarely first class citizens in most languages I have used and I plan to change that.

    I think I can combine your inspiration of gray codes with this idea.

    I think memory layout and data structure and algorithm can be 3 separate decisions. I am yet to see any developers talk of this. Most of the time the CPU is idling waiting for memory, disk or network. We can arrange and iterate data to be efficient.

  42. So the self.N is the product number of the nested loops, you can increment this by 1 each time to get each product of the nested loops.

    We can load balance the loops and schedule the N to not starve the outer loops.

    So rather than that sequence, we can generate 000, 111, 222, 010, 230, 013 and so on.

    Load balanced loops means you can progress on multiple items simultaneously, concurrently. I combined concurrent loops with parallel loops with multithreading. I plan it to be a pattern to create incredibly reactive frontends and backends with low latency to processing. Many frontends lag when encrypting, compressing or resizing images, it doesn't need to be that slow. IntelliJ doesn't need to be slow with concurrent loops and multithreading.

    See https://github.com/samsquire/multiversion-concurrency-contro...

    Loops are rarely first class citizens in most languages I have used and I plan to change that.

    I think I can combine your inspiration of gray codes with this idea.

    I think memory layout and data structure and algorithm can be 3 separate decisions. I am yet to see any developers talk of this. Most of the time the CPU is idling waiting for memory, disk or network. We can arrange and iterate data to be efficient.

  43. Caffeine

    A high performance caching library for Java

    Yeah, saves you from thundering herd problems

    https://github.com/ben-manes/caffeine cache does something similar by caching the future that gets returned on look-ups if the value is still being computed

  44. til

    Today I Learned. Reference for things I already tried. (by bschaeffer)

    I wrote a version of this with Elixir: https://github.com/bschaeffer/til/tree/master/elixir/gen_ser...

    Didn't know what to call it but PromiseMaps is nice name for it.

  45. hierarch_old

    Discontinued Hierarch: A new, blazingly fast, in-memory proof-of-concept data structure and indexing algorithm for querying on dynamic attribute/tag/type-based hierarchical data.

  46. Multiplexer

    Asynchronous caching and multiplexing layer for Swift client apps

    I wrote a Swift client library for doing exactly that, though I called the trick "multiplexing" for lack of a better term. The library uses callbacks, no promises/futures (yet). It also provides caching of the results with TTL. Shameless self-promo [1]

    [1] https://github.com/crontab/Multiplexer

  47. ideas4

    Discontinued An Additional 100 Ideas for Computing https://samsquire.github.io/ideas4/

  48. preemptible-thread

    Discontinued How to preempt threads in user space

  49. FusionCache

    FusionCache is an easy to use, fast and robust hybrid cache with advanced resiliency features.

    If you are in the .NET space I suggest you to take a look at FusionCache. It has cache stampede protection built in, plus some other nice features like a fail-safe mechanism and soft/hard timeouts https://github.com/jodydonetti/ZiggyCreatures.FusionCache

  50. sdsl-lite

    Succinct Data Structure Library 2.0

    Succinct Data Structures [0] [1]. It encompass many different underlying data structure types but the overarching idea is that you want small data size while still keeping "big O" run time.

    In other words, data structures that effectively reach a 'practical' entropy lower bound while still keeping asymptotic run time.

    [0] https://en.wikipedia.org/wiki/Succinct_data_structure

    [1] https://github.com/simongog/sdsl-lite

  51. BuRR

    Bumped Ribbon Retrieval and Approximate Membership Query (by lorenzhs)

    If you enjoyed XOR filters, you might also like ribbon filters, something that I had the pleasure of working on last year. They share the basic idea of using a system of linear equations, but instead of considering 3 random positions per key, the positions to probe are narrowly concentrated along a ribbon with a typical width of 64. This makes them far more cache-efficient to construct and query.

    By purposefully overloading the data structure by a few per cent and bumping those items that cannot be inserted as a result of this overloading to the next layer (making this a recursive data structure), we can achieve almost arbitrarily small space overheads: <1% is no problem for the fast configurations, and <0.1% overhead with around 50% extra runtime cost. This compares to around 10% for XOR filters and ≥ 44% for Bloom filters.

    In fact, I'm going to present them at a conference on Monday - the paper is already out: https://drops.dagstuhl.de/opus/volltexte/2022/16538/pdf/LIPI... and the implementation is at https://github.com/lorenzhs/BuRR/. I hope this isn't too much self-promotion for HN, but I'm super hyped about this :)

  52. proposal-function-memo

    A TC39 proposal for function memoization in the JavaScript language.

    https://github.com/tc39/proposal-function-memo

    This might be relevant to this exact pattern

      const getThing = (async function (arg1, arg2) {

  53. us

    An alternative interface to Sia

    It might be easier to think about it as a stack, rather than a tree. Each element of the stack represents a subtree -- a perfect binary tree. If you ever have two subtrees of height k, you merge them together into one subtree of height k+1. Your stack might already have another subtree of height k+1; if so, you repeat the process, until there's at most one subtree of each height.

    This process is isomorphic to binary addition. Worked example: let's start with a single leaf, i.e. a subtree of height 0. Then we "add" another leaf; since we now have a pair of two equally-sized leaves, we merge them into one subtree of height 1. Then we add a third leaf; now this one doesn't have a sibling to merge with, so we just keep it. Now our "stack" contains two subtrees: one of height 1, and one of height 0.

    Now the isomorphism: we start with the binary integer 1, i.e. a single bit at index 0. We add another 1 to it, and the 1s "merge" into a single 1 bit at index 1. Then we add another 1, resulting in two 1 bits at different indices: 11. If we add one more bit, we'll get 100; likewise, if we add another leaf to our BNT, we'll get a single subtree of height 2. Thus, the binary representation of the number of leaves "encodes" the structure of the BNT.

    This isomorphism allows you to do some neat tricks, like calculating the size of a Merkle proof in 3 asm instructions. There's some code here if that helps: https://github.com/lukechampine/us/blob/master/merkle/stack....

    You could also check out section 5.1 of the BLAKE3 paper: https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak...

  54. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Sourcetrail: Free and Open-Source Interactive Source Explorer

    1 project | news.ycombinator.com | 7 Aug 2024
  • How do you use a python model in a java code

    1 project | /r/java | 18 May 2023
  • How to quickly learn/understand the system architecture of any given application?

    1 project | /r/devops | 7 Mar 2023
  • Tools to understand a new code base

    1 project | /r/C_Programming | 24 Dec 2022
  • Best way to combine Python and Java?

    10 projects | /r/java | 29 Oct 2022

Did you know that C++ is
the 7th most popular programming language
based on number of references?