PGM-index vs robin-map

PGM-index

🏅State-of-the-art learned data structure that enables fast lookup, predecessor, range searches and updates in arrays of billions of items using orders of magnitude less space than traditional indexes (by gvinciguerra)

Source Code

pgm.di.unipi.it

Suggest alternative

Edit details

robin-map

C++ implementation of a fast hash map and hash set using robin hood hashing (by Tessil)

C++ CPP hash-map Data structures header-only hash-table

Source Code

Suggest alternative

Edit details

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

PGM-index		robin-map
	Project
6	Mentions	10
751	Stars	1,165
-	Growth	-
2.8	Activity	5.4
1 day ago	Latest Commit	6 days ago
C++	Language	C++
Apache License 2.0	License	MIT License

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

PGM-index

Posts with mentions or reviews of PGM-index. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-04-26.

Self-indexing RDBMS? Could AI help?
3 projects | /r/Database | 26 Apr 2023

PGM Index
Piecewise Geometric Model Index
1 project | news.ycombinator.com | 5 Jul 2022
Manticore Search 5
4 projects | dev.to | 27 May 2022

Manticore Columnar Library uses Piecewise Geometric Model index, which exploits a learned mapping between the indexed keys and their location in memory. The succinctness of this mapping, coupled with a peculiar recursive construction algorithm, makes the PGM-index a data structure that dominates traditional indexes by orders of magnitude in space while still offering the best query and update time performance.
PGM Indexes: Learned indexes that match B-tree performance with 83x less space
7 projects | news.ycombinator.com | 25 Jan 2021

Yep, I'm working on a multidimensional version that I hope to upload to the main repo (https://github.com/gvinciguerra/PGM-index) in a few weeks.

robin-map

Posts with mentions or reviews of robin-map. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-11-10.

Factor is faster than Zig
11 projects | news.ycombinator.com | 10 Nov 2023

In my example the table stores the hash codes themselves instead of the keys (because the hash function is invertible)
Oh, I see, right. If determining the home bucket is trivial, then the back-shifting method is great. The issue is just that it’s not as much of a general-purpose solution as it may initially seem.
“With a different algorithm (Robin Hood or bidirectional linear probing), the load factor can be kept well over 90% with good performance, as the benchmarks in the same repo demonstrate.”
I’ve seen the 90% claim made several times in literature on Robin Hood hash tables. In my experience, the claim is a bit exaggerated, although I suppose it depends on what our idea of “good performance” is. See these benchmarks, which again go up to a maximum load factor of 0.95 (Although boost and Absl forcibly grow/rehash at 0.85-0.9):
https://strong-starlight-4ea0ed.netlify.app/
Tsl, Martinus, and CC are all Robin Hood tables (https://github.com/Tessil/robin-map, https://github.com/martinus/robin-hood-hashing, and https://github.com/JacksonAllan/CC, respectively). Absl and Boost are the well-known SIMD-based hash tables. Khash (https://github.com/attractivechaos/klib/blob/master/khash.h) is, I think, an ordinary open-addressing table using quadratic probing. Fastmap is a new, yet-to-be-published design that is fundamentally similar to bytell (https://www.youtube.com/watch?v=M2fKMP47slQ) but also incorporates some aspects of the aforementioned SIMD maps (it caches a 4-bit fragment of the hash code to avoid most key comparisons).
As you can see, all the Robin Hood maps spike upwards dramatically as the load factor gets high, becoming as much as 5-6 times slower at 0.95 vs 0.5 in one of the benchmarks (uint64_t key, 256-bit struct value: Total time to erase 1000 existing elements with N elements in map). Only the SIMD maps (with Boost being the better performer) and Fastmap appear mostly immune to load factor in all benchmarks, although the SIMD maps do - I believe - use tombstones for deletion.
I’ve only read briefly about bi-directional linear probing – never experimented with it.
If this isn't the perfect data structure, why?
3 projects | /r/C_Programming | 22 Oct 2023

From your other comments, it seems like your knowledge of hash tables might be limited to closed-addressing/separate-chaining hash tables. The current frontrunners in high-performance, memory-efficient hash table design all use some form of open addressing, largely to avoid pointer chasing and limit cache misses. In this regard, you want to check our SSE-powered hash tables (such as Abseil, Boost, and Folly/F14), Robin Hood hash tables (such as Martinus and Tessil), or Skarupke (I've recently had a lot of success with a similar design that I will publish here soon and is destined to replace my own Robin Hood hash tables). Also check out existing research/benchmarks here and here. But we a little bit wary of any benchmarks you look at or perform because there are a lot of factors that influence the result (e.g. benchmarking hash tables at a maximum load factor of 0.5 will produce wildly different result to benchmarking them at a load factor of 0.95, just as benchmarking them with integer keys-value pairs will produce different results to benchmarking them with 256-byte key-value pairs). And you need to familiarize yourself with open addressing and different probing strategies (e.g. linear, quadratic) first.
Convenient Containers v1.0.3: Better compile speed, faster maps and sets
4 projects | /r/C_Programming | 3 May 2023

The main advantage of the latest version is that it reduces build time by about 53% (GCC 12.1), based on the comprehensive test suit found in unit_tests.c. This improvement is significant because compile time was previously a drawback of this library, with maps and sets—in particular—compiling slower than their C++ template-based counterparts. I achieved it by refactoring the library to do less work inside API macros and, in particular, use fewer _Generic statements, which seem to be a compile-speed bottleneck. A nice side effect of the refactor is that the library can now more easily be extended with the planned dynamic strings and ordered maps and sets. The other major improvement concerns the performance of maps and sets. Here are some interactive benchmarks[1] comparing CC’s maps to two popular implementations of Robin Hood hash maps in C++ (as well as std::unordered_map as a baseline). They show that CC maps perform roughly on par with those implementations.
Fibonacci Hashing: An Optimization the World Forgot (Better Than Integer Modulo)
1 project | news.ycombinator.com | 30 Apr 2023

Skimming the code, it seems to use bitwise and with a mask if the hash size is a power of two, which has the same caveats as described for Dinkumware (it's fast, but if your hash is poor it can have awful results), and uses modulo if not which has all the issues the article describes. This is encapsulated in a single file [1] so it looks like it'd be easy to improve.
[1] https://github.com/Tessil/robin-map/blob/master/include/tsl/...
Inside boost::unordered_flat_map
11 projects | /r/cpp | 18 Nov 2022
boost::unordered map is a new king of data structures
10 projects | /r/cpp | 30 Jun 2022

Unordered hash map shootout CMAP = https://github.com/tylov/STC KMAP = https://github.com/attractivechaos/klib PMAP = https://github.com/greg7mdp/parallel-hashmap FMAP = https://github.com/skarupke/flat_hash_map RMAP = https://github.com/martinus/robin-hood-hashing HMAP = https://github.com/Tessil/hopscotch-map TMAP = https://github.com/Tessil/robin-map UMAP = std::unordered_map Usage: shootout [n-million=40 key-bits=25] Random keys are in range [0, 2^25). Seed = 1656617916: T1: Insert/update random keys: KMAP: time: 1.949, size: 15064129, buckets: 33554432, sum: 165525449561381 CMAP: time: 1.649, size: 15064129, buckets: 22145833, sum: 165525449561381 PMAP: time: 2.434, size: 15064129, buckets: 33554431, sum: 165525449561381 FMAP: time: 2.112, size: 15064129, buckets: 33554432, sum: 165525449561381 RMAP: time: 1.708, size: 15064129, buckets: 33554431, sum: 165525449561381 HMAP: time: 2.054, size: 15064129, buckets: 33554432, sum: 165525449561381 TMAP: time: 1.645, size: 15064129, buckets: 33554432, sum: 165525449561381 UMAP: time: 6.313, size: 15064129, buckets: 31160981, sum: 165525449561381 T2: Insert sequential keys, then remove them in same order: KMAP: time: 1.173, size: 0, buckets: 33554432, erased 20000000 CMAP: time: 1.651, size: 0, buckets: 33218751, erased 20000000 PMAP: time: 3.840, size: 0, buckets: 33554431, erased 20000000 FMAP: time: 1.722, size: 0, buckets: 33554432, erased 20000000 RMAP: time: 2.359, size: 0, buckets: 33554431, erased 20000000 HMAP: time: 0.849, size: 0, buckets: 33554432, erased 20000000 TMAP: time: 0.660, size: 0, buckets: 33554432, erased 20000000 UMAP: time: 2.138, size: 0, buckets: 31160981, erased 20000000 T3: Remove random keys: KMAP: time: 1.973, size: 0, buckets: 33554432, erased 23367671 CMAP: time: 2.020, size: 0, buckets: 33218751, erased 23367671 PMAP: time: 2.940, size: 0, buckets: 33554431, erased 23367671 FMAP: time: 1.147, size: 0, buckets: 33554432, erased 23367671 RMAP: time: 1.941, size: 0, buckets: 33554431, erased 23367671 HMAP: time: 1.135, size: 0, buckets: 33554432, erased 23367671 TMAP: time: 1.064, size: 0, buckets: 33554432, erased 23367671 UMAP: time: 5.632, size: 0, buckets: 31160981, erased 23367671 T4: Iterate random keys: KMAP: time: 0.748, size: 23367671, buckets: 33554432, repeats: 8, sum: 4465059465719680 CMAP: time: 0.627, size: 23367671, buckets: 33218751, repeats: 8, sum: 4465059465719680 PMAP: time: 0.680, size: 23367671, buckets: 33554431, repeats: 8, sum: 4465059465719680 FMAP: time: 0.735, size: 23367671, buckets: 33554432, repeats: 8, sum: 4465059465719680 RMAP: time: 0.464, size: 23367671, buckets: 33554431, repeats: 8, sum: 4465059465719680 HMAP: time: 0.719, size: 23367671, buckets: 33554432, repeats: 8, sum: 4465059465719680 TMAP: time: 0.662, size: 23367671, buckets: 33554432, repeats: 8, sum: 4465059465719680 UMAP: time: 6.168, size: 23367671, buckets: 31160981, repeats: 8, sum: 4465059465719680 T5: Lookup random keys: KMAP: time: 0.943, size: 23367671, buckets: 33554432, lookups: 34235332, found: 29040438 CMAP: time: 0.863, size: 23367671, buckets: 33218751, lookups: 34235332, found: 29040438 PMAP: time: 1.635, size: 23367671, buckets: 33554431, lookups: 34235332, found: 29040438 FMAP: time: 0.969, size: 23367671, buckets: 33554432, lookups: 34235332, found: 29040438 RMAP: time: 1.705, size: 23367671, buckets: 33554431, lookups: 34235332, found: 29040438 HMAP: time: 0.712, size: 23367671, buckets: 33554432, lookups: 34235332, found: 29040438 TMAP: time: 0.584, size: 23367671, buckets: 33554432, lookups: 34235332, found: 29040438 UMAP: time: 1.974, size: 23367671, buckets: 31160981, lookups: 34235332, found: 29040438
Unordered_map alternatives
2 projects | /r/cpp_questions | 24 Feb 2022

Good general-purpose candidates to consider: - Abseil flat_hash_map - Tessil's robin-map
A brief and incomplete guide for selecting the appropriate container from inside/outside the C++ standard library, based on performance characteristics, functionality and benchmark results
1 project | /r/cpp | 11 Dec 2021

a = yes, b = no 0. Is all you're doing just inserting to the back of the container and iterating? 0a. Do you know the largest possible maximum capacity you will ever have for this container, and is the lowest possible maximum capacity not too far away from that? 0aa. Use an array. 0ab. Use a vector. 0b. Can you change your data layout or your processing strategy so that back insertion and iterating would be all you're doing? 0ba. Goto 0a. 0bb. Goto 1. 1. Is the use of the container stack-like, queue-like or ring-like? 1a. If stack-like, use plf::stack, if queue-like, use plf::queue (both are faster than the std:: equivalent adaptors, have stable pointers to elements and are configurable in terms of memory block sizes). If ring-like, use "ring_span lite". 1b. If not, goto 2. 2. Does each element need to be accessible via an identifier ie. key? ie. is the data associative. 2a. If so, is the number of elements small and the type sizeof not large? 2aa. If so, is the value of an element also the key? 2aaa. If so, just make an array or vector of elements, and sequentially-scan to lookup elements. Benchmark vs absl:: sets below. 2aab. If not, make a vector of key/element structs, and do sequential scans of the vector to find the element based on the key. Benchmark vs absl:: maps below. 2ab. If not, do the elements need to have an order? 2aba. If so, is the value of the element also the key? 2abaa. If so, can multiple keys have the same value? 2abaaa. If so, use absl::btree_multiset. 2abaab. If not, use absl::btree_set. 2abab. If not, can multiple keys have the same value? 2ababa. If so, use absl::btree_multimap. 2ababb. If not, use absl::btree_map. 2abb. If no order needed, is the value of the element also the key? 2abba. If so, can multiple keys have the same value? 2abbaa. If so, use std::unordered_multiset or absl::btree_multiset. 2abbab. If not, is pointer stability to elements necessary? 2abbaba. If so, use absl::node_hash_set. 2abbabb. If not, use absl::flat_hash_set. 2abbb. If not, can multiple keys have the same value? 2abbba. If so, use std::unordered_multimap or absl::btree_multimap. 2abbbb. If not, is on-the-fly insertion and erasure common in your use case, as opposed to mostly lookups? 2abbbba. If so, use robin-map. 2abbbbb. If not, is pointer stability to elements necessary? 2abbbbba. If so, use absl::flat_hash_map > . Use absl::node_hash_map if pointer stability to keys is also necessary. 2abbbbbb. If not, use absl::flat_hash_map. 2b. If not, goto 3. Note: if iteration over the associative container is frequent rather than rare, try the std:: equivalents to the absl:: containers or tsl::sparse_map. Also take a look at this page of benchmark conclusions for more definitive comparisons across more use-cases and C++ hash map implementations. 3. Are stable pointers/iterators/references to elements which remain valid after non-back insertion/erasure required, and/or is there a need to sort non-movable/copyable elements? 3a. If so, is the order of elements important and/or is there a need to sort non-movable/copyable elements? 3aa. If so, will this container often be accessed and modified by multiple threads simultaneously? 3aaa. If so, use forward_list (for its lowered side-effects when erasing and inserting). 3aab. If not, do you require range-based splicing between two or more containers (as opposed to splicing of entire containers)? 3aaba. If so, use std::list. 3aabb. If not, use plf::list. 3ab. If not, use hive. 3b. If not, goto 4. 4. Is the order of elements important? 4a. If so, are you almost entirely inserting/erasing to/from the back of the container? 4aa. If so, use vector, with reserve() if the maximum capacity is known in advance (or array). 4ab. If not, are you mostly inserting/erasing to/from the front of the container? 4aba. If so, use deque. 4abb. If not, is insertion/erasure to/from the middle of the container frequent when compared to iteration or back erasure/insertion? 4abba. If so, is it mostly erasures rather than insertions, and can the processing of multiple erasures be delayed until a later point in processing, eg. the end of a frame in a video game? 4abbaa. If so, try the vector erase_if pairing approach listed at the bottom of this guide, and benchmark against plf::list to see which one performs best. Use deque with the erase_if pairing if the number of elements is very large. 4abbab. If not, goto 3aa. 4abbb. If not, are elements large or is there a very large number of elements? 4abbba. If so, benchmark vector against plf::list, or if there is a very large number of elements benchmark deque against plf::list. 4abbbb. If not, do you often need to insert/erase to/from the front of the container? 4abbbba. If so, use deque. 4abbbbb. If not, use vector, or array if number of elements is known in advance. 4b. If not, goto 5. 5. Is non-back erasure frequent compared to iteration? 5a. If so, is the non-back erasure always at the front of the container? 5aa. If always at the front, use deque. 5ab. If not, is the type large, non-trivially copyable/movable or non-copyable/movable? 5aba. If so, use hive. 5abb. If not, is the number of elements very large? 5abba. If so, use a deque with a swap-and-pop approach (to save memory vs vector - assumes standard deque implementation of fixed block sizes) - swap the element you wish to erase with the back element, and then pop_back() to erase. Benchmark vs hive. 5abbb. If not, use a vector with a swap-and-pop approach and benchmark vs hive. 5b. If not, goto 6. 6. Can non-back erasures be delayed until a later point in processing eg. the end of a video game frame? 6a. If so, is the type large or is the number of elements large? 6aa. If so, use hive. 6ab. If not, is consistent latency more important than lower average latency? 6aba. If so, use hive. 6abb. If not, try the erase_if pairing approach listed below with vector, or with deque if the number of elements is large. Benchmark this approach against hive to see which performs best. 6b. If not, use hive. Vector erase_if pairing approach: Try pairing the type with a boolean, in a vector, then marking this boolean for erasure during processing, and then use erase_if with the boolean to remove multiple elements at once at the designated later point in processing. Alternatively if there is a condition in the element itself which identifies it as needing to be erased, try using this directly with erase_if and skip the boolean pairing. If you know the total number of elements in advance, use array instead of vector, or reserve() with vector.
When using data structures such as trees and linked lists is it common to build your own or use a library?
1 project | /r/Cplusplus | 11 Sep 2021

Need a faster hashmap? Use robin-map. Need memory efficiency? There's a few good sparse maps out there. Need a btree or efficient red-black tree? Google has a good btree implementation. The STL will probably implement map and set using a RB-tree.
hash_map
1 project | /r/Cplusplus | 22 Jun 2021

See robin-map for a Robin-Hood hashmap implementation.

What are some alternatives?

When comparing PGM-index and robin-map you can also consider the following projects:

ALEX - A library for building an in-memory, Adaptive Learned indEX

robin-hood-hashing - Fast & memory efficient hashtable based on robin hood hashing for C++11/14/17/20

manticoresearch - Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon

Hopscotch map - C++ implementation of a fast hash map and hash set using hopscotch hashing

sdsl-lite - Succinct Data Structure Library 3.0

ordered-map - C++ hash map and hash set which preserve the order of insertion

SOSD - A Benchmark for Learned Indexes

unordered_dense - A fast & densely stored hashmap and hashset based on robin-hood backward shift deletion

RadixSpline - A Single-Pass Learned Index

flat_hash_map - A very fast hashtable

bolt - 10x faster matrix and vector operations

emhash - Fast and memory efficient c++ flat hash map/set

PGM-index vs ALEX robin-map vs robin-hood-hashing PGM-index vs manticoresearch robin-map vs Hopscotch map PGM-index vs sdsl-lite robin-map vs ordered-map PGM-index vs SOSD robin-map vs unordered_dense PGM-index vs RadixSpline robin-map vs flat_hash_map PGM-index vs bolt robin-map vs emhash

Compare PGM-index vs robin-map and see what are their differences.

PGM-index

robin-map

PGM-index

robin-map

What are some alternatives?