PGM-index
bolt
Our great sponsors
PGM-index | bolt | |
---|---|---|
6 | 6 | |
751 | 2,463 | |
- | - | |
2.8 | 0.0 | |
5 days ago | over 1 year ago | |
C++ | C++ | |
Apache License 2.0 | Mozilla Public License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
PGM-index
-
Self-indexing RDBMS? Could AI help?
PGM Index
- Piecewise Geometric Model Index
-
Manticore Search 5
Manticore Columnar Library uses Piecewise Geometric Model index, which exploits a learned mapping between the indexed keys and their location in memory. The succinctness of this mapping, coupled with a peculiar recursive construction algorithm, makes the PGM-index a data structure that dominates traditional indexes by orders of magnitude in space while still offering the best query and update time performance.
-
PGM Indexes: Learned indexes that match B-tree performance with 83x less space
Yep, I'm working on a multidimensional version that I hope to upload to the main repo (https://github.com/gvinciguerra/PGM-index) in a few weeks.
bolt
-
Show HN: Want something better than k-means? Try BanditPAM
> frown on that sort of dataset
That example was definitely contrived and designed to strongly illustrate the point. I'll counter slightly that non-peaky topologies aren't uncommon, but they're unlikely to look anything that would push KMedoids to a pathological state rather than just a slightly worse state ("worse" assuming that KMeans is the right choice for a given problem).
> worth pointing out .. data reference
Totally agreed. I hope my answer didn't come across as too negative. It's good work, and everyone else was talking about the positives, so I just didn't want to waste too much time echoing again that while getting the other points across.
> bolt reference
https://github.com/dblalock/bolt
They say as much in their paper, but they aren't the first vector quantization library by any stretch. Their contributions are, roughly:
1. If you're careful selecting the right binning strategy then you can cancel out a meaningful amount of discretization error.
2. If you do that, you can afford to choose parameters that fit everything nicely into AVX2 machine words, turning 100s of branching instructions into 1-4 instructions.
3. Doing some real-world tests to show that (1-2) matter.
Last I checked their code wasn't very effective for the places I wanted to apply it, but the paper is pretty solid. I'd replace it with a faster KMeans approximation less likely to crash on big data (maybe even initializing with KMedoids :) ), and if the thing you're quantizing is trainable with some sort of gradient update step then you should do a few optimization passes in the discretized form as well.
- Bolt: Faster matrix and vector operations that run on compressed data
- 10x faster matrix and vector operations
-
[R] Multiplying Matrices Without Multiplying
Code: https://github.com/dblalock/bolt
What are some alternatives?
ALEX - A library for building an in-memory, Adaptive Learned indEX
composer - Supercharge Your Model Training
manticoresearch - Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
halutmatmul - Hashed Lookup Table based Matrix Multiplication (halutmatmul) - Stella Nera accelerator
robin-map - C++ implementation of a fast hash map and hash set using robin hood hashing
draco - Draco is a library for compressing and decompressing 3D geometric meshes and point clouds. It is intended to improve the storage and transmission of 3D graphics.
sdsl-lite - Succinct Data Structure Library 3.0
LightGBM - A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
SOSD - A Benchmark for Learned Indexes
heavydb - HeavyDB (formerly OmniSciDB)
RadixSpline - A Single-Pass Learned Index
Snappy - A fast compressor/decompressor