A Compressed Indexable Bitset

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • Folly

    An open-source C++ library developed and used at Facebook.

  • The EF core algorithm implemented in folly [3] may be a bit faster, and implementing partitioning on top of that is relatively easy.

    It would definitely compress much better than roaring bitmaps. In terms of performance, it depends on the access patterns. If very sparse (large jumps) PEF would likely be faster, if dense (visit a large fraction of the bitmap) it'd be slower.

    It is possible to squeeze a bit more compression out of PEF by introducing a chunk type for Elias-Fano of the chunk complement (for very dense chunks), but you lose the operation of skipping to a given position, which is however not needed in inverted indexes (you only need to skip past a given id, and that can be supported efficiently). That is not mentioned in the paper because at the time I thought the skip-to-position operation was a non-negotiable.

    [1] https://github.com/ot/ds2i/

    [2] https://github.com/pisa-engine/pisa

    [3] https://github.com/facebook/folly/blob/main/folly/experiment...

  • ds2i

    A library of inverted index data structures

  • The EF core algorithm implemented in folly [3] may be a bit faster, and implementing partitioning on top of that is relatively easy.

    It would definitely compress much better than roaring bitmaps. In terms of performance, it depends on the access patterns. If very sparse (large jumps) PEF would likely be faster, if dense (visit a large fraction of the bitmap) it'd be slower.

    It is possible to squeeze a bit more compression out of PEF by introducing a chunk type for Elias-Fano of the chunk complement (for very dense chunks), but you lose the operation of skipping to a given position, which is however not needed in inverted indexes (you only need to skip past a given id, and that can be supported efficiently). That is not mentioned in the paper because at the time I thought the skip-to-position operation was a non-negotiable.

    [1] https://github.com/ot/ds2i/

    [2] https://github.com/pisa-engine/pisa

    [3] https://github.com/facebook/folly/blob/main/folly/experiment...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • pisa

    PISA: Performant Indexes and Search for Academia

  • The EF core algorithm implemented in folly [3] may be a bit faster, and implementing partitioning on top of that is relatively easy.

    It would definitely compress much better than roaring bitmaps. In terms of performance, it depends on the access patterns. If very sparse (large jumps) PEF would likely be faster, if dense (visit a large fraction of the bitmap) it'd be slower.

    It is possible to squeeze a bit more compression out of PEF by introducing a chunk type for Elias-Fano of the chunk complement (for very dense chunks), but you lose the operation of skipping to a given position, which is however not needed in inverted indexes (you only need to skip past a given id, and that can be supported efficiently). That is not mentioned in the paper because at the time I thought the skip-to-position operation was a non-negotiable.

    [1] https://github.com/ot/ds2i/

    [2] https://github.com/pisa-engine/pisa

    [3] https://github.com/facebook/folly/blob/main/folly/experiment...

  • efg

    GPU based Compressed Graph Traversal

  • Btw, core EF is quite efficient on the decoding side even on GPUs. I wanted to do PEF, but that seemed a bit more involved and didn't have the time to do it. Here's a GPU implementation for graph problems if anyone is interested: https://github.com/pgera/efg

  • tantivy

    Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

  • The roaring bitmap variant is used only for the optional index (1 docid => 0 or 1 value) in the columnar storage (DocValues), not for the inverted index. Since this is used for aggregation, some queries may be a full scan.

    The inverted index in tantivy uses bitpacked values of 128 elements with a skip index on top.

    > I didn't follow the rest of your comment, select is what EF is good at, every other data structure needs a lot more scanning once you land on the right chunk. With BMI2 you can also use the PDEP instruction to accelerate the final select on a 64-bit block

    The select for the sparse codec is a [simple array index access](https://github.com/quickwit-oss/tantivy/blob/main/columnar/s...), that is hard to beat. Compression is not good near the 5k threshold though.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts