Show HN: B-field, a probabilistic key-value data structure (`rust-bfield`)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • rust-bfield

    B-field implementation in Rust

  • > So it’s for cases where you have any key but associated with one of only (preferably few) discrete values

    We use it for a case with ~million unique values, but it's certainly more space efficient for cases where you have tens, hundreds, or thousands of values. The "Space Requirements" section has a few examples: https://github.com/onecodex/rust-bfield?tab=readme-ov-file#s... (e.g., you can store a key-value pair with 32 distinct values in ~27 bits of space at a 0.1% false positive rate).

    > all the docs say “designed for in-memory lookups”

    We use mmap for persistence as our use case is largely a build-once, read many times one. As a practical matter, the data structure involves lots of random access, so is better suited to in-memory use from a speed POV.

    > fyi, you use temp::temp_file() but never actually use the result, instead using the hard-coded /tmp path

    Thank you, have opened an issue and we'll fix it!

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • BuRR

    Bumped Ribbon Retrieval and Approximate Membership Query (by lorenzhs)

  • Very interesting and I'll have to read more to understand how it fully works, but _initially_ the space requirements doesn't seem too impressive? Am I missing something here? Maybe the solution here is more flexible?

    One alternative approach for many of these problems is to start with a perfect minimal hash function which hashes your key into a unique number [0, N) and then have a packed array of size N where each element is of a fixed byte size. To look up the value you first use the hash function to get an index, and then you look up in the packed array. There's also no error rate here: This is exact.

    PTHash (https://arxiv.org/abs/2104.10402) needs roughly ~4 bits per element to create a perfect minimal hash function.

    > Store 1 billion web URLs and assign each of them one of a small number of categories values (n=8) in 2.22Gb (params include ν=8, κ=1, =0.1%; 19 bits per element)

    Assuming that "n=8" here means "8 bits" we need 1GB (8 bits * billion) to represent all of the values, and then 500 MB for the hash function (4 bits * billion).

    I also don't quite understand what "2.22Gb" here refers to. 19 bits * billion = 2.357 SI-giga bytes = 19 SI-giga bits = 2.212 gibi bytes.

    > Store 1 billion DNA or RNA k-mers ("ACGTA...") and associate them with any of the ~500k bacterial IDs current described by NCBI in 6.93Gb (ν=62, κ=4, =0.1%; 59 bits per element)

    "~500k bacterial ID" can be represented with 19 bits. 1 billion of these take ~2.3GB, and then we have the additional 500MB for the perfect hash function.

    Another data structure which is even more fine-tuned for this problem space is Bumped Ribbon Retrieval (https://arxiv.org/abs/2109.01892) where they have <1% overhead over just storing the plain bit values.

  • rust-lapper

    Rust implementation of a fast, easy, interval tree library nim-lapper

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Show HN: Open-Source Distributed Data Framework for LLM Applications

    1 project | news.ycombinator.com | 23 May 2024
  • Show HN: A phone number to text with questions about current events

    2 projects | news.ycombinator.com | 10 May 2024
  • How I got my first Rust job by doing open-source

    3 projects | dev.to | 30 Apr 2024
  • RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

    1 project | news.ycombinator.com | 30 Apr 2024
  • FastLLM by Qdrant – lightweight LLM tailored For RAG

    1 project | news.ycombinator.com | 1 Apr 2024