Word-Aligned Bloom Filters

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • bitcoinbook

    Mastering Bitcoin 3rd Edition - Programming the Open Blockchain

  • They finally clicked for me after reading about how they're used by the bitcoin network: https://github.com/bitcoinbook/bitcoinbook/blob/develop/ch08...

  • Apache Hive

    Apache Hive

  • > whether this would really work out in most workloads

    > just because it keeps the cache-lines hotter and less likely to be evicted.

    Okay, so keeping cache for a bloom filter problem is real - but the real force evicting memory out of the cache line is the next row-group you read + all the other stuff you have to do when you implement this in a database product.

    So the two things I work with, Apache Hive and Apache Impala switched to a blocked bloom filter at different points in time.

    Hive BloomKFilter - https://github.com/apache/hive/blob/master/storage-api/src/j...

    Impala/Kudu one - https://github.com/apache/impala/blob/master/be/src/kudu/uti...

    The C++ one also has an AVX specialization, while the Java one relies on the JVM to do it (not always) - https://github.com/apache/impala/blob/master/be/src/kudu/uti...

    We ran a lot of trivial benchmarks and several benchmarks where the shuffle-join (not sort-merge, this is just a partitioned hash join) generates a bloom filter (a semijoin) before sending rows out and the 1-cache line version won out when the bloom filter went slightly over the 1 Million + 5% rate [1].

    The regular bloom filter went from (38ns -> 108ns for 1k -> 1m items), while the BloomK stuck at (27ns) despite making room for a million times more items in the bloom. The bloom-1 (which is the 64bit version) underperformed on accuracy (was ~2x faster at 16ns per op, but worse at filtering out items).

    [1] - https://github.com/prasanthj/bloomfilter/tree/master/benchma...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Apache Impala

    Apache Impala

  • > whether this would really work out in most workloads

    > just because it keeps the cache-lines hotter and less likely to be evicted.

    Okay, so keeping cache for a bloom filter problem is real - but the real force evicting memory out of the cache line is the next row-group you read + all the other stuff you have to do when you implement this in a database product.

    So the two things I work with, Apache Hive and Apache Impala switched to a blocked bloom filter at different points in time.

    Hive BloomKFilter - https://github.com/apache/hive/blob/master/storage-api/src/j...

    Impala/Kudu one - https://github.com/apache/impala/blob/master/be/src/kudu/uti...

    The C++ one also has an AVX specialization, while the Java one relies on the JVM to do it (not always) - https://github.com/apache/impala/blob/master/be/src/kudu/uti...

    We ran a lot of trivial benchmarks and several benchmarks where the shuffle-join (not sort-merge, this is just a partitioned hash join) generates a bloom filter (a semijoin) before sending rows out and the 1-cache line version won out when the bloom filter went slightly over the 1 Million + 5% rate [1].

    The regular bloom filter went from (38ns -> 108ns for 1k -> 1m items), while the BloomK stuck at (27ns) despite making room for a million times more items in the bloom. The bloom-1 (which is the 64bit version) underperformed on accuracy (was ~2x faster at 16ns per op, but worse at filtering out items).

    [1] - https://github.com/prasanthj/bloomfilter/tree/master/benchma...

  • bloomfilter

    BloomFilter implementation in Java that uses Murmur3 for fast hashing (by prasanthj)

  • > whether this would really work out in most workloads

    > just because it keeps the cache-lines hotter and less likely to be evicted.

    Okay, so keeping cache for a bloom filter problem is real - but the real force evicting memory out of the cache line is the next row-group you read + all the other stuff you have to do when you implement this in a database product.

    So the two things I work with, Apache Hive and Apache Impala switched to a blocked bloom filter at different points in time.

    Hive BloomKFilter - https://github.com/apache/hive/blob/master/storage-api/src/j...

    Impala/Kudu one - https://github.com/apache/impala/blob/master/be/src/kudu/uti...

    The C++ one also has an AVX specialization, while the Java one relies on the JVM to do it (not always) - https://github.com/apache/impala/blob/master/be/src/kudu/uti...

    We ran a lot of trivial benchmarks and several benchmarks where the shuffle-join (not sort-merge, this is just a partitioned hash join) generates a bloom filter (a semijoin) before sending rows out and the 1-cache line version won out when the bloom filter went slightly over the 1 Million + 5% rate [1].

    The regular bloom filter went from (38ns -> 108ns for 1k -> 1m items), while the BloomK stuck at (27ns) despite making room for a million times more items in the bloom. The bloom-1 (which is the 64bit version) underperformed on accuracy (was ~2x faster at 16ns per op, but worse at filtering out items).

    [1] - https://github.com/prasanthj/bloomfilter/tree/master/benchma...

  • simdjson

    Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks

  • Is this the project? https://github.com/simdjson/simdjson

    If so, Ive been following it for a couple years, but I put it out of my mind recently after moving to AMD. I could sware it was an intel only project, but a quick scan of the that git suggests I'm wrong. So either I'm totally missremembering, or AMD support was added later.

    Anyway, I cant wait to try that out again. I wonder why most projects don't just use this as their default json parser now?

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • The state of Java Object Serialization libraries in Q2 2023

    5 projects | /r/java | 7 Apr 2023
  • Why is json logging the “standard”?

    2 projects | /r/golang | 10 Mar 2023
  • Removies

    5 projects | dev.to | 28 Aug 2022
  • JOINs in SQL

    1 project | news.ycombinator.com | 5 May 2024
  • Unconventional Aggregation

    1 project | news.ycombinator.com | 29 Apr 2024