Word-Aligned Bloom Filters

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

bitcoinbook

347 22,608 9.9 HTML

Mastering Bitcoin 3rd Edition - Programming the Open Blockchain

They finally clicked for me after reading about how they're used by the bitcoin network: https://github.com/bitcoinbook/bitcoinbook/blob/develop/ch08...

Apache Hive

14 5,335 9.6 Java

Apache Hive

> whether this would really work out in most workloads
> just because it keeps the cache-lines hotter and less likely to be evicted.
Okay, so keeping cache for a bloom filter problem is real - but the real force evicting memory out of the cache line is the next row-group you read + all the other stuff you have to do when you implement this in a database product.
So the two things I work with, Apache Hive and Apache Impala switched to a blocked bloom filter at different points in time.
Hive BloomKFilter - https://github.com/apache/hive/blob/master/storage-api/src/j...
Impala/Kudu one - https://github.com/apache/impala/blob/master/be/src/kudu/uti...
The C++ one also has an AVX specialization, while the Java one relies on the JVM to do it (not always) - https://github.com/apache/impala/blob/master/be/src/kudu/uti...
We ran a lot of trivial benchmarks and several benchmarks where the shuffle-join (not sort-merge, this is just a partitioned hash join) generates a bloom filter (a semijoin) before sending rows out and the 1-cache line version won out when the bloom filter went slightly over the 1 Million + 5% rate [1].
The regular bloom filter went from (38ns -> 108ns for 1k -> 1m items), while the BloomK stuck at (27ns) despite making room for a million times more items in the bloom. The bloom-1 (which is the 64bit version) underperformed on accuracy (was ~2x faster at 16ns per op, but worse at filtering out items).
[1] - https://github.com/prasanthj/bloomfilter/tree/master/benchma...

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Apache Impala

1 1,083 9.7 C++

Apache Impala

> whether this would really work out in most workloads
> just because it keeps the cache-lines hotter and less likely to be evicted.
Okay, so keeping cache for a bloom filter problem is real - but the real force evicting memory out of the cache line is the next row-group you read + all the other stuff you have to do when you implement this in a database product.
So the two things I work with, Apache Hive and Apache Impala switched to a blocked bloom filter at different points in time.
Hive BloomKFilter - https://github.com/apache/hive/blob/master/storage-api/src/j...
Impala/Kudu one - https://github.com/apache/impala/blob/master/be/src/kudu/uti...
The C++ one also has an AVX specialization, while the Java one relies on the JVM to do it (not always) - https://github.com/apache/impala/blob/master/be/src/kudu/uti...
We ran a lot of trivial benchmarks and several benchmarks where the shuffle-join (not sort-merge, this is just a partitioned hash join) generates a bloom filter (a semijoin) before sending rows out and the 1-cache line version won out when the bloom filter went slightly over the 1 Million + 5% rate [1].
The regular bloom filter went from (38ns -> 108ns for 1k -> 1m items), while the BloomK stuck at (27ns) despite making room for a million times more items in the bloom. The bloom-1 (which is the 64bit version) underperformed on accuracy (was ~2x faster at 16ns per op, but worse at filtering out items).
[1] - https://github.com/prasanthj/bloomfilter/tree/master/benchma...

bloomfilter

1 11 0.0 Java

BloomFilter implementation in Java that uses Murmur3 for fast hashing (by prasanthj)

> whether this would really work out in most workloads
> just because it keeps the cache-lines hotter and less likely to be evicted.
Okay, so keeping cache for a bloom filter problem is real - but the real force evicting memory out of the cache line is the next row-group you read + all the other stuff you have to do when you implement this in a database product.
So the two things I work with, Apache Hive and Apache Impala switched to a blocked bloom filter at different points in time.
Hive BloomKFilter - https://github.com/apache/hive/blob/master/storage-api/src/j...
Impala/Kudu one - https://github.com/apache/impala/blob/master/be/src/kudu/uti...
The C++ one also has an AVX specialization, while the Java one relies on the JVM to do it (not always) - https://github.com/apache/impala/blob/master/be/src/kudu/uti...
We ran a lot of trivial benchmarks and several benchmarks where the shuffle-join (not sort-merge, this is just a partitioned hash join) generates a bloom filter (a semijoin) before sending rows out and the 1-cache line version won out when the bloom filter went slightly over the 1 Million + 5% rate [1].
The regular bloom filter went from (38ns -> 108ns for 1k -> 1m items), while the BloomK stuck at (27ns) despite making room for a million times more items in the bloom. The bloom-1 (which is the 64bit version) underperformed on accuracy (was ~2x faster at 16ns per op, but worse at filtering out items).
[1] - https://github.com/prasanthj/bloomfilter/tree/master/benchma...

simdjson

65 18,409 9.2 C++

Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks

Is this the project? https://github.com/simdjson/simdjson
If so, Ive been following it for a couple years, but I put it out of my mind recently after moving to AMD. I could sware it was an intel only project, but a quick scan of the that git suggests I'm wrong. So either I'm totally missremembering, or AMD support was added later.
Anyway, I cant wait to try that out again. I wonder why most projects don't just use this as their default json parser now?

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

The state of Java Object Serialization libraries in Q2 2023

5 projects | /r/java | 7 Apr 2023
Why is json logging the “standard”?

2 projects | /r/golang | 10 Mar 2023
Removies

5 projects | dev.to | 28 Aug 2022
JOINs in SQL

1 project | news.ycombinator.com | 5 May 2024
Unconventional Aggregation

1 project | news.ycombinator.com | 29 Apr 2024

Word-Aligned Bloom Filters

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Database JSON impala Java json-parser
Post date: 3 Oct 2021

bitcoinbook

Apache Hive

InfluxDB

Apache Impala

bloomfilter

simdjson

SaaSHub

Related posts

The state of Java Object Serialization libraries in Q2 2023

Why is json logging the “standard”?

Removies

JOINs in SQL

Unconventional Aggregation

Word-Aligned Bloom Filters

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Database JSON impala Java json-parser Post date: 3 Oct 2021

bitcoinbook

Apache Hive

InfluxDB

Apache Impala

bloomfilter

simdjson

SaaSHub

Related posts

The state of Java Object Serialization libraries in Q2 2023

Why is json logging the “standard”?

Removies

JOINs in SQL

Unconventional Aggregation

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Database JSON impala Java json-parser
Post date: 3 Oct 2021