simdutf
heed
simdutf | heed | |
---|---|---|
11 | 17 | |
960 | 476 | |
4.8% | 11.6% | |
9.1 | 8.9 | |
3 days ago | 1 day ago | |
C++ | Rust | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
simdutf
- Glibc Buffer Overflow in Iconv
-
Vectorizing Unicode conversions on real RISC-V hardware
The project was mostly inspired by simdutf [0] which has been around for a couple of years already, and I don't think iconv has any of its vectorized implementations for other architectures.
[0] https://github.com/simdutf/simdutf
-
Cray-1 performance vs. modern CPUs
I'm actually doing something quite similar in my, in progress, unicode conversion routines.
For utf8 validation there is a clever algorithm that uses three 4-bit look-ups to detect utf8 errors: https://github.com/simdutf/simdutf/blob/master/src/icelake/i...
Aside on LMUL, if you haven't encountered it yet: rvv allows you to group vector registers when configuring the vector configuration with vsetvl such that vector instruction operate on multiple vector registers at once. That is, with LMUL=1 you have v0,v1...v31. With LMUL=2 you effectively have v0,v2,...v30, where each vector register is twice as large. with LMUL=4 v0,v4,...v28, with LMUL=8 v0,v8,...v24.
In my code, I happen to read the data with LMUL=2. The trivial implementation would just call vrgather.vv with LMUL=2, but since we only need a lookup table with 128 bits, LMUL=1 would be enough to store the lookup table (V requires a minimum VLEN of 128 bits).
So instead I do six LMUL=1 vrgather.vv's instead of three LMUL=2 vrgather.vv's because there is no lane crossing required and this will run faster in hardware: (see [0] for a relevant mico benchmark)
# codegen for equivalent of that function
-
What C++ library do you wish existed but hasn’t been created yet?
utf8 normalization, stemming, case insensitive comparison. https://github.com/unicode-rs example for rust What are options for C++? 1. translate to utf16 ( https://github.com/simdutf/simdutf ) and use icu -- slow 2. boost text, https://github.com/tzlaine/text , also slow (because the author doesn't care or couldn't care), we made a lot of patches to make our library faster than lucene, but still this part is slower than icu for utf16 (icu for utf16 also very slow...)
-
[Preprint] Transcoding Unicode Characters with AVX-512 Instructions
You can find the corresponding assembly code in this repository. The main branch only contains implementations based on C++ with intrinsics.
-
What's everyone working on this week (10/2023)?
The next big thing is making it LSP-compatible. All language servers must implement UTF-16 based character offsets, which is kinda unfortunate considering that files are much more likely to be stored in UTF-8 (I think?). I don't want to do the UTF-8 -> UTF-16 transcoding, so instead I'll use the excellent simdutf library to count how much code points a UTF-8 string would take if it was transcoded into UTF-16 — which is much faster than actual transcoding. So this is what I'm going to do this week — rewriting parsers to produce UTF-16 offsets + some final benchmarking. After that is done, I'll consider the "research" part of this project completed and will start writing an actual Markdown parser.
-
Why would a language not natively support SIMD?
You can find the assembly code here: https://github.com/simdutf/simdutf/tree/clausecker The corresponding C++ code is in the main branch.
- High speed Unicode routines using SIMD
-
text-2.0-rc1 with UTF8 underlying representation is available for testing!
Or via an ultrafast simdutf.
- Simdutf: Unicode validation and transcoding at billions of characters per second
heed
-
What's everyone working on this week (10/2023)?
At Meilisearch we are currently trying to add a better error handling in heed v0.20, our LMDB key-value store wrapper. Unfortunately, when there are a lot of generics it can become harder to play with…
-
We’re the Meilisearch team! To celebrate v1.0 of our open-source search engine, Ask us Anything!
There are issues and pull requests but I advise you to look at the milli folder in the Meilisearch repository, it’s where all the logic is done. We extensively use RoaringBitmaps, heed the LMDB wrapper and grenad when indexing.
-
Release of an alpha version to perfect the heed library: the most maintained Rust LMDB wrapper
I’ll continue to introduce new features and new safety guards until v0.20.0. Can you tell me more about your project? Or is it private?
-
Are there any embedded databases that have multiple-process support?
LMDB support multiple readers and one writer at the same time. It is ensured by the library. Note that LMDB is a key-value store. You can use the heed library which is the most maintained Rust wrapper.
-
Key/Value Store Recommendations
Note that heed is ensuring that you are not trying to use transactions, databases and environments in the right way. I have added much more work in that regard in the important update that I am working on too!
-
What's everyone working on this week (45/2022)?
I am currently working on exposing the new LMDB encryption feature from heed the safe LMDB wrapper with the help of the Cryptography community.
- Ask for advice from the cryptographic community about heed: the LMDB wrapper
-
redb: high performance, embedded, key-value database in pure Rust
Have you considered heed or even sanakirja?
-
[Requesting Help] LMDB Databases in Rust
rkv hasn't been updated for a while. I recommend using heed - https://docs.rs/heed
-
I need a stable Key-Value database
For wrappers around LMBD, I'd recommend RKV or Heed https://github.com/mozilla/rkv https://github.com/Kerollmops/heed
What are some alternatives?
simdutf8 - SIMD-accelerated UTF-8 validation for Rust.
sled - the champagne of beta embedded databases
DirectXMath - DirectXMath is an all inline SIMD C++ linear algebra library for use in games and graphics apps
KeyDB - A Multithreaded Fork of Redis
simde - Implementations of SIMD instruction sets for systems which don't natively support them.
lmdb-rs - Rust bindings for LMDB
eve - Expressive Vector Engine - SIMD in C++ Goes Brrrr
rkv - A simple, humane, typed key-value storage solution.
Vc - SIMD Vector Classes for C++
milli - Search engine library for Meilisearch ⚡️
simdjson - Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks
nanodb-specification - Nano ledger database format specification and Python sample