libu8ident
RoaringBitmap
libu8ident | RoaringBitmap | |
---|---|---|
9 | 24 | |
17 | 3,395 | |
- | 1.0% | |
1.8 | 8.5 | |
11 months ago | 26 days ago | |
C | Java | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
libu8ident
- Roaring bitmaps are compressed bitmaps, can be 100x faster
-
International domain names: where does HTTPS://meßagefactory.ca lead you?
In programming languages it's much worse. Identifiers can either be unidentifiable, and if so everybody has a different opinion what "identifiable" means. Even the standard on identifiers, UTF-39, is buggy and has too many interpretations, leading to a complete disaster. https://github.com/rurban/libu8ident/blob/master/doc/c11.md
In punycode domain names it's quite simple still.
With other names, it's even worse. No-one cares. Linkers do not, username and filesystem drivers do not. The Apple HFS+ did care a bit one day, until someone in the higher ranks decided that no-one needs unicode security anymore and switched the new APFS to unsafe again.
-
Using Unicode in a compiler
No, it's definitely not safe to use unrestricted Unicode in a compiler. See https://github.com/rurban/libu8ident/ for identifier rules, and http://www.unicode.org/reports/tr55/ for much worse problems.
- Ask HN: What interesting problems are you working on? ( 2022 Edition)
- Unicode Utilities: Confusables
-
How can you be fooled by the U+202E trick?
That's why unicode published the security guidelines and mechanisms to avoid such attacks. In 2004 already.
The problem is that nobody cared. Browsers invented punycode instead of following tr39, email ditto. But ok, at least something. Java did it, cperl did, rust did it.
Everybody else is vulnerable. Esp. most other programming languages, filesystems and login systems. https://github.com/rurban/libu8ident/blob/master/doc/c11.md
- Prevent Trojan Source attacks with GCC 12
-
Unicode Normalization Forms: When ö = ö
I'm maintaining such a library.
coreutils, diff, grep, patch, sed and friends all cannot find Unicode strings, they have no string support. They can only mimic filesystems, finding binary garbage. Strings are so rthi g different than pure ASCII or BINARY garbage. Strings have an encoding and are Unicode.
Filesystems are even worse because they need to treat filenames as identifiers, but do not. Nobody cares about TR31, TR39, TR36 and so on.
Here is an overview of the sad state of Unicode unsafeties in programming languages: https://github.com/rurban/libu8ident/blob/master/c11.md
- Why does Windows 10 run faster than Fedora?
RoaringBitmap
-
Iterating over Bit Sets Quickly
I was recently reading about Roaring https://roaringbitmap.org/ which is a highly optimized compressed bitset implementation. I reccomend reading about it if you are interested in this sort of thing. The talk at https://roaringbitmap.org/talks/ is especially good.
- Roaring Bitmaps
- Roaring bitmaps are compressed bitmaps, can be 100x faster
-
What feature would you like to remove in C++26?
However, I would love compressed (not just packed) bitsets too, which is something different to me. I would make it another class with a similar interface, based on something like roaring. It doesn't need to be in the standard, but it would be nice if the API was a such that one could easily swap implementations.
-
Jaccard Index
As an aside if you find yourself having to compute them on the fly, know that the Roaring Bitmaps libraries is the way to go [1]. The bitmaps are compressed, and can be streamed directly into SIMD computations (batching XORs and popcnts 256 bits wide!). The Jaccard index is just intersection_len / union_len [2] away
[1] https://roaringbitmap.org/
[2] https://roaringbitmap.readthedocs.io/en/latest/#roaringbitma...
-
Looking for fast, space-efficient key-lookup
Use a two stage approach, with a bloom/cuckoo filter stored as a https://roaringbitmap.org/ in memory. Then a secondary key/value store on disk (bolt or anything else).
-
BitSet Vs BigInteger
As an aside, if you're dealing with large bit sets, you might also want to evaluate Roaring Bitmaps.
-
Negative Incentives in Academic Research
Sidetracking a bit the conversation. What a coincidence that the author (Lemire) is also represented on Today's #1 "Ask HN: What are some cool but obscure data structures you know about?" as he is the main contributor of RoaringBitmap https://github.com/RoaringBitmap/RoaringBitmap and one of the main authors of the data structure.
- Ask HN: What are some 'cool' but obscure data structures you know about?
- Roaring bitmaps: A better compressed bitset
What are some alternatives?
Confusables - Simple library for matching a string to another string that is same but has letters that only *look* the same as original string
HyperMinHash-java - Union, intersection, and set cardinality in loglog space
featurebase - A crazy fast analytical database, built on bitmaps. Perfect for ML applications. Learn more at: http://docs.featurebase.com/. Start a Docker instance: https://hub.docker.com/r/featurebasedb/featurebase
lucene - Apache Lucene open-source search software
libredwg - Official mirror of libredwg. With CI hooks and nightly releases. PR's ok
CQEngine - Ultra-fast SQL-like queries on Java collections
safeclib - safec libc extension with all C11 Annex K functions
Primes - Prime Number Projects in C#/C++/Python
nbperf - Improved NetBSD's Perfect Hash Generation Tool v3
Feign - Feign makes writing java http clients easier
reals - A lightweight python3 library for arithmetic with real numbers.
maven-compiler-plugin - Apache Maven Compiler Plugin