dictomaton
Caffeine
Our great sponsors
dictomaton | Caffeine | |
---|---|---|
2 | 43 | |
129 | 15,204 | |
- | - | |
1.8 | 9.7 | |
about 2 years ago | 7 days ago | |
Java | Java | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
dictomaton
-
Calculate the difference and intersection of any two regexes
Say you want to compute all strings of length 5 that the automaton can generate. Conceptually the nicest way is to create an automaton that matches any five characters and then compute the intersection between that automaton and the regex automaton. Then you can generate all the strings in the intersection automaton. Of course, IRL, you wouldn't actually generate the intersection (you can easily do this on the fly), but you get the idea.
Automata are really a lost art in modern natural language processing. We used to do things like store a large vocabulary in an deterministic acyclic minimized automaton (nice and compact, so-called dictionary automaton). And then to find, say all words within Levenshtein distance 2 of hacker, create a Levenshtein automaton for hacker and then compute (on the fly) the intersection between the Levenshtein automaton and the dictionary automaton. The language of the automaton is then all words within the intersection automaton.
I wrote a Java package a decade ago that implements some of this stuff:
https://github.com/danieldk/dictomaton
-
Ask HN: What are some 'cool' but obscure data structures you know about?
Also related: Levenshtein automata - automata for words that match every word within a given Levenshtein distance. The intersection of a Levenshtein automaton of a word and a DAWG gives you an automaton of all words within the given edit distance.
I haven't done any Java in years, but I made a Java package in 2013 that supports: DAWGs, Levenshtein automata and perfect hash automata:
https://github.com/danieldk/dictomaton
Caffeine
-
Otter, Fastest Go in-memory cache based on S3-FIFO algorithm
/u/someplaceguy,
Those LIRS traces, along with many others, available at this page [1]. I did a cursory review using their traces using Caffeine's and the author's simulators to avoid bias or a mistaken implementation. In their target workloads Caffeine was on par or better [2]. I have not seen anything novel in this or their previous works and find their claims to be easily disproven, so I have not implement this policy in Caffeine simulator yet.
[1]: https://github.com/ben-manes/caffeine/wiki/Simulator
[2]: https://github.com/1a1a11a/libCacheSim/discussions/20
-
Google/guava: Google core libraries for Java
That, and also when caffeine came out it replaced one of the major uses (caching) of guava.
https://github.com/ben-manes/caffeine
-
GC, hands off my data!
I decided to start with an overview of what open-source options are currently available. When it comes to the implementation of the on-heap cache mechanism, the options are numerous – there is well known: guava, ehcache, caffeine and many other solutions. However, when I began researching cache mechanisms offering the possibility of storing data outside GC control, I found out that there are very few solutions left. Out of the popular ones, only Terracotta is supported. It seems that this is a very niche solution and we do not have many options to choose from. In terms of less-known projects, I came across Chronicle-Map, MapDB and OHC. I chose the last one because it was created as part of the Cassandra project, which I had some experience with and was curious about how this component worked:
-
Spring Cache with Caffeine
Visit the official Caffeine git project and documentation here for more information if you are interested in the subject.
-
Helidon Níma is the first Java microservices framework based on virtual threads
not to distract from your valid points but, when used properly, Caffeine + Reactor can work together really nicely [1].
[1] https://github.com/ben-manes/caffeine/tree/master/examples/c...
-
FIFO-Reinsertion is better than LRU [pdf]
Yes, I think that is my main concern in that often research papers do not disclose the weaknesses of their approaches and the opposing tradeoffs. There is no silver bullet.
The stress workload that I use is to chain corda-large [1], 5x loop [2], corda-large at a cache size of 512 entries and 6M requests. This shifts from a strongly LRU-biased pattern to an MRU one, and then back again. My solution to this was to use hill climbing by sampling the hit rate to adaptively size of the admission window (aka your FIFO) to reconfigure the cache region sizes. You already have similar code in your CACHEUS implementation which built on that idea to apply it to a multi-agent policy.
Caffeine adjusts the frequency comparison for admission slightly to allow ~1% of losing warm candidates to enter the main region. This is to protect against hash flooding attack (HashDoS) [3]. That isn't intended to improve or correct the policy's decision making so should be unrelated to your observations, but an important change for real-world usage.
I believe LIRS2 [4] adaptively sizes their LIR region, but I do not recall the details as a complex algorithm. It did very well across different workloads when I tried it out and the authors were able to make a few performance fixes based on my feedback. Unfortunately I find LIRS algorithms to be too difficult to maintain for an industry setting because while excellent, the implementation logic is not intuitive which makes it frustrating to debug.
[1] https://github.com/ben-manes/caffeine/blob/master/simulator/...
-
Guava 32.0 (released today) and the @Beta annotation
A lot of Guava's most popular libraries graduated to the JDK. Also Caffeine is the evolution of our c.g.common.cache library. So you need Guava less than you used to. Hooray!
- Monitoring Guava Cache Statistics
-
Apache Baremaps: online maps toolkit
Unfortunately, I don't gather statistics on the demonstration server. I believe that the in-memory caffeine cache (https://github.com/ben-manes/caffeine) saved me.
-
Similar probabilistic algorithms like Hyperloglog?
Caffeine is a Java cache that uses a 4-bit count-min sketch to estimate the popularity of an entry over a sample period. This is used by an admission filter (TinyLFU) to determine whether the new arrival is more valuable than the LRU victim. This is combined with hill climbing to optimize how much space is allocated for frequency vs recency. That results in an adaptive eviction policy that is space and time efficient, and achieves very high hit rates.
What are some alternatives?
ann-benchmarks - Benchmarks of approximate nearest neighbor libraries in Python
Ehcache - Ehcache 3.x line
sdsl-lite - Succinct Data Structure Library 2.0
Hazelcast - Hazelcast is a unified real-time data platform combining stream processing with a fast data store, allowing customers to act instantly on data-in-motion for real-time insights.
RVS_Generic_Swift_Toolbox - A Collection Of Various Swift Tools, Like Extensions and Utilities
cache2k - Lightweight, high performance Java caching
multiversion-concurrency-contro
Apache Geode - Apache Geode
minisketch - Minisketch: an optimized library for BCH-based set reconciliation
Guava - Google core libraries for Java
TablaM - The practical relational programing language for data-oriented applications
scaffeine - Thin Scala wrapper for Caffeine (https://github.com/ben-manes/caffeine)