xorfilter
hackernews-button
xorfilter | hackernews-button | |
---|---|---|
1 | 8 | |
658 | 83 | |
2.3% | - | |
5.1 | 2.8 | |
4 months ago | 5 months ago | |
Go | C | |
Apache License 2.0 | GNU General Public License v3.0 only |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
xorfilter
-
Show HN: Privacy-preserving browser extension linking to HN discussion
> Rather than determining if the current site has been submitted by querying the Firebase/Algolia APIs with every page you visit, the extension contains regularly-updating Bloom filters for all submitted HN stories to preserve user privacy.
Nice!
I built a pi-hole esque stub dns-resolver that uses Bloom Filters generated from hostfiles (60 MiB, 5M entries --> 2 MiB with 1% false positives) and it worked like a charm. At some point, I also looked into Xor Filters which are apparently even lighter and faster but couldn't find a JavaScript implementation [0].
I; however, stopped using Bloom Filters because its immutability meant building it over and over again which was a pain. Inverted Bloom Filters might have been useful since they can be updated in-place [1]. Instead, I went for storing hostnames in a Finite State Automata [2], which while not as compact as Bloom Filters, could be updated in-place, are deterministic, and faster. Likely, not a fit for your use-case however.
PinSketches might be a fit for accomplishing efficient set reconciliation.
[0] https://github.com/FastFilter/xorfilter#implementations-of-x...
[1] https://www.youtube.com/watch?v=eIs9nJ-JFvA
[2] http://stevehanov.ca/blog/?id=115
[3] https://github.com/sipa/minisketch
hackernews-button
- GitHub - jstrieb/hackernews-button: Privacy-preserving Firefox extension linking to Hacker News discussions; built with Bloom filters and WebAssembly
-
Ask HN: I curate HN stories which didn't reach the front page. Feedback please
It's worth noting that my extension is far from perfect – it turns out that determining whether a specific page has been submitted to Hacker News is far from a trivial problem to solve. In general, this is because multiple URLs can map to the same page.
Direct string comparison of the current URL to previously submitted ones doesn't work because there are many ways for two identical web pages to have different URLs. For example, the URL fragments can differ (the part after the "#" that may or may not be present). Also there can be tracking parameters (often—but not necessarily—prefixed with "utm_"), which don't change anything about the page. But the URL parameters can't be entirely disregarded because sometimes sites, forums in particular, rely on them – consider pages that use an "?id=..." parameter for different pages. Thus some parameters should be removed, but some shouldn't. The same website having different domains (or domains that change over time) further complicates the situation.
My solution was to "canonicalize" URLs by transforming them into a simplified form using some pretty rough heuristics for common sources of noise. The Python code to do that is here: https://github.com/jstrieb/hackernews-button/blob/master/can...
All of this to say that even though I've used my extension for months and have been quite happy, there will inevitably be false negatives.
- Privacy-preserving Firefox extension linking to Hacker News discussion; built with Bloom filters and WebAssembly
-
Show HN: Privacy-preserving browser extension linking to HN discussion
Thanks for clearing that up, yes I'm not that familiar with Bloom filters, seems like an interesting and useful concept. It could probably (pun intended) be applied to many applications to increase privacy.
I like the [1] Workflow file you've made, the comments really help with reading shell code. I'm also amazed you can query 4M entries everyday with BigQuery, I thought that might be fairly expensive to do right? Or is this below a free tier?
[1] https://github.com/jstrieb/hackernews-button/actions/runs/61...
What are some alternatives?
newsit - Chrome Extension for Hacker News and Reddit Links