Our great sponsors
-
Useful.
Though I'll have to hunt out (or try knock together) something that we can run locally for checking internal-only/white-listed hosts (like https://testssl.sh/ for HTTPS config checking).
-
I looked into this for a project a couple of years ago and ended up using mfsbsd instead.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
but if we assume that it works fine?
AFAIK there's ongoing work for FreeBSD
-
dedupfs
A Python FUSE file system that features transparent deduplication and compression which make it ideal for archiving backups.
Is it possible to use tarsnap's deduplication code on my own server? We're setting up an ML dataset distribution box, and I was hoping to avoid storing e.g. imagenet as a tarball + untar'd (so that nginx can serve each photo individually) + imagenet in TFDS format.
https://github.com/xolox/dedupfs was the closest I found, but it has a lot of downsides.
Has anyone made an interface to tarsnap's tarball dedup code? A python wrapper around the block dedup code would be ideal, but I doubt it exists.
(Sorry for the random question -- I was just hoping for a standalone library along the lines of tarsnap's "filesystem block database" APIs. I thought about emailing this to you instead, but I'm crossing my fingers that some random HN'er might know. I'm sort of surprised that filesystems don't make it effortless. In fact, I delayed posting this for an hour to go research whether ZFS is the actual solution -- apparently "no, not unless you have specific brands of SSDs: https://www.truenas.com/community/resources/my-experiments-i..." which rules out my non-SSD 64TB Hetzner server. But like, dropbox solved this problem a decade ago -- isn't there something similar by now?)
-
ssh-audit
SSH server & client security auditing (banner, key exchange, encryption, mac, compression, compatibility, security, etc) (by jtesta)
-
-
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Not foolish! The Tarsnap client code is open source, but the license file prohibits anyone from using the code: https://github.com/Tarsnap/tarsnap/blob/master/COPYING
> Redistribution and use in source and binary forms, without modification,
-
Samba
https://gitlab.com/samba-team/samba is the Official GitLab mirror of https://git.samba.org/samba.git -- Merge requests should be made on GitLab (not on GitHub) (by samba-team)
provided by Tarsnap Backup Inc.
The codebase is a jewel. I love the design, the way it's organized, the coding style, the algorithms, everything.
My process was to skim Colin's thesis: http://www.daemonology.net/papers/thesis.pdf
Along with the rsync thesis: https://www.samba.org/~tridge/phd_thesis.pdf
Then I started making a mental map of tarsnap: How does it build its deduplication index? How does it decide where block boundaries start within a file? Etc.
Eventually I started coding the algorithms in Python, mostly as a way of understanding the code. It's not actually as hard as it sounds, but you have to be rigorous. (It's a C -> Python conversion, after all, so there's not much room for error.)
My process was basically: Copy the C code into a Python file; comment out the code; for each line, write the corresponding Python; try to get something running as quickly as possible.
It worked pretty well, but I eventually lost interest.
Over the years, I've wanted a deduplication library, and 2021 is no exception. Someday I'll just roll up my sleeves and finish porting it.