-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
I'd recommend ArchiveBox, it takes care of extracting videos and media files using youtube-dl, and it also saves to Archive.org for redundancy.
To dedup video, I found after years of search one software that worked good enough to be useful: Video Duplicate Finder by 0x90d. It's open source and very easy to use with a GUI or in command-line. It will build a database of screenshots at different timepoints in each video and compare them. It works extremely well, it can find duplicates of different size, video quality (bitrate, resolution) and even different durations. It's the fastest video deduplicator and also the most reliable I have ever used, others are gadgets compared to this one. Rarely, some videos are not properly matched so you do need to check manually if you want to retain a maximum of videos, but otherwise if you don't mind losing a few ones you can just select all duplicates and remove them.