Is there any good software for deduping (deduplicating) content in WARC files?

This page summarizes the projects mentioned and recommended in the original post on /r/DataHoarder

SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
surveyjs.io
featured
InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
  • pywb

    Core Python Web Archiving Toolkit for replay and recording of web archives

  • I have thousands of bookmarks on raindrop.io that I've been wanting to archive for a while. However, I've archived ~150 pages so far with Pywb and it ended up being 500MB across two WARCs, even with the dedupe setting specified in my settings file. It dedupes while archiving pages. I want software to get any spots missed and be sure that WARCs are actually deduped.

  • SurveyJS

    Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

    SurveyJS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • I can't install a Python package, pywb, looks like a problem with brotlipy. What can I do?

    1 project | /r/termux | 5 Oct 2022
  • Ran grab-site now have some warc.gz files etc, the site in question was originally hosted in a mixture of html and javascript, what's the best and easiest way to make this accessible as a user for offline personal use?

    1 project | /r/Archiveteam | 2 Apr 2022
  • How good is ArchiveWeb.page?

    2 projects | /r/Archiveteam | 16 Apr 2021
  • Can anyone familiar with databases of Youtube archives help me? I don't know how to find what I'm looking for. Details in post.

    4 projects | /r/Archivists | 4 Oct 2022
  • Help with WARC files/Extracting a portion of a full crawl

    1 project | /r/Archiveteam | 27 May 2022