Is there a way to archive groups of webpages similarly to how web archive does it?

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

heritrix3

6 2,700 6.2 Java

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

You can go all the way and use the same tools as the Internet Archive: https://github.com/internetarchive/heritrix3

brozzler

2 630 8.3 Python

brozzler - distributed browser-based web crawler

Actually, the IA uses Brozzler (https://github.com/internetarchive/brozzler) now if I remember correctly.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

WARC'in the Crawler

1 project | news.ycombinator.com | 21 Dec 2023
Why isn't the internet more fun and weird?

2 projects | news.ycombinator.com | 6 Jul 2022
Heritrix: Internet Archive's extensible, web-scale, archival-quality web crawler

1 project | news.ycombinator.com | 26 Sep 2021
Best Http client for web scraping

1 project | /r/java | 26 Sep 2021
The Internet Is Rotting

3 projects | news.ycombinator.com | 30 Jun 2021

Is there a way to archive groups of webpages similarly to how web archive does it?

This page summarizes the projects mentioned and recommended in the original post on /r/DataHoarder
Java webcrawling warc heritrix
Post date: 29 Mar 2022

heritrix3

brozzler

InfluxDB

Related posts

WARC'in the Crawler

Why isn't the internet more fun and weird?

Heritrix: Internet Archive's extensible, web-scale, archival-quality web crawler

Best Http client for web scraping

The Internet Is Rotting

Is there a way to archive groups of webpages similarly to how web archive does it?

This page summarizes the projects mentioned and recommended in the original post on /r/DataHoarder Java webcrawling warc heritrix Post date: 29 Mar 2022

heritrix3

brozzler

InfluxDB

Related posts

WARC'in the Crawler

Why isn't the internet more fun and weird?

Heritrix: Internet Archive's extensible, web-scale, archival-quality web crawler

Best Http client for web scraping

The Internet Is Rotting

This page summarizes the projects mentioned and recommended in the original post on /r/DataHoarder
Java webcrawling warc heritrix
Post date: 29 Mar 2022