Webrecorder: Capture interactive websites and replay them at a later time

Our great sponsors

SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

Our great sponsors

browsertrix-crawler

13 538 9.0 TypeScript

Run a high-fidelity browser-based crawler in a single Docker container

(Disclaimer: I work at Webrecorder)
Our automated crawler browsertrix-crawler (https://github.com/webrecorder/browsertrix-crawler) uses Puppeteer to run browsers that we archive in by loading pages, running behaviors such as auto-scroll, and then record the request/response traffic. We have some custom behavior for some social media and video sites to make sure that content is appropriate captured. It is a bit of a cat-and-mouse game as we have to continue to update these behaviors as sites change, but for the most part it works pretty well.
The trickier part is in replaying the archived websites, as a certain amount of re-writing has to happen in order to make sure the HTML and JS are working with archived assets rather than the live web. One implementation of this is replayweb.page (https://github.com/webrecorder/replayweb.page), which does all of the rewriting client-side in the browser. This sets you interact with archived websites in WARC or WACZ format as if interacting with the original site.

replayweb.page

24 611 7.6 TypeScript

Serverless replay of web archives directly in the browser

(Disclaimer: I work at Webrecorder)
Our automated crawler browsertrix-crawler (https://github.com/webrecorder/browsertrix-crawler) uses Puppeteer to run browsers that we archive in by loading pages, running behaviors such as auto-scroll, and then record the request/response traffic. We have some custom behavior for some social media and video sites to make sure that content is appropriate captured. It is a bit of a cat-and-mouse game as we have to continue to update these behaviors as sites change, but for the most part it works pretty well.
The trickier part is in replaying the archived websites, as a certain amount of re-writing has to happen in order to make sure the HTML and JS are working with archived assets rather than the live web. One implementation of this is replayweb.page (https://github.com/webrecorder/replayweb.page), which does all of the rewriting client-side in the browser. This sets you interact with archived websites in WARC or WACZ format as if interacting with the original site.

SurveyJS

surveyjs.io sponsored

Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
archiveweb.page-site

1 23 5.5 HTML

The ArchiveWeb.page Site

This is actually an issue with their docs that I encountered a few weeks ago when I was first experimenting with this tool. They apparently added a Spanish-language version of the docs, including an associated extra directory tree in the URL, but they failed to set up redirects or even update the existing links in the documentation.
So those two pages are actually located at https://archiveweb.page/en/troubleshooting/errors/ and https://archiveweb.page/en/contact/ respectively.
It looks like their docs site is open source at https://github.com/webrecorder/archiveweb.page-site, so I may try and send a pull request later today to go ahead and correct those links, and possibly also try and deploy some redirects to fix any existing links.

archiveweb.page

7 728 6.3 JavaScript

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!
Playwright

379 61,568 9.9 TypeScript

Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.

See: https://github.com/microsoft/playwright/issues/6319

readability

51 8,056 6.3 JavaScript

A standalone version of the readability lib

I wonder if Firefox "reader mode as a utility" might be a viable alternative for Pinboard like "content oriented" archiving?
https://github.com/mozilla/readability

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

r18 database of metadata
3 projects | /r/DataHoarder | 30 Sep 2022
Ask HN: What is going on at archive.ph?
2 projects | news.ycombinator.com | 27 Sep 2022
"scrape" a javascript object from a website?
1 project | /r/webscraping | 9 Sep 2022
Archiveweb.page – A High-Fidelity Web Archiving Extension for Chromium Browsers
1 project | /r/CKsTechNews | 9 Oct 2021
Archiveweb.page – A High-Fidelity Web Archiving Extension for Chromium Browsers
1 project | news.ycombinator.com | 9 Oct 2021

Webrecorder: Capture interactive websites and replay them at a later time

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
web-archiving Chromium Dev Tools web-archive Extension
Post date: 1 Aug 2023

browsertrix-crawler

replayweb.page

SurveyJS

archiveweb.page-site

archiveweb.page

Playwright

readability

Related posts

Webrecorder: Capture interactive websites and replay them at a later time

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com web-archiving Chromium Dev Tools web-archive Extension Post date: 1 Aug 2023

browsertrix-crawler

replayweb.page

SurveyJS

archiveweb.page-site

archiveweb.page

Playwright

readability

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
web-archiving Chromium Dev Tools web-archive Extension
Post date: 1 Aug 2023