Webrecorder: Capture interactive websites and replay them at a later time

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • browsertrix-crawler

    Run a high-fidelity browser-based crawler in a single Docker container

  • (Disclaimer: I work at Webrecorder)

    Our automated crawler browsertrix-crawler (https://github.com/webrecorder/browsertrix-crawler) uses Puppeteer to run browsers that we archive in by loading pages, running behaviors such as auto-scroll, and then record the request/response traffic. We have some custom behavior for some social media and video sites to make sure that content is appropriate captured. It is a bit of a cat-and-mouse game as we have to continue to update these behaviors as sites change, but for the most part it works pretty well.

    The trickier part is in replaying the archived websites, as a certain amount of re-writing has to happen in order to make sure the HTML and JS are working with archived assets rather than the live web. One implementation of this is replayweb.page (https://github.com/webrecorder/replayweb.page), which does all of the rewriting client-side in the browser. This sets you interact with archived websites in WARC or WACZ format as if interacting with the original site.

  • replayweb.page

    Serverless replay of web archives directly in the browser

  • (Disclaimer: I work at Webrecorder)

    Our automated crawler browsertrix-crawler (https://github.com/webrecorder/browsertrix-crawler) uses Puppeteer to run browsers that we archive in by loading pages, running behaviors such as auto-scroll, and then record the request/response traffic. We have some custom behavior for some social media and video sites to make sure that content is appropriate captured. It is a bit of a cat-and-mouse game as we have to continue to update these behaviors as sites change, but for the most part it works pretty well.

    The trickier part is in replaying the archived websites, as a certain amount of re-writing has to happen in order to make sure the HTML and JS are working with archived assets rather than the live web. One implementation of this is replayweb.page (https://github.com/webrecorder/replayweb.page), which does all of the rewriting client-side in the browser. This sets you interact with archived websites in WARC or WACZ format as if interacting with the original site.

  • SurveyJS

    Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

    SurveyJS logo
  • archiveweb.page-site

    The ArchiveWeb.page Site

  • This is actually an issue with their docs that I encountered a few weeks ago when I was first experimenting with this tool. They apparently added a Spanish-language version of the docs, including an associated extra directory tree in the URL, but they failed to set up redirects or even update the existing links in the documentation.

    So those two pages are actually located at https://archiveweb.page/en/troubleshooting/errors/ and https://archiveweb.page/en/contact/ respectively.

    It looks like their docs site is open source at https://github.com/webrecorder/archiveweb.page-site, so I may try and send a pull request later today to go ahead and correct those links, and possibly also try and deploy some redirects to fix any existing links.

  • archiveweb.page

    A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

  • Playwright

    Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.

  • See: https://github.com/microsoft/playwright/issues/6319

  • readability

    A standalone version of the readability lib

  • I wonder if Firefox "reader mode as a utility" might be a viable alternative for Pinboard like "content oriented" archiving?

    https://github.com/mozilla/readability

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts