Future-Proofing Web Scraping via JavaScript Runtime Heap Snapshots

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • devtools

    Replay.io DevTools (by replayio)

  • Not _quite_ what you're describing, but Replay [0], the company I work for, _is_ building a true "time-traveling debugger" for JS. It works by recording the OS-level interactions with the browser process, then re-running those in the cloud. From the user's perspective in our debugging client UI, they can jump to any point in a timeline and do typical step debugging. However, you can also see how many times any line of code ran, and also add print statements to any line that will print out the results from _every time that line got executed_.

    So, no heap analysis per se, but you can definitely inspect the variables and stack from anywhere in the recording.

    Right now our debugging client is just scratching the surface of the info we have available from our backend. We recently put together a couple small examples that use the Replay backend API to extract data from recordings and do other analysis, like generating code coverage reports and introspecting React's internals to determine whether a given component was mounting or re-rendering.

    Given that capability, we hope to add the ability to do "React component stack" debugging in the not-too-distant future, such as a button that would let you "Step Back to Parent Component". We're also working on adding Redux DevTools integration now (like, I filed an initial PR for this today! [2]), and hope to add integration with other frameworks down the road.

    [0] https://replay.io

    [1] https://github.com/RecordReplay/replay-protocol-examples

    [2] https://github.com/RecordReplay/devtools/pull/6601

  • puppeteer-heap-snapshot

    API and CLI tool to fetch and query Chome DevTools heap snapshots.

  • That's an exceedingly clever idea, thanks for sharing it!

    Please consider adding an actual license text file to your repo, since (a) I don't think GitHub's licensee looks inside package.json (b) I bet most of the "license" properties of package.json files are "yeah, yeah, whatever" versus an intentional choice: https://github.com/adriancooney/puppeteer-heap-snapshot/blob... I'm not saying that applies to you, but an explicit license file in the repo would make your wishes clearer

  • SurveyJS

    Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.

    SurveyJS logo
  • profiler

    Firefox Profiler — Web app for Firefox performance analysis

  • Well kind of for Firefox, there is this profiling tool which you could use (semi-built in)

    https://github.com/firefox-devtools/profiler. Which let you save a report in json.gz format

  • Protocol-Examples

    Example apps demonstrating how to use the Replay Protocol API

  • Not _quite_ what you're describing, but Replay [0], the company I work for, _is_ building a true "time-traveling debugger" for JS. It works by recording the OS-level interactions with the browser process, then re-running those in the cloud. From the user's perspective in our debugging client UI, they can jump to any point in a timeline and do typical step debugging. However, you can also see how many times any line of code ran, and also add print statements to any line that will print out the results from _every time that line got executed_.

    So, no heap analysis per se, but you can definitely inspect the variables and stack from anywhere in the recording.

    Right now our debugging client is just scratching the surface of the info we have available from our backend. We recently put together a couple small examples that use the Replay backend API to extract data from recordings and do other analysis, like generating code coverage reports and introspecting React's internals to determine whether a given component was mounting or re-rendering.

    Given that capability, we hope to add the ability to do "React component stack" debugging in the not-too-distant future, such as a button that would let you "Step Back to Parent Component". We're also working on adding Redux DevTools integration now (like, I filed an initial PR for this today! [2]), and hope to add integration with other frameworks down the road.

    [0] https://replay.io

    [1] https://github.com/RecordReplay/replay-protocol-examples

    [2] https://github.com/RecordReplay/devtools/pull/6601

  • Playwright

    Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.

  • I had understood that Playwright actually used the DevTools protocol rather than the WebDriver protocol, as mentioned here:

    https://github.com/microsoft/playwright/issues/4862

    And there's a bit of detail about how they're different here:

    https://stackoverflow.com/q/50939116/142780

    However that's more a detail and doesn't really undermine your point about Firefox / Safari being handled differently, it's just that Playwright implemented their own versions of the protocol for those two non-Chromium based browsers

  • puppeteer

    Node.js API for Chrome

  • In a similar vein, I have found success using request interception [1] for some websites where the HTML and API authentication scheme is unstable, but the API responses themselves are stable.

    If you can drive the browser using simple operations like keyboard commands, you can get the underlying data reliably by listening for matching 'response' events and handling the data as it comes in.

    [1] https://github.com/puppeteer/puppeteer/blob/main/docs/api.md...

  • fxsnapshot

    Query tool for Firefox heap snapshots.

  • > Firefox does have a memory snapshot feature, but the file it saved is some kind of binary encoded thing without any obvious strings in it

    Those .fxsnapshot files are gzipped binary heaps. There is a 3rd-party decoder for it:

    https://github.com/jimblandy/fxsnapshot

    Given mozilla's track record with selenium-webdriver, expect this format to change on you two versions from now, YMMV.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts