Future-Proofing Web Scraping via JavaScript Runtime Heap Snapshots

Our great sponsors

SurveyJS - Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

Our great sponsors

devtools

44 650 9.9 TypeScript

Replay.io DevTools (by replayio)

Not _quite_ what you're describing, but Replay [0], the company I work for, _is_ building a true "time-traveling debugger" for JS. It works by recording the OS-level interactions with the browser process, then re-running those in the cloud. From the user's perspective in our debugging client UI, they can jump to any point in a timeline and do typical step debugging. However, you can also see how many times any line of code ran, and also add print statements to any line that will print out the results from _every time that line got executed_.
So, no heap analysis per se, but you can definitely inspect the variables and stack from anywhere in the recording.
Right now our debugging client is just scratching the surface of the info we have available from our backend. We recently put together a couple small examples that use the Replay backend API to extract data from recordings and do other analysis, like generating code coverage reports and introspecting React's internals to determine whether a given component was mounting or re-rendering.
Given that capability, we hope to add the ability to do "React component stack" debugging in the not-too-distant future, such as a button that would let you "Step Back to Parent Component". We're also working on adding Redux DevTools integration now (like, I filed an initial PR for this today! [2]), and hope to add integration with other frameworks down the road.
[0] https://replay.io
[1] https://github.com/RecordReplay/replay-protocol-examples
[2] https://github.com/RecordReplay/devtools/pull/6601

puppeteer-heap-snapshot

2 1,343 0.0 TypeScript

API and CLI tool to fetch and query Chome DevTools heap snapshots.

That's an exceedingly clever idea, thanks for sharing it!
Please consider adding an actual license text file to your repo, since (a) I don't think GitHub's licensee looks inside package.json (b) I bet most of the "license" properties of package.json files are "yeah, yeah, whatever" versus an intentional choice: https://github.com/adriancooney/puppeteer-heap-snapshot/blob... I'm not saying that applies to you, but an explicit license file in the repo would make your wishes clearer

SurveyJS

surveyjs.io sponsored

Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
profiler

184 1,094 9.7 JavaScript

Firefox Profiler — Web app for Firefox performance analysis

Well kind of for Firefox, there is this profiling tool which you could use (semi-built in)
https://github.com/firefox-devtools/profiler. Which let you save a report in json.gz format

Protocol-Examples

2 8 4.6 TypeScript

Example apps demonstrating how to use the Replay Protocol API

Not _quite_ what you're describing, but Replay [0], the company I work for, _is_ building a true "time-traveling debugger" for JS. It works by recording the OS-level interactions with the browser process, then re-running those in the cloud. From the user's perspective in our debugging client UI, they can jump to any point in a timeline and do typical step debugging. However, you can also see how many times any line of code ran, and also add print statements to any line that will print out the results from _every time that line got executed_.
So, no heap analysis per se, but you can definitely inspect the variables and stack from anywhere in the recording.
Right now our debugging client is just scratching the surface of the info we have available from our backend. We recently put together a couple small examples that use the Replay backend API to extract data from recordings and do other analysis, like generating code coverage reports and introspecting React's internals to determine whether a given component was mounting or re-rendering.
Given that capability, we hope to add the ability to do "React component stack" debugging in the not-too-distant future, such as a button that would let you "Step Back to Parent Component". We're also working on adding Redux DevTools integration now (like, I filed an initial PR for this today! [2]), and hope to add integration with other frameworks down the road.
[0] https://replay.io
[1] https://github.com/RecordReplay/replay-protocol-examples
[2] https://github.com/RecordReplay/devtools/pull/6601

Playwright

379 61,568 9.9 TypeScript

Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.

I had understood that Playwright actually used the DevTools protocol rather than the WebDriver protocol, as mentioned here:
https://github.com/microsoft/playwright/issues/4862
And there's a bit of detail about how they're different here:
https://stackoverflow.com/q/50939116/142780
However that's more a detail and doesn't really undermine your point about Firefox / Safari being handled differently, it's just that Playwright implemented their own versions of the protocol for those two non-Chromium based browsers

puppeteer

359 86,704 9.9 TypeScript

Node.js API for Chrome

In a similar vein, I have found success using request interception [1] for some websites where the HTML and API authentication scheme is unstable, but the API responses themselves are stable.
If you can drive the browser using simple operations like keyboard commands, you can get the underlying data reliably by listening for matching 'response' events and handling the data as it comes in.
[1] https://github.com/puppeteer/puppeteer/blob/main/docs/api.md...

fxsnapshot

1 1 0.0 Rust

Query tool for Firefox heap snapshots.

> Firefox does have a memory snapshot feature, but the file it saved is some kind of binary encoded thing without any obvious strings in it
Those .fxsnapshot files are gzipped binary heaps. There is a 3rd-party decoder for it:
https://github.com/jimblandy/fxsnapshot
Given mozilla's track record with selenium-webdriver, expect this format to change on you two versions from now, YMMV.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project