Our great sponsors
-
SurveyJS
Open-Source JSON Form Builder to Create Dynamic Forms Right in Your App. With SurveyJS form UI libraries, you can build and style forms in a fully-integrated drag & drop form builder, render them in your JS app, and store form submission data in any backend, inc. PHP, ASP.NET Core, and Node.js.
To avoid this headache in HTTP Archive analyses, we've resorted to custom metrics. These are executed on each page at runtime, and it's been really effective. It enables us to analyze both the fully rendered page as well as the static HTML. But one big limitation with custom metrics is that they only work at runtime. So if we want to change the code or analyze an older dataset, we're out of luck.
While looking for a way to implement capo.js in BigQuery to understand how pages in HTTP Archive are ordered, I came across the Cheerio library, which is a jQuery-like interface over an HTML parser.
A longstanding problem in the HTTP Archive dataset has been extracting insights from blobs of HTML in BigQuery. For example, take the source code of example.com:
While looking for a way to implement capo.js in BigQuery to understand how pages in HTTP Archive are ordered, I came across the Cheerio library, which is a jQuery-like interface over an HTML parser.
Related posts
- Creating and deploying web scraper using Apify
- Htmlq: Like Jq, but for HTML
- I have an idea for a project and I wanna know which resources are available for me
- Why is it so much easier for people/clients to update their socials as opposed to their website? What’s the solution?
- Scraping the web for information. Is this the right approach