-
webtraversallibrary
The Web Traversal Library (WTL) is a Python library for abstracting web interactions on top of a base execution layer such as Selenium.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
1) We've open-sourced a dataset of about 50k labeled product web pages from roughly 8000 distinct e-commerce merchants, available as MHTML and WebTraversalLibrary clones (see next point :) ), along with the corresponding screenshots. Not all of the MHTMLs render correctly, but the ones that do also have screenshots in a corresponding dataset for CV applications. You can find documentation regarding how to download these datasets (as well as some example code) here. You can read about the dataset (more statistics, biases, labelling procedure, challenges etc.) and find some initial benchmarks we've run in this pre-print: https://arxiv.org/abs/2111.02168
2) If interacting with the Web is more your thing, you can also check out the WebTraversalLibrary which you can use to easily script agents that interact with the Internet via a browser. This library provides extremely useful abstractions so that you don't have to worry about writing the code to interact with the low-level implementations of the browser at all (it abstracts the browser up to a state/action level so all you have to do is worry about the RL part). You can find quite a few example scripts in the repo.