[R] A new dataset and a library that you can use for ML and RL over the Web

This page summarizes the projects mentioned and recommended in the original post on /r/MachineLearning

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • 1) We've open-sourced a dataset of about 50k labeled product web pages from roughly 8000 distinct e-commerce merchants, available as MHTML and WebTraversalLibrary clones (see next point :) ), along with the corresponding screenshots. Not all of the MHTMLs render correctly, but the ones that do also have screenshots in a corresponding dataset for CV applications. You can find documentation regarding how to download these datasets (as well as some example code) here. You can read about the dataset (more statistics, biases, labelling procedure, challenges etc.) and find some initial benchmarks we've run in this pre-print: https://arxiv.org/abs/2111.02168

  • webtraversallibrary

    The Web Traversal Library (WTL) is a Python library for abstracting web interactions on top of a base execution layer such as Selenium.

  • 2) If interacting with the Web is more your thing, you can also check out the WebTraversalLibrary which you can use to easily script agents that interact with the Internet via a browser. This library provides extremely useful abstractions so that you don't have to worry about writing the code to interact with the low-level implementations of the browser at all (it abstracts the browser up to a state/action level so all you have to do is worry about the RL part). You can find quite a few example scripts in the repo.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • [D] Datasets and Models for Structured Information Extraction on HTML

    3 projects | /r/MachineLearning | 31 May 2022
  • Web Scraping in a professional setting: Selenium vs. BeautifulSoup

    2 projects | /r/Python | 26 Oct 2021
  • What is the most interesting / funniest solution you have seen done with Python & Selenium?

    3 projects | /r/Python | 14 Sep 2021