An AI Scraping Tool Is Overwhelming Websites with Traffic

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • img2dataset

    Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

  • The established norm is that scrapers have to download robots.txt and support the standard robots.txt features, notably including `Crawl-Delay` which sets a rate limit. This is the established standard by which websites tell scrapers what the rules are for scraping them.

    This tool is scraping sites, it has webmasters reporting actual disruption, it doesn't have robots.txt support. When people complained (eg in https://github.com/rom1504/img2dataset/issues/48), the author's stance was basically "PRs welcome". It looks like a third party recently contributed a PR to make it respect robots.txt (https://github.com/rom1504/img2dataset/pull/302), albeit without `Crawl-Delay` support, which is not merged yet.

    I have seen the same thing with other recent AI tools (eg https://github.com/m1guelpf/browser-agent/issues/2) and I think it's important to defend the robots.txt convention and nip this in the bud. If a bot doesn't make a reasonable effort to respect robots.txt and it causes disruption, it's a denial-of-service attack and should be treated as such. No excuses.

  • browser-agent

    A browser AI agent, using GPT-4

  • The established norm is that scrapers have to download robots.txt and support the standard robots.txt features, notably including `Crawl-Delay` which sets a rate limit. This is the established standard by which websites tell scrapers what the rules are for scraping them.

    This tool is scraping sites, it has webmasters reporting actual disruption, it doesn't have robots.txt support. When people complained (eg in https://github.com/rom1504/img2dataset/issues/48), the author's stance was basically "PRs welcome". It looks like a third party recently contributed a PR to make it respect robots.txt (https://github.com/rom1504/img2dataset/pull/302), albeit without `Crawl-Delay` support, which is not merged yet.

    I have seen the same thing with other recent AI tools (eg https://github.com/m1guelpf/browser-agent/issues/2) and I think it's important to defend the robots.txt convention and nip this in the bud. If a bot doesn't make a reasonable effort to respect robots.txt and it causes disruption, it's a denial-of-service attack and should be treated as such. No excuses.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts