Our great sponsors
-
img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
AI-powered content generation has exploded in popularity recently, with bots like ChatGPT and Bard, but the giant amounts of data these bots require comes from harvesting the web. What if you don’t want your content feeding the bots? Some respect robots.txt, others notice a new ‘noai’ header tag.
The particular tool mentioned in the Vice article is Img2dataset, and right now, it doesn't pay attention to the robots.txt file, the normal mechanism you can use to dissuade well behaved bots from indexing your content. However, it does respect a new HTTP header directive, X-Robots-Tag: noai (and also noindex, though that's an existing and already well-known part of the robots.txt standard).
Related posts
- Prompt Engineering Guide
- Resources to deepen LLMs understanding for software engineers
- Step-by-Step Guide to building an Anomaly Detector using a LLM
- The Essential Guide to Prompt Engineering for Creators and Innovators
- OpenAI sued for web scraping from millions of internet users in order to train ChatGPT