Experimental library for scraping websites using OpenAI's GPT API

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • mlscraper

    🤖 Scrape data from HTML websites automatically by just providing examples

  • Why GPT-based then? There are libraries that do this: You give examples, they generate the rules for you and give you a scraper object that takes any html and returns the scraped data.

    Mine: https://github.com/lorey/mlscraper

  • scrapeghost

    👻 Experimental library for scraping websites using OpenAI's GPT API.

  • Their ToS mentions scraping but it pertains to scraping their frontend instead of using their API, which they don't want you to do.

    Also - this library requests the HTML by itself [0] and ships it as a prompt but with preset system messages as the instruction [1].

    [0] - https://github.com/jamesturk/scrapeghost/blob/main/src/scrap...

    [1] - https://github.com/jamesturk/scrapeghost/blob/main/src/scrap...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • wtf_wikipedia

    a pretty-committed wikipedia markup parser

  • This may finally be a solution for scraping wikipedia and turning it into structured data. (Or do we even need structured data in the post-AI age?)

    Mediawiki is notorious for being hard to parse:

    * https://github.com/spencermountain/wtf_wikipedia#ok-first- - why it's hard

    * https://techblog.wikimedia.org/2022/04/26/what-it-takes-to-p... - an entire article about parsing page TITLES

    * https://osr.cs.fau.de/wp-content/uploads/2017/09/wikitext-pa... - a paper published about a wikitext parser

  • autoscraper

    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

  • wikipedia_ql

    Query language for efficient data extraction from Wikipedia

  • duckling

    Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.

  • For the reasons others have said I don't see it replacing 'traditional' scraping soon. But I am looking forward to it replacing current methods of extracting data from the scraped content.

    I've been using Duckling [0] for extracting fuzzy dates and times from text. It does a good job but I needed a custom build with extra rules to make that into a great job. And that's just for dates, 1 of 13 dimensions supported. Being able to use an AI that handles them with better accuracy will be fantastic.

    Does a specialised model trained to extract times and dates already exist? It's entity tagging but a specialised form (especially when dealing with historical documents where you may need Gregorian and Julian calendars).

    [0] https://github.com/facebook/duckling

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts