Experimental library for scraping websites using OpenAI's GPT API

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

mlscraper

10 1,219 0.6 Python

🤖 Scrape data from HTML websites automatically by just providing examples

Why GPT-based then? There are libraries that do this: You give examples, they generate the rules for you and give you a scraper object that takes any html and returns the scraped data.
Mine: https://github.com/lorey/mlscraper

scrapeghost

10 1,390 8.2 Python

👻 Experimental library for scraping websites using OpenAI's GPT API.

Their ToS mentions scraping but it pertains to scraping their frontend instead of using their API, which they don't want you to do.
Also - this library requests the HTML by itself [0] and ships it as a prompt but with preset system messages as the instruction [1].
[0] - https://github.com/jamesturk/scrapeghost/blob/main/src/scrap...
[1] - https://github.com/jamesturk/scrapeghost/blob/main/src/scrap...

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
wtf_wikipedia

1 742 8.0 JavaScript

a pretty-committed wikipedia markup parser

This may finally be a solution for scraping wikipedia and turning it into structured data. (Or do we even need structured data in the post-AI age?)
Mediawiki is notorious for being hard to parse:
* https://github.com/spencermountain/wtf_wikipedia#ok-first- - why it's hard
* https://techblog.wikimedia.org/2022/04/26/what-it-takes-to-p... - an entire article about parsing page TITLES
* https://osr.cs.fau.de/wp-content/uploads/2017/09/wikitext-pa... - a paper published about a wikitext parser

autoscraper

9 5,937 0.0 Python

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
wikipedia_ql

3 357 0.0 Python

Query language for efficient data extraction from Wikipedia
duckling

13 4,015 0.0 Haskell

Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.

For the reasons others have said I don't see it replacing 'traditional' scraping soon. But I am looking forward to it replacing current methods of extracting data from the scraped content.
I've been using Duckling [0] for extracting fuzzy dates and times from text. It does a good job but I needed a custom build with extra rules to make that into a great job. And that's just for dates, 1 of 13 dimensions supported. Being able to use an AI that handles them with better accuracy will be fantastic.
Does a specialised model trained to extract times and dates already exist? It's entity tagging but a specialised form (especially when dealing with historical documents where you may need Gregorian and Julian calendars).
[0] https://github.com/facebook/duckling

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

What are the best tools for web scraping and analysis of natural language to populate a dataset?
3 projects | /r/datasets | 12 Apr 2023
Could someone recommend me a library for c# like one of these two (they are for python) : mlscraper and autoscraper
2 projects | /r/learnprogramming | 19 Mar 2023
Best python modules for scraping HTML?
1 project | /r/pythontips | 26 Feb 2023
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
1 project | /r/webdev | 2 Dec 2022
Scrapping - How to deal with page changes Ai
1 project | /r/webscraping | 25 Mar 2022

Experimental library for scraping websites using OpenAI's GPT API

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Scraping Wikipedia Scraper Webscraping Machine Learning
Post date: 25 Mar 2023

mlscraper

scrapeghost

InfluxDB

wtf_wikipedia

autoscraper

wikipedia_ql

duckling

Related posts

Experimental library for scraping websites using OpenAI's GPT API

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Scraping Wikipedia Scraper Webscraping Machine Learning Post date: 25 Mar 2023

mlscraper

scrapeghost

InfluxDB

wtf_wikipedia

autoscraper

wikipedia_ql

duckling

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Scraping Wikipedia Scraper Webscraping Machine Learning
Post date: 25 Mar 2023