wtf_wikipedia
scrapeghost
Our great sponsors
wtf_wikipedia | scrapeghost | |
---|---|---|
1 | 10 | |
743 | 1,396 | |
- | - | |
8.0 | 8.2 | |
13 days ago | 5 months ago | |
JavaScript | Python | |
MIT License | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
wtf_wikipedia
-
Experimental library for scraping websites using OpenAI's GPT API
This may finally be a solution for scraping wikipedia and turning it into structured data. (Or do we even need structured data in the post-AI age?)
Mediawiki is notorious for being hard to parse:
* https://github.com/spencermountain/wtf_wikipedia#ok-first- - why it's hard
* https://techblog.wikimedia.org/2022/04/26/what-it-takes-to-p... - an entire article about parsing page TITLES
* https://osr.cs.fau.de/wp-content/uploads/2017/09/wikitext-pa... - a paper published about a wikitext parser
scrapeghost
-
Those of you who have developed product features using GPT4 API (or failed to do so), how did it go?
Not my project but an ex-colleague has been having some success in this direction: https://jamesturk.github.io/scrapeghost/
-
What are the best tools for web scraping and analysis of natural language to populate a dataset?
Yes, there is something like that available - ScrapeGhost.
- FLaNK Stack Weekly 3 April 2023
- Scraping Websites Using GPT
-
@TwitterDev Announces New Twitter API Tiers
With AI scraping, tools can be far more resilient than soon enough to minor dom changes. See - https://jamesturk.github.io/scrapeghost/.
-
Experimental library for scraping websites using OpenAI's GPT API
Their ToS mentions scraping but it pertains to scraping their frontend instead of using their API, which they don't want you to do.
Also - this library requests the HTML by itself [0] and ships it as a prompt but with preset system messages as the instruction [1].
[0] - https://github.com/jamesturk/scrapeghost/blob/main/src/scrap...
[1] - https://github.com/jamesturk/scrapeghost/blob/main/src/scrap...
- scrapeghost. Web scrape using gpt-4 (experimental)
What are some alternatives?
sdow - Six Degrees of Wikipedia
autoscraper - A Smart, Automatic, Fast and Lightweight Web Scraper for Python
anon - tweet about anonymous Wikipedia edits from particular IP address ranges
tmx-solver - ThreatMetrix (anti-bot/fraud-detection) solver, deobfuscator & data harvester
duckling - Language, engine, and tooling for expressing, testing, and evaluating composable language rules on input strings.
wikipedia_ql - Query language for efficient data extraction from Wikipedia
Bandwhich - Terminal bandwidth utilization tool
bpytop - Linux/OSX/FreeBSD resource monitor
exiftool - ExifTool meta information reader/writer
glances - Glances an Eye on your system. A top/htop alternative for GNU/Linux, BSD, Mac OS and Windows operating systems.