Top 23 web-scraping Open-Source Projects

Scrapy

180 50,896 9.6 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16

changedetection.io

196 14,870 9.5 Python

The best and simplest free open source web page change detection, website watcher, restock monitor and notification service. Restock Monitor, change detection. Designed for simplicity - Simply monitor which websites had a text change for free. Free Open source web page change detection, Website defacement monitoring, Price change notification

Project mention: Google have removed RSS support from their developer blogs | news.ycombinator.com | 2023-12-11

I use ChangeDetection,
- https://changedetection.io/#features
- https://github.com/dgtlmoon/changedetection.io

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
crawlee

29 12,129 9.8 TypeScript

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Project mention: How to scrape Amazon products | dev.to | 2024-04-01

In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.

Douyin_TikTok_Download_API

3 6,780 8.2 Python

🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具，支持API调用，在线批量解析及下载。

Project mention: TikTok video scraper | /r/webscraping | 2023-05-23

At the moment I am working on a web scraper for TikTok. At the moment, I am able to retrieve data about the first 16 videos from a channel. The way I achieved this was to make requests to an unofficial API https://github.com/Evil0ctal/Douyin_TikTok_Download_API. My problem is that the requirements for this project do not allow me to use any package that would extract data from TikTok. I would like to ask you all, how should I go about this task. Already tried getting data from the HTML, but is not sufficient since most of it is not displayed when I use requests.get(URL). Could you please recommend some repositories that could help or some way of extracting the data? Thank you!

awesome-web-scraping

6 6,308 5.1 Makefile

List of libraries, tools and APIs for web scraping and data processing.
autoscraper

9 5,937 0.0 Python

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
rod

20 4,784 7.9 Go

A Devtools driver for web automation and scraping

Project mention: Need help authenticating to Okta programatically. | /r/okta | 2023-07-03

I have tried the following. 1. Login to Okta via browser programatically using go-rod. Which I managed to do so successfully, but I'm failing to load up Slack as it's stuck in the browser loader screen for Slack. 2. I tried to authenticate via Okta RESTful API. So far, I have managed to authenticate using {{domain}}/api/v1/authn, and then subsequently using MFA via the verify endpoint {{domain}}/api/v1/authn/factors/{{factorID}}/verify which returns me a sessionToken. From here, I can successfully create a sessionCookie which have proven quite useless to me. Perhaps I am doing it wrongly.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
trafilatura

13 2,778 8.7 Python

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14

The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.

snoop

7 2,683 9.4 Python

Snoop — инструмент разведки на основе открытых данных (OSINT world)

Project mention: Osint update of the Snoop Project tool search for user by nickname | news.ycombinator.com | 2024-01-02

Grab

0 2,354 3.0 Python

Web Scraping Framework
pythoncode-tutorials

1 2,002 8.1 Jupyter Notebook

The Python Code Tutorials
rvest

13 1,470 7.2 R

Simple web scraping for R

Project mention: Collecting Data from News Articles using Web Scraping - Help | /r/rstats | 2023-06-01

You’re looking for the rvest package

core

2 1,319 7.9 PHP

The complete web scraping toolkit for PHP. (by roach-php)
curl_cffi

5 1,330 9.3 Python

Python binding for curl-impersonate via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints.

Project mention: This Week In Python | dev.to | 2024-03-22

curl_cffi – A http client that can impersonate browser tls/ja3/http2 fingerprints

selectolax

6 967 7.7 Cython

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).

Project mention: GitHub – GSA/code-gov: An informative repo for all Code.gov repos | news.ycombinator.com | 2023-09-09

https://github.com/rushter/selectolax#simple-benchmark )
(Apache Nutch is a Java-based web crawler which supports e.g. CommonCrawl (which backs various foundational LLMs)) https://en.wikipedia.org/wiki/Apache_Nutch#Search_engines_bu... . But extruct extracts more types of metadata and data than Nutch AFAIU: https://github.com/scrapinghub/extruct )
datasette-graphql adds a GraphQL HTTP API to a SQLite database:

botasaurus

5 899 9.2 Python

The All in One Framework to build Awesome Scrapers.

Project mention: This Week In Python | dev.to | 2024-04-05

botasaurus – The All in One Framework to build Awesome Scrapers

till

5 807 1.8 Go

DataHen Till is a companion tool to your existing web scraper that instantly makes it scalable, maintainable, and more unblockable, with minimal code changes on your scraper. Integrates with any scraper in 5 minutes.
spidr

0 792 6.4 Ruby

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use. (by postmodern)
scrapy-fake-useragent

3 681 2.3 Python

Random User-Agent middleware based on fake-useragent
gogoanime-api

2 670 3.4 JavaScript

Anime Streaming, Discovery API made with Cheerio and Express. Uses data from Gogoanime

Project mention: I created an anime website . | /r/developersIndia | 2023-06-29

web-scraping

43 617 0.0 Python

Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, MacroTrends, SHFE and alternative data crawlers on Tomtom, BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist

Project mention: web-scraping: NEW Data - star count:554.0 | /r/algoprojects | 2023-09-25

google-maps-scraper

5 601 7.1 Go

scrape data data from Google Maps. Extracts data such as the name, address, phone number, website URL, rating, reviews number, latitude and longitude, reviews,email and more for each place (by gosom)

Project mention: Show HN: A Google Maps Scraper | news.ycombinator.com | 2023-12-03

google-search-results-python

4 517 4.5 Python

Google Search Results via SERP API pip Python Package

Project mention: Make Direct Async Requests to SerpApi with Python | dev.to | 2023-05-24

In this blog post we'll cover on how to make direct requests to serpapi.com/search.json without using SerpApi's google-search-results Python client.

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

web-scraping related posts

Launching Crawlee Blog: Your Node.js resource hub for web scraping and automation.
1 project | dev.to | 26 Feb 2024
wayback-machine-scraper: NEW Data - star count:380.0
1 project | /r/algoprojects | 10 Dec 2023
Anything like scrapy in other languages?
1 project | /r/webscraping | 10 Dec 2023
Show HN: A Google Maps Scraper
1 project | news.ycombinator.com | 3 Dec 2023
Trafilatura: Python tool to gather text on the Web
3 projects | news.ycombinator.com | 14 Aug 2023
Google Maps Scraper in Golang
1 project | news.ycombinator.com | 27 Jul 2023
Best web scraping framework to learn
1 project | /r/webscraping | 12 Jul 2023
A note from our sponsor - SaaSHub
www.saashub.com | 28 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source web-scraping projects? This list will help you:

	Project	Stars
1	Scrapy	50,896
2	changedetection.io	14,870
3	crawlee	12,129
4	Douyin_TikTok_Download_API	6,780
5	awesome-web-scraping	6,308
6	autoscraper	5,937
7	rod	4,784
8	trafilatura	2,778
9	snoop	2,683
10	Grab	2,354
11	pythoncode-tutorials	2,002
12	rvest	1,470
13	core	1,319
14	curl_cffi	1,330
15	selectolax	967
16	botasaurus	899
17	till	807
18	spidr	792
19	scrapy-fake-useragent	681
20	gogoanime-api	670
21	web-scraping	617
22	google-maps-scraper	601
23	google-search-results-python	517