Top 23 Scraping Open-Source Projects

Scrapy

180 50,824 9.6 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

Project mention: Scrapy: A Fast and Powerful Scraping and Web Crawling Framework | news.ycombinator.com | 2024-02-16

colly

39 22,120 6.0 Go

Elegant Scraper and Crawler Framework for Golang

Project mention: Scraping the full snippet from Google search result | dev.to | 2024-01-01

SerpApi focuses on scraping search results. That's why we need extra help to scrape individual sites. We'll use GoColly package.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
requests-html

14 13,575 0.0 Python

Pythonic HTML Parsing for Humans™
crawlee

29 12,044 9.8 TypeScript

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Project mention: How to scrape Amazon products | dev.to | 2024-04-01

In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.

webmagic

1 11,232 7.0 Java

A scalable web crawler framework for Java.
undetected-chromedriver

40 8,018 7.1 Python

Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)

Project mention: ad_clicker premium - Google/Bing Ads Clicker | /r/IMadeThis | 2023-12-08

This command-line tool clicks ads for a certain query on Google/Bing search using undetected_chromedriver package. Supports proxy, running multiple simultaneous browsers, ad targeting/exclusion, and running in loop.

tabula

11 6,511 2.8 CSS

Tabula is a tool for liberating data tables trapped inside PDF files

Project mention: Automatisches Auslesen von PDFs | /r/de_EDV | 2023-05-16

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
awesome-web-scraping

6 6,308 5.1 Makefile

List of libraries, tools and APIs for web scraping and data processing.
autoscraper

9 5,937 0.0 Python

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Ferret

0 5,616 3.1 Go

Declarative web scraping
Mechanize

1 4,354 7.2 Ruby

Mechanize is a ruby library that makes automated web interaction easy.
Data-science

1 3,950 7.5 Jupyter Notebook

Collection of useful data science topics along with articles, videos, and code (by khuyentran1401)
fake-useragent

1 3,459 9.0 Python

Up-to-date simple useragent faker with real world database
Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE

8 3,048 3.9 Python

Do you want to LEARN NEW STUFF for FREE? Don't worry, with the power of web-scraping and automation, this script will find the necessary Udemy coupons & enroll you for PAID UDEMY COURSES, ABSOLUTELY FREE!
Symfony Panther

15 2,882 6.0 PHP

A browser testing and web crawling library for PHP and Symfony

Project mention: Any good resources on how to do “interaction” tests? | /r/PHP | 2023-05-27

Use some library with WebDriver API (f.ex. Symfony Panther if you want to stick to PHP, or Playwright) to run tests against your WP backend

trafilatura

13 2,740 8.4 Python

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Project mention: Trafilatura: Python tool to gather text on the Web | news.ycombinator.com | 2023-08-14

The feature list answers that question pretty well: https://github.com/adbar/trafilatura#features
Basically: you could implement all of this on top of BeautifulSoup - polite crawling policies, sitemap and feed parsing, URL de-duplication, parallel processing, download queues, heuristics for extracting just the main article content, metadata extraction, language detection... but it would require writing an enormous amount of extra code.

snoop

7 2,683 9.4 Python

Snoop — инструмент разведки на основе открытых данных (OSINT world)

Project mention: Osint update of the Snoop Project tool search for user by nickname | news.ycombinator.com | 2024-01-02

Geziyor

2 2,475 0.6 Go

Geziyor, blazing fast web crawling & scraping framework for Go. Supports JS rendering.

Project mention: Show HN: I scraped 25M Shopify products to build a search engine | news.ycombinator.com | 2023-12-13

As someone who has scraped millions of items myself, I had success using Geziyor (https://github.com/geziyor/geziyor) built in Go. Shopify sites are especially easy to scrape because they tend to share the same product data formatting and don't hide it behind JS rendering.

Grab

0 2,354 3.0 Python

Web Scraping Framework
awesome-puppeteer

1 2,318 3.0

A curated list of awesome puppeteer resources.
facebook-scraper

13 2,177 3.7 Python

Scrape Facebook public pages without an API key

Project mention: scraping instagram without selenium | /r/webscraping | 2023-06-30

Afaik on Facebook there are no such APIs, only good old HTML parsing, check out this project for example https://github.com/kevinzg/facebook-scraper (most of the parsing code is here https://github.com/kevinzg/facebook-scraper/blob/master/facebook_scraper/extractors.py )

DiDOM

0 2,173 0.6 PHP

Simple and fast HTML and XML parser
headless-chromium-php

10 2,148 7.2 PHP

Instrument headless chrome/chromium instances from PHP

Project mention: HTML to PDF package that supports RTL language? | /r/laravel | 2023-05-11

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Scraping related posts

2024-03-01 listening in on the neighborhood
5 projects | news.ycombinator.com | 2 Mar 2024
Scrapy: A Fast and Powerful Scraping and Web Crawling Framework
1 project | news.ycombinator.com | 16 Feb 2024
Connect OpenAI with external APIs with Function calling
2 projects | dev.to | 28 Dec 2023
A command-line utility for taking automated screenshots of websites
1 project | news.ycombinator.com | 15 Dec 2023
Direction Of The Stock Market
1 project | /r/StockMarket | 6 Dec 2023
Show HN: Flyscrape – A standalone and scriptable web scraper in Go
6 projects | news.ycombinator.com | 11 Nov 2023
What are you going to do the day wi-fi/data shuts off?
1 project | /r/preppers | 12 Sep 2023
A note from our sponsor - WorkOS
workos.com | 24 Apr 2024

The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Index

What are some of the best open-source Scraping projects? This list will help you:

	Project	Stars
1	Scrapy	50,824
2	colly	22,120
3	requests-html	13,575
4	crawlee	12,044
5	webmagic	11,232
6	undetected-chromedriver	8,018
7	tabula	6,511
8	awesome-web-scraping	6,308
9	autoscraper	5,937
10	Ferret	5,616
11	Mechanize	4,354
12	Data-science	3,950
13	fake-useragent	3,459
14	Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE	3,048
15	Symfony Panther	2,882
16	trafilatura	2,740
17	snoop	2,683
18	Geziyor	2,475
19	Grab	2,354
20	awesome-puppeteer	2,318
21	facebook-scraper	2,177
22	DiDOM	2,173
23	headless-chromium-php	2,148