Top 23 Scraper Open-Source Projects

Huginn

121 41,523 7.2 Ruby

Create agents that monitor and act on your behalf. Your agents are standing by!

Project mention: Create agents that monitor and act on your behalf | news.ycombinator.com | 2024-03-24

cheerio

50 27,780 9.7 TypeScript

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

Project mention: 8 NPM Packages for JavaScript Beginners [2024][+tutorials] | dev.to | 2024-04-02

Cheerio is your ticket to the world of server-side magic, allowing you to manipulate HTML and XML documents with jQuery-like syntax. It’s perfect for web scraping, data extraction, or just making sense of the mess that is web content. With Cheerio, you get to play around with the DOM, use CSS selectors, and basically do all the cool things you'd do in the browser, but server-side.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
lux

13 25,247 7.4 Go

👾 Fast and simple video download library and CLI tool written in Go

Project mention: Bilibili download stalls at around 30-60% | /r/youtubedl | 2023-05-18

Not a fix, but I tend to use lux when downloading from bilibili. It is faster too.

colly

39 22,165 6.0 Go

Elegant Scraper and Crawler Framework for Golang

Project mention: Scraping the full snippet from Google search result | dev.to | 2024-01-01

SerpApi focuses on scraping search results. That's why we need extra help to scrape individual sites. We'll use GoColly package.

newspaper

13 13,720 0.0 Python

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
crawlee

29 12,129 9.8 TypeScript

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Project mention: How to scrape Amazon products | dev.to | 2024-04-01

In this guide, we'll be extracting information from Amazon product pages using the power of TypeScript in combination with the Cheerio and Crawlee libraries. We'll explore how to retrieve and extract detailed product data such as titles, prices, image URLs, and more from Amazon's vast marketplace. We'll also discuss handling potential blocking issues that may arise during the scraping process.

chinese-xinhua

1 10,641 0.0 Python

:orange_book: 中华新华字典数据库。包括歇后语，成语，词语，汉字。
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Douyin_TikTok_Download_API

3 6,780 8.2 Python

🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音、快手、TikTok、Bilibili数据爬取工具，支持API调用，在线批量解析及下载。

Project mention: TikTok video scraper | /r/webscraping | 2023-05-23

At the moment I am working on a web scraper for TikTok. At the moment, I am able to retrieve data about the first 16 videos from a channel. The way I achieved this was to make requests to an unofficial API https://github.com/Evil0ctal/Douyin_TikTok_Download_API. My problem is that the requirements for this project do not allow me to use any package that would extract data from TikTok. I would like to ask you all, how should I go about this task. Already tried getting data from the HTML, but is not sufficient since most of it is not displayed when I use requests.get(URL). Could you please recommend some repositories that could help or some way of extracting the data? Thank you!

awesome-crawler

7 6,078 1.5

A collection of awesome web crawler,spider in different languages
autoscraper

9 5,937 0.0 Python

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Ferret

0 5,616 3.1 Go

Declarative web scraping
rod

20 4,750 7.9 Go

A Devtools driver for web automation and scraping

Project mention: Need help authenticating to Okta programatically. | /r/okta | 2023-07-03

I have tried the following. 1. Login to Okta via browser programatically using go-rod. Which I managed to do so successfully, but I'm failing to load up Slack as it's stuck in the browser loader screen for Slack. 2. I tried to authenticate via Okta RESTful API. So far, I have managed to authenticate using {{domain}}/api/v1/authn, and then subsequently using MFA via the verify endpoint {{domain}}/api/v1/authn/factors/{{factorID}}/verify which returns me a sessionToken. From here, I can successfully create a sessionCookie which have proven quite useless to me. Perhaps I am doing it wrongly.

myGPTReader

6 4,375 8.6 Python

A community-driven way to read and chat with AI bots - powered by chatGPT.
node-ytdl-core

15 4,280 2.8 JavaScript

YouTube video downloader in javascript.

Project mention: Simple Youtube Downloader in under 50 Javascript lines | dev.to | 2023-11-04

ytdl-core for streaming from Youtube

snscrape

29 4,224 7.3 Python

A social networking service scraper in Python

Project mention: Can someone walk me through this? | /r/learnpython | 2023-11-21

Here's what I'm trying to use: https://github.com/JustAnotherArchivist/snscrapeWhat do I need to open/run any of this? My goal with this is to extract my follower list off Twitter, and I'd very much like to know how to run it on my machine instead of having someone run it for me on theirs. I can't even figure out what I need to open the Readme file.

scrape-it

2 3,978 5.7 JavaScript

🔮 A Node.js scraper for humans.

Project mention: Built a website to help you find... pocket knives! | /r/reactjs | 2023-06-05

https://github.com/IonicaBizau/scrape-it for simple code to target dom elements with classes/ids etc...

browser-fingerprinting

8 3,830 1.0 JavaScript

Analysis of Bot Protection systems with available countermeasures 🚿. How to defeat anti-bot system 👻 and get around browser fingerprinting scripts 🕵️‍♂️ when scraping the web?

Project mention: A site that tracks the price of a Big Mac in every US McDonald's | news.ycombinator.com | 2024-01-13

Yes, there is a lot written about it. Here is one link I have saved:
https://github.com/niespodd/browser-fingerprinting

Emby.Plugins.JavScraper

1 3,132 0.0 C#

Emby/Jellyfin 的一个日本电影刮削器插件，可以从某些网站抓取影片信息。
Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE

8 3,053 3.9 Python

Do you want to LEARN NEW STUFF for FREE? Don't worry, with the power of web-scraping and automation, this script will find the necessary Udemy coupons & enroll you for PAID UDEMY COURSES, ABSOLUTELY FREE!
Geziyor

2 2,475 0.6 Go

Geziyor, blazing fast web crawling & scraping framework for Go. Supports JS rendering.

Project mention: Show HN: I scraped 25M Shopify products to build a search engine | news.ycombinator.com | 2023-12-13

As someone who has scraped millions of items myself, I had success using Geziyor (https://github.com/geziyor/geziyor) built in Go. Shopify sites are especially easy to scrape because they tend to share the same product data formatting and don't hide it behind JS rendering.

freeDictionaryAPI

4 2,335 0.0 JavaScript

There was no free Dictionary API on the web when I wanted one for my friend, so I created one.

Project mention: Feedback on new game | /r/wordgames | 2023-05-23

I see that the request to dictionaryapi.dev comes directly from the client. Looking at the github for that project, I see this:

google-play-scraper

4 2,218 5.9 JavaScript

Node.js scraper to get data from Google Play
bulk-downloader-for-reddit

80 2,203 0.0 Python

Downloads and archives content from reddit

Project mention: BDFR skipping Reddit hosted videos | /r/DataHoarder | 2023-10-21

SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Scraper related posts

Tutorial: Extracting structured data from websites using Groq and Firecrawl
1 project | news.ycombinator.com | 22 Apr 2024
Open Sanctions
1 project | news.ycombinator.com | 17 Apr 2024
How to scrape Amazon products
4 projects | dev.to | 1 Apr 2024
Create agents that monitor and act on your behalf
1 project | news.ycombinator.com | 24 Mar 2024
Show HN: Fitter – configurable open-source scraper
1 project | news.ycombinator.com | 14 Jan 2024
Scraping the full snippet from Google search result
3 projects | dev.to | 1 Jan 2024
Help regarding workflow
2 projects | /r/ObsidianMD | 7 Dec 2023
A note from our sponsor - SaaSHub
www.saashub.com | 25 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Scraper projects? This list will help you:

	Project	Stars
1	Huginn	41,523
2	cheerio	27,780
3	lux	25,247
4	colly	22,165
5	newspaper	13,720
6	crawlee	12,129
7	chinese-xinhua	10,641
8	Douyin_TikTok_Download_API	6,780
9	awesome-crawler	6,078
10	autoscraper	5,937
11	Ferret	5,616
12	rod	4,750
13	myGPTReader	4,375
14	node-ytdl-core	4,280
15	snscrape	4,224
16	scrape-it	3,978
17	browser-fingerprinting	3,830
18	Emby.Plugins.JavScraper	3,132
19	Automatic-Udemy-Course-Enroller-GET-PAID-UDEMY-COURSES-for-FREE	3,053
20	Geziyor	2,475
21	freeDictionaryAPI	2,335
22	google-play-scraper	2,218
23	bulk-downloader-for-reddit	2,203