Top 23 Webscraping Open-Source Projects

Huginn

121 41,523 7.2 Ruby

Create agents that monitor and act on your behalf. Your agents are standing by!

Project mention: Create agents that monitor and act on your behalf | news.ycombinator.com | 2024-03-24

ani-cli

37 6,577 7.8 Shell

A cli tool to browse and play anime

Project mention: Rule | /r/196 | 2023-05-18

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
awesome-web-scraping

6 6,308 5.1 Makefile

List of libraries, tools and APIs for web scraping and data processing.
autoscraper

9 5,937 0.0 Python

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
browser-fingerprinting

8 3,830 1.0 JavaScript

Analysis of Bot Protection systems with available countermeasures 🚿. How to defeat anti-bot system 👻 and get around browser fingerprinting scripts 🕵️‍♂️ when scraping the web?

Project mention: A site that tracks the price of a Big Mac in every US McDonald's | news.ycombinator.com | 2024-01-13

Yes, there is a lot written about it. Here is one link I have saved:
https://github.com/niespodd/browser-fingerprinting

soup

4 2,125 0.0 Go

Web Scraper in Go, similar to BeautifulSoup
webscraping-from-0-to-hero

1 1,453 5.8

The web scraping open project repository aims to share knowledge and experiences about web scraping with Python

Project mention: Web Scraping from 0 to hero – Sharing knowledge about web scraping on GH | news.ycombinator.com | 2023-07-06

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
scrapeghost

10 1,390 8.2 Python

👻 Experimental library for scraping websites using OpenAI's GPT API.
requests-cache

7 1,254 8.7 Python

Transparent persistent cache for python requests
CrossLinked

1 1,146 5.6 Python

LinkedIn enumeration tool to extract valid employee names from an organization through search engine scraping
gazpacho

1 730 3.2 Python

🥫 The simple, fast, and modern web scraping library
xidel

18 651 5.6 Pascal

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.

Project mention: Move over jq I found something easier: fx | news.ycombinator.com | 2023-06-06

You could try Xidel[1]. It supports JSON, XML and HTML using XPath/XQuery 3.1
It has some extensions to the standard that are pretty nice (JSONiq, CSS selectors, html “template” matching), but you can limit it to just standard XPath/XQuery if you like.
I recommend getting the nightly v .99 build if you give it a try, the stable .98 version is pretty old and I’ve had no issues with .99
1. https://www.videlibri.de/xidel.html

NYTimes-App

3 507 0.0 Kotlin

🗽 A Simple Demonstration of the New York Times App 📱 using Jsoup web crawler with MVVM Architecture 🔥
tarsier

2 486 9.2 Jupyter Notebook

Vision utilities for web interaction agents 👀

Project mention: Control the browser using GPT-4 vision by AgentGPT team | news.ycombinator.com | 2023-11-12

morph

1 463 0.0 Ruby

Take the hassle out of web scraping (by openaustralia)
dude

28 413 9.0 Python

dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators

Project mention: Webscraping beginner here ready to start leveling up to intermediate. Looking for some good webscraping repositories (e.g any of your GitHub repos/projects) that I can use as learning tools, and general recommendations for what to do next | /r/webscraping | 2023-05-08

Please check https://github.com/roniemartinez/dude

mov-cli

2 395 9.5 Python

Watch everything from your terminal.
r-web-scraping-cheat-sheet

1 378 0.0 R

Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium.
Rcrawler

2 344 0.0 R

An R web crawler and scraper
TikTokBot

1 341 0.0 Python

A TikTokBot that downloads trending tiktok videos and compiles them using FFmpeg
polite

2 322 5.3 R

Be nice on the web
EasyApplyJobsBot

2 317 7.9 Python

A python bot to automatically apply all Linkedin,Glassdoor, etc Easy Apply jobs based on your preferences. Auto login, auto fill additional questions, apply automatically!

Project mention: Experiência dos candidatos numa vaga Sênior | /r/brdev | 2023-05-08

zimit

9 231 7.9 Python

Make a ZIM file from any Web site and surf offline!
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Webscraping related posts

Create agents that monitor and act on your behalf
1 project | news.ycombinator.com | 24 Mar 2024
How To Scrape TikTok in 2024
1 project | dev.to | 8 Mar 2024
Direction Of The Stock Market
1 project | /r/StockMarket | 6 Dec 2023
And I thought amazing fics suddenly being deleted was a myth
1 project | /r/AO3 | 18 Nov 2023
Control the browser using GPT-4 vision by AgentGPT team
1 project | news.ycombinator.com | 12 Nov 2023
Show HN: Open-Source Desktop AI Webscraper
1 project | news.ycombinator.com | 15 Oct 2023
ThreatMetrix (anti-bot/fraud-detection) solver, deobfuscator & data harvester
1 project | /r/webscraping | 25 Aug 2023
A note from our sponsor - SaaSHub
www.saashub.com | 25 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Webscraping projects? This list will help you:

	Project	Stars
1	Huginn	41,523
2	ani-cli	6,577
3	awesome-web-scraping	6,308
4	autoscraper	5,937
5	browser-fingerprinting	3,830
6	soup	2,125
7	webscraping-from-0-to-hero	1,453
8	scrapeghost	1,390
9	requests-cache	1,254
10	CrossLinked	1,146
11	gazpacho	730
12	xidel	651
13	NYTimes-App	507
14	tarsier	486
15	morph	463
16	dude	413
17	mov-cli	395
18	r-web-scraping-cheat-sheet	378
19	Rcrawler	344
20	TikTokBot	341
21	polite	322
22	EasyApplyJobsBot	317
23	zimit	231