Ask HN: What are the best tools for web scraping in 2022?

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

Playwright

377 61,381 9.9 TypeScript

Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.

For simple scraping where the content is fairly static, or when performance is critical, I will use linkedom to process pages.
https://github.com/WebReflection/linkedom
When the content is complex or involves clicking, Playwright is probably the best tool for the job.
https://github.com/microsoft/playwright
estela

10 153 8.1 Python

estela, an elastic web scraping cluster 🕸

estela is an elastic web scraping cluster running on Kubernetes. It provides mechanisms to deploy, run and scale web scraping spiders via a REST API and a web interface.
It is a modern alternative to the few OSS projects available for such needs, like scrapyd and gerapy. estela aims to help web scraping teams and individuals that are considering moving away from proprietary scraping clouds, or who are in the process of designing their on-premise scraping architecture, so as not to needlessly reinvent the wheel, and to benefit from the get-go from features such as built-in scalability and elasticity, among others.
estela has been recently published as OSS under the MIT license:
https://github.com/bitmakerla/estela
More details about it can be found in the release blog post and the official documentation:
https://bitmaker.la/blog/2022/06/24/estela-oss-release.html
https://estela.bitmaker.la/docs/
estela supports Scrapy spiders for the moment being, but additional frameworks/languages are on the roadmap.
All kinds of feedback and contributions are welcome!
Disclaimer: I'm part of the development team behind estela :-)
WorkOS

workos.com
sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
puppeteer

356 86,704 9.9 TypeScript

Node.js API for Chrome
curl-impersonate

31 3,291 7.1 Python

curl-impersonate: A special build of curl that can impersonate Chrome & Firefox

curl-impersonate[1] is a curl fork that I maintain and which lets you fetch sites while impersonating a browser. Unfortunately, the practice of TLS and HTTP fingerprinting of web clients has become extremely common in the past ~1 year, which means a regular curl request will often return some JS challenge and not the real content. curl-impersonate helps with that.
[1] https://github.com/lwthiker/curl-impersonate
polite

2 322 5.3 R

Be nice on the web

The polite package using R is intended to be a friendly way of scraping content from the owner. "The three pillars of a polite session are seeking permission, taking slowly and never asking twice."
https://github.com/dmi3kno/polite
ssscraper

3 18 5.6 Go

A crawler/scraper based on golang + colly, configurable via JSON

For a particular type of scraping, we wrote SSScraper on top of Colly and it works really well:
https://github.com/gotripod/ssscraper/
wistalk

5 24 0.0 Python

Wistalk : Analyze Wikipedia User's Activity

Beautiful Soup gets the job done. I made several app by using it.
[1] https://github.com/altilunium/wistalk (Scrap wikipedia to analyze user's activity)
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
psedex

1 0 0.0 HTML

Psedex

[2] https://github.com/altilunium/psedex (Scrap goverment website to get list of all registered online services in Indonesia)
makalahIF

1 1 0.0 HTML

Makalah dari https://informatika.stei.itb.ac.id/~rinaldi.munir/
wi-page

2 1 0.0 Python

Rank Wikipedia Article's Contributors by Byte Counts.

[4] https://github.com/altilunium/wi-page (Scrap wikipedia to get most active contributors that contribute to a certain article)
arachnid

1 1 10.0 Python

Web spider (by altilunium)
powerpage-web-crawler

6 7 0.0 HTML

a portable, lightweight web crawler using Powerpage.

it depends. for no-code solution, please check [powerpage-web-crawler](https://github.com/casualwriter/powerpage-web-crawler) for crawling blog/posts.
undetected-chromedriver

40 8,018 7.1 Python

Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
medusa-crawler

1 5 10.0 Ruby

The Official Medusa Crawler gem
linkedom

13 1,463 8.1 HTML

A triple-linked lists based DOM implementation.

For simple scraping where the content is fairly static, or when performance is critical, I will use linkedom to process pages.
https://github.com/WebReflection/linkedom
When the content is complex or involves clicking, Playwright is probably the best tool for the job.
https://github.com/microsoft/playwright
chrome-aws-lambda

12 3,135 0.0 TypeScript

Chromium Binary for AWS Lambda and Google Cloud Functions
browserless

21 7,552 9.8 TypeScript

Deploy headless browsers in Docker. Run on our cloud or bring your own. Free for non-commercial uses.
cheerio

50 27,749 9.7 TypeScript

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

If the content you need is static, I like using node + cheerio [0] as the selector syntax is quite powerful. If there is some javascript execution involved however, I will fall back to puppeteer.
[0] - https://cheerio.js.org/
pup

52 7,992 0.0 HTML

Parsing HTML at the command line

Unpopular opinion, but Bash/Shell Scripting. Seriously, it's probably the fastest way to get things done. For fetching, use cURL. Want to extract particular markup? Use pup[1]. Want to process csv? Use cskit[2]. Or JSON? Use jq[3]. Want to use DB? Use psql. Once you get the hang of shell scripting, you can create a scrapers by just wiring up these utils.
The only thing I wish was present was better support for RegExes. Bash and most unix tools don't support PCRE which can severely limiting. Plus, sometimes you want to process text as a whole vs line-by-line.
I would also recommend Python's sh[4] module is Shell scripting isn't your cup of tea. You get best of both worlds. Ease of use with Bash utils, and a saner syntax.
[1]: https://github.com/ericchiang/pup
jq

306 25,063 0.0 C

Discontinued Command-line JSON processor [Moved to: https://github.com/jqlang/jq] (by stedolan)
colly

39 22,120 6.0 Go

Elegant Scraper and Crawler Framework for Golang

I’m not sure about “best” but I’ve been using Colly (written in Go) and it’s been pretty slick. Haven’t run in to anything it can’t do.
http://go-colly.org/
bumblebee-Old-and-abbandoned

76 467 0.0 Shell

OUTDATED!!!!! - Replaced by "The Bumblebee Project" and "Ironhide"

My main qualms with bash as a scripting language are that its syntax is not only kind of bonkers (no judgement, I know it's an old tool) but also just crazily unsafe. I link to a few high-profile things whenever people ask me why my mantra is "the time to switch your script from bash to python is when you want to delete things".
>rm -rf /usr /lib/nvidia-current/xorg/xorg
https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...
>rm -rf "$STEAMROOT/"*
https://github.com/valvesoftware/steam-for-linux/issues/3671
It's just too easy to shoot your foot.
steam-for-linux

463 4,096 2.0

Issue tracking for the Steam for Linux beta client

My main qualms with bash as a scripting language are that its syntax is not only kind of bonkers (no judgement, I know it's an old tool) but also just crazily unsafe. I link to a few high-profile things whenever people ask me why my mantra is "the time to switch your script from bash to python is when you want to delete things".
>rm -rf /usr /lib/nvidia-current/xorg/xorg
https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...
>rm -rf "$STEAMROOT/"*
https://github.com/valvesoftware/steam-for-linux/issues/3671
It's just too easy to shoot your foot.
Scrapy

180 50,824 9.6 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

8. If you decide to have your own infrastructure, you can use https://github.com/scrapy/scrapyd.
scrapyd

6 2,838 5.9 Python

A service daemon to run Scrapy spiders

8. If you decide to have your own infrastructure, you can use https://github.com/scrapy/scrapyd.
scrapy-redis

4 5,447 5.0 Python

Redis-based components for Scrapy.

11. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using https://github.com/rmax/scrapy-redis.
crawlee

29 12,044 9.8 TypeScript

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

I'm working on a personal project that involves A LOT of scraping, and through several iterations I've gotten some stuff that works quite well. Here's a quick summary of what I've explored (both paid and free):
* Apify (https://apify.com/) is a great, comprehensive system if you need to get fairly low-level. Everything is hosted there, they've got their own proxy service (or you can roll your own), and their open source framework (https://github.com/apify/crawlee) is excellent.
* I've also experimented with running both their SDK (crawlee) and Playwright directly on Google Cloud Run, and that also works well and is an order-of-magnitude less expensive than running directly on their platform.
* Bright Data nee Luminati is excellent for cheap data center proxies ($0.65/GB pay as you go), but prices get several orders of magnitude more expensive if you need anything more thorough than data center proxies.
* For some direct API crawls that I do, all of the scraping stuff is unnecessary and I just ping the APIs directly.
* If the site you're scraping is using any sort of anti-bot protection, I've found that ScrapingBee (https://www.scrapingbee.com/) is by far the easiest solution. I spent many many hours fighting anti-bot protection doing it myself with some combination of Bright Data, Apify and Playwright, and in the end I kinda stopped battling and just decided to let ScrapingBee deal with it for me. I may be lucky in that the sites I'm scraping don't really use JS heavily, so the plain vanilla, no-JS ScrapingBee service works almost all of the time for those. Otherwise it can get quite expensive if you need JS rendering, premium proxies, etc. But a big thumbs up to them for making it really easy.
Always looking for new techniques and tools, so I'll monitor this thread closely.
google-search-results-php

32 53 2.2 PHP

Google Search Results PHP API via Serp Api

We've built https://serpapi.com
We've invented the industry what you referring as "data type specific APIs"; APIs that abstract away all proxies issues, captcha solvings, various layouts support, even scrapping-related legal issues, and much more to a clean JSON response every single call. It was a lot of work but our success rate and response times are now rivaling non-scraping commercial APIs: https://serpapi.com/status
I think the next battle will be still legal despite all the wins in favor of scrapping public pages and common sense understanding this is the way to go. The EFF has been doing an amazing work in this world and we are proud to be a significant yearly contributor to the EFF.
Webscraping Open Project

11 1,307 0.0 Python

Discontinued The web scraping open project repository aims to share knowledge and experiences about web scraping with Python [Moved to: https://github.com/TheWebScrapingClub/webscraping-from-0-to-hero]

I’m collecting my experience in using these tools in this “web scraping open knowledge project” on github (https://github.com/reanalytics-databoutique/webscraping-open...) and on my substack (http://thewebscraping.club/) for longer free content
webscraping-open

2 - -

I’m collecting my experience in using these tools in this “web scraping open knowledge project” on github (https://github.com/reanalytics-databoutique/webscraping-open...) and on my substack (http://thewebscraping.club/) for longer free content
SaaSHub

www.saashub.com
sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Could someone recommend me a library for c# like one of these two (they are for python) : mlscraper and autoscraper
2 projects | /r/learnprogramming | 19 Mar 2023
Show HN: Flyscrape – A standalone and scriptable web scraper in Go
6 projects | news.ycombinator.com | 11 Nov 2023
Colly: Elegant Scraper and Crawler Framework for Golang
1 project | news.ycombinator.com | 23 Aug 2023
Squirm - This was the night of the crawling terror!
4 projects | /r/crystal_programming | 6 May 2023
colly VS scrapemate - a user suggested alternative
2 projects | 15 Apr 2023

Ask HN: What are the best tools for web scraping in 2022?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Crawler Scraping Crawling Scraper Automation
Post date: 10 Aug 2022

Playwright

estela

WorkOS

puppeteer

curl-impersonate

polite

ssscraper

wistalk

InfluxDB

psedex

makalahIF

wi-page

arachnid

powerpage-web-crawler

undetected-chromedriver

medusa-crawler

linkedom

chrome-aws-lambda

browserless

cheerio

pup

jq

colly

bumblebee-Old-and-abbandoned

steam-for-linux

Scrapy

scrapyd

scrapy-redis

crawlee

google-search-results-php

Webscraping Open Project

webscraping-open

SaaSHub

Related posts

Ask HN: What are the best tools for web scraping in 2022?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Crawler Scraping Crawling Scraper Automation Post date: 10 Aug 2022

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Crawler Scraping Crawling Scraper Automation
Post date: 10 Aug 2022