Ask HN: What are the best tools for web scraping in 2022?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • Playwright

    Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.

    For simple scraping where the content is fairly static, or when performance is critical, I will use linkedom to process pages.

    https://github.com/WebReflection/linkedom

    When the content is complex or involves clicking, Playwright is probably the best tool for the job.

    https://github.com/microsoft/playwright

  • estela

    estela, an elastic web scraping cluster 🕸

    estela is an elastic web scraping cluster running on Kubernetes. It provides mechanisms to deploy, run and scale web scraping spiders via a REST API and a web interface.

    It is a modern alternative to the few OSS projects available for such needs, like scrapyd and gerapy. estela aims to help web scraping teams and individuals that are considering moving away from proprietary scraping clouds, or who are in the process of designing their on-premise scraping architecture, so as not to needlessly reinvent the wheel, and to benefit from the get-go from features such as built-in scalability and elasticity, among others.

    estela has been recently published as OSS under the MIT license:

    https://github.com/bitmakerla/estela

    More details about it can be found in the release blog post and the official documentation:

    https://bitmaker.la/blog/2022/06/24/estela-oss-release.html

    https://estela.bitmaker.la/docs/

    estela supports Scrapy spiders for the moment being, but additional frameworks/languages are on the roadmap.

    All kinds of feedback and contributions are welcome!

    Disclaimer: I'm part of the development team behind estela :-)

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • puppeteer

    Node.js API for Chrome

  • curl-impersonate

    curl-impersonate: A special build of curl that can impersonate Chrome & Firefox

    curl-impersonate[1] is a curl fork that I maintain and which lets you fetch sites while impersonating a browser. Unfortunately, the practice of TLS and HTTP fingerprinting of web clients has become extremely common in the past ~1 year, which means a regular curl request will often return some JS challenge and not the real content. curl-impersonate helps with that.

    [1] https://github.com/lwthiker/curl-impersonate

  • polite

    Be nice on the web

    The polite package using R is intended to be a friendly way of scraping content from the owner. "The three pillars of a polite session are seeking permission, taking slowly and never asking twice."

    https://github.com/dmi3kno/polite

  • ssscraper

    A crawler/scraper based on golang + colly, configurable via JSON

    For a particular type of scraping, we wrote SSScraper on top of Colly and it works really well:

    https://github.com/gotripod/ssscraper/

  • wistalk

    Wistalk : Analyze Wikipedia User's Activity

    Beautiful Soup gets the job done. I made several app by using it.

    [1] https://github.com/altilunium/wistalk (Scrap wikipedia to analyze user's activity)

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • psedex

    Psedex

    [2] https://github.com/altilunium/psedex (Scrap goverment website to get list of all registered online services in Indonesia)

  • makalahIF

    Makalah dari https://informatika.stei.itb.ac.id/~rinaldi.munir/

  • wi-page

    Rank Wikipedia Article's Contributors by Byte Counts.

    [4] https://github.com/altilunium/wi-page (Scrap wikipedia to get most active contributors that contribute to a certain article)

  • arachnid

    Web spider (by altilunium)

  • powerpage-web-crawler

    a portable, lightweight web crawler using Powerpage.

    it depends. for no-code solution, please check [powerpage-web-crawler](https://github.com/casualwriter/powerpage-web-crawler) for crawling blog/posts.

  • undetected-chromedriver

    Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)

  • medusa-crawler

    The Official Medusa Crawler gem

  • linkedom

    A triple-linked lists based DOM implementation.

    For simple scraping where the content is fairly static, or when performance is critical, I will use linkedom to process pages.

    https://github.com/WebReflection/linkedom

    When the content is complex or involves clicking, Playwright is probably the best tool for the job.

    https://github.com/microsoft/playwright

  • chrome-aws-lambda

    Chromium Binary for AWS Lambda and Google Cloud Functions

  • browserless

    Deploy headless browsers in Docker. Run on our cloud or bring your own. Free for non-commercial uses.

  • cheerio

    The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

    If the content you need is static, I like using node + cheerio [0] as the selector syntax is quite powerful. If there is some javascript execution involved however, I will fall back to puppeteer.

    [0] - https://cheerio.js.org/

  • pup

    Parsing HTML at the command line

    Unpopular opinion, but Bash/Shell Scripting. Seriously, it's probably the fastest way to get things done. For fetching, use cURL. Want to extract particular markup? Use pup[1]. Want to process csv? Use cskit[2]. Or JSON? Use jq[3]. Want to use DB? Use psql. Once you get the hang of shell scripting, you can create a scrapers by just wiring up these utils.

    The only thing I wish was present was better support for RegExes. Bash and most unix tools don't support PCRE which can severely limiting. Plus, sometimes you want to process text as a whole vs line-by-line.

    I would also recommend Python's sh[4] module is Shell scripting isn't your cup of tea. You get best of both worlds. Ease of use with Bash utils, and a saner syntax.

    [1]: https://github.com/ericchiang/pup

  • jq

    Discontinued Command-line JSON processor [Moved to: https://github.com/jqlang/jq] (by stedolan)

  • colly

    Elegant Scraper and Crawler Framework for Golang

    I’m not sure about “best” but I’ve been using Colly (written in Go) and it’s been pretty slick. Haven’t run in to anything it can’t do.

    http://go-colly.org/

  • bumblebee-Old-and-abbandoned

    OUTDATED!!!!! - Replaced by "The Bumblebee Project" and "Ironhide"

    My main qualms with bash as a scripting language are that its syntax is not only kind of bonkers (no judgement, I know it's an old tool) but also just crazily unsafe. I link to a few high-profile things whenever people ask me why my mantra is "the time to switch your script from bash to python is when you want to delete things".

    >rm -rf /usr /lib/nvidia-current/xorg/xorg

    https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...

    >rm -rf "$STEAMROOT/"*

    https://github.com/valvesoftware/steam-for-linux/issues/3671

    It's just too easy to shoot your foot.

  • steam-for-linux

    Issue tracking for the Steam for Linux beta client

    My main qualms with bash as a scripting language are that its syntax is not only kind of bonkers (no judgement, I know it's an old tool) but also just crazily unsafe. I link to a few high-profile things whenever people ask me why my mantra is "the time to switch your script from bash to python is when you want to delete things".

    >rm -rf /usr /lib/nvidia-current/xorg/xorg

    https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commi...

    >rm -rf "$STEAMROOT/"*

    https://github.com/valvesoftware/steam-for-linux/issues/3671

    It's just too easy to shoot your foot.

  • Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    8. If you decide to have your own infrastructure, you can use https://github.com/scrapy/scrapyd.

  • scrapyd

    A service daemon to run Scrapy spiders

    8. If you decide to have your own infrastructure, you can use https://github.com/scrapy/scrapyd.

  • scrapy-redis

    Redis-based components for Scrapy.

    11. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using https://github.com/rmax/scrapy-redis.

  • crawlee

    Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

    I'm working on a personal project that involves A LOT of scraping, and through several iterations I've gotten some stuff that works quite well. Here's a quick summary of what I've explored (both paid and free):

    * Apify (https://apify.com/) is a great, comprehensive system if you need to get fairly low-level. Everything is hosted there, they've got their own proxy service (or you can roll your own), and their open source framework (https://github.com/apify/crawlee) is excellent.

    * I've also experimented with running both their SDK (crawlee) and Playwright directly on Google Cloud Run, and that also works well and is an order-of-magnitude less expensive than running directly on their platform.

    * Bright Data nee Luminati is excellent for cheap data center proxies ($0.65/GB pay as you go), but prices get several orders of magnitude more expensive if you need anything more thorough than data center proxies.

    * For some direct API crawls that I do, all of the scraping stuff is unnecessary and I just ping the APIs directly.

    * If the site you're scraping is using any sort of anti-bot protection, I've found that ScrapingBee (https://www.scrapingbee.com/) is by far the easiest solution. I spent many many hours fighting anti-bot protection doing it myself with some combination of Bright Data, Apify and Playwright, and in the end I kinda stopped battling and just decided to let ScrapingBee deal with it for me. I may be lucky in that the sites I'm scraping don't really use JS heavily, so the plain vanilla, no-JS ScrapingBee service works almost all of the time for those. Otherwise it can get quite expensive if you need JS rendering, premium proxies, etc. But a big thumbs up to them for making it really easy.

    Always looking for new techniques and tools, so I'll monitor this thread closely.

  • google-search-results-php

    Google Search Results PHP API via Serp Api

    We've built https://serpapi.com

    We've invented the industry what you referring as "data type specific APIs"; APIs that abstract away all proxies issues, captcha solvings, various layouts support, even scrapping-related legal issues, and much more to a clean JSON response every single call. It was a lot of work but our success rate and response times are now rivaling non-scraping commercial APIs: https://serpapi.com/status

    I think the next battle will be still legal despite all the wins in favor of scrapping public pages and common sense understanding this is the way to go. The EFF has been doing an amazing work in this world and we are proud to be a significant yearly contributor to the EFF.

  • Webscraping Open Project

    Discontinued The web scraping open project repository aims to share knowledge and experiences about web scraping with Python [Moved to: https://github.com/TheWebScrapingClub/webscraping-from-0-to-hero]

    I’m collecting my experience in using these tools in this “web scraping open knowledge project” on github (https://github.com/reanalytics-databoutique/webscraping-open...) and on my substack (http://thewebscraping.club/) for longer free content

  • I’m collecting my experience in using these tools in this “web scraping open knowledge project” on github (https://github.com/reanalytics-databoutique/webscraping-open...) and on my substack (http://thewebscraping.club/) for longer free content

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts