So you want to Scrape like the Big Boys?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • Crawling-Infrastructure

    Distributed crawling infrastructure running on top of severless computation, cloud storage (such as S3) and sophisticated queues.

  • cloudscraper

    A Python module to bypass Cloudflare's anti-bot page.

  • I'm really surprised that the JS challenges helped so much, given that there are open source libraries for bypassing them (e.g. cloudscraper[0]).

    [0]: https://github.com/venomous/cloudscraper

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Varnish

    The project homepage (by varnishcache)

  • It sucks when this happens, but it's easily avoidable by using a caching frontend of some sort.

    My favorite is Varnish,[0] which I have used with great success for _many_ web sites throughout the years. Even a web site that 10+ millions of requests per day ran from a single web server for a long time a decade-ish ago.

    [0] https://varnish-cache.org/

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts