How to Crawl the Web with Scrapy

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

rtila-releases

1 66 3.3

Rtila [1]
Created by an indy/solo developer-on-fire cranking out user-requested features quite quickly... check the releases page [2]
I have used (or at lelast trialled) the vast majority acraping tech and written hundreds of scrapers since my first VB5 controlling IE and dumping to SQLserver in the 90's and then moving to various php and python libs/frameworks and a handful of windows apps like ubot and imacros (both of which were useful to me at some point but I never use those nowadays)
A recent release of Rtila allows creating standalone bots you can run using it's built-in local Node.js server (which also has it's own locally hosted server API you can program anything else against using any language you like)
[1] www.rtila.net
[2] https://github.com/IKAJIAN/rtila-releases/releases

puppeteer

359 86,773 9.9 TypeScript

Node.js API for Chrome

Google search isn't a static site, the results are dynamically generated based on what it knows about you (location, browser language, recent searches from address, recent searches from account, and so on with all of the things they know from trying to sell ad slots to that device).
That being said there isn't anything wrong with using Scrapy for this. If you're more familiar with web browsers than Python something like https://github.com/puppeteer/puppeteer can also be turned into a quick way to scrape a site by giving you a headless browser controlled by whatever you script in nodejs.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
parsel

5 1,077 6.4 Python

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
google-search-results-php

32 53 2.2 PHP

Google Search Results PHP API via Serp Api

I've had excellent luck with SerpAPI. It's $50 a month for 5,000 searches which has been plenty for my needs at a small SEO/marketing agency.
http://serpapi.com

colly

39 22,165 6.0 Go

Elegant Scraper and Crawler Framework for Golang

What's your problem with Colly? [0]
[0] http://go-colly.org/

Scrapy

180 50,896 9.6 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

While I agree that Scrapy is a great tool for beginner tutorials and easy entry into scraping, it's becoming difficult to use it in real world scenarios because almost all the large players now employ some anti-bot or anti-scraping protection.
A great example above all is Cloudflare. You simply can't convince Cloudflare you're a human with Scrapy alone. Scrapy has only experimental support of HTTP2 and does not support proxies over HTTP2 (https://github.com/scrapy/scrapy/issues/5213). Yet, all browsers use HTTP2 now, which means all normal users use HTTP2... You get the point.
What we use now is Got Scraping (https://github.com/apify/got-scraping). It's a special purpose extension of Got (HTTP client with 18 mil weekly downloads) that masks its HTTP communication as if it was coming from a real browser. Of course, this will not get you as far as Puppeteer or Playwright (headless browsers), but it improved our scraping tremendously. If you need a full crawling library, see the Apify SDK (https://sdk.apify.com) which uses Got Scraping under the hood.

got-scraping

3 393 6.5 TypeScript

HTTP client made for scraping based on got.

While I agree that Scrapy is a great tool for beginner tutorials and easy entry into scraping, it's becoming difficult to use it in real world scenarios because almost all the large players now employ some anti-bot or anti-scraping protection.
A great example above all is Cloudflare. You simply can't convince Cloudflare you're a human with Scrapy alone. Scrapy has only experimental support of HTTP2 and does not support proxies over HTTP2 (https://github.com/scrapy/scrapy/issues/5213). Yet, all browsers use HTTP2 now, which means all normal users use HTTP2... You get the point.
What we use now is Got Scraping (https://github.com/apify/got-scraping). It's a special purpose extension of Got (HTTP client with 18 mil weekly downloads) that masks its HTTP communication as if it was coming from a real browser. Of course, this will not get you as far as Puppeteer or Playwright (headless browsers), but it improved our scraping tremendously. If you need a full crawling library, see the Apify SDK (https://sdk.apify.com) which uses Got Scraping under the hood.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project