-
I don't know how its developers pulled this off, but it works. Its main feature is that it was developed specifically for web scraping. It also has a higher-level library to work with - botasaurus.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Let's look at a simple code example. This will work for requests, httpx, and aiohttp with a clean installation and no extensions.
-
The answer lies in the Accept-Encoding header. In the example above, I just copied it from my browser, so it lists all the compression methods my browser supports: "gzip, deflate, br, zstd". The Wayfair backend supports compression with "br", which is Brotli, and uses it as the most efficient method.
-
crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
By the way, adding the brotli dependency was my first contribution to crawlee-python. They use httpx as the base HTTP client.
-
You might expect me to write here an equally elegant solution for httpx, but no. About six months ago, you could do the "dirty trick" and change the basic httpcore parameters that it passes to h2, which are responsible for the HTTP2 handshake. But now, as I'm writing this article, that doesn't work anymore.
-
Yes, that's right. 2023 brought us not only Large Language Models like ChatGPT but also improved Cloudflare protection.
-
You may have also noticed that a new supported data compression format zstd appeared some time ago. I haven't seen any backends that use it yet, but httpx will support decompression in versions above 0.28.0. I already use it to compress server response dumps in my projects; it shows incredible efficiency in asynchronous solutions with aiofiles.
-
One might ask, what about Scrapy? I'll be honest: I don't really keep up with their updates. But I haven't heard about Zyte doing anything to bypass TLS fingerprinting. So out of the box Scrapy will also be blocked, but nothing is stopping you from using curl_cffi in your Scrapy Spider.
-
You may have also noticed that a new supported data compression format zstd appeared some time ago. I haven't seen any backends that use it yet, but httpx will support decompression in versions above 0.28.0. I already use it to compress server response dumps in my projects; it shows incredible efficiency in asynchronous solutions with aiofiles.
-
You might expect me to write here an equally elegant solution for httpx, but no. About six months ago, you could do the "dirty trick" and change the basic httpcore parameters that it passes to h2, which are responsible for the HTTP2 handshake. But now, as I'm writing this article, that doesn't work anymore.
-
My journey in web scraping started in 2016 when I was solving lead generation challenges for a small company. Initially, I used off-the-shelf solutions such as Import.io and Kimono Labs. However, I quickly encountered limitations such as blocking, inaccurate data extraction, and performance issues. This led me to learn Python. Those were the glory days when requests and lxml/beautifulsoup were enough to extract data from most websites. And if you knew how to work with threads, you were already a respected expert :)
-
The first way to solve this problem is to use tls-client
-
tls-client
net/http.Client like HTTP Client with options to select specific client TLS Fingerprints to use for requests.
This Python wrapper around the Golang library provides an API similar to requests.
-
curl_cffi
Discontinued Python binding for curl-impersonate via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints. [Moved to: https://github.com/lexiforest/curl_cffi] (by yifeikong)
The second way to solve this problem is to use curl_cffi
-
As I personally would like to see an HTTP client for Python developed and maintained by a major company. And Rust shows itself very well as a library language for Python. Let's remember at least Ruff and Pydantic v2.
-
Candidate #1 Playwright + playwright-stealth
-
undetected-chromedriver
Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)
Candidate #2 undetected_chromedriver
-
A new framework for web scraping using browser automation, built on botasaurus-driver. The initial commit was made on May 9, 2023.
-
Middleware libraries are written by the community and are extending their functionality. For example, scrapy-playwright.
-
crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Developed by Apify, it is a Python adaptation of their famous JS framework crawlee, first released on Jul 9, 2019.