Current problems and mistakes of web scraping in Python and tricks to solve them!

This page summarizes the projects mentioned and recommended in the original post on dev.to

SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • botasaurus

    The All in One Framework to build Awesome Scrapers.

    I don't know how its developers pulled this off, but it works. Its main feature is that it was developed specifically for web scraping. It also has a higher-level library to work with - botasaurus.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  • httpx

    A next generation HTTP client for Python. 🦋

    Let's look at a simple code example. This will work for requests, httpx, and aiohttp with a clean installation and no extensions.

  • brotli

    Brotli compression format

    The answer lies in the Accept-Encoding header. In the example above, I just copied it from my browser, so it lists all the compression methods my browser supports: "gzip, deflate, br, zstd". The Wayfair backend supports compression with "br", which is Brotli, and uses it as the most efficient method.

  • crawlee-python

    Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

    By the way, adding the brotli dependency was my first contribution to crawlee-python. They use httpx as the base HTTP client.

  • httpcore

    A minimal HTTP client. ⚙️

    You might expect me to write here an equally elegant solution for httpx, but no. About six months ago, you could do the "dirty trick" and change the basic httpcore parameters that it passes to h2, which are responsible for the HTTP2 handshake. But now, as I'm writing this article, that doesn't work anymore.

  • flow-pipeline

    A set of tools and examples to run a flow-pipeline (sFlow, NetFlow)

    Yes, that's right. 2023 brought us not only Large Language Models like ChatGPT but also improved Cloudflare protection.

  • zstd

    Zstandard - Fast real-time compression algorithm

    You may have also noticed that a new supported data compression format zstd appeared some time ago. I haven't seen any backends that use it yet, but httpx will support decompression in versions above 0.28.0. I already use it to compress server response dumps in my projects; it shows incredible efficiency in asynchronous solutions with aiofiles.

  • Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    One might ask, what about Scrapy? I'll be honest: I don't really keep up with their updates. But I haven't heard about Zyte doing anything to bypass TLS fingerprinting. So out of the box Scrapy will also be blocked, but nothing is stopping you from using curl_cffi in your Scrapy Spider.

  • aiofiles

    File support for asyncio

    You may have also noticed that a new supported data compression format zstd appeared some time ago. I haven't seen any backends that use it yet, but httpx will support decompression in versions above 0.28.0. I already use it to compress server response dumps in my projects; it shows incredible efficiency in asynchronous solutions with aiofiles.

  • h2

    HTTP/2 State-Machine based protocol implementation (by python-hyper)

    You might expect me to write here an equally elegant solution for httpx, but no. About six months ago, you could do the "dirty trick" and change the basic httpcore parameters that it passes to h2, which are responsible for the HTTP2 handshake. But now, as I'm writing this article, that doesn't work anymore.

  • lxml

    The lxml XML toolkit for Python

    My journey in web scraping started in 2016 when I was solving lead generation challenges for a small company. Initially, I used off-the-shelf solutions such as Import.io and Kimono Labs. However, I quickly encountered limitations such as blocking, inaccurate data extraction, and performance issues. This led me to learn Python. Those were the glory days when requests and lxml/beautifulsoup were enough to extract data from most websites. And if you knew how to work with threads, you were already a respected expert :)

  • Python-Tls-Client

    Advanced HTTP Library

    The first way to solve this problem is to use tls-client

  • tls-client

    net/http.Client like HTTP Client with options to select specific client TLS Fingerprints to use for requests.

    This Python wrapper around the Golang library provides an API similar to requests.

  • curl_cffi

    Discontinued Python binding for curl-impersonate via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints. [Moved to: https://github.com/lexiforest/curl_cffi] (by yifeikong)

    The second way to solve this problem is to use curl_cffi

  • ruff

    An extremely fast Python linter and code formatter, written in Rust.

    As I personally would like to see an HTTP client for Python developed and maintained by a major company. And Rust shows itself very well as a library language for Python. Let's remember at least Ruff and Pydantic v2.

  • playwright_stealth

    playwright stealth

    Candidate #1 Playwright + playwright-stealth

  • undetected-chromedriver

    Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)

    Candidate #2 undetected_chromedriver

  • botasaurus-driver

    Super Fast, Super Anti-Detect, and Super Intuitive Web Driver

    A new framework for web scraping using browser automation, built on botasaurus-driver. The initial commit was made on May 9, 2023.

  • scrapy-playwright

    🎭 Playwright integration for Scrapy

    Middleware libraries are written by the community and are extending their functionality. For example, scrapy-playwright.

  • crawlee

    Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

    Developed by Apify, it is a Python adaptation of their famous JS framework crawlee, first released on Jul 9, 2019.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Web Scraper Multiparadigmático!

    2 projects | /r/devpt | 14 May 2023
  • Daily News Scraper and Sms Notifications - Part One

    2 projects | dev.to | 16 Mar 2021
  • Scrapy, a fast high-level web crawling and scraping framework for Python

    1 project | news.ycombinator.com | 19 Aug 2024
  • Automate Spider Creation in Scrapy with Jinja2 and JSON

    2 projects | dev.to | 27 Jul 2024
  • Analyzing Svenskalag Data using DBT and DuckDB

    4 projects | dev.to | 17 Jun 2024

Did you konow that Python is
the 1st most popular programming language
based on number of metions?