Current problems and mistakes of web scraping in Python and tricks to solve them!

This page summarizes the projects mentioned and recommended in the original post on dev.to

InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com
featured
Stream - Scalable APIs for Chat, Feeds, Moderation, & Video.
Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.
getstream.io
featured
  1. botasaurus

    The All in One Framework to Build Undefeatable Scrapers

    I don't know how its developers pulled this off, but it works. Its main feature is that it was developed specifically for web scraping. It also has a higher-level library to work with - botasaurus.

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. httpx

    A next generation HTTP client for Python. 🦋

    Let's look at a simple code example. This will work for requests, httpx, and aiohttp with a clean installation and no extensions.

  4. brotli

    Brotli compression format

    The answer lies in the Accept-Encoding header. In the example above, I just copied it from my browser, so it lists all the compression methods my browser supports: "gzip, deflate, br, zstd". The Wayfair backend supports compression with "br", which is Brotli, and uses it as the most efficient method.

  5. crawlee-python

    Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

    By the way, adding the brotli dependency was my first contribution to crawlee-python. They use httpx as the base HTTP client.

  6. httpcore

    A minimal HTTP client. ⚙️

    You might expect me to write here an equally elegant solution for httpx, but no. About six months ago, you could do the "dirty trick" and change the basic httpcore parameters that it passes to h2, which are responsible for the HTTP2 handshake. But now, as I'm writing this article, that doesn't work anymore.

  7. flow-pipeline

    A set of tools and examples to run a flow-pipeline (sFlow, NetFlow)

    Yes, that's right. 2023 brought us not only Large Language Models like ChatGPT but also improved Cloudflare protection.

  8. zstd

    Zstandard - Fast real-time compression algorithm

    You may have also noticed that a new supported data compression format zstd appeared some time ago. I haven't seen any backends that use it yet, but httpx will support decompression in versions above 0.28.0. I already use it to compress server response dumps in my projects; it shows incredible efficiency in asynchronous solutions with aiofiles.

  9. Stream

    Stream - Scalable APIs for Chat, Feeds, Moderation, & Video. Stream helps developers build engaging apps that scale to millions with performant and flexible Chat, Feeds, Moderation, and Video APIs and SDKs powered by a global edge network and enterprise-grade infrastructure.

    Stream logo
  10. Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    One might ask, what about Scrapy? I'll be honest: I don't really keep up with their updates. But I haven't heard about Zyte doing anything to bypass TLS fingerprinting. So out of the box Scrapy will also be blocked, but nothing is stopping you from using curl_cffi in your Scrapy Spider.

  11. aiofiles

    File support for asyncio

    You may have also noticed that a new supported data compression format zstd appeared some time ago. I haven't seen any backends that use it yet, but httpx will support decompression in versions above 0.28.0. I already use it to compress server response dumps in my projects; it shows incredible efficiency in asynchronous solutions with aiofiles.

  12. h2

    Pure-Python HTTP/2 protocol implementation (by python-hyper)

    You might expect me to write here an equally elegant solution for httpx, but no. About six months ago, you could do the "dirty trick" and change the basic httpcore parameters that it passes to h2, which are responsible for the HTTP2 handshake. But now, as I'm writing this article, that doesn't work anymore.

  13. lxml

    The lxml XML toolkit for Python

    My journey in web scraping started in 2016 when I was solving lead generation challenges for a small company. Initially, I used off-the-shelf solutions such as Import.io and Kimono Labs. However, I quickly encountered limitations such as blocking, inaccurate data extraction, and performance issues. This led me to learn Python. Those were the glory days when requests and lxml/beautifulsoup were enough to extract data from most websites. And if you knew how to work with threads, you were already a respected expert :)

  14. Python-Tls-Client

    Advanced HTTP Library

    The first way to solve this problem is to use tls-client

  15. tls-client

    net/http.Client like HTTP Client with options to select specific client TLS Fingerprints to use for requests.

    This Python wrapper around the Golang library provides an API similar to requests.

  16. curl_cffi

    Discontinued Python binding for curl-impersonate via cffi. A http client that can impersonate browser tls/ja3/http2 fingerprints. [Moved to: https://github.com/lexiforest/curl_cffi] (by yifeikong)

    The second way to solve this problem is to use curl_cffi

  17. ruff

    An extremely fast Python linter and code formatter, written in Rust.

    As I personally would like to see an HTTP client for Python developed and maintained by a major company. And Rust shows itself very well as a library language for Python. Let's remember at least Ruff and Pydantic v2.

  18. playwright_stealth

    playwright stealth

    Candidate #1 Playwright + playwright-stealth

  19. undetected-chromedriver

    Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM)

    Candidate #2 undetected_chromedriver

  20. botasaurus-driver

    Super Fast, Super Anti-Detect, and Super Intuitive Web Driver

    A new framework for web scraping using browser automation, built on botasaurus-driver. The initial commit was made on May 9, 2023.

  21. scrapy-playwright

    🎭 Playwright integration for Scrapy

    Middleware libraries are written by the community and are extending their functionality. For example, scrapy-playwright.

  22. crawlee

    Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

    Developed by Apify, it is a Python adaptation of their famous JS framework crawlee, first released on Jul 9, 2019.

  23. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Web Scraper Multiparadigmático!

    2 projects | /r/devpt | 14 May 2023
  • Daily News Scraper and Sms Notifications - Part One

    2 projects | dev.to | 16 Mar 2021
  • Scrapy needs to have sane defaults that do no harm

    1 project | news.ycombinator.com | 23 Apr 2025
  • How to scrape Crunchbase using Python in 2024 (Easy Guide)

    5 projects | dev.to | 15 Jan 2025
  • 11 best open-source web crawlers and scrapers in 2024

    12 projects | dev.to | 29 Oct 2024

Did you know that Python is
the 2nd most popular programming language
based on number of references?