Web Scraping in Python – The Complete Guide

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • shot-scraper

    A command-line utility for taking automated screenshots of websites

  • I strongly recommend adding Playwright to your set of tools for Python web scraping. It's by far the most powerful and best designed browser automation tool I've ever worked with.

    I use it for my shot-scraper CLI tool: https://shot-scraper.datasette.io/ - which lets you scrape web pages directly from the command line by running JavaScript against pages to extract JSON data: https://shot-scraper.datasette.io/en/stable/javascript.html

  • flyscrape

    Flyscrape is a command-line web scraping tool designed for those without advanced programming skills.

  • Shameless plug:

    Flyscrape[0] lets you eliminate a lot of boilerplate code that is otherwise necessary when building a scraper from scratch, while still giving you the flexibility to extract data that perfectly fit your needs.

    It comes as a single binary executable and runs small JavaScript files without having to deal with npm or node.

    You can have a collection of small and isolated scraping scripts, rather than full on node projects.

    [0]: https://github.com/philippta/flyscrape

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Playwright

    Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.

  • Hah yeah that's confusing.

    https://playwright.dev/python/docs/intro is actually the documentation for pytest-playwright - their pytest plugin.

    https://playwright.dev/python/docs/library is the documentation for their automation library.

    I just filed an issue pointing out that this is confusing. https://github.com/microsoft/playwright/issues/29579

  • discogs

    parsing data from discogs

  • Check out the cloudscraper library if are having speed/cpu issues with sites that require js/have cloudfare defending them. That plus a proxy list plus threading allows me to make 300 requests a minute across 32 different proxies. Recently implemented it for a project: https://github.com/rezaisrad/discogs/tree/main/src/managers

  • captcha-solver

    basic google recaptcha solver using llava-v1.6-7b (by notune)

  • If you have a decent gpu (16gb+ vram) and are using Linux, then this tool I wrote some days ago might do the trick. (at least for googles recaptcha). Also, for now, you have to call the main.py every time you see a captcha on a site and you need the gui since I am only using vision via Screenshots, no HTML or similar. (Sorry that it's not yet that well optimized. I am currently very busy with lots of other things, but next week I should have time to improve this further. But it should still work for basic scraping.) https://github.com/notune/captcha-solver/

  • Nokogiri

    Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.

  • skyscraper

    Structural scraping for the rest of us. (by nathell)

  • Yes!

    My Clojure scraping framework [0] facilitates that kind of workflow, and I’ve been using it to scrape/restructure massive sites (millions of pages). I guess I’m going to write a blog post about scraping with it at scale. Although it doesn’t really scale much above that – it’s meant for single-machine loads at the moment – it could be enhanced to support that kind of workflow rather easily.

    [0]: https://github.com/nathell/skyscraper

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • pypandoc

    Thin wrapper for "pandoc" (MIT)

  • I recently used [0] Playwright for Python and [1] pypandoc to build a scraper that fetches a webpage and turns the content into sane markdown so that it can be passed into an AI coding chat [2].

    They are both very gentle dependencies to add to a project. Both packages contain built in or scriptable methods to install their underlying platform-specific binary dependencies. This means you don't need to ask end users to use some complex, platform-specific package manager to install playwright and pandoc.

    Playwright let's you scrape pages that rely on js. Pandoc is great at turning HTML into sensible markdown. Below is an excerpt of the openai pricing docs [3] that have been scraped to markdown [4] in this manner.

    [0] https://playwright.dev/python/docs/intro

    [1] https://github.com/JessicaTegner/pypandoc

    [2] https://github.com/paul-gauthier/aider

    [3] https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turb...

    [4] https://gist.githubusercontent.com/paul-gauthier/95a1434a28d...

      ## GPT-4 and GPT-4 Turbo

  • aider

    aider is AI pair programming in your terminal

  • I recently used [0] Playwright for Python and [1] pypandoc to build a scraper that fetches a webpage and turns the content into sane markdown so that it can be passed into an AI coding chat [2].

    They are both very gentle dependencies to add to a project. Both packages contain built in or scriptable methods to install their underlying platform-specific binary dependencies. This means you don't need to ask end users to use some complex, platform-specific package manager to install playwright and pandoc.

    Playwright let's you scrape pages that rely on js. Pandoc is great at turning HTML into sensible markdown. Below is an excerpt of the openai pricing docs [3] that have been scraped to markdown [4] in this manner.

    [0] https://playwright.dev/python/docs/intro

    [1] https://github.com/JessicaTegner/pypandoc

    [2] https://github.com/paul-gauthier/aider

    [3] https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turb...

    [4] https://gist.githubusercontent.com/paul-gauthier/95a1434a28d...

      ## GPT-4 and GPT-4 Turbo

  • cheerio

    The fast, flexible, and elegant library for parsing and manipulating HTML and XML.

  • > I'm not sure why Python web scraping is so popular compared to Node.js web scraping

    Take this with a grain of salt, since I am fully cognizant that I'm the outlier in most of these conversations, but Scrapy is A++ the no-kidding best framework for this activity that has been created thus far. So, if there was scrapyjs maybe I'd look into it, but there's not (that I'm aware of) so here we are. This conversation often comes up in any such "well, I just use requests & ..." conversation and if one is happy with main.py and a bunch of requests invocations, I'm glad for you, but I don't want to try and cobble together all the side-band stuff that Scrapy and its ecosystem provide for me in a reusable and predictable way

    Also, often those conversations conflate the server side language with the "scrape using headed browser" language which happens to be the same one. So, if one is using cheerio <https://github.com/cheeriojs/cheerio> then sure node can be a fine thing - if the blog post is all "fire up puppeteer, what can go wrong?!" then there is the road to ruin of doing battle with all kinds of detection problems since it's kind of a browser but kind of not

    I, under no circumstances, want the target site running their JS during my crawl runs. I fully accept responsibility for reproducing any XHR or auth or whatever to find the 3 URLs that I care about, without downloading every thumbnail and marketing JS and beacon and and and. I'm also cognizant that my traffic will thus stand out since it uniquely does not make the beacon and marketing calls, but my experience has been that I get the ban hammer less often with my target fetches than trying to pretend to be a browser with a human on the keyboard/mouse but is not

  • forward-proxy-manager

    Request distributor for web scraping

  • I've found myself writing the same session/proxy/rate limiting/header faking management code over and over for my scrapers. I've extracted it into it's own service that runs in docker and acts as a MITM proxy between you and target. It is client language agnostic, so you can write scrapers in python, node or whatever and still have great performance.

    Highly recommend this approach, it allows you to separate infrastructure code, that gets highly complex as you need more requests, from actual spider/parser code that is usually pretty straightforward and project specific.

    https://github.com/jkelin/forward-proxy-manager

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts