Web Scraping 101 with Python

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • Scout APM - Less time debugging, more time building
  • OPS - Build and Run Open Source Unikernels
  • SonarQube - Static code analysis for 29 languages.
  • pyppeteer

    Headless chrome/chromium automation library (unofficial port of puppeteer)

  • playwright-python

    Python version of the Playwright testing and automation library.

    There's an official Python library for Playwright as well: https://github.com/microsoft/playwright-python

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

    I think the Scrapy library written in Python (https://scrapy.org) is excellent for writing and deploying scrapers.

  • memex-program-index

    A list of memex-related tools and their repository URLs

    There was (is?) a DARPA project called "Memex" that was built to crawl the hidden web that has many tools like crawling with authentication, automatic registration, machine-learning to detect search-forms, auto detecting pagination etc etc etc etc https://github.com/darpa-i2o/memex-program-index

  • scraper

    Open source nodejs web scraper. It scrapes, stores and exports data. Use it from your own javascript/typescript code, via command line or docker container. Supports multiple storage options: SQLite, MySQL, PostgreSQL. Supports multiple browser or dom-like clients: Puppeteer, Playwright, Cheerio, JSdom. (by get-set-fetch)

    I'm using this exact strategy to scrape content directly from DOM using APIs like document.querySelectorAll. You can use the same code in both headless browser clients like Puppeteer or Playwright and DOM clients like cheerio or jsdom (assuming you have a wrapper over document API). Depending on the way a web page was fetched (opened in a browser tab or fetched via nodejs http/https requests), ExtractHtmlContentPlugin, ExtractUrlsPlugin use different DOM wrappers (native, cheerio, jsdom) to scrape the content.

    [1] https://github.com/get-set-fetch/scraper/blob/main/src/plugi...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts