Web Scraping 101 with Python

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • pyppeteer

    Headless chrome/chromium automation library (unofficial port of puppeteer)

  • playwright-python

    Python version of the Playwright testing and automation library.

  • There's an official Python library for Playwright as well: https://github.com/microsoft/playwright-python

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • Scrapy

    Scrapy, a fast high-level web crawling & scraping framework for Python.

  • I think the Scrapy library written in Python (https://scrapy.org) is excellent for writing and deploying scrapers.

  • memex-program-index

    A list of memex-related tools and their repository URLs

  • There was (is?) a DARPA project called "Memex" that was built to crawl the hidden web that has many tools like crawling with authentication, automatic registration, machine-learning to detect search-forms, auto detecting pagination etc etc etc etc https://github.com/darpa-i2o/memex-program-index

  • scraper

    Nodejs web scraper. Contains a command line, docker container, terraform module and ansible roles for distributed cloud scraping. Supported databases: SQLite, MySQL, PostgreSQL. Supported headless clients: Puppeteer, Playwright, Cheerio, JSdom. (by get-set-fetch)

  • I'm using this exact strategy to scrape content directly from DOM using APIs like document.querySelectorAll. You can use the same code in both headless browser clients like Puppeteer or Playwright and DOM clients like cheerio or jsdom (assuming you have a wrapper over document API). Depending on the way a web page was fetched (opened in a browser tab or fetched via nodejs http/https requests), ExtractHtmlContentPlugin, ExtractUrlsPlugin use different DOM wrappers (native, cheerio, jsdom) to scrape the content.

    [1] https://github.com/get-set-fetch/scraper/blob/main/src/plugi...

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts