Web Scraping 101 with Python

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

pyppeteer

17 3,418 5.6 Python

Headless chrome/chromium automation library (unofficial port of puppeteer)
playwright-python

31 10,675 9.0 Python

Python version of the Playwright testing and automation library.

There's an official Python library for Playwright as well: https://github.com/microsoft/playwright-python

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Scrapy

180 50,896 9.6 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

I think the Scrapy library written in Python (https://scrapy.org) is excellent for writing and deploying scrapers.

memex-program-index

1 142 0.0

A list of memex-related tools and their repository URLs

There was (is?) a DARPA project called "Memex" that was built to crawl the hidden web that has many tools like crawling with authentication, automatic registration, machine-learning to detect search-forms, auto detecting pagination etc etc etc etc https://github.com/darpa-i2o/memex-program-index

scraper

12 98 0.0 TypeScript

Nodejs web scraper. Contains a command line, docker container, terraform module and ansible roles for distributed cloud scraping. Supported databases: SQLite, MySQL, PostgreSQL. Supported headless clients: Puppeteer, Playwright, Cheerio, JSdom. (by get-set-fetch)

I'm using this exact strategy to scrape content directly from DOM using APIs like document.querySelectorAll. You can use the same code in both headless browser clients like Puppeteer or Playwright and DOM clients like cheerio or jsdom (assuming you have a wrapper over document API). Depending on the way a web page was fetched (opened in a browser tab or fetched via nodejs http/https requests), ExtractHtmlContentPlugin, ExtractUrlsPlugin use different DOM wrappers (native, cheerio, jsdom) to scrape the content.
[1] https://github.com/get-set-fetch/scraper/blob/main/src/plugi...

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Is this github safe to use?
1 project | /r/antivirus | 2 Nov 2023
Free-games-claimer: claim free games automatically
1 project | news.ycombinator.com | 18 Aug 2023
What is your favourite game on steam and how many hours do you have on it?
1 project | /r/pcmasterrace | 4 Jun 2023
is there an app that sends out notice for games that goes on discount or goes free in PC storefronts?
1 project | /r/androidapps | 20 May 2023
Web Search and Scrape
3 projects | /r/javahelp | 14 Oct 2022

Web Scraping 101 with Python

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Puppeteer Web Crawling Automation Scraper Playwright
Post date: 10 Feb 2021

pyppeteer

playwright-python

InfluxDB

Scrapy

memex-program-index

scraper

WorkOS

Related posts

Web Scraping 101 with Python

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Puppeteer Web Crawling Automation Scraper Playwright Post date: 10 Feb 2021

pyppeteer

playwright-python

InfluxDB

Scrapy

memex-program-index

scraper

WorkOS

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Puppeteer Web Crawling Automation Scraper Playwright
Post date: 10 Feb 2021