Mastering Web Scraping in Python: Crawling from Scratch

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

skyscraper

3 401 4.9 Clojure

Structural scraping for the rest of us. (by nathell)

I’ve done a fair share of scraping, and I learned that on a large scale, there are a lot of cross-cutting repetitive concerns. Things like caching, fetching HTML (preferably in parallel), throttling, retries, navigation, emitting the output as a dataset…
My library, Skyscraper [0], attempts to help with these. It’s written in Clojure (based on Enlive or Reaver, both counterparts to Beautiful Soup), but the principles should be readily transferable everywhere.
[0]: https://github.com/nathell/skyscraper

WebDumper

2 131 0.0 TypeScript

A tool for scraping, dumping and unpacking (webpacked) javascript source files.

I've had decent experience using [1] WebDumper to scrape SPA (single page applications), which rely almost entirely DOM rendering and client-side JS.
I'm curious if folks here had any other recommendations for scraping SPAs, ie React or Angular applications.
1: https://github.com/EllyMandliel/WebDumper

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
ChromeController

1 209 3.3 Python

Comprehensive wrapper and execution manager for the Chrome browser using the Chrome Debugging Protocol.
colly

39 22,205 5.7 Go

Elegant Scraper and Crawler Framework for Golang

If performance, Especially concurrency matters then we should include Go based libraries in the discussion as well.
Colly[1] is a all batteries included scrapping library, But could be bit intimidating for someone new to HTML scrapping.
rod [2] is a browser automation library based on DevTools Protocol which adheres to Go'ish way of doing things and so it's very intuitive but comes with overhead of running chromium browser even if its headless.
[1] https://github.com/gocolly/colly
[2] https://github.com/go-rod/rod

rod

20 4,808 7.9 Go

A Devtools driver for web automation and scraping

If performance, Especially concurrency matters then we should include Go based libraries in the discussion as well.
Colly[1] is a all batteries included scrapping library, But could be bit intimidating for someone new to HTML scrapping.
rod [2] is a browser automation library based on DevTools Protocol which adheres to Go'ish way of doing things and so it's very intuitive but comes with overhead of running chromium browser even if its headless.
[1] https://github.com/gocolly/colly
[2] https://github.com/go-rod/rod

grub-2.0

4 19 0.0 Python

Grub is an AI powered Web crawler.

I’ve found imaging the page and doing OCR on the image is quite good for text extraction. Many pages on the Internet render with JavaScript, which means BS may not see the text in the DOM.
Here is the code to do some of that: https://github.com/kordless/grub-2.0

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Squirm - This was the night of the crawling terror!

4 projects | /r/crystal_programming | 6 May 2023
Scrape and automate with golang

2 projects | /r/golang | 11 Jun 2022
Show HN: Extracting structured data from the web with LLMs

2 projects | news.ycombinator.com | 1 May 2024
Sometimes things simply don't work

3 projects | dev.to | 23 Apr 2024
Tutorial: Extracting structured data from websites using Groq and Firecrawl

1 project | news.ycombinator.com | 22 Apr 2024

Mastering Web Scraping in Python: Crawling from Scratch

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Scraper Crawler Testing Chrome Nginx
Post date: 11 Aug 2021

skyscraper

WebDumper

InfluxDB

ChromeController

colly

rod

grub-2.0

Related posts

Squirm - This was the night of the crawling terror!

Scrape and automate with golang

Show HN: Extracting structured data from the web with LLMs

Sometimes things simply don't work

Tutorial: Extracting structured data from websites using Groq and Firecrawl

Mastering Web Scraping in Python: Crawling from Scratch

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Scraper Crawler Testing Chrome Nginx Post date: 11 Aug 2021

skyscraper

WebDumper

InfluxDB

ChromeController

colly

rod

grub-2.0

Related posts

Squirm - This was the night of the crawling terror!

Scrape and automate with golang

Show HN: Extracting structured data from the web with LLMs

Sometimes things simply don't work

Tutorial: Extracting structured data from websites using Groq and Firecrawl

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Scraper Crawler Testing Chrome Nginx
Post date: 11 Aug 2021