Mastering Web Scraping in Python: Crawling from Scratch

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • skyscraper

    Structural scraping for the rest of us. (by nathell)

  • I’ve done a fair share of scraping, and I learned that on a large scale, there are a lot of cross-cutting repetitive concerns. Things like caching, fetching HTML (preferably in parallel), throttling, retries, navigation, emitting the output as a dataset…

    My library, Skyscraper [0], attempts to help with these. It’s written in Clojure (based on Enlive or Reaver, both counterparts to Beautiful Soup), but the principles should be readily transferable everywhere.

    [0]: https://github.com/nathell/skyscraper

  • WebDumper

    A tool for scraping, dumping and unpacking (webpacked) javascript source files.

  • I've had decent experience using [1] WebDumper to scrape SPA (single page applications), which rely almost entirely DOM rendering and client-side JS.

    I'm curious if folks here had any other recommendations for scraping SPAs, ie React or Angular applications.

    1: https://github.com/EllyMandliel/WebDumper

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • ChromeController

    Comprehensive wrapper and execution manager for the Chrome browser using the Chrome Debugging Protocol.

  • colly

    Elegant Scraper and Crawler Framework for Golang

  • If performance, Especially concurrency matters then we should include Go based libraries in the discussion as well.

    Colly[1] is a all batteries included scrapping library, But could be bit intimidating for someone new to HTML scrapping.

    rod [2] is a browser automation library based on DevTools Protocol which adheres to Go'ish way of doing things and so it's very intuitive but comes with overhead of running chromium browser even if its headless.

    [1] https://github.com/gocolly/colly

    [2] https://github.com/go-rod/rod

  • rod

    A Devtools driver for web automation and scraping

  • If performance, Especially concurrency matters then we should include Go based libraries in the discussion as well.

    Colly[1] is a all batteries included scrapping library, But could be bit intimidating for someone new to HTML scrapping.

    rod [2] is a browser automation library based on DevTools Protocol which adheres to Go'ish way of doing things and so it's very intuitive but comes with overhead of running chromium browser even if its headless.

    [1] https://github.com/gocolly/colly

    [2] https://github.com/go-rod/rod

  • grub-2.0

    Grub is an AI powered Web crawler.

  • I’ve found imaging the page and doing OCR on the image is quite good for text extraction. Many pages on the Internet render with JavaScript, which means BS may not see the text in the DOM.

    Here is the code to do some of that: https://github.com/kordless/grub-2.0

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Squirm - This was the night of the crawling terror!

    4 projects | /r/crystal_programming | 6 May 2023
  • Scrape and automate with golang

    2 projects | /r/golang | 11 Jun 2022
  • Show HN: Extracting structured data from the web with LLMs

    2 projects | news.ycombinator.com | 1 May 2024
  • Sometimes things simply don't work

    3 projects | dev.to | 23 Apr 2024
  • Tutorial: Extracting structured data from websites using Groq and Firecrawl

    1 project | news.ycombinator.com | 22 Apr 2024