The State of Web Scraping in 2021

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • Playwright

    Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.

  • I moved from selenium to playwright. It has a pleasant API and things just works our of box. I ran into odd problems with selenium before, especially when waiting for a specific element. Selenium didn't register it, but I could see it load.

    It was uncharacteristic of me, because I tend to use boring, older technologies. But this gamble paid off for me.

    https://playwright.dev/

  • pyppeteer

    Headless chrome/chromium automation library (unofficial port of puppeteer)

  • In my own experience puppeteer is much better/capable than selenium but the problem is that puppeteer requires nodejs. its python-wrapper https://github.com/pyppeteer/pyppeteer was not as good as selenium when you like to use python.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • colly

    Elegant Scraper and Crawler Framework for Golang

  • If you're familiar with Go, there's Colly too [1]. I liked its simplicity and approach and even wrote a little wrapper around it to run it via Docker and a config file:

    https://gotripod.com/insights/super-simple-site-crawling-and...

    [1] http://go-colly.org/

  • lambdasoup

    Functional HTML scraping and rewriting with CSS in OCaml

  • OCaml’s Lambda Soup (https://aantron.github.io/lambdasoup/) is a amazing library/, especially for those that prefer functional programming

  • selectolax

    Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).

  • Lazyweb link: https://github.com/rushter/selectolax

    although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_

    > Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.

    although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor

    ---

    > It looks like the author of the article just googled some libraries for each language and didn't research the topic

    Heh, oh, new to the Internet, are you? :-D

  • lexbor

    Lexbor is development of an open source HTML Renderer library. https://lexbor.com

  • Lazyweb link: https://github.com/rushter/selectolax

    although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_

    > Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.

    although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor

    ---

    > It looks like the author of the article just googled some libraries for each language and didn't research the topic

    Heh, oh, new to the Internet, are you? :-D

  • utls

    Fork of the Go standard TLS library, providing low-level access to the ClientHello for mimicry purposes.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Show HN: Flyscrape – A standalone and scriptable web scraper in Go

    6 projects | news.ycombinator.com | 11 Nov 2023
  • Colly: Elegant Scraper and Crawler Framework for Golang

    1 project | news.ycombinator.com | 23 Aug 2023
  • HTML Templates | Why would you use them over react?

    8 projects | /r/golang | 16 Apr 2023
  • colly VS scrapemate - a user suggested alternative

    2 projects | 15 Apr 2023
  • Web scraper help

    1 project | /r/golang | 1 Mar 2023