The State of Web Scraping in 2021

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • SonarLint - Clean code begins in your IDE with SonarLint
  • Revelo Payroll - Free Global Payroll designed for tech teams
  • InfluxDB - Collect and Analyze Billions of Data Points in Real Time
  • Onboard AI - Learn any GitHub repo in 59 seconds
  • Playwright

    Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.

    I moved from selenium to playwright. It has a pleasant API and things just works our of box. I ran into odd problems with selenium before, especially when waiting for a specific element. Selenium didn't register it, but I could see it load.

    It was uncharacteristic of me, because I tend to use boring, older technologies. But this gamble paid off for me.

    https://playwright.dev/

  • pyppeteer

    Headless chrome/chromium automation library (unofficial port of puppeteer)

    In my own experience puppeteer is much better/capable than selenium but the problem is that puppeteer requires nodejs. its python-wrapper https://github.com/pyppeteer/pyppeteer was not as good as selenium when you like to use python.

  • SonarLint

    Clean code begins in your IDE with SonarLint. Up your coding game and discover issues early. SonarLint is a free plugin that helps you find & fix bugs and security issues from the moment you start writing code. Install from your favorite IDE marketplace today.

  • colly

    Elegant Scraper and Crawler Framework for Golang

    If you're familiar with Go, there's Colly too [1]. I liked its simplicity and approach and even wrote a little wrapper around it to run it via Docker and a config file:

    https://gotripod.com/insights/super-simple-site-crawling-and...

    [1] http://go-colly.org/

  • lambdasoup

    Functional HTML scraping and rewriting with CSS in OCaml

    OCaml’s Lambda Soup (https://aantron.github.io/lambdasoup/) is a amazing library/, especially for those that prefer functional programming

  • selectolax

    Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).

    Lazyweb link: https://github.com/rushter/selectolax

    although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_

    > Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.

    although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor

    ---

    > It looks like the author of the article just googled some libraries for each language and didn't research the topic

    Heh, oh, new to the Internet, are you? :-D

  • lexbor

    Lexbor is development of an open source HTML Renderer library. https://lexbor.com

    Lazyweb link: https://github.com/rushter/selectolax

    although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_

    > Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.

    although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor

    ---

    > It looks like the author of the article just googled some libraries for each language and didn't research the topic

    Heh, oh, new to the Internet, are you? :-D

  • utls

    Fork of the Go standard TLS library, providing low-level access to the ClientHello for mimicry purposes.

  • Revelo Payroll

    Free Global Payroll designed for tech teams. Building a great tech team takes more than a paycheck. Zero payroll costs, get AI-driven insights to retain best talent, and delight them with amazing local benefits. 100% free and compliant.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts