The State of Web Scraping in 2021

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • JetBrains - Developer Ecosystem Survey 2022
  • Scout APM - Less time debugging, more time building
  • SonarQube - Static code analysis for 29 languages.
  • Playwright

    Playwright is a framework for Web Testing and Automation. It allows testing Chromium, Firefox and WebKit with a single API.

    I moved from selenium to playwright. It has a pleasant API and things just works our of box. I ran into odd problems with selenium before, especially when waiting for a specific element. Selenium didn't register it, but I could see it load.

    It was uncharacteristic of me, because I tend to use boring, older technologies. But this gamble paid off for me.

    https://playwright.dev/

  • pyppeteer

    Headless chrome/chromium automation library (unofficial port of puppeteer)

    In my own experience puppeteer is much better/capable than selenium but the problem is that puppeteer requires nodejs. its python-wrapper https://github.com/pyppeteer/pyppeteer was not as good as selenium when you like to use python.

  • JetBrains

    Developer Ecosystem Survey 2022. Take part in the Developer Ecosystem Survey 2022 by JetBrains and get a chance to win a Macbook, a Nvidia graphics card, or other prizes. We’ll create an infographic full of stats, and you’ll get personalized results so you can compare yourself with other developers.

  • colly

    Elegant Scraper and Crawler Framework for Golang

    If you're familiar with Go, there's Colly too [1]. I liked its simplicity and approach and even wrote a little wrapper around it to run it via Docker and a config file:

    https://gotripod.com/insights/super-simple-site-crawling-and...

    [1] http://go-colly.org/

  • lambdasoup

    Functional HTML scraping and rewriting with CSS in OCaml

    OCaml’s Lambda Soup (https://aantron.github.io/lambdasoup/) is a amazing library/, especially for those that prefer functional programming

  • selectolax

    Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).

    Lazyweb link: https://github.com/rushter/selectolax

    although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_

    > Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.

    although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor

    ---

    > It looks like the author of the article just googled some libraries for each language and didn't research the topic

    Heh, oh, new to the Internet, are you? :-D

  • lexbor

    Lexbor is development of an open source HTML Renderer library. http://lexbor.com

    Lazyweb link: https://github.com/rushter/selectolax

    although I don't follow the need to have what appears to be two completely separate HTML parsing C libraries as dependencies; seeing this in the readme for Modest gives me the shivers because lxml has _seen some shit_

    > Modest is a fast HTML renderer implemented as a pure C99 library with no outside dependencies.

    although its other dep seems much more cognizant about the HTML5 standard, for whatever that's worth: https://github.com/lexbor/lexbor#lexbor

    ---

    > It looks like the author of the article just googled some libraries for each language and didn't research the topic

    Heh, oh, new to the Internet, are you? :-D

  • utls

    Fork of the Go standard TLS library, providing low-level access to the ClientHello for mimicry purposes.

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts