Web Scraping Open Knowledge

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

CodeRabbit: AI Code Reviews for Developers
Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
coderabbit.ai
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. Webscraping Open Project

    Discontinued The web scraping open project repository aims to share knowledge and experiences about web scraping with Python [Moved to: https://github.com/TheWebScrapingClub/webscraping-from-0-to-hero]

    On the page about canvas fingerprinting[0], it only mentions Cloudflare. From what I can tell, reCaptcha v3 also uses canvas fingerprinting [1]

    [0] https://github.com/reanalytics-databoutique/webscraping-open...

    [1] https://brianwjoe.com/2019/02/06/how-does-recaptcha-v3-work/

  2. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  3. docker-selenium-lambda

    The simplest demo of chrome automation by python and selenium in AWS Lambda

    Most helpful for me regarding scraping. Using Selenium on Lambdas, and using this container: https://github.com/umihico/docker-selenium-lambda

  4. hextuples

    An RDF serialization format designed for performance in the browser

  5. On the page about canvas fingerprinting[0], it only mentions Cloudflare. From what I can tell, reCaptcha v3 also uses canvas fingerprinting [1]

    [0] https://github.com/reanalytics-databoutique/webscraping-open...

    [1] https://brianwjoe.com/2019/02/06/how-does-recaptcha-v3-work/

  6. openstates-scrapers

    source for Open States scrapers

    Here’s an example of parsing some particularly annoying old school html. I’m not claiming it’s the _best_ way to do it, just that you can, and I’m not sure this one is doable with selectors. https://github.com/openstates/openstates-scrapers/blob/40246...

  7. cloudscraper

    A Python module to bypass Cloudflare's anti-bot page.

    Anyone with a stake in bypassing anti-bot measures isn't going to share their tactics, since sharing it will lead to such workaround being patched or mitigated, requiring them to research for more bot detection workarounds.

    Projects like cloudscraper[0] are often linked to point and say "look! they broke Cloudflare!" but CF and the rest of the industry has detections for tools like this, and instead of rolling out blocks for these tools, they give website owners tools like bot score[1] to manage their own risk level on a per-page basis.

    0: https://github.com/VeNoMouS/cloudscraper

    1: https://developers.cloudflare.com/bots/concepts/bot-score/

  8. domonic

    Create HTML with python 3 using a standard DOM API. Includes a python port of JavaScript for interoperability and tons of other cool features. A fast prototyping library.

    I'm not sure about quicker. Doesn't scrapy use elementpath?. which converts a css query to an xpath under the hood as there is no complete CSSOM available for python. Likely as there is no modern standards based python dom to operate on so doing it on lxml tree is probably the best option. I find the main difference is xpath can return an attribute value where as css returns the node. You can use either from the terminal in my lib... https://github.com/byteface/domonic (as it uses elementpath like scrapy)

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. morph

    Take the hassle out of web scraping (by openaustralia)

    This is the one I know about: https://morph.io/ and https://github.com/openaustralia/morph#readme (AGPLv3) -- they used to be at the intersection of "heroku for scrapers" and DoltHub (e.g. https://www.dolthub.com/repositories/dolthub/us-businesses/d...) since the scrapers would run but then make their data available as CSV or sqlite or whatever. But, when I just tried to load one of the morph.io scrapers, the page just said "creating new template" so I'm guessing they've gone the way of the ScraperWiki.com that preceded them: turns out, hosted compute for free isn't free

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Web Scraping in Python – The Complete Guide

    11 projects | news.ycombinator.com | 20 Feb 2024
  • Online Django Development Sprint, October 19-20.

    6 projects | /r/django | 19 Oct 2023
  • Reflex – Web apps in pure Python

    9 projects | news.ycombinator.com | 2 Aug 2023
  • Turning webpages into pdf

    2 projects | /r/learnpython | 6 Jul 2023
  • Foundations: First Blog Post

    4 projects | dev.to | 5 Mar 2022