Web Scraping Open Knowledge

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

Webscraping Open Project

11 1,307 0.0 Python

Discontinued The web scraping open project repository aims to share knowledge and experiences about web scraping with Python [Moved to: https://github.com/TheWebScrapingClub/webscraping-from-0-to-hero]

On the page about canvas fingerprinting[0], it only mentions Cloudflare. From what I can tell, reCaptcha v3 also uses canvas fingerprinting [1]
[0] https://github.com/reanalytics-databoutique/webscraping-open...
[1] https://brianwjoe.com/2019/02/06/how-does-recaptcha-v3-work/

docker-selenium-lambda

1 449 8.3 Dockerfile

The simplest demo of chrome automation by python and selenium in AWS Lambda

Most helpful for me regarding scraping. Using Selenium on Lambdas, and using this container: https://github.com/umihico/docker-selenium-lambda

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
hextuples

2 28 2.2

An RDF serialization format designed for performance in the browser
webscraping-open

2 - -

On the page about canvas fingerprinting[0], it only mentions Cloudflare. From what I can tell, reCaptcha v3 also uses canvas fingerprinting [1]
[0] https://github.com/reanalytics-databoutique/webscraping-open...
[1] https://brianwjoe.com/2019/02/06/how-does-recaptcha-v3-work/

openstates-scrapers

18 834 9.8 Python

source for Open States scrapers

Here’s an example of parsing some particularly annoying old school html. I’m not claiming it’s the _best_ way to do it, just that you can, and I’m not sure this one is doable with selectors. https://github.com/openstates/openstates-scrapers/blob/40246...

cloudscraper

19 3,974 1.5 Python

A Python module to bypass Cloudflare's anti-bot page.

Anyone with a stake in bypassing anti-bot measures isn't going to share their tactics, since sharing it will lead to such workaround being patched or mitigated, requiring them to research for more bot detection workarounds.
Projects like cloudscraper[0] are often linked to point and say "look! they broke Cloudflare!" but CF and the rest of the industry has detections for tools like this, and instead of rolling out blocks for these tools, they give website owners tools like bot score[1] to manage their own risk level on a per-page basis.
0: https://github.com/VeNoMouS/cloudscraper
1: https://developers.cloudflare.com/bots/concepts/bot-score/

domonic

32 130 6.1 Python

Create HTML with python 3 using a standard DOM API. Includes a python port of JavaScript for interoperability and tons of other cool features. A fast prototyping library.

I'm not sure about quicker. Doesn't scrapy use elementpath?. which converts a css query to an xpath under the hood as there is no complete CSSOM available for python. Likely as there is no modern standards based python dom to operate on so doing it on lxml tree is probably the best option. I find the main difference is xpath can return an attribute value where as css returns the node. You can use either from the terminal in my lib... https://github.com/byteface/domonic (as it uses elementpath like scrapy)

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
morph

1 463 0.0 Ruby

Take the hassle out of web scraping (by openaustralia)

This is the one I know about: https://morph.io/ and https://github.com/openaustralia/morph#readme (AGPLv3) -- they used to be at the intersection of "heroku for scrapers" and DoltHub (e.g. https://www.dolthub.com/repositories/dolthub/us-businesses/d...) since the scrapers would run but then make their data available as CSV or sqlite or whatever. But, when I just tried to load one of the morph.io scrapers, the page just said "creating new template" so I'm guessing they've gone the way of the ScraperWiki.com that preceded them: turns out, hosted compute for free isn't free

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project