alltheplaces vs scrapy-playwright

alltheplaces

A set of spiders and scrapers to extract location information from places that post their location on the internet. (by alltheplaces)

Source Code

alltheplaces.xyz

Suggest alternative

Edit details

scrapy-playwright

🎭 Playwright integration for Scrapy (by elacuesta)

Playwright playwright-python Scrapy Python Python3 python-asyncio javascript-renderer headless-browser chrome-headless firefox-headless webkit-headless HacktoberFest

Source Code

Suggest alternative

Edit details

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

alltheplaces		scrapy-playwright
	Project
6	Mentions	11
559	Stars	837
7.2%	Growth	3.1%
10.0	Activity	7.8
5 days ago	Latest Commit	3 months ago
Python	Language	Python
GNU General Public License v3.0 or later	License	BSD 3-clause "New" or "Revised" License

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

alltheplaces

Posts with mentions or reviews of alltheplaces. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-11-10.

Differentiating between hypermarkets and supermarkets.
1 project | /r/openstreetmap | 9 Dec 2023

Maybe a different approach? https://www.alltheplaces.xyz/ has stores grouped by name
Meta, Microsoft and Amazon team up on maps project
1 project | news.ycombinator.com | 26 Jul 2023

I have an interest in this topic as a contributor to AllThePlaces[1], an open source project collating Scrapy spiders that crawl websites of franchises/retail chains that you'd find listed in name-suggestion-index[2] to retrieve location data. The project is now just shy of collecting 3 million points of interest from almost 1700 spiders.
Overture Maps appears to be quite a closed and proprietary project, with claims of openness limited to being able to download a data set and accompanying schema specification. Some issues that immediately come to mind:
1. There is no published description for how the data was generated. End users thus are given no assurance of how accurate and complete the data is.
a. As an example, administrative boundaries are frightfully complex and include disputed boundaries, significant ambiguity in definition of boundaries, and trade-off between precision of boundaries versus performance of algorithms using administrative boundary data. Which definition of a boundary does Overture Maps adhere to, or can it support multiple definitions?
b. It's probable that Microsoft have contributed ld+json/microdata geographic data from BingBot crawls of the Internet. This data is notoriously incorrect, including fields mixed up and invalidly repurposed, "CLOSED" in field names to denote closure of a place 5 years ago but the web page remains online, and much ambiguity in opening hours specifications. For AllThePlaces, many of the spiders developed require human consideration, sometimes of considerable complexity, to piece together horribly messy data that is published by shop and restaurant franchises, and other organisations providing location data via their websites.
c. For location information where +/- 1-5m accuracy and precision may be required (e.g. individual shops within a shopping centre[3]), source data is typically provided by the authoritative sources with 1mm precision and +/- 10-100m accuracy. AllThePlaces, Overture Maps, Google Maps and similar still need human editors (OpenStreetMap editors) to do on-the-ground surveys to pinpoint precise locations and to standardise the definition of a location (e.g. for a point, should it be the centroid of the largest regular polygon which could be placed in the overall irregular polygon, the center of mass of a planar lamina, the location of the main entrance, or some other definition?).
d. If Overture Maps is dependent on BingBot for place data, they'll miss an enormous number of points of interest that BingBot would never be able to find. For example, an undocumented REST/JSON/GraphQL API call or modification to parameters to an observed store locator API call may be necessary to return all locations and relevant fields of data. Website developers routinely do stupid things with robots.txt such as instruct a bot to crawl 10k pages (1GB+) from a sitemap last updated 5 years ago rather than make 10 fast API calls for up-to-date data (5MB). Overture Maps would be free to consume data from AllThePlaces as it is CC-0 licensed, and possibly correlate it with other data sources such as BingBot crawl data, a government database of licensed commercial premises or postal address geocoding data. However the messiness of data in various sources would be approaching impossible to reconcile, even for humans, and Overture Maps would possibly have to decide whether to err on the side of having duplicates, or lack completeness.
2. There is no published tooling for how someone else can reproduce the same data.
a. AllThePlaces users fairly frequently experience the wrath of Cloudflare, Imperva and other Internet-breaking third parties, as well as custom geographic blocking schemes and more rarely, overzealous rate limiting mechanisms. If Overture Maps is dependent on BingBot crawls, they'll have a slight advantage over AllThePlaces due to deliberate whitelisting of BingBot from the likes of Cloudflare, Imperva, customer firewalls, etc. However, no matter whether you're AllThePlaces or Overture Maps or anyone else, if you want to capture as many points of interest as possible across the world, use of residential ISP subnets and anti-bot-detection software is increasingly required. They'll need people in dozens of countries each crawling websites targeted to the same country, using residential ISP address space. Otherwise they end up with an American view of the world, or a European view of the world, or something else that isn't the full picture.
b. If Overture Maps has locations incorrect for a franchise/brand due to a data cleansing problem or sourcing data from a bad (perhaps non-authoritative source), there are no software repositories for the franchise/brand to raise an issue or submit a patch against.
[1] https://www.alltheplaces.xyz/
[2] http://nsi.guide/
[3] Example Australian shopping centre as captured by AllThePlaces: https://www.alltheplaces.xyz/map/#18.07/-33.834646/150.98952...
An aggressive, stealthy web spider operating from Microsoft IP space
1 project | news.ycombinator.com | 18 Jan 2023

I recently made some contributions to https://github.com/alltheplaces/alltheplaces which is a set of Scrapy spiders for extracting locations of primarily franchise business stores, for use with projects such as OpenStreetMap. After these contributions I am now in shock at how hostile a lot of businesses are with allowing people the privilege of accessing their websites to find store locations, store information or to purchase an item.
Some examples of what I found you can't do at perhaps 20% of business websites:
* Use the website near a country border where people often commute freely across borders on a daily or regular basis (example: Schengen region).
* Plan a holiday figuring out when a shop closes for the day in another country.
* Order something from an Internet connection in another country to arrive when you return home.
* Look up product information whilst on a work trip overseas.
* Access the website via a workplace proxy server that is mandated to be used and just so happens to be hosted in a data centre (the likes of AWS and Azure included) which is blocked simply because it's not a residential ISP. A more complex check would be the website checking the time between a request from Javascript code executing on client reaching the server matches the time the server thinks it should have taken (by ICMP PING or whatever to the origin IP address).
* Use the website simply because the geo IP database the company uses hasn't been updated to reflect block reassignments by the regional registry, and thinks that your French residential IP address is a defunct Latvian business IP address.
* Find the business on a price comparison website.
* Find the business on a search engine that isn't Google.
* Access the website without first allowing obfuscated Javascript to execute (example: [1]).
* Use the website if you had certain disabilities.
* Access the website with IPv6 or using 464XLAT (shared origin IPv4 address with potentially a large pool of other users).
The answer to me appears obvious: Store and product information is for the most part static and can be stored in static HTML and JSON files that are rewritten on a regular cycle, or at least cached until their next update. JSON files can be advertised freely to the public so if anyone was trying to obtain information off the website, they could do so with a single request for a small file causing as minimal impact as possible.
Of course, none of the anti-bot measures implemented by websites actually stop bots they're most seeking to block. There are services specifically designed for scraping websites that have 100,000's of residential IP addresses in their pools around the world (fixed and mobile ISPs), and it's trivial with software such as selenium-stealth to make each request look legitimate from a Javascript perspective as if the request was from a Chrome browser running on Windows 10 with a single 2160p screen, etc. If you force bots down the path of working around anti-bot measures, it's a never ending battle the website will ultimately lose because the website will end up blocking legitimate customers and have extremely poor performance that is worsened by forcing bots to make 10x the number of requests to the website just to look legitimate or pass client tests.
[1] https://www.imperva.com/products/web-application-firewall-wa...
What does Overture Map mean for the future of OpenStreetMap
1 project | news.ycombinator.com | 24 Dec 2022

They don't care about sharing it.
https://www.alltheplaces.xyz/ scrapes location maps and digests them into consistent line delimited GeoJSON.
Datasette AllThePlaces
2 projects | news.ycombinator.com | 10 Nov 2022

Note that [AllThePlaces](https://www.alltheplaces.xyz/) is not OSM. This website only contains data from [567 different websites](https://alltheplaces-datasette.fly.dev/alltheplaces/count_by...). Still, the fact that this is all served from SQLITE makes it an accessible way to get started with large geospatial data. [GitHub repo](https://github.com/eyeseast/alltheplaces-datasette)
What is the best website or interface to compile store locations?
1 project | /r/openstreetmap | 9 Oct 2022

Check my project All the Places: https://github.com/alltheplaces/alltheplaces

scrapy-playwright

Posts with mentions or reviews of scrapy-playwright. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-07-06.

Web Scraping Dynamic Websites With Scrapy Playwright
1 project | dev.to | 6 Mar 2024

scrapy-playwright is an integration between Scrapy and Playwright. It enables scraping dynamic web pages with Scrapy by processing the web scraping requests using a Playwright instance.
Turning webpages into pdf
2 projects | /r/learnpython | 6 Jul 2023
Scrapy & splash guide
1 project | /r/learnpython | 18 Feb 2023
Web scraping with Python
4 projects | dev.to | 14 Feb 2023

To integrate Playwright with Scrapy, we will use the scrapy-playwright library. Then, we will scrape https://www.mintmobile.com/product/google-pixel-7-pro-bundle/ to demonstrate how to extract data from a website using Playwright and Scrapy.
which libraries/frameworks could be used for page interaction?
2 projects | /r/webscraping | 6 Nov 2022

Scrapy-playwright
Implementing a Selenium backend on a web app?
1 project | /r/webscraping | 8 Oct 2022

your website is a dynamic there is many integration on scrappy can help you This the best best one https://github.com/scrapy-plugins/scrapy-playwright
Is Selenium still a good choice?
3 projects | /r/webscraping | 13 Jul 2022

This concern should be lifted if you are a Scrapy lover. There is a Scrapy integration for playwright, that gives you a lot of freedom and lets you operate from a Scrapy spider.
Scraping Dynamic Javascript Websites with Scrapy and Scrapy-playwright
2 projects | dev.to | 14 Jun 2022

Now we need to modify scrapy's settings to allow it to work with playwright. Instructions can be found on playwright's GitHub page. We need to add settings for DOWNLOAD_HANDLERS and TWISTED_REACTOR. New settings that were added can be found between ###. This is what the settings file should look like:
Web Scraping with Python: Everything you need to know
2 projects | news.ycombinator.com | 15 May 2022

You can use something like scrapy-playwright[0] to run a headless browser framework as your download handler. I think there are versions for some of the other headless systems, if you prefer those.
[0] https://github.com/scrapy-plugins/scrapy-playwright
Make an addition to scrapy_playwright source code
1 project | /r/scrapy | 22 Feb 2022

[1]: https://github.com/scrapy-plugins/scrapy-playwright/issues/61

What are some alternatives?

When comparing alltheplaces and scrapy-playwright you can also consider the following projects:

scrapyrt - HTTP API for Scrapy spiders

scrapy-splash - Scrapy+Splash for JavaScript integration

openstates-scrapers - source for Open States scrapers

scrapy-cloudflare-middleware - A Scrapy middleware to bypass the CloudFlare's anti-bot protection

scrapy-yle-kuntavaalit2021 - Fetch YLE kuntavaalit 2021 data

Scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.

scrapy-folder-tree - A scrapy pipeline which stores files using folder trees.

scrapy-rotating-proxies - use multiple proxies with Scrapy

topojson - Encode spatial data as topology in Python! 🌍 https://mattijn.github.io/topojson

scrapy-fake-useragent - Random User-Agent middleware based on fake-useragent

alltheplaces-datasette - AllThePlaces in Datasette

ArchiveBox - 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

alltheplaces vs scrapyrt scrapy-playwright vs scrapy-splash alltheplaces vs openstates-scrapers scrapy-playwright vs scrapy-cloudflare-middleware alltheplaces vs scrapy-yle-kuntavaalit2021 scrapy-playwright vs Scrapy alltheplaces vs scrapy-folder-tree scrapy-playwright vs scrapy-rotating-proxies alltheplaces vs topojson scrapy-playwright vs scrapy-fake-useragent alltheplaces vs alltheplaces-datasette scrapy-playwright vs ArchiveBox

Compare alltheplaces vs scrapy-playwright and see what are their differences.

alltheplaces

scrapy-playwright

alltheplaces

scrapy-playwright

What are some alternatives?