alltheplaces VS scrapy-playwright

Compare alltheplaces vs scrapy-playwright and see what are their differences.

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
alltheplaces scrapy-playwright
6 11
559 837
7.2% 3.1%
10.0 7.8
5 days ago 3 months ago
Python Python
GNU General Public License v3.0 or later BSD 3-clause "New" or "Revised" License
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

alltheplaces

Posts with mentions or reviews of alltheplaces. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-11-10.
  • Differentiating between hypermarkets and supermarkets.
    1 project | /r/openstreetmap | 9 Dec 2023
    Maybe a different approach? https://www.alltheplaces.xyz/ has stores grouped by name
  • Meta, Microsoft and Amazon team up on maps project
    1 project | news.ycombinator.com | 26 Jul 2023
    I have an interest in this topic as a contributor to AllThePlaces[1], an open source project collating Scrapy spiders that crawl websites of franchises/retail chains that you'd find listed in name-suggestion-index[2] to retrieve location data. The project is now just shy of collecting 3 million points of interest from almost 1700 spiders.

    Overture Maps appears to be quite a closed and proprietary project, with claims of openness limited to being able to download a data set and accompanying schema specification. Some issues that immediately come to mind:

    1. There is no published description for how the data was generated. End users thus are given no assurance of how accurate and complete the data is.

    a. As an example, administrative boundaries are frightfully complex and include disputed boundaries, significant ambiguity in definition of boundaries, and trade-off between precision of boundaries versus performance of algorithms using administrative boundary data. Which definition of a boundary does Overture Maps adhere to, or can it support multiple definitions?

    b. It's probable that Microsoft have contributed ld+json/microdata geographic data from BingBot crawls of the Internet. This data is notoriously incorrect, including fields mixed up and invalidly repurposed, "CLOSED" in field names to denote closure of a place 5 years ago but the web page remains online, and much ambiguity in opening hours specifications. For AllThePlaces, many of the spiders developed require human consideration, sometimes of considerable complexity, to piece together horribly messy data that is published by shop and restaurant franchises, and other organisations providing location data via their websites.

    c. For location information where +/- 1-5m accuracy and precision may be required (e.g. individual shops within a shopping centre[3]), source data is typically provided by the authoritative sources with 1mm precision and +/- 10-100m accuracy. AllThePlaces, Overture Maps, Google Maps and similar still need human editors (OpenStreetMap editors) to do on-the-ground surveys to pinpoint precise locations and to standardise the definition of a location (e.g. for a point, should it be the centroid of the largest regular polygon which could be placed in the overall irregular polygon, the center of mass of a planar lamina, the location of the main entrance, or some other definition?).

    d. If Overture Maps is dependent on BingBot for place data, they'll miss an enormous number of points of interest that BingBot would never be able to find. For example, an undocumented REST/JSON/GraphQL API call or modification to parameters to an observed store locator API call may be necessary to return all locations and relevant fields of data. Website developers routinely do stupid things with robots.txt such as instruct a bot to crawl 10k pages (1GB+) from a sitemap last updated 5 years ago rather than make 10 fast API calls for up-to-date data (5MB). Overture Maps would be free to consume data from AllThePlaces as it is CC-0 licensed, and possibly correlate it with other data sources such as BingBot crawl data, a government database of licensed commercial premises or postal address geocoding data. However the messiness of data in various sources would be approaching impossible to reconcile, even for humans, and Overture Maps would possibly have to decide whether to err on the side of having duplicates, or lack completeness.

    2. There is no published tooling for how someone else can reproduce the same data.

    a. AllThePlaces users fairly frequently experience the wrath of Cloudflare, Imperva and other Internet-breaking third parties, as well as custom geographic blocking schemes and more rarely, overzealous rate limiting mechanisms. If Overture Maps is dependent on BingBot crawls, they'll have a slight advantage over AllThePlaces due to deliberate whitelisting of BingBot from the likes of Cloudflare, Imperva, customer firewalls, etc. However, no matter whether you're AllThePlaces or Overture Maps or anyone else, if you want to capture as many points of interest as possible across the world, use of residential ISP subnets and anti-bot-detection software is increasingly required. They'll need people in dozens of countries each crawling websites targeted to the same country, using residential ISP address space. Otherwise they end up with an American view of the world, or a European view of the world, or something else that isn't the full picture.

    b. If Overture Maps has locations incorrect for a franchise/brand due to a data cleansing problem or sourcing data from a bad (perhaps non-authoritative source), there are no software repositories for the franchise/brand to raise an issue or submit a patch against.

    [1] https://www.alltheplaces.xyz/

    [2] http://nsi.guide/

    [3] Example Australian shopping centre as captured by AllThePlaces: https://www.alltheplaces.xyz/map/#18.07/-33.834646/150.98952...

  • An aggressive, stealthy web spider operating from Microsoft IP space
    1 project | news.ycombinator.com | 18 Jan 2023
    I recently made some contributions to https://github.com/alltheplaces/alltheplaces which is a set of Scrapy spiders for extracting locations of primarily franchise business stores, for use with projects such as OpenStreetMap. After these contributions I am now in shock at how hostile a lot of businesses are with allowing people the privilege of accessing their websites to find store locations, store information or to purchase an item.

    Some examples of what I found you can't do at perhaps 20% of business websites:

    * Use the website near a country border where people often commute freely across borders on a daily or regular basis (example: Schengen region).

    * Plan a holiday figuring out when a shop closes for the day in another country.

    * Order something from an Internet connection in another country to arrive when you return home.

    * Look up product information whilst on a work trip overseas.

    * Access the website via a workplace proxy server that is mandated to be used and just so happens to be hosted in a data centre (the likes of AWS and Azure included) which is blocked simply because it's not a residential ISP. A more complex check would be the website checking the time between a request from Javascript code executing on client reaching the server matches the time the server thinks it should have taken (by ICMP PING or whatever to the origin IP address).

    * Use the website simply because the geo IP database the company uses hasn't been updated to reflect block reassignments by the regional registry, and thinks that your French residential IP address is a defunct Latvian business IP address.

    * Find the business on a price comparison website.

    * Find the business on a search engine that isn't Google.

    * Access the website without first allowing obfuscated Javascript to execute (example: [1]).

    * Use the website if you had certain disabilities.

    * Access the website with IPv6 or using 464XLAT (shared origin IPv4 address with potentially a large pool of other users).

    The answer to me appears obvious: Store and product information is for the most part static and can be stored in static HTML and JSON files that are rewritten on a regular cycle, or at least cached until their next update. JSON files can be advertised freely to the public so if anyone was trying to obtain information off the website, they could do so with a single request for a small file causing as minimal impact as possible.

    Of course, none of the anti-bot measures implemented by websites actually stop bots they're most seeking to block. There are services specifically designed for scraping websites that have 100,000's of residential IP addresses in their pools around the world (fixed and mobile ISPs), and it's trivial with software such as selenium-stealth to make each request look legitimate from a Javascript perspective as if the request was from a Chrome browser running on Windows 10 with a single 2160p screen, etc. If you force bots down the path of working around anti-bot measures, it's a never ending battle the website will ultimately lose because the website will end up blocking legitimate customers and have extremely poor performance that is worsened by forcing bots to make 10x the number of requests to the website just to look legitimate or pass client tests.

    [1] https://www.imperva.com/products/web-application-firewall-wa...

  • What does Overture Map mean for the future of OpenStreetMap
    1 project | news.ycombinator.com | 24 Dec 2022
    They don't care about sharing it.

    https://www.alltheplaces.xyz/ scrapes location maps and digests them into consistent line delimited GeoJSON.

  • Datasette AllThePlaces
    2 projects | news.ycombinator.com | 10 Nov 2022
    Note that [AllThePlaces](https://www.alltheplaces.xyz/) is not OSM. This website only contains data from [567 different websites](https://alltheplaces-datasette.fly.dev/alltheplaces/count_by...). Still, the fact that this is all served from SQLITE makes it an accessible way to get started with large geospatial data. [GitHub repo](https://github.com/eyeseast/alltheplaces-datasette)
  • What is the best website or interface to compile store locations?
    1 project | /r/openstreetmap | 9 Oct 2022
    Check my project All the Places: https://github.com/alltheplaces/alltheplaces

scrapy-playwright

Posts with mentions or reviews of scrapy-playwright. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-07-06.

What are some alternatives?

When comparing alltheplaces and scrapy-playwright you can also consider the following projects:

scrapyrt - HTTP API for Scrapy spiders

scrapy-splash - Scrapy+Splash for JavaScript integration

openstates-scrapers - source for Open States scrapers

scrapy-cloudflare-middleware - A Scrapy middleware to bypass the CloudFlare's anti-bot protection

scrapy-yle-kuntavaalit2021 - Fetch YLE kuntavaalit 2021 data

Scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.

scrapy-folder-tree - A scrapy pipeline which stores files using folder trees.

scrapy-rotating-proxies - use multiple proxies with Scrapy

topojson - Encode spatial data as topology in Python! 🌍 https://mattijn.github.io/topojson

scrapy-fake-useragent - Random User-Agent middleware based on fake-useragent

alltheplaces-datasette - AllThePlaces in Datasette

ArchiveBox - 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...