alltheplaces

A set of spiders and scrapers to extract location information from places that post their location on the internet. (by alltheplaces)

Alltheplaces Alternatives

Similar projects and alternatives to alltheplaces

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better alltheplaces alternative or higher similarity.

alltheplaces reviews and mentions

Posts with mentions or reviews of alltheplaces. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-11-10.
  • Differentiating between hypermarkets and supermarkets.
    1 project | /r/openstreetmap | 9 Dec 2023
    Maybe a different approach? https://www.alltheplaces.xyz/ has stores grouped by name
  • Meta, Microsoft and Amazon team up on maps project
    1 project | news.ycombinator.com | 26 Jul 2023
    I have an interest in this topic as a contributor to AllThePlaces[1], an open source project collating Scrapy spiders that crawl websites of franchises/retail chains that you'd find listed in name-suggestion-index[2] to retrieve location data. The project is now just shy of collecting 3 million points of interest from almost 1700 spiders.

    Overture Maps appears to be quite a closed and proprietary project, with claims of openness limited to being able to download a data set and accompanying schema specification. Some issues that immediately come to mind:

    1. There is no published description for how the data was generated. End users thus are given no assurance of how accurate and complete the data is.

    a. As an example, administrative boundaries are frightfully complex and include disputed boundaries, significant ambiguity in definition of boundaries, and trade-off between precision of boundaries versus performance of algorithms using administrative boundary data. Which definition of a boundary does Overture Maps adhere to, or can it support multiple definitions?

    b. It's probable that Microsoft have contributed ld+json/microdata geographic data from BingBot crawls of the Internet. This data is notoriously incorrect, including fields mixed up and invalidly repurposed, "CLOSED" in field names to denote closure of a place 5 years ago but the web page remains online, and much ambiguity in opening hours specifications. For AllThePlaces, many of the spiders developed require human consideration, sometimes of considerable complexity, to piece together horribly messy data that is published by shop and restaurant franchises, and other organisations providing location data via their websites.

    c. For location information where +/- 1-5m accuracy and precision may be required (e.g. individual shops within a shopping centre[3]), source data is typically provided by the authoritative sources with 1mm precision and +/- 10-100m accuracy. AllThePlaces, Overture Maps, Google Maps and similar still need human editors (OpenStreetMap editors) to do on-the-ground surveys to pinpoint precise locations and to standardise the definition of a location (e.g. for a point, should it be the centroid of the largest regular polygon which could be placed in the overall irregular polygon, the center of mass of a planar lamina, the location of the main entrance, or some other definition?).

    d. If Overture Maps is dependent on BingBot for place data, they'll miss an enormous number of points of interest that BingBot would never be able to find. For example, an undocumented REST/JSON/GraphQL API call or modification to parameters to an observed store locator API call may be necessary to return all locations and relevant fields of data. Website developers routinely do stupid things with robots.txt such as instruct a bot to crawl 10k pages (1GB+) from a sitemap last updated 5 years ago rather than make 10 fast API calls for up-to-date data (5MB). Overture Maps would be free to consume data from AllThePlaces as it is CC-0 licensed, and possibly correlate it with other data sources such as BingBot crawl data, a government database of licensed commercial premises or postal address geocoding data. However the messiness of data in various sources would be approaching impossible to reconcile, even for humans, and Overture Maps would possibly have to decide whether to err on the side of having duplicates, or lack completeness.

    2. There is no published tooling for how someone else can reproduce the same data.

    a. AllThePlaces users fairly frequently experience the wrath of Cloudflare, Imperva and other Internet-breaking third parties, as well as custom geographic blocking schemes and more rarely, overzealous rate limiting mechanisms. If Overture Maps is dependent on BingBot crawls, they'll have a slight advantage over AllThePlaces due to deliberate whitelisting of BingBot from the likes of Cloudflare, Imperva, customer firewalls, etc. However, no matter whether you're AllThePlaces or Overture Maps or anyone else, if you want to capture as many points of interest as possible across the world, use of residential ISP subnets and anti-bot-detection software is increasingly required. They'll need people in dozens of countries each crawling websites targeted to the same country, using residential ISP address space. Otherwise they end up with an American view of the world, or a European view of the world, or something else that isn't the full picture.

    b. If Overture Maps has locations incorrect for a franchise/brand due to a data cleansing problem or sourcing data from a bad (perhaps non-authoritative source), there are no software repositories for the franchise/brand to raise an issue or submit a patch against.

    [1] https://www.alltheplaces.xyz/

    [2] http://nsi.guide/

    [3] Example Australian shopping centre as captured by AllThePlaces: https://www.alltheplaces.xyz/map/#18.07/-33.834646/150.98952...

  • An aggressive, stealthy web spider operating from Microsoft IP space
    1 project | news.ycombinator.com | 18 Jan 2023
    I recently made some contributions to https://github.com/alltheplaces/alltheplaces which is a set of Scrapy spiders for extracting locations of primarily franchise business stores, for use with projects such as OpenStreetMap. After these contributions I am now in shock at how hostile a lot of businesses are with allowing people the privilege of accessing their websites to find store locations, store information or to purchase an item.

    Some examples of what I found you can't do at perhaps 20% of business websites:

    * Use the website near a country border where people often commute freely across borders on a daily or regular basis (example: Schengen region).

    * Plan a holiday figuring out when a shop closes for the day in another country.

    * Order something from an Internet connection in another country to arrive when you return home.

    * Look up product information whilst on a work trip overseas.

    * Access the website via a workplace proxy server that is mandated to be used and just so happens to be hosted in a data centre (the likes of AWS and Azure included) which is blocked simply because it's not a residential ISP. A more complex check would be the website checking the time between a request from Javascript code executing on client reaching the server matches the time the server thinks it should have taken (by ICMP PING or whatever to the origin IP address).

    * Use the website simply because the geo IP database the company uses hasn't been updated to reflect block reassignments by the regional registry, and thinks that your French residential IP address is a defunct Latvian business IP address.

    * Find the business on a price comparison website.

    * Find the business on a search engine that isn't Google.

    * Access the website without first allowing obfuscated Javascript to execute (example: [1]).

    * Use the website if you had certain disabilities.

    * Access the website with IPv6 or using 464XLAT (shared origin IPv4 address with potentially a large pool of other users).

    The answer to me appears obvious: Store and product information is for the most part static and can be stored in static HTML and JSON files that are rewritten on a regular cycle, or at least cached until their next update. JSON files can be advertised freely to the public so if anyone was trying to obtain information off the website, they could do so with a single request for a small file causing as minimal impact as possible.

    Of course, none of the anti-bot measures implemented by websites actually stop bots they're most seeking to block. There are services specifically designed for scraping websites that have 100,000's of residential IP addresses in their pools around the world (fixed and mobile ISPs), and it's trivial with software such as selenium-stealth to make each request look legitimate from a Javascript perspective as if the request was from a Chrome browser running on Windows 10 with a single 2160p screen, etc. If you force bots down the path of working around anti-bot measures, it's a never ending battle the website will ultimately lose because the website will end up blocking legitimate customers and have extremely poor performance that is worsened by forcing bots to make 10x the number of requests to the website just to look legitimate or pass client tests.

    [1] https://www.imperva.com/products/web-application-firewall-wa...

  • What does Overture Map mean for the future of OpenStreetMap
    1 project | news.ycombinator.com | 24 Dec 2022
    They don't care about sharing it.

    https://www.alltheplaces.xyz/ scrapes location maps and digests them into consistent line delimited GeoJSON.

  • Datasette AllThePlaces
    2 projects | news.ycombinator.com | 10 Nov 2022
    Note that [AllThePlaces](https://www.alltheplaces.xyz/) is not OSM. This website only contains data from [567 different websites](https://alltheplaces-datasette.fly.dev/alltheplaces/count_by...). Still, the fact that this is all served from SQLITE makes it an accessible way to get started with large geospatial data. [GitHub repo](https://github.com/eyeseast/alltheplaces-datasette)
  • What is the best website or interface to compile store locations?
    1 project | /r/openstreetmap | 9 Oct 2022
    Check my project All the Places: https://github.com/alltheplaces/alltheplaces
  • A note from our sponsor - WorkOS
    workos.com | 24 Apr 2024
    The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →

Stats

Basic alltheplaces repo stats
6
528
10.0
3 days ago

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com