Meta, Microsoft and Amazon team up on maps project

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • alltheplaces

    A set of spiders and scrapers to extract location information from places that post their location on the internet.

  • I have an interest in this topic as a contributor to AllThePlaces[1], an open source project collating Scrapy spiders that crawl websites of franchises/retail chains that you'd find listed in name-suggestion-index[2] to retrieve location data. The project is now just shy of collecting 3 million points of interest from almost 1700 spiders.

    Overture Maps appears to be quite a closed and proprietary project, with claims of openness limited to being able to download a data set and accompanying schema specification. Some issues that immediately come to mind:

    1. There is no published description for how the data was generated. End users thus are given no assurance of how accurate and complete the data is.

    a. As an example, administrative boundaries are frightfully complex and include disputed boundaries, significant ambiguity in definition of boundaries, and trade-off between precision of boundaries versus performance of algorithms using administrative boundary data. Which definition of a boundary does Overture Maps adhere to, or can it support multiple definitions?

    b. It's probable that Microsoft have contributed ld+json/microdata geographic data from BingBot crawls of the Internet. This data is notoriously incorrect, including fields mixed up and invalidly repurposed, "CLOSED" in field names to denote closure of a place 5 years ago but the web page remains online, and much ambiguity in opening hours specifications. For AllThePlaces, many of the spiders developed require human consideration, sometimes of considerable complexity, to piece together horribly messy data that is published by shop and restaurant franchises, and other organisations providing location data via their websites.

    c. For location information where +/- 1-5m accuracy and precision may be required (e.g. individual shops within a shopping centre[3]), source data is typically provided by the authoritative sources with 1mm precision and +/- 10-100m accuracy. AllThePlaces, Overture Maps, Google Maps and similar still need human editors (OpenStreetMap editors) to do on-the-ground surveys to pinpoint precise locations and to standardise the definition of a location (e.g. for a point, should it be the centroid of the largest regular polygon which could be placed in the overall irregular polygon, the center of mass of a planar lamina, the location of the main entrance, or some other definition?).

    d. If Overture Maps is dependent on BingBot for place data, they'll miss an enormous number of points of interest that BingBot would never be able to find. For example, an undocumented REST/JSON/GraphQL API call or modification to parameters to an observed store locator API call may be necessary to return all locations and relevant fields of data. Website developers routinely do stupid things with robots.txt such as instruct a bot to crawl 10k pages (1GB+) from a sitemap last updated 5 years ago rather than make 10 fast API calls for up-to-date data (5MB). Overture Maps would be free to consume data from AllThePlaces as it is CC-0 licensed, and possibly correlate it with other data sources such as BingBot crawl data, a government database of licensed commercial premises or postal address geocoding data. However the messiness of data in various sources would be approaching impossible to reconcile, even for humans, and Overture Maps would possibly have to decide whether to err on the side of having duplicates, or lack completeness.

    2. There is no published tooling for how someone else can reproduce the same data.

    a. AllThePlaces users fairly frequently experience the wrath of Cloudflare, Imperva and other Internet-breaking third parties, as well as custom geographic blocking schemes and more rarely, overzealous rate limiting mechanisms. If Overture Maps is dependent on BingBot crawls, they'll have a slight advantage over AllThePlaces due to deliberate whitelisting of BingBot from the likes of Cloudflare, Imperva, customer firewalls, etc. However, no matter whether you're AllThePlaces or Overture Maps or anyone else, if you want to capture as many points of interest as possible across the world, use of residential ISP subnets and anti-bot-detection software is increasingly required. They'll need people in dozens of countries each crawling websites targeted to the same country, using residential ISP address space. Otherwise they end up with an American view of the world, or a European view of the world, or something else that isn't the full picture.

    b. If Overture Maps has locations incorrect for a franchise/brand due to a data cleansing problem or sourcing data from a bad (perhaps non-authoritative source), there are no software repositories for the franchise/brand to raise an issue or submit a patch against.

    [1] https://www.alltheplaces.xyz/

    [2] http://nsi.guide/

    [3] Example Australian shopping centre as captured by AllThePlaces: https://www.alltheplaces.xyz/map/#18.07/-33.834646/150.98952...

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts