Data-Mining Wikipedia for Fun and Profit

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • EasierRDF

    Making RDF easy enough for most developers

  • wikibase-cli

    read and edit a Wikibase instance from the command line

  • I learned SPARQL recently, and would agrre its complicated to get info out of Wikidata.

    However, having read the article, they didnt have an easy time with scraping Wikipedia either.

    So I'd probably still recommend people look into wikidata and SPARQL if they want to do this kind of thing.

    Theres a few tools that generate queries for you, and some cli tools as well:

    https://github.com/maxlath/wikibase-cli#readme

    It makes Wikipedia better too, in a virtuous cycle, with some infoboxes like those that he scraped being converted to be automatically populated from wikidata.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • relatedhow

    A website to quickly find out how species are related

  • I am doubtful. I tried for a long time to use it to get data or for my taxonomic graph project (https://relatedhow.kodare.com/) and SPARCQL was just not usable at all. The biggest problem was the 60s time limit. Totally not workable for what I wanted. I also had issues with seemingly inconsistent results, but it was hard to tell.

    I ended up loading the full nightly db dump and filtering it streaming from the zip instead. Faster and it actually worked.

    The code to do that is at https://github.com/boxed/relatedhow

  • qlever

    Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.

  • There's an alternate Wikidata query engine available here: https://qlever.cs.uni-freiburg.de/wikidata (from https://github.com/ad-freiburg/QLever)

    Currently it doesn't support some SPARQL features, but I've found it to generally be quite a bit faster for most queries.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts