Data-Mining Wikipedia for Fun and Profit

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

EasierRDF

3 258 0.0 Python

Making RDF easy enough for most developers
wikibase-cli

1 216 7.9 JavaScript

read and edit a Wikibase instance from the command line

I learned SPARQL recently, and would agrre its complicated to get info out of Wikidata.
However, having read the article, they didnt have an easy time with scraping Wikipedia either.
So I'd probably still recommend people look into wikidata and SPARQL if they want to do this kind of thing.
Theres a few tools that generate queries for you, and some cli tools as well:
https://github.com/maxlath/wikibase-cli#readme
It makes Wikipedia better too, in a virtuous cycle, with some infoboxes like those that he scraped being converted to be automatically populated from wikidata.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
relatedhow

2 3 5.7 Python

A website to quickly find out how species are related

I am doubtful. I tried for a long time to use it to get data or for my taxonomic graph project (https://relatedhow.kodare.com/) and SPARCQL was just not usable at all. The biggest problem was the 60s time limit. Totally not workable for what I wanted. I also had issues with seemingly inconsistent results, but it was hard to tell.
I ended up loading the full nightly db dump and filtering it streaming from the zip instead. Faster and it actually worked.
The code to do that is at https://github.com/boxed/relatedhow

qlever

3 275 9.3 C++

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.

There's an alternate Wikidata query engine available here: https://qlever.cs.uni-freiburg.de/wikidata (from https://github.com/ad-freiburg/QLever)
Currently it doesn't support some SPARQL features, but I've found it to generally be quite a bit faster for most queries.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project