[Python] How can I clean up Wikipedia's XML backup dump to create dictionaries of commonly used words for multiple languages?

This page summarizes the projects mentioned and recommended in the original post on /r/learnprogramming

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • mwparserfromhell

    A Python parser for MediaWiki wikicode

  • In particular what you're looking at is not XML but wikitext. I found a discussion on stackoverflow about solving the same problem of getting text from wikitext. Seems like the most promising solution in Python since you already have the dump is to run each page through mwparserfromhell. According to the top stackoverflow answer you could use something like

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Processing Wikipedia Dumps With Python

    1 project | /r/programming | 18 May 2023
  • How can I clean up Wikipedia's XML backup dump to create dictionaries of commonly used words for multiple languages?

    2 projects | /r/learnpython | 10 Oct 2021
  • I spent the 2 weeks building a complex data parsing program for a data project and today I found out that such a library already exists.

    1 project | /r/learnprogramming | 14 May 2022
  • [UPDATE] Here's the transcript of the 1781 most-used German Nouns according to a 4.2 million word corpus research performed by Routledge

    1 project | /r/German | 9 Jul 2021
  • The Future of MySQL is PostgreSQL: an extension for the MySQL wire protocol

    1 project | news.ycombinator.com | 26 Apr 2024