YSK: You can freely and legally download the entire Wikipedia database

This page summarizes the projects mentioned and recommended in the original post on /r/YouShouldKnow

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • scripts

  • # Script https://github.com/apertium/WikiExtractor to extract text from enwiki-20220101-pages-articles-multistream.xml.bz2 git clone https://github.com/apertium/WikiExtractor.git # cd to where I downloaded torrent https://meta.wikimedia.org/wiki/Data_dump_torrents https://nicdex.com/files/wikipedia/enwiki-20220101-pages-articles-multistream.torrent cd /run/media/jjenkx/easystore/DB/enwiki-20220101-pages-articles-multistream/ # Extract text with python script python3 /home/jjenkx/.local/scripts/WikiExtractor.py --infn enwiki-20220101-pages-articles-multistream.xml.bz2 # python script wrote output file as wiki.txt # Split wiki.txt into multiple files split --number=128 wiki.txt # Compress the files to .xz # Adjust the -j value from 8 to whatever your max threads or max ram is. Need aprox 1gb ram per core find -name '*' -type f -regextype posix-extended -iregex '(^.*\/x\w\w$)' | sort -g | parallel -j 8 --eta --bar --delay "$((RANDOM % 4))" "pixz -9e < {} > {}.wiki.xz" # I search with this script # It prompts for ripgrep pcre2 search term https://github.com/JJenkx/Personal/blob/main/searchwikipedia.sh

  • PowerShell

    PowerShell for every system!

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • WikiExtractor

    Extracts and cleans text from Wikipedia database dump and stores output in a number of files of similar size in a given directory. (by apertium)

  • # Script https://github.com/apertium/WikiExtractor to extract text from enwiki-20220101-pages-articles-multistream.xml.bz2 git clone https://github.com/apertium/WikiExtractor.git # cd to where I downloaded torrent https://meta.wikimedia.org/wiki/Data_dump_torrents https://nicdex.com/files/wikipedia/enwiki-20220101-pages-articles-multistream.torrent cd /run/media/jjenkx/easystore/DB/enwiki-20220101-pages-articles-multistream/ # Extract text with python script python3 /home/jjenkx/.local/scripts/WikiExtractor.py --infn enwiki-20220101-pages-articles-multistream.xml.bz2 # python script wrote output file as wiki.txt # Split wiki.txt into multiple files split --number=128 wiki.txt # Compress the files to .xz # Adjust the -j value from 8 to whatever your max threads or max ram is. Need aprox 1gb ram per core find -name '*' -type f -regextype posix-extended -iregex '(^.*\/x\w\w$)' | sort -g | parallel -j 8 --eta --bar --delay "$((RANDOM % 4))" "pixz -9e < {} > {}.wiki.xz" # I search with this script # It prompts for ripgrep pcre2 search term https://github.com/JJenkx/Personal/blob/main/searchwikipedia.sh

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts