YSK: You can freely and legally download the entire Wikipedia database

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

scripts

4 0 8.4 Shell

# Script https://github.com/apertium/WikiExtractor to extract text from enwiki-20220101-pages-articles-multistream.xml.bz2 git clone https://github.com/apertium/WikiExtractor.git # cd to where I downloaded torrent https://meta.wikimedia.org/wiki/Data_dump_torrents https://nicdex.com/files/wikipedia/enwiki-20220101-pages-articles-multistream.torrent cd /run/media/jjenkx/easystore/DB/enwiki-20220101-pages-articles-multistream/ # Extract text with python script python3 /home/jjenkx/.local/scripts/WikiExtractor.py --infn enwiki-20220101-pages-articles-multistream.xml.bz2 # python script wrote output file as wiki.txt # Split wiki.txt into multiple files split --number=128 wiki.txt # Compress the files to .xz # Adjust the -j value from 8 to whatever your max threads or max ram is. Need aprox 1gb ram per core find -name '*' -type f -regextype posix-extended -iregex '(^.*\/x\w\w$)' | sort -g | parallel -j 8 --eta --bar --delay "$((RANDOM % 4))" "pixz -9e < {} > {}.wiki.xz" # I search with this script # It prompts for ripgrep pcre2 search term https://github.com/JJenkx/Personal/blob/main/searchwikipedia.sh

PowerShell

397 43,290 9.6 C#

PowerShell for every system!
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
WikiExtractor

1 21 2.2 Python

Extracts and cleans text from Wikipedia database dump and stores output in a number of files of similar size in a given directory. (by apertium)

# Script https://github.com/apertium/WikiExtractor to extract text from enwiki-20220101-pages-articles-multistream.xml.bz2 git clone https://github.com/apertium/WikiExtractor.git # cd to where I downloaded torrent https://meta.wikimedia.org/wiki/Data_dump_torrents https://nicdex.com/files/wikipedia/enwiki-20220101-pages-articles-multistream.torrent cd /run/media/jjenkx/easystore/DB/enwiki-20220101-pages-articles-multistream/ # Extract text with python script python3 /home/jjenkx/.local/scripts/WikiExtractor.py --infn enwiki-20220101-pages-articles-multistream.xml.bz2 # python script wrote output file as wiki.txt # Split wiki.txt into multiple files split --number=128 wiki.txt # Compress the files to .xz # Adjust the -j value from 8 to whatever your max threads or max ram is. Need aprox 1gb ram per core find -name '*' -type f -regextype posix-extended -iregex '(^.*\/x\w\w$)' | sort -g | parallel -j 8 --eta --bar --delay "$((RANDOM % 4))" "pixz -9e < {} > {}.wiki.xz" # I search with this script # It prompts for ripgrep pcre2 search term https://github.com/JJenkx/Personal/blob/main/searchwikipedia.sh

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project