WikiExtractor

Extracts and cleans text from Wikipedia database dump and stores output in a number of files of similar size in a given directory. (by apertium)

WikiExtractor Alternatives

Similar projects and alternatives to WikiExtractor based on common topics and language

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better WikiExtractor alternative or higher similarity.

WikiExtractor reviews and mentions

Posts with mentions or reviews of WikiExtractor. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2022-08-07.
  • YSK: You can freely and legally download the entire Wikipedia database
    4 projects | /r/YouShouldKnow | 7 Aug 2022
    # Script https://github.com/apertium/WikiExtractor to extract text from enwiki-20220101-pages-articles-multistream.xml.bz2 git clone https://github.com/apertium/WikiExtractor.git # cd to where I downloaded torrent https://meta.wikimedia.org/wiki/Data_dump_torrents https://nicdex.com/files/wikipedia/enwiki-20220101-pages-articles-multistream.torrent cd /run/media/jjenkx/easystore/DB/enwiki-20220101-pages-articles-multistream/ # Extract text with python script python3 /home/jjenkx/.local/scripts/WikiExtractor.py --infn enwiki-20220101-pages-articles-multistream.xml.bz2 # python script wrote output file as wiki.txt # Split wiki.txt into multiple files split --number=128 wiki.txt # Compress the files to .xz # Adjust the -j value from 8 to whatever your max threads or max ram is. Need aprox 1gb ram per core find -name '*' -type f -regextype posix-extended -iregex '(^.*\/x\w\w$)' | sort -g | parallel -j 8 --eta --bar --delay "$((RANDOM % 4))" "pixz -9e < {} > {}.wiki.xz" # I search with this script # It prompts for ripgrep pcre2 search term https://github.com/JJenkx/Personal/blob/main/searchwikipedia.sh

Stats

Basic WikiExtractor repo stats
1
21
2.2
8 months ago

The primary programming language of WikiExtractor is Python.


Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com