WikiExtractor Alternatives
Similar projects and alternatives to WikiExtractor based on common topics and language
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
packaging
Debian, Fedora, Windows, macOS packaging scripts for Apertium, HFST, CG-3, and related techs. (by apertium)
NOTE:
The number of mentions on this list indicates mentions on common posts plus user suggested alternatives.
Hence, a higher number means a better WikiExtractor alternative or higher similarity.
WikiExtractor reviews and mentions
Posts with mentions or reviews of WikiExtractor.
We have used some of these posts to build our list of alternatives
and similar projects. The last one was on 2022-08-07.
-
YSK: You can freely and legally download the entire Wikipedia database
# Script https://github.com/apertium/WikiExtractor to extract text from enwiki-20220101-pages-articles-multistream.xml.bz2 git clone https://github.com/apertium/WikiExtractor.git # cd to where I downloaded torrent https://meta.wikimedia.org/wiki/Data_dump_torrents https://nicdex.com/files/wikipedia/enwiki-20220101-pages-articles-multistream.torrent cd /run/media/jjenkx/easystore/DB/enwiki-20220101-pages-articles-multistream/ # Extract text with python script python3 /home/jjenkx/.local/scripts/WikiExtractor.py --infn enwiki-20220101-pages-articles-multistream.xml.bz2 # python script wrote output file as wiki.txt # Split wiki.txt into multiple files split --number=128 wiki.txt # Compress the files to .xz # Adjust the -j value from 8 to whatever your max threads or max ram is. Need aprox 1gb ram per core find -name '*' -type f -regextype posix-extended -iregex '(^.*\/x\w\w$)' | sort -g | parallel -j 8 --eta --bar --delay "$((RANDOM % 4))" "pixz -9e < {} > {}.wiki.xz" # I search with this script # It prompts for ripgrep pcre2 search term https://github.com/JJenkx/Personal/blob/main/searchwikipedia.sh
Stats
Basic WikiExtractor repo stats
1
21
2.2
8 months ago
The primary programming language of WikiExtractor is Python.
Popular Comparisons
Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com