Our great sponsors
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
WikiExtractor
Extracts and cleans text from Wikipedia database dump and stores output in a number of files of similar size in a given directory. (by apertium)
# Script https://github.com/apertium/WikiExtractor to extract text from enwiki-20220101-pages-articles-multistream.xml.bz2 git clone https://github.com/apertium/WikiExtractor.git # cd to where I downloaded torrent https://meta.wikimedia.org/wiki/Data_dump_torrents https://nicdex.com/files/wikipedia/enwiki-20220101-pages-articles-multistream.torrent cd /run/media/jjenkx/easystore/DB/enwiki-20220101-pages-articles-multistream/ # Extract text with python script python3 /home/jjenkx/.local/scripts/WikiExtractor.py --infn enwiki-20220101-pages-articles-multistream.xml.bz2 # python script wrote output file as wiki.txt # Split wiki.txt into multiple files split --number=128 wiki.txt # Compress the files to .xz # Adjust the -j value from 8 to whatever your max threads or max ram is. Need aprox 1gb ram per core find -name '*' -type f -regextype posix-extended -iregex '(^.*\/x\w\w$)' | sort -g | parallel -j 8 --eta --bar --delay "$((RANDOM % 4))" "pixz -9e < {} > {}.wiki.xz" # I search with this script # It prompts for ripgrep pcre2 search term https://github.com/JJenkx/Personal/blob/main/searchwikipedia.sh
# Script https://github.com/apertium/WikiExtractor to extract text from enwiki-20220101-pages-articles-multistream.xml.bz2 git clone https://github.com/apertium/WikiExtractor.git # cd to where I downloaded torrent https://meta.wikimedia.org/wiki/Data_dump_torrents https://nicdex.com/files/wikipedia/enwiki-20220101-pages-articles-multistream.torrent cd /run/media/jjenkx/easystore/DB/enwiki-20220101-pages-articles-multistream/ # Extract text with python script python3 /home/jjenkx/.local/scripts/WikiExtractor.py --infn enwiki-20220101-pages-articles-multistream.xml.bz2 # python script wrote output file as wiki.txt # Split wiki.txt into multiple files split --number=128 wiki.txt # Compress the files to .xz # Adjust the -j value from 8 to whatever your max threads or max ram is. Need aprox 1gb ram per core find -name '*' -type f -regextype posix-extended -iregex '(^.*\/x\w\w$)' | sort -g | parallel -j 8 --eta --bar --delay "$((RANDOM % 4))" "pixz -9e < {} > {}.wiki.xz" # I search with this script # It prompts for ripgrep pcre2 search term https://github.com/JJenkx/Personal/blob/main/searchwikipedia.sh
Related posts
- 3 lines of code don't understand the results.
- Task Scheduler -windowstyle hidden / minimized
- Just messing around with arrays and efficiency in PS, thought I'd share
- Register-ArgumentCompleter: how to fall back to file completion when completing a flag such as "--foo="
- New PowerShell Version - v7.3.7: [7.3.7] - 2023-09-18