= wiki2corpus = * downloads articles from Wikipedia for a given language id (URL prefix) * works with Wikipedia API (HTML output) as it is not straightforward to turn MediaWiki syntax into plain text * HTML files are converted into plain text using jusText, some paragraphs, tables, lists are removed (with unuseful texts) * title, categories, translations, number of paragraphs and number of chars are put as attributes to `` structure == Requirements == * Python >= 3.6 * [wiki:Justext jusText] == Get wiki2corpus == {{{ wget https://corpus.tools/raw-attachment/wiki/Downloads/wiki2corpus-2.0.py }}} == Usage == {{{ usage: wiki2corpus.py [-h] [--cache CACHE] [--wait WAIT] [--newest] [--links LINKS] [--nicetitles] langcode wordlist Wikipedia downloader positional arguments: langcode Wikipedia language prefix wordlist Path to a list of ~2000 most frequent words in the language (UTF-8, one per line) optional arguments: -h, --help show this help message and exit --cache CACHE Directory with previously downloaded pages and data --wait WAIT Time interval between GET requests --newest Download the newest versions of articles --links LINKS Gather external links from Wikipedia --nicetitles Download only titles starting with alphabetical character }}} == Example == Let us say you want to download Belarusian Wikipedia. The ISO 639-1 language code is "be". You need also a list of Belarusian words. If you have jusText installed, it might be e.g. here `/usr/lib/python2.7/site-packages/justext/stoplists/Belarusian.txt` {{{ python wiki2corpus.py be Belarusian.txt > bewiki.prevert }}} The `bewiki.prevert` file can be used to feed a pipeline for following processing of the data. It is in a simple XML-like format with element ``. You may e.g. tokenize the prevertical with this script (`wikipipe.sh`): {{{ #!/bin/bash LANG=$1 LIST=$2 if [ -d "${LANG}wiki.cache" ]; then CACHE="--cache ${LANG}wiki.cache" else CACHE="" fi python wiki2corpus.py $LANG $LIST $CACHE --links ${LANG}wiki.links |\ unitok --trim 200 /usr/lib/python2.7/site-packages/unitok/configs/other.py |\ onion -m -s -n 7 -t 0.7 -d doc 2> /dev/null |\ xz - > ${LANG}wiki.vert.xz }}} It can be invoked with the command `bash wikipipe.sh be`. Both required tools are available too: [wiki:Unitok unitok], [wiki:Onion onion]. == Licence == wiki2corpus is licensed under [http://choosealicense.com/licenses/mit/ MIT licence]