= wiki2corpus = * downloads articles from Wikipedia for a given language id (URL prefix) * works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text * HTML files are converted into plain text using jusText * texts are tokenized using unitok == Requirements == * Justext * Unitok == Get wiki2corpus == See [wiki:Downloads] for the latest version. == Licence == Unitok is licensed under [http://choosealicense.com/licenses/mit/ MIT licence]