= wiki2corpus = * downloads articles from Wikipedia for a given language id (URL prefix) * works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text * HTML files are converted into plain text using jusText * texts are tokenized using unitok == Requirements == * [wiki:Justext jusText] * [wiki:Unitok unitok] == Get wiki2corpus == See [wiki:Downloads] for the latest version. == Licence == wiki2corpus is licensed under [http://choosealicense.com/licenses/mit/ MIT licence]