= wiki2corpus = * downloads articles from Wikipedia for a given language id (URL prefix) * works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text * HTML files are converted into plain text using jusText * texts are tokenized using unitok * title, categories, translations, number of paragraphs and number of chars are put as attributes to `````` structure == Requirements == * [wiki:Justext jusText] * [wiki:Unitok unitok] == Get wiki2corpus == See [wiki:Downloads] for the latest version. == Usage == {{{ usage: wikidownloader.py [-h] [--cache CACHE] [--wait WAIT] [--newest] [--links LINKS] langcode Wikipedia downloader positional arguments: langcode Wikipedia language prefix optional arguments: -h, --help show this help message and exit --cache CACHE Directory with previously downloaded pages and data --wait WAIT Time interval between GET requests --newest Download the newest versions of articles (do not use cache) --links LINKS Gather external links from Wikipedia (Reference section) }}} == Example == Let us say you want to download fr.wikipedia.org. You can use this command: {{{ python wikidownloader.py be --wait 7 --links bewiki.links > bewiki.prevert }}} The ```.prevert``` file can be used to feed a pipeline for following processing of the data. == Licence == wiki2corpus is licensed under [http://choosealicense.com/licenses/mit/ MIT licence]