= wiki2corpus =

* downloads articles from Wikipedia for a given language id (URL prefix)
* works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text
* HTML files are converted into plain text using jusText
* texts are tokenized using unitok
* title, categories, translations, number of paragraphs and number of chars are put as attributes to ```<doc>``` structure

== Requirements ==

* [wiki:Justext jusText]
* [wiki:Unitok unitok]

== Get wiki2corpus ==

See [wiki:Downloads] for the latest version.

== Usage ==

{{{
usage: wikidownloader.py [-h] [--cache CACHE] [--wait WAIT] [--newest]
                         [--links LINKS]
                         langcode

Wikipedia downloader

positional arguments:
  langcode       Wikipedia language prefix

optional arguments:
  -h, --help     show this help message and exit
  --cache CACHE  Directory with previously downloaded pages and data
  --wait WAIT    Time interval between GET requests
  --newest       Download the newest versions of articles (do not use cache)
  --links LINKS  Gather external links from Wikipedia (Reference section)
}}}

== Example ==

Let us say you want to download fr.wikipedia.org. You can use this command:
{{{
python wikidownloader.py be --wait 7 --links bewiki.links > bewiki.prevert
}}}

The ```.prevert``` file can be used to feed a pipeline for following processing of the data.

== Licence ==

wiki2corpus is licensed under [http://choosealicense.com/licenses/mit/ MIT licence]