wiki:wiki2corpus

Version 5 (modified by admin, 9 years ago) ( diff )
--

wiki2corpus

downloads articles from Wikipedia for a given language id (URL prefix)
works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text
HTML files are converted into plain text using jusText
texts are tokenized using unitok
title, categories, translations, number of paragraphs and number of chars are put as attributes to <doc> structure

Requirements

Get wiki2corpus

See Downloads for the latest version.

Usage

usage: wikidownloader.py [-h] [--cache CACHE] [--wait WAIT] [--newest]
                         [--links LINKS]
                         langcode

Wikipedia downloader

positional arguments:
  langcode       Wikipedia language prefix

optional arguments:
  -h, --help     show this help message and exit
  --cache CACHE  Directory with previously downloaded pages and data
  --wait WAIT    Time interval between GET requests
  --newest       Download the newest versions of articles (do not use cache)
  --links LINKS  Gather external links from Wikipedia (Reference section)

Example

Let us say you want to download Belarusian Wikipedia. The ISO 639-1 language code is "be", so you want to download articles from be.wikipedia.org. You can use this command:

python wikidownloader.py be --wait 7 --links bewiki.links > bewiki.prevert

The bewiki.prevert file can be used to feed a pipeline for following processing of the data. It is in a simple XML-like format with element <doc>.

Licence

wiki2corpus is licensed under MIT licence

Download in other formats:

Plain Text