wiki:wiki2corpus

Version 7 (modified by admin, 8 years ago) ( diff )

--

wiki2corpus

  • downloads articles from Wikipedia for a given language id (URL prefix)
  • works with Wikipedia API (HTML output) as it is not straightforward to turn MediaWiki syntax into plain text
  • HTML files are converted into plain text using jusText, some paragraphs, tables, lists are removed (with unuseful texts)
  • title, categories, translations, number of paragraphs and number of chars are put as attributes to <doc> structure

Requirements

Get wiki2corpus

See Downloads for the latest version.

Usage

usage: wiki2corpus.py [-h] [--cache CACHE] [--wait WAIT] [--newest]
                            [--links LINKS] [--nicetitles]
                            langcode wordlist

Wikipedia downloader

positional arguments:
  langcode       Wikipedia language prefix
  wordlist       Path to a list of ~2000 most frequent words in the language
                 (UTF-8, one per line)

optional arguments:
  -h, --help     show this help message and exit
  --cache CACHE  Directory with previously downloaded pages and data
  --wait WAIT    Time interval between GET requests
  --newest       Download the newest versions of articles
  --links LINKS  Gather external links from Wikipedia
  --nicetitles   Download only titles starting with alphabetical character

Example

Let us say you want to download Belarusian Wikipedia. The ISO 639-1 language code is "be". You need also a list of Belarusian words. If you have jusText installed, it might be e.g. here /usr/lib/python2.7/site-packages/justext/stoplists/Belarusian.txt

python wiki2corpus.py be usr/lib/python2.7/site-packages/justext/stoplists/Belarusian.txt > bewiki.prevert

The bewiki.prevert file can be used to feed a pipeline for following processing of the data. It is in a simple XML-like format with element <doc>.

You may e.g. tokenize the prevertical with this script (wikipipe.sh):

#!/bin/bash

LANG=$1
LIST=$2

if [ -d "${LANG}wiki.cache" ]; then
    CACHE="--cache ${LANG}wiki.cache"
else
    CACHE=""
fi

python wiki2corpus.py $LANG $LIST $CACHE --links ${LANG}wiki.links |\
unitok --trim 200 /usr/lib/python2.7/site-packages/unitok/configs/other_2.py |\
onion -m -s -n 7 -t 0.7 -d doc 2> /dev/null |\
xz - > ${LANG}wiki.vert.xz

It can be invoked with the command bash wikipipe.sh be. Both required tools are available too: unitok, onion.

Licence

wiki2corpus is licensed under MIT licence