Changes between Version 6 and Version 7 of wiki2corpus


Ignore:
Timestamp:
03/10/17 17:21:50 (8 years ago)
Author:
admin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • wiki2corpus

    v6 v7  
    22
    33* downloads articles from Wikipedia for a given language id (URL prefix)
    4 * works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text
    5 * HTML files are converted into plain text using jusText
     4* works with Wikipedia API (HTML output) as it is not straightforward to turn MediaWiki syntax into plain text
     5* HTML files are converted into plain text using jusText, some paragraphs, tables, lists are removed (with unuseful texts)
    66* title, categories, translations, number of paragraphs and number of chars are put as attributes to `<doc>` structure
    77
     
    1717
    1818{{{
    19 usage: wikidownloader.py [-h] [--cache CACHE] [--wait WAIT] [--newest]
    20                          [--links LINKS]
    21                          langcode
     19usage: wiki2corpus.py [-h] [--cache CACHE] [--wait WAIT] [--newest]
     20                            [--links LINKS] [--nicetitles]
     21                            langcode wordlist
    2222
    2323Wikipedia downloader
     
    2525positional arguments:
    2626  langcode       Wikipedia language prefix
     27  wordlist       Path to a list of ~2000 most frequent words in the language
     28                 (UTF-8, one per line)
    2729
    2830optional arguments:
     
    3032  --cache CACHE  Directory with previously downloaded pages and data
    3133  --wait WAIT    Time interval between GET requests
    32   --newest       Download the newest versions of articles (do not use cache)
    33   --links LINKS  Gather external links from Wikipedia (Reference section)
     34  --newest       Download the newest versions of articles
     35  --links LINKS  Gather external links from Wikipedia
     36  --nicetitles   Download only titles starting with alphabetical character
    3437}}}
    3538
    3639== Example ==
    3740
    38 Let us say you want to download Belarusian Wikipedia. The ISO 639-1 language code is "be", so you want to download articles from be.wikipedia.org. You can use this command:
     41Let us say you want to download Belarusian Wikipedia. The ISO 639-1 language code is "be". You need also a list of Belarusian words. If you have jusText installed, it might be e.g. here `/usr/lib/python2.7/site-packages/justext/stoplists/Belarusian.txt`
    3942
    4043{{{
    41 python wikidownloader.py be --wait 7 --links bewiki.links > bewiki.prevert
     44python wiki2corpus.py be usr/lib/python2.7/site-packages/justext/stoplists/Belarusian.txt > bewiki.prevert
    4245}}}
    4346
     
    5053
    5154LANG=$1
     55LIST=$2
    5256
    5357if [ -d "${LANG}wiki.cache" ]; then
     
    5761fi
    5862
    59 python wikidownloader.py $LANG $CACHE --links ${LANG}wiki.links |\
     63python wiki2corpus.py $LANG $LIST $CACHE --links ${LANG}wiki.links |\
    6064unitok --trim 200 /usr/lib/python2.7/site-packages/unitok/configs/other_2.py |\
    6165onion -m -s -n 7 -t 0.7 -d doc 2> /dev/null |\