Changes between Version 6 and Version 7 of wiki2corpus
- Timestamp:
- 03/10/17 17:21:50 (8 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
wiki2corpus
v6 v7 2 2 3 3 * downloads articles from Wikipedia for a given language id (URL prefix) 4 * works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text5 * HTML files are converted into plain text using jusText 4 * works with Wikipedia API (HTML output) as it is not straightforward to turn MediaWiki syntax into plain text 5 * HTML files are converted into plain text using jusText, some paragraphs, tables, lists are removed (with unuseful texts) 6 6 * title, categories, translations, number of paragraphs and number of chars are put as attributes to `<doc>` structure 7 7 … … 17 17 18 18 {{{ 19 usage: wiki downloader.py [-h] [--cache CACHE] [--wait WAIT] [--newest]20 [--links LINKS]21 langcode19 usage: wiki2corpus.py [-h] [--cache CACHE] [--wait WAIT] [--newest] 20 [--links LINKS] [--nicetitles] 21 langcode wordlist 22 22 23 23 Wikipedia downloader … … 25 25 positional arguments: 26 26 langcode Wikipedia language prefix 27 wordlist Path to a list of ~2000 most frequent words in the language 28 (UTF-8, one per line) 27 29 28 30 optional arguments: … … 30 32 --cache CACHE Directory with previously downloaded pages and data 31 33 --wait WAIT Time interval between GET requests 32 --newest Download the newest versions of articles (do not use cache) 33 --links LINKS Gather external links from Wikipedia (Reference section) 34 --newest Download the newest versions of articles 35 --links LINKS Gather external links from Wikipedia 36 --nicetitles Download only titles starting with alphabetical character 34 37 }}} 35 38 36 39 == Example == 37 40 38 Let us say you want to download Belarusian Wikipedia. The ISO 639-1 language code is "be" , so you want to download articles from be.wikipedia.org. You can use this command:41 Let us say you want to download Belarusian Wikipedia. The ISO 639-1 language code is "be". You need also a list of Belarusian words. If you have jusText installed, it might be e.g. here `/usr/lib/python2.7/site-packages/justext/stoplists/Belarusian.txt` 39 42 40 43 {{{ 41 python wiki downloader.py be --wait 7 --links bewiki.links> bewiki.prevert44 python wiki2corpus.py be usr/lib/python2.7/site-packages/justext/stoplists/Belarusian.txt > bewiki.prevert 42 45 }}} 43 46 … … 50 53 51 54 LANG=$1 55 LIST=$2 52 56 53 57 if [ -d "${LANG}wiki.cache" ]; then … … 57 61 fi 58 62 59 python wiki downloader.py $LANG$CACHE --links ${LANG}wiki.links |\63 python wiki2corpus.py $LANG $LIST $CACHE --links ${LANG}wiki.links |\ 60 64 unitok --trim 200 /usr/lib/python2.7/site-packages/unitok/configs/other_2.py |\ 61 65 onion -m -s -n 7 -t 0.7 -d doc 2> /dev/null |\