Changes between Version 5 and Version 6 of wiki2corpus
- Timestamp:
- 12/07/16 17:09:44 (8 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
wiki2corpus
v5 v6 4 4 * works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text 5 5 * HTML files are converted into plain text using jusText 6 * texts are tokenized using unitok7 6 * title, categories, translations, number of paragraphs and number of chars are put as attributes to `<doc>` structure 8 7 … … 10 9 11 10 * [wiki:Justext jusText] 12 * [wiki:Unitok unitok]13 11 14 12 == Get wiki2corpus == … … 46 44 The `bewiki.prevert` file can be used to feed a pipeline for following processing of the data. It is in a simple XML-like format with element `<doc>`. 47 45 46 You may e.g. tokenize the prevertical with this script (`wikipipe.sh`): 47 48 {{{ 49 #!/bin/bash 50 51 LANG=$1 52 53 if [ -d "${LANG}wiki.cache" ]; then 54 CACHE="--cache ${LANG}wiki.cache" 55 else 56 CACHE="" 57 fi 58 59 python wikidownloader.py $LANG $CACHE --links ${LANG}wiki.links |\ 60 unitok --trim 200 /usr/lib/python2.7/site-packages/unitok/configs/other_2.py |\ 61 onion -m -s -n 7 -t 0.7 -d doc 2> /dev/null |\ 62 xz - > ${LANG}wiki.vert.xz 63 }}} 64 65 It can be invoked with the command `bash wikipipe.sh be`. Both required tools are available too: [wiki:Unitok unitok], [wiki:Onion onion]. 66 48 67 == Licence == 49 68