Changes between Version 5 and Version 6 of wiki2corpus


Ignore:
Timestamp:
12/07/16 17:09:44 (7 years ago)
Author:
admin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • wiki2corpus

    v5 v6  
    44* works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text
    55* HTML files are converted into plain text using jusText
    6 * texts are tokenized using unitok
    76* title, categories, translations, number of paragraphs and number of chars are put as attributes to `<doc>` structure
    87
     
    109
    1110* [wiki:Justext jusText]
    12 * [wiki:Unitok unitok]
    1311
    1412== Get wiki2corpus ==
     
    4644The `bewiki.prevert` file can be used to feed a pipeline for following processing of the data. It is in a simple XML-like format with element `<doc>`.
    4745
     46You may e.g. tokenize the prevertical with this script (`wikipipe.sh`):
     47
     48{{{
     49#!/bin/bash
     50
     51LANG=$1
     52
     53if [ -d "${LANG}wiki.cache" ]; then
     54    CACHE="--cache ${LANG}wiki.cache"
     55else
     56    CACHE=""
     57fi
     58
     59python wikidownloader.py $LANG $CACHE --links ${LANG}wiki.links |\
     60unitok --trim 200 /usr/lib/python2.7/site-packages/unitok/configs/other_2.py |\
     61onion -m -s -n 7 -t 0.7 -d doc 2> /dev/null |\
     62xz - > ${LANG}wiki.vert.xz
     63}}}
     64
     65It can be invoked with the command `bash wikipipe.sh be`. Both required tools are available too: [wiki:Unitok unitok], [wiki:Onion onion].
     66
    4867== Licence ==
    4968