Changes between Version 4 and Version 5 of wiki2corpus


Ignore:
Timestamp:
12/07/16 17:02:20 (8 years ago)
Author:
admin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • wiki2corpus

    v4 v5  
    55* HTML files are converted into plain text using jusText
    66* texts are tokenized using unitok
    7 * title, categories, translations, number of paragraphs and number of chars are put as attributes to ```<doc>``` structure
     7* title, categories, translations, number of paragraphs and number of chars are put as attributes to `<doc>` structure
    88
    99== Requirements ==
     
    3838== Example ==
    3939
    40 Let us say you want to download fr.wikipedia.org. You can use this command:
     40Let us say you want to download Belarusian Wikipedia. The ISO 639-1 language code is "be", so you want to download articles from be.wikipedia.org. You can use this command:
     41
    4142{{{
    4243python wikidownloader.py be --wait 7 --links bewiki.links > bewiki.prevert
    4344}}}
    4445
    45 The ```.prevert``` file can be used to feed a pipeline for following processing of the data.
     46The `bewiki.prevert` file can be used to feed a pipeline for following processing of the data. It is in a simple XML-like format with element `<doc>`.
    4647
    4748== Licence ==