Changes between Version 3 and Version 4 of wiki2corpus


Ignore:
Timestamp:
12/07/16 16:55:17 (8 years ago)
Author:
admin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • wiki2corpus

    v3 v4  
    55* HTML files are converted into plain text using jusText
    66* texts are tokenized using unitok
     7* title, categories, translations, number of paragraphs and number of chars are put as attributes to ```<doc>``` structure
    78
    89== Requirements ==
     
    1516See [wiki:Downloads] for the latest version.
    1617
     18== Usage ==
     19
     20{{{
     21usage: wikidownloader.py [-h] [--cache CACHE] [--wait WAIT] [--newest]
     22                         [--links LINKS]
     23                         langcode
     24
     25Wikipedia downloader
     26
     27positional arguments:
     28  langcode       Wikipedia language prefix
     29
     30optional arguments:
     31  -h, --help     show this help message and exit
     32  --cache CACHE  Directory with previously downloaded pages and data
     33  --wait WAIT    Time interval between GET requests
     34  --newest       Download the newest versions of articles (do not use cache)
     35  --links LINKS  Gather external links from Wikipedia (Reference section)
     36}}}
     37
     38== Example ==
     39
     40Let us say you want to download fr.wikipedia.org. You can use this command:
     41{{{
     42python wikidownloader.py be --wait 7 --links bewiki.links > bewiki.prevert
     43}}}
     44
     45The ```.prevert``` file can be used to feed a pipeline for following processing of the data.
    1746
    1847== Licence ==