Changes between Version 4 and Version 5 of wiki2corpus
- Timestamp:
- 12/07/16 17:02:20 (8 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
wiki2corpus
v4 v5 5 5 * HTML files are converted into plain text using jusText 6 6 * texts are tokenized using unitok 7 * title, categories, translations, number of paragraphs and number of chars are put as attributes to ` ``<doc>``` structure7 * title, categories, translations, number of paragraphs and number of chars are put as attributes to `<doc>` structure 8 8 9 9 == Requirements == … … 38 38 == Example == 39 39 40 Let us say you want to download fr.wikipedia.org. You can use this command: 40 Let us say you want to download Belarusian Wikipedia. The ISO 639-1 language code is "be", so you want to download articles from be.wikipedia.org. You can use this command: 41 41 42 {{{ 42 43 python wikidownloader.py be --wait 7 --links bewiki.links > bewiki.prevert 43 44 }}} 44 45 45 The ` ``.prevert``` file can be used to feed a pipeline for following processing of the data.46 The `bewiki.prevert` file can be used to feed a pipeline for following processing of the data. It is in a simple XML-like format with element `<doc>`. 46 47 47 48 == Licence ==