Changes between Initial Version and Version 1 of wiki2corpus


Ignore:
Timestamp:
12/07/16 13:24:39 (8 years ago)
Author:
admin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • wiki2corpus

    v1 v1  
     1= wiki2corpus =
     2
     3* downloads articles from Wikipedia for a given language id (URL prefix)
     4* works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text
     5* HTML files are converted into plain text using jusText
     6* texts are tokenized using unitok
     7
     8== Requirements ==
     9
     10* justext
     11* unitok
     12
     13== Get wiki2corpus ==
     14
     15See [wiki:Downloads] for the latest version.
     16
     17
     18== Licence ==
     19
     20Unitok is licensed under [http://choosealicense.com/licenses/mit/ MIT licence]