Version 2 (modified by 8 years ago) ( diff ) | ,
---|
wiki2corpus
- downloads articles from Wikipedia for a given language id (URL prefix)
- works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text
- HTML files are converted into plain text using jusText
- texts are tokenized using unitok
Requirements
- Justext
- Unitok
Get wiki2corpus
See Downloads for the latest version.
Licence
Unitok is licensed under MIT licence