wiki:wiki2corpus

Version 1 (modified by admin, 8 years ago) ( diff )

--

wiki2corpus

  • downloads articles from Wikipedia for a given language id (URL prefix)
  • works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text
  • HTML files are converted into plain text using jusText
  • texts are tokenized using unitok

Requirements

  • justext
  • unitok

Get wiki2corpus

See Downloads for the latest version.

Licence

Unitok is licensed under MIT licence