wiki:wiki2corpus

Version 2 (modified by admin, 7 years ago) ( diff )

--

wiki2corpus

  • downloads articles from Wikipedia for a given language id (URL prefix)
  • works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text
  • HTML files are converted into plain text using jusText
  • texts are tokenized using unitok

Requirements

  • Justext
  • Unitok

Get wiki2corpus

See Downloads for the latest version.

Licence

Unitok is licensed under MIT licence