| 1 | = wiki2corpus = |
| 2 | |
| 3 | * downloads articles from Wikipedia for a given language id (URL prefix) |
| 4 | * works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text |
| 5 | * HTML files are converted into plain text using jusText |
| 6 | * texts are tokenized using unitok |
| 7 | |
| 8 | == Requirements == |
| 9 | |
| 10 | * justext |
| 11 | * unitok |
| 12 | |
| 13 | == Get wiki2corpus == |
| 14 | |
| 15 | See [wiki:Downloads] for the latest version. |
| 16 | |
| 17 | |
| 18 | == Licence == |
| 19 | |
| 20 | Unitok is licensed under [http://choosealicense.com/licenses/mit/ MIT licence] |