| | 1 | = wiki2corpus = |
| | 2 | |
| | 3 | * downloads articles from Wikipedia for a given language id (URL prefix) |
| | 4 | * works with Wikipedia API (HTML output) as it is not straightforward to turn mediawiki syntax into plain text |
| | 5 | * HTML files are converted into plain text using jusText |
| | 6 | * texts are tokenized using unitok |
| | 7 | |
| | 8 | == Requirements == |
| | 9 | |
| | 10 | * justext |
| | 11 | * unitok |
| | 12 | |
| | 13 | == Get wiki2corpus == |
| | 14 | |
| | 15 | See [wiki:Downloads] for the latest version. |
| | 16 | |
| | 17 | |
| | 18 | == Licence == |
| | 19 | |
| | 20 | Unitok is licensed under [http://choosealicense.com/licenses/mit/ MIT licence] |