unitok
- splits input text into tokens (one token per line)
- recognizes URLs, e-mail addreses, DNS domains, IP addresses
- for specified languages recognizes abbreviations and clictics (such as 've or n't in English)
- preserves XML-like tags
- replaces entities with unicode equivalents
- adds glue (<g/>) tags between tokens not separated by space
Python 2.7 required, Python 3 compatibility will be added soon
Paper | Cite | LicenceGet unitok
See Downloads for the latest version.
Licence
Unitok is licensed under Mozilla Public License Version 2.0.
Last modified
2 days ago
Last modified on 12/11/25 13:27:39

