- splits input text into tokens (one token per line)
- recognizes URLs, e-mail addreses, DNS domains, IP addresses
- for specified languages recognizes abbreviations and clictics (such as 've or n't in English)
- preserves XML-like tags
- replaces entities with unicode equivalents
- adds glue (<g/>) tags between tokens not separated by space
See Downloads for the latest version.
Unitok is licensed under Mozilla Public License Version 2.0.
Last modified 20 months ago Last modified on Aug 7, 2015, 3:11:55 PM