wiki:Unitok

unitok

  • splits input text into tokens (one token per line)
  • recognizes URLs, e-mail addreses, DNS domains, IP addresses
  • for specified languages recognizes abbreviations and clictics (such as 've or n't in English)
  • preserves XML-like tags
  • replaces entities with unicode equivalents
  • adds glue (<g/>) tags between tokens not separated by space

Python 2.7 required, Python 3 compatibility will be added soon

Paper | Cite | Licence

Get unitok

See Downloads for the latest version.

Licence

Unitok is licensed under Mozilla Public License Version 2.0.

Last modified 2 days ago Last modified on 12/11/25 13:27:39