wiki:Unitok

unitok

  • splits input text into tokens (one token per line)
  • recognizes URLs, e-mail addreses, DNS domains, IP addresses
  • for specified languages recognizes abbreviations and clictics (such as 've or n't in English)
  • preserves XML-like tags
  • replaces entities with unicode equivalents
  • adds glue (<g/>) tags between tokens not separated by space

Get unitok

See Downloads for the latest version.

Licence

Unitok is licensed under Mozilla Public License Version 2.0.

Last modified 20 months ago Last modified on Aug 7, 2015, 3:11:55 PM