| Version 5 (modified by , 5 years ago) ( diff ) | 
|---|
unitok
- splits input text into tokens (one token per line)
 - recognizes URLs, e-mail addreses, DNS domains, IP addresses
 - for specified languages recognizes abbreviations and clictics (such as 've or n't in English)
 - preserves XML-like tags
 - replaces entities with unicode equivalents
 - adds glue (<g/>) tags between tokens not separated by space
 
Get unitok
See Downloads for the latest version.
Licence
Unitok is licensed under Mozilla Public License Version 2.0.

        
        