unitok
- splits input text into tokens (one token per line)
 - recognizes URLs, e-mail addreses, DNS domains, IP addresses
 - for specified languages recognizes abbreviations and clictics (such as 've or n't in English)
 - preserves XML-like tags
 - replaces entities with unicode equivalents
 - adds glue (<g/>) tags between tokens not separated by space
 
Get unitok
See Downloads for the latest version.
Licence
Unitok is licensed under Mozilla Public License Version 2.0.
          
            Last modified
 5 years ago          
          
            Last modified on 06/18/20 21:01:58
          
        
      
        
        