= unitok = Univerzal text tokenizer for scripts using whitespace to separate tokens: * splits input text into tokens (one token per line) * recognizes URLs, e-mail addreses, DNS domains, IP addresses * for specified languages recognizes abbreviations and clictics (such as 've or n't in English) * preserves XML-like tags * replaces entities with unicode equivalents * adds glue () tags between tokens not separated by space Requires a configuration file defining tokens in the target language. Configuration files are provided in directory configs. configs/other.py is the default configuration that can be used for any language written in a script using whitespace to separate tokens. Language-specific configuration files contain language-specific token regexps, e.g. abbreviations common in the language. Uninormed input is expected, i.e. the input has to be character-normalized using a standalone script uninorm.py, see usage below. {{{ #!html Paper | Cite | Licence }}} == Usage example (English) == {{{ python uninorm.py < input.txt | python unitok.py --trim 100 configs/english.py > output.vert }}} == Get unitok == See [wiki:Downloads] for the latest version. == Licence == Unitok is licensed under [https://www.mozilla.org/MPL/2.0 Mozilla Public License Version 2.0].