= unitok =
Univerzal text tokenizer for scripts using whitespace to separate tokens:
* splits input text into tokens (one token per line)
* recognizes URLs, e-mail addreses, DNS domains, IP addresses
* for specified languages recognizes abbreviations and clictics (such as 've or n't in English)
* preserves XML-like tags
* replaces entities with unicode equivalents
* adds glue () tags between tokens not separated by space
Requires a configuration file defining tokens in the target language.
Configuration files are provided in directory configs.
configs/other.py is the default configuration that can be used for any language
written in a script using whitespace to separate tokens.
Language-specific configuration files contain language-specific token regexps,
e.g. abbreviations common in the language.
Uninormed input is expected, i.e. the input has to be character-normalized using
a standalone script uninorm.py, see usage below.
{{{
#!html
Paper
|
Cite
|
Licence
}}}
== Usage example (English) ==
{{{
python uninorm.py < input.txt | python unitok.py --trim 100 configs/english.py > output.vert
}}}
== Get unitok ==
See [wiki:Downloads] for the latest version.
== Licence ==
Unitok is licensed under [https://www.mozilla.org/MPL/2.0 Mozilla Public License Version 2.0].