This is a joint portal of the [http://nlp.fi.muni.cz Masaryk University's NLP Centre] and [http://www.sketchengine.co.uk Lexical Computing] focusing on various tools for corpus processing. {{{ #!html
Onion (ONe Instance ONly) is a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set. Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata.
JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences.
}}}