{{{ #!html

This is the joint portal of the Masaryk University's NLP Centre and Lexical Computing focusing on various tools for corpus processing.

Onion (ONe Instance ONly) is a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set. Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata.
JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences.
}}}