Search:
Login
Preferences
Wiki
Search
wiki:
WikiStart
Version 14 (modified by
admin
,
10 years ago
) (
diff
)
--
This is the joint portal of the Masaryk University's NLP Centre and Lexical Computing focusing on various tools for corpus processing.
Onion (ONe Instance ONly) is a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set.
Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata.
JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences.
Download in other formats:
Plain Text