This is a joint portal of the [http://nlp.fi.muni.cz Masaryk University's NLP Centre] and [http://www.sketchengine.co.uk Lexical Computing] dedicated to a number of software tools for corpus processing. If you have any questions or suggestions, please subscribe to the [https://groups.google.com/a/sketchengine.co.uk/forum/#!forum/noske NoSketch Engine] Google group, where you can get involved in the discussion with the developers and other users. {{{ #!html

JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences.

Paper | Cite | Licence

Chared is a tool for detecting the character encoding of a text in a known language. It contains models for a wide range of languages.

Paper | Cite | Licence

Spiderling is a web spider for linguistics. It can crawl text-rich parts of the web and collect a lot of data suitable for text corpora.

Paper | Cite | Licence

Onion (ONe Instance ONly) is a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set.

Paper | Cite | Licence

Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata.

Paper | Cite | Licence

NoSketch Engine is the open-sourced little brother of the corpus querying system Sketch Engine.

Paper | Cite | Licence

wiki2corpus is a script which downloads Wikipedia articles (for a given language) and outputs them in the form of prevertical which can be further processed by other corpus tools.

Language Filter is a language discriminating tool. It works with the vertical format. The language of paragraphs and documents is determined according to pre-defined lists of words with corpus frequency.

}}}