Corpus tools

wiki:WikiStart

Version 39 (modified by admin, 7 years ago) ( diff )
--

This is a joint portal of the Masaryk University's NLP Centre and Lexical Computing dedicated to a number of software tools for corpus processing.

If you have any questions or suggestions, please subscribe to the NoSketch Engine Google group, where you can get involved in the discussion with the developers and other users.

JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences.

Paper | Cite | Licence

Chared is a tool for detecting the character encoding of a text in a known language. It contains models for a wide range of languages.

Paper | Cite | Licence

Spiderling is a web spider for linguistics. It can crawl text-rich parts of the web and collect a lot of data suitable for text corpora.

Paper | Cite | Licence

Onion (ONe Instance ONly) is a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set.

Paper | Cite | Licence

Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata.

Paper | Cite | Licence

NoSketch Engine is the open-sourced little brother of the corpus querying system Sketch Engine.

Paper | Cite | Licence

wiki2corpus is a script which downloads Wikipedia articles (for a given language) and outputs them in the form of prevertical which can be further processed by other corpus tools.

Download in other formats:

Plain Text