Corpus tools

wiki:WikiStart

This is a joint portal of the Masaryk University's NLP Centre and Lexical Computing dedicated to a number of software tools for corpus processing including a well-known corpus manager Sketch Engine.

If you have any questions or suggestions, please subscribe to the NoSketch Engine Google group, where you can get involved in the discussion with the developers and other users.

JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences.

Paper | Cite | Licence

Chared is a tool for detecting the character encoding of a text in a known language. It contains models for a wide range of languages.

Paper | Cite | Licence

Spiderling is a web spider for linguistics. It can crawl text-rich parts of the web and collect a lot of data suitable for text corpora.

Paper | Cite | Licence

Onion (ONe Instance ONly) is a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set.

Paper | Cite | Licence

Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata.

Paper | Cite | Licence

NoSketch Engine is the open-sourced little brother of the corpus querying system Sketch Engine.

Paper | Cite | Licence