Corpus tools

wiki:WikiStart

Version 47 (modified by admin, 11 months ago) ( diff )
improving texts from the view of SEO and including important keywords

Corpus.Tools is a joint portal of Masaryk University's NLP Centre and Lexical Computing, dedicated to a range of software tools for text corpus processing, including the widely used corpus software Sketch Engine.

It offers advanced corpus tools for language processing and research. There are tools for corpus analysis and corpus building, helping linguists, experts in language technology, and NLP engineers process efficiently large language data.

These corpus tools streamline working with large text datasets across many languages. They are designed to clean and deduplicate documents and text data, compile and annotate them, and to analyse them using linguistic and statistical criteria. The tools are language-independent, suitable for major languages as well as low-resourced and minority languages.

If you have questions, join the NoSketch Engine Google group to connect with the developers and other users.

JusText is a HTML boilerplate removal tool. It can remove navigation links, headers, footers, etc. from HTML pages and keep only the main body of text containing complete sentences. It is especially useful for collecting linguistically valuable texts suitable for linguistic analysis.

Paper | Cite | Licence

Chared is a tool for detecting the character encoding of a text in a known language. It contains models for a wide range of languages.

Paper | Cite | Licence

Spiderling is a web spider for linguistics. It crawls the web and collects linguistically valuable text-rich web pages, suitable for building text corpora and linguistic datasets.

Paper | Cite | Licence

Onion (ONe Instance ONly) is a de-duplicator for large collections of texts. It measures the similarity of paragraphs or whole documents and removes duplicate texts based on the threshold set by the user. It is mainly useful for removing duplicated (shared, reposted, republished) content from texts intended for text corpora.

Paper | Cite | Licence

Unitok is a universal text tokenizer with customizable settings for many languages. It can turn plain text into a sequence of newline-separated tokens (vertical format) while preserving XML-like tags containing metadata. Designed for fast tokenization of extensive text collections, enabling the creation of large text corpora.

Paper | Cite | Licence

NoSketch Engine is the open-sourced little brother of the Sketch Engine corpus system. It includes tools such as concordancer, frequency lists, keyword extraction, advanced searching using linguistic criteria and many others.