Changes between Version 46 and Version 47 of WikiStart


Ignore:
Timestamp:
08/28/25 09:04:38 (5 days ago)
Author:
admin
Comment:

improving texts from the view of SEO and including important keywords

Legend:

Unmodified
Added
Removed
Modified
  • WikiStart

    v46 v47  
    1 This is a joint portal of the [http://nlp.fi.muni.cz Masaryk University's NLP Centre] and [https://lexicalcomputing.com/ Lexical Computing] dedicated to a number of software tools for corpus processing including a well-known corpus manager [https://www.sketchengine.eu/ Sketch Engine].
     1Corpus.Tools is a joint portal of [http://nlp.fi.muni.cz Masaryk University's NLP Centre] and [https://lexicalcomputing.com/ Lexical Computing], dedicated to a range of software tools for text corpus processing, including the widely used corpus software [https://www.sketchengine.eu/ Sketch Engine].
    22
    3 If you have any questions or suggestions, please subscribe to the [https://groups.google.com/a/sketchengine.co.uk/g/noske NoSketch Engine] Google group, where you can get involved in the discussion with the developers and other users.
     3
     4
     5It offers advanced corpus tools for language processing and research. There are tools for corpus analysis and corpus building, helping linguists, experts in language technology, and NLP engineers process efficiently large language data.
     6
     7These corpus tools streamline working with large text datasets across many languages. They are designed to clean and deduplicate documents and text data, compile and annotate them, and to analyse them using linguistic and statistical criteria. The tools are language-independent, suitable for major languages as well as low-resourced and minority languages.
     8
     9If you have questions, join the [https://groups.google.com/a/sketchengine.co.uk/g/noske NoSketch Engine Google group] to connect with the developers and other users.
     10
     11
     12
     13
    414
    515{{{
     
    919<td class="app" style="background-color:#DDA0DD ; background-image:url('/chrome/site/justext_nb.png')">
    1020<p><a href="/wiki/Justext">
    11 JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences.</a><p>
     21JusText is a HTML boilerplate removal tool. It can remove navigation links, headers, footers, etc. from HTML pages and keep only the main body of text containing complete sentences. It is especially useful for collecting linguistically valuable texts suitable for linguistic analysis.</a><p>
    1222<p>
    1323<a class="lnk" href="http://is.muni.cz/th/45523/fi_d/phdthesis.pdf">Paper</a>
     
    3444
    3545<td class="app" style="background-color:#20B2AA ; background-image:url('/chrome/site/spiderling_nb.png')">
    36 <p><a href="/wiki/SpiderLing">Spiderling is a web spider for linguistics. It can crawl text-rich parts of the web and collect a lot of data suitable for text corpora.
     46<p><a href="/wiki/SpiderLing">Spiderling is a web spider for linguistics. It crawls the web and collects linguistically valuable text-rich web pages, suitable for building text corpora and linguistic datasets.
    3747</a><p>
    3848<p>
     
    4757<td class="app" style="background-color:#6B8E23 ; background-image:url('/chrome/site/onion_nb.png')">
    4858<p><a href="/wiki/Onion">
    49 Onion (ONe Instance ONly) is a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set.</a></p>
     59Onion (ONe Instance ONly) is a de-duplicator for large collections of texts. It measures the similarity of paragraphs or whole documents and removes duplicate texts based on the threshold set by the user. It is mainly useful for removing duplicated (shared, reposted, republished) content from texts intended for text corpora.</a></p>
    5060<p>
    5161<a class="lnk" href="http://is.muni.cz/th/45523/fi_d/phdthesis.pdf">Paper</a>
     
    6171<td class="app" style="background-color:#9ACD32 ; background-image:url('/chrome/site/unitok_nb.png')">
    6272<p style="color:white;"><a href="/wiki/Unitok">
    63 Unitok is a universal text tokeniser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata.</a></p>
     73Unitok is a universal text tokenizer with customizable settings for many languages. It can turn plain text into a sequence of newline-separated tokens (vertical format) while preserving XML-like tags containing metadata. Designed for fast tokenization of extensive text collections, enabling the creation of large text corpora.</a></p>
    6474<p>
    6575<a class="lnk" href="http://nlp.fi.muni.cz/raslan/raslan14.pdf#page=79">Paper</a>
     
    7282
    7383<td class="app" style="background-color:#9932CC ; background-image:url('/chrome/site/noske_icon_logo_only_white.png')">
    74 <p style="color:white;"><a href="http://nlp.fi.muni.cz/trac/noske">NoSketch Engine is the open-sourced little brother of the corpus querying system Sketch Engine.
     84<p style="color:white;"><a href="http://nlp.fi.muni.cz/trac/noske">NoSketch Engine is the open-sourced little brother of the Sketch Engine corpus system. It includes tools such as concordancer, frequency lists, keyword extraction, advanced searching using linguistic criteria and many others.
    7585</a><p>
    7686<p>
     
    8797
    8898<td class="app black" style="background-color:#A52A2A; background-image: url('/chrome/site/w2c_44.png');">
    89 <p><a href="/wiki/wiki2corpus" style="color:white;">wiki2corpus is a script which downloads Wikipedia articles (for a given language) and outputs them in the form of prevertical which can be further processed by other corpus tools.
     99<p><a href="/wiki/wiki2corpus" style="color:white;">wiki2corpus is a script which downloads the main body of text from Wikipedia articles (for a given language) and outputs them in the form of a prevertical (plain text format) which can be fed to corpus tools including Sketch Engine and NoSketch Engine.
    90100</a><p>
    91101
     
    96106
    97107<td class="app black" style="background-color:#191970; background-image: url('/chrome/site/noske_nb.png');">
    98 <p><a href="/wiki/languagefilter" style="color:white;">Language Filter is a language discriminating tool. It works with the vertical format. The language of paragraphs and documents is determined according to pre-defined lists of words with corpus frequency.
     108<p><a href="/wiki/languagefilter" style="color:white;">Language Filter is a tool for distinguishing languages. It works with the vertical format. The language of paragraphs and documents is determined according to pre-defined word frequency lists (i.e. wordlists generated from large web corpora).
    99109</a><p>
    100110<p>