Changes between Version 46 and Version 47 of WikiStart
- Timestamp:
- 08/28/25 09:04:38 (5 days ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
WikiStart
v46 v47 1 This is a joint portal of the [http://nlp.fi.muni.cz Masaryk University's NLP Centre] and [https://lexicalcomputing.com/ Lexical Computing] dedicated to a number of software tools for corpus processing including a well-known corpus manager[https://www.sketchengine.eu/ Sketch Engine].1 Corpus.Tools is a joint portal of [http://nlp.fi.muni.cz Masaryk University's NLP Centre] and [https://lexicalcomputing.com/ Lexical Computing], dedicated to a range of software tools for text corpus processing, including the widely used corpus software [https://www.sketchengine.eu/ Sketch Engine]. 2 2 3 If you have any questions or suggestions, please subscribe to the [https://groups.google.com/a/sketchengine.co.uk/g/noske NoSketch Engine] Google group, where you can get involved in the discussion with the developers and other users. 3 4 5 It offers advanced corpus tools for language processing and research. There are tools for corpus analysis and corpus building, helping linguists, experts in language technology, and NLP engineers process efficiently large language data. 6 7 These corpus tools streamline working with large text datasets across many languages. They are designed to clean and deduplicate documents and text data, compile and annotate them, and to analyse them using linguistic and statistical criteria. The tools are language-independent, suitable for major languages as well as low-resourced and minority languages. 8 9 If you have questions, join the [https://groups.google.com/a/sketchengine.co.uk/g/noske NoSketch Engine Google group] to connect with the developers and other users. 10 11 12 13 4 14 5 15 {{{ … … 9 19 <td class="app" style="background-color:#DDA0DD ; background-image:url('/chrome/site/justext_nb.png')"> 10 20 <p><a href="/wiki/Justext"> 11 JusText is a HTML boilerplate removal tool. It can strip navigation links, headers, footers, etc. from HTML pages and leave just regular text containing full sentences.</a><p>21 JusText is a HTML boilerplate removal tool. It can remove navigation links, headers, footers, etc. from HTML pages and keep only the main body of text containing complete sentences. It is especially useful for collecting linguistically valuable texts suitable for linguistic analysis.</a><p> 12 22 <p> 13 23 <a class="lnk" href="http://is.muni.cz/th/45523/fi_d/phdthesis.pdf">Paper</a> … … 34 44 35 45 <td class="app" style="background-color:#20B2AA ; background-image:url('/chrome/site/spiderling_nb.png')"> 36 <p><a href="/wiki/SpiderLing">Spiderling is a web spider for linguistics. It c an crawl text-rich parts of the web and collect a lot of data suitable for text corpora.46 <p><a href="/wiki/SpiderLing">Spiderling is a web spider for linguistics. It crawls the web and collects linguistically valuable text-rich web pages, suitable for building text corpora and linguistic datasets. 37 47 </a><p> 38 48 <p> … … 47 57 <td class="app" style="background-color:#6B8E23 ; background-image:url('/chrome/site/onion_nb.png')"> 48 58 <p><a href="/wiki/Onion"> 49 Onion (ONe Instance ONly) is a de-duplicator for large collections of texts. It can measure the similarity of paragraphs or whole documents and drop duplicate ones based on the threshold you set.</a></p>59 Onion (ONe Instance ONly) is a de-duplicator for large collections of texts. It measures the similarity of paragraphs or whole documents and removes duplicate texts based on the threshold set by the user. It is mainly useful for removing duplicated (shared, reposted, republished) content from texts intended for text corpora.</a></p> 50 60 <p> 51 61 <a class="lnk" href="http://is.muni.cz/th/45523/fi_d/phdthesis.pdf">Paper</a> … … 61 71 <td class="app" style="background-color:#9ACD32 ; background-image:url('/chrome/site/unitok_nb.png')"> 62 72 <p style="color:white;"><a href="/wiki/Unitok"> 63 Unitok is a universal text tokeni ser with specific settings for many languages. It can turn plain text into a sequence of newline-separated tokens (“vertical” format), while preserving XML-like tags containing metadata.</a></p>73 Unitok is a universal text tokenizer with customizable settings for many languages. It can turn plain text into a sequence of newline-separated tokens (vertical format) while preserving XML-like tags containing metadata. Designed for fast tokenization of extensive text collections, enabling the creation of large text corpora.</a></p> 64 74 <p> 65 75 <a class="lnk" href="http://nlp.fi.muni.cz/raslan/raslan14.pdf#page=79">Paper</a> … … 72 82 73 83 <td class="app" style="background-color:#9932CC ; background-image:url('/chrome/site/noske_icon_logo_only_white.png')"> 74 <p style="color:white;"><a href="http://nlp.fi.muni.cz/trac/noske">NoSketch Engine is the open-sourced little brother of the corpus querying system Sketch Engine.84 <p style="color:white;"><a href="http://nlp.fi.muni.cz/trac/noske">NoSketch Engine is the open-sourced little brother of the Sketch Engine corpus system. It includes tools such as concordancer, frequency lists, keyword extraction, advanced searching using linguistic criteria and many others. 75 85 </a><p> 76 86 <p> … … 87 97 88 98 <td class="app black" style="background-color:#A52A2A; background-image: url('/chrome/site/w2c_44.png');"> 89 <p><a href="/wiki/wiki2corpus" style="color:white;">wiki2corpus is a script which downloads Wikipedia articles (for a given language) and outputs them in the form of prevertical which can be further processed by other corpus tools.99 <p><a href="/wiki/wiki2corpus" style="color:white;">wiki2corpus is a script which downloads the main body of text from Wikipedia articles (for a given language) and outputs them in the form of a prevertical (plain text format) which can be fed to corpus tools including Sketch Engine and NoSketch Engine. 90 100 </a><p> 91 101 … … 96 106 97 107 <td class="app black" style="background-color:#191970; background-image: url('/chrome/site/noske_nb.png');"> 98 <p><a href="/wiki/languagefilter" style="color:white;">Language Filter is a language discriminating tool. It works with the vertical format. The language of paragraphs and documents is determined according to pre-defined lists of words with corpus frequency.108 <p><a href="/wiki/languagefilter" style="color:white;">Language Filter is a tool for distinguishing languages. It works with the vertical format. The language of paragraphs and documents is determined according to pre-defined word frequency lists (i.e. wordlists generated from large web corpora). 99 109 </a><p> 100 110 <p>