Changes between Initial Version and Version 1 of SpiderLing


Ignore:
Timestamp:
08/06/15 14:02:33 (9 years ago)
Author:
admin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SpiderLing

    v1 v1  
     1= !SpiderLing =
     2!SpiderLing — a web spider for linguistics — is software for obtaining
     3text from the web useful for building text corpora. Many documents on
     4the web only contain material not suitable for text corpora, such as
     5site navigation, lists of links, lists of products, and other kind of
     6text not comprised of full sentences. In fact such pages represent the
     7vast majority of the web. Therefore, by doing unrestricted web crawls,
     8we typically download a lot of data which gets filtered out during
     9post-processing. This makes the process of web corpus collection
     10inefficient. The aim of our work is to focus the crawling on the text
     11rich parts of the web and maximize the number of words in the final
     12corpus per downloaded megabyte.
     13
     14== Publications ==
     15We presented our results at the following venues:
     16
     17[http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/tentens_14may2013.docx
     18The TenTen Corpus Family]\\
     19by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel\\
     20at [http://ucrel.lancs.ac.uk/cl2013/ 7th International Corpus
     21Linguistics Conference], Lancaster, July 2013.
     22
     23[http://nlp.fi.muni.cz/~xsuchom2/papers/BaisaSuchomel_TurkicResources.pdf
     24Large Corpora for Turkic Languages and Unsupervised Morphological
     25Analysis]\\
     26by Vít Baisa, Vít Suchomel\\
     27at [http://multisaund.eu/lrec2012_turkiclanguage.php Language
     28Resources and Technologies for Turkic Languages] (at conference LREC),
     29Istanbul, May 2012
     30
     31[http://nlp.fi.muni.cz/~xsuchom2/papers/PomikalekSuchomel_SpiderlingEfficiency.pdf
     32Efficient Web Crawling for Large Text Corpora]\\
     33by Jan Pomikálek, Vít Suchomel\\
     34at [http://sigwac.org.uk/wiki/WAC7 ACL SIGWAC Web as Corpus] (at
     35conference WWW), Lyon, April 2012
     36
     37
     38== Large textual corpora built using !SpiderLing since October 2011 ==
     39||= language  =||= raw data size [GB] =||= cleaned data size [GB] =||= yield rate =||= corpus size [billion tokens] =||= crawling duration [days] =||
     40||American Spanish ||   1874||   44||  2.36%||   8.7||   14||
     41||Arabic           ||   2015||   58||  2.89%||   6.6||   14||
     42||Bulgarian        ||       ||     ||       ||   0.9||    8||
     43||Czech            ||  ~4000||     ||       ||   5.8||  ~40||
     44||English          ||   2859||  108||  3.78%||  17.8||   17||
     45||Estonian         ||    100||    3||  2.67%||   0.3||   14||
     46||French           ||   3273||   72||  2.19%||  12.4||   15||
     47||German           ||   5554||  145||  2.61%||  19.7||   30||
     48||Hungarian        ||       ||     ||       ||   3.1||   20||
     49||Japanese         ||   2806||   61||  2.19%||  11.1||   28||
     50||Korean           ||       ||     ||       ||   0.5||   20||
     51||Polish           ||       ||     ||       ||   9.5||   17||
     52||Russian          ||   4142||  198||  4.77%||  20.2||   14||
     53||Turkish          ||   2700||   26||  0.97%||   4.1||   14||
     54
     55
     56== Acknowledgements ==
     57This software is developed at the [http://nlp.fi.muni.cz/en/nlpc
     58Natural Language Processing Centre] of Masaryk University in Brno,
     59Czech Republic \\
     60in cooperation with [http://lexicalcomputing.com/ Lexical Computing
     61Ltd.], UK, a [http://www.sketchengine.co.uk/ corpus tool] company.
     62
     63== Contact ==
     64{{{
     65#!html
     66<SCRIPT TYPE="text/javascript">
     67  emailE='fi.muni.cz';
     68  emailE=('xsuchom2' + '@' + emailE);
     69  document.write('Vít Suchomel: <A href="mailto:' + emailE + '">' +
     70emailE + '</a>');
     71</SCRIPT>
     72}}}
     73
     74== Licence ==
     75This software is the result of project LM2010013 (LINDAT-Clarin -
     76Vybudování a provoz českého uzlu pan-evropské infrastruktury pro
     77výzkum). This result is consistent with the expected objectives of the
     78project. The owner of the result is Masaryk University, a public
     79university, ID: 00216224. Masaryk University allows other companies
     80and individuals to use this software free of charge and without
     81territorial restrictions under the terms of the
     82[http://www.gnu.org/licenses/gpl.txt GPL license].
     83
     84This permission is granted for the duration of property rights.
     85
     86This software is not subject to special information treatment
     87according to Act No. 412/2005 Coll., as amended. In case that a person
     88who will use the software under this license offer violates the
     89license terms, the permission to use the software terminates.
     90
     91== Source code ==
     92[attachment:spiderling-src-0.77.tar.xz]