| 1 | = !SpiderLing = |
| 2 | !SpiderLing — a web spider for linguistics — is software for obtaining |
| 3 | text from the web useful for building text corpora. Many documents on |
| 4 | the web only contain material not suitable for text corpora, such as |
| 5 | site navigation, lists of links, lists of products, and other kind of |
| 6 | text not comprised of full sentences. In fact such pages represent the |
| 7 | vast majority of the web. Therefore, by doing unrestricted web crawls, |
| 8 | we typically download a lot of data which gets filtered out during |
| 9 | post-processing. This makes the process of web corpus collection |
| 10 | inefficient. The aim of our work is to focus the crawling on the text |
| 11 | rich parts of the web and maximize the number of words in the final |
| 12 | corpus per downloaded megabyte. |
| 13 | |
| 14 | == Publications == |
| 15 | We presented our results at the following venues: |
| 16 | |
| 17 | [http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/tentens_14may2013.docx |
| 18 | The TenTen Corpus Family]\\ |
| 19 | by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel\\ |
| 20 | at [http://ucrel.lancs.ac.uk/cl2013/ 7th International Corpus |
| 21 | Linguistics Conference], Lancaster, July 2013. |
| 22 | |
| 23 | [http://nlp.fi.muni.cz/~xsuchom2/papers/BaisaSuchomel_TurkicResources.pdf |
| 24 | Large Corpora for Turkic Languages and Unsupervised Morphological |
| 25 | Analysis]\\ |
| 26 | by Vít Baisa, Vít Suchomel\\ |
| 27 | at [http://multisaund.eu/lrec2012_turkiclanguage.php Language |
| 28 | Resources and Technologies for Turkic Languages] (at conference LREC), |
| 29 | Istanbul, May 2012 |
| 30 | |
| 31 | [http://nlp.fi.muni.cz/~xsuchom2/papers/PomikalekSuchomel_SpiderlingEfficiency.pdf |
| 32 | Efficient Web Crawling for Large Text Corpora]\\ |
| 33 | by Jan Pomikálek, Vít Suchomel\\ |
| 34 | at [http://sigwac.org.uk/wiki/WAC7 ACL SIGWAC Web as Corpus] (at |
| 35 | conference WWW), Lyon, April 2012 |
| 36 | |
| 37 | |
| 38 | == Large textual corpora built using !SpiderLing since October 2011 == |
| 39 | ||= language =||= raw data size [GB] =||= cleaned data size [GB] =||= yield rate =||= corpus size [billion tokens] =||= crawling duration [days] =|| |
| 40 | ||American Spanish || 1874|| 44|| 2.36%|| 8.7|| 14|| |
| 41 | ||Arabic || 2015|| 58|| 2.89%|| 6.6|| 14|| |
| 42 | ||Bulgarian || || || || 0.9|| 8|| |
| 43 | ||Czech || ~4000|| || || 5.8|| ~40|| |
| 44 | ||English || 2859|| 108|| 3.78%|| 17.8|| 17|| |
| 45 | ||Estonian || 100|| 3|| 2.67%|| 0.3|| 14|| |
| 46 | ||French || 3273|| 72|| 2.19%|| 12.4|| 15|| |
| 47 | ||German || 5554|| 145|| 2.61%|| 19.7|| 30|| |
| 48 | ||Hungarian || || || || 3.1|| 20|| |
| 49 | ||Japanese || 2806|| 61|| 2.19%|| 11.1|| 28|| |
| 50 | ||Korean || || || || 0.5|| 20|| |
| 51 | ||Polish || || || || 9.5|| 17|| |
| 52 | ||Russian || 4142|| 198|| 4.77%|| 20.2|| 14|| |
| 53 | ||Turkish || 2700|| 26|| 0.97%|| 4.1|| 14|| |
| 54 | |
| 55 | |
| 56 | == Acknowledgements == |
| 57 | This software is developed at the [http://nlp.fi.muni.cz/en/nlpc |
| 58 | Natural Language Processing Centre] of Masaryk University in Brno, |
| 59 | Czech Republic \\ |
| 60 | in cooperation with [http://lexicalcomputing.com/ Lexical Computing |
| 61 | Ltd.], UK, a [http://www.sketchengine.co.uk/ corpus tool] company. |
| 62 | |
| 63 | == Contact == |
| 64 | {{{ |
| 65 | #!html |
| 66 | <SCRIPT TYPE="text/javascript"> |
| 67 | emailE='fi.muni.cz'; |
| 68 | emailE=('xsuchom2' + '@' + emailE); |
| 69 | document.write('Vít Suchomel: <A href="mailto:' + emailE + '">' + |
| 70 | emailE + '</a>'); |
| 71 | </SCRIPT> |
| 72 | }}} |
| 73 | |
| 74 | == Licence == |
| 75 | This software is the result of project LM2010013 (LINDAT-Clarin - |
| 76 | Vybudování a provoz českého uzlu pan-evropské infrastruktury pro |
| 77 | výzkum). This result is consistent with the expected objectives of the |
| 78 | project. The owner of the result is Masaryk University, a public |
| 79 | university, ID: 00216224. Masaryk University allows other companies |
| 80 | and individuals to use this software free of charge and without |
| 81 | territorial restrictions under the terms of the |
| 82 | [http://www.gnu.org/licenses/gpl.txt GPL license]. |
| 83 | |
| 84 | This permission is granted for the duration of property rights. |
| 85 | |
| 86 | This software is not subject to special information treatment |
| 87 | according to Act No. 412/2005 Coll., as amended. In case that a person |
| 88 | who will use the software under this license offer violates the |
| 89 | license terms, the permission to use the software terminates. |
| 90 | |
| 91 | == Source code == |
| 92 | [attachment:spiderling-src-0.77.tar.xz] |