Version 2 (modified by 9 years ago) ( diff ) | ,
---|
SpiderLing
SpiderLing — a web spider for linguistics — is software for obtaining text from the web useful for building text corpora. Many documents on the web only contain material not suitable for text corpora, such as site navigation, lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient. The aim of our work is to focus the crawling on the text rich parts of the web and maximize the number of words in the final corpus per downloaded megabyte.
Source code
Publications
We presented our results at the following venues:
[http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/tentens_14may2013.docx
The TenTen Corpus Family]
by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel
at [http://ucrel.lancs.ac.uk/cl2013/ 7th International Corpus
Linguistics Conference], Lancaster, July 2013.
[http://nlp.fi.muni.cz/~xsuchom2/papers/BaisaSuchomel_TurkicResources.pdf
Large Corpora for Turkic Languages and Unsupervised Morphological
Analysis]
by Vít Baisa, Vít Suchomel
at [http://multisaund.eu/lrec2012_turkiclanguage.php Language
Resources and Technologies for Turkic Languages] (at conference LREC),
Istanbul, May 2012
[http://nlp.fi.muni.cz/~xsuchom2/papers/PomikalekSuchomel_SpiderlingEfficiency.pdf
Efficient Web Crawling for Large Text Corpora]
by Jan Pomikálek, Vít Suchomel
at ACL SIGWAC Web as Corpus (at
conference WWW), Lyon, April 2012
Large textual corpora built using SpiderLing since October 2011
language | raw data size [GB] | cleaned data size [GB] | yield rate | corpus size [billion tokens] | crawling duration [days] |
---|---|---|---|---|---|
American Spanish | 1874 | 44 | 2.36% | 8.7 | 14 |
Arabic | 2015 | 58 | 2.89% | 6.6 | 14 |
Bulgarian | 0.9 | 8 | |||
Czech | ~4000 | 5.8 | ~40 | ||
English | 2859 | 108 | 3.78% | 17.8 | 17 |
Estonian | 100 | 3 | 2.67% | 0.3 | 14 |
French | 3273 | 72 | 2.19% | 12.4 | 15 |
German | 5554 | 145 | 2.61% | 19.7 | 30 |
Hungarian | 3.1 | 20 | |||
Japanese | 2806 | 61 | 2.19% | 11.1 | 28 |
Korean | 0.5 | 20 | |||
Polish | 9.5 | 17 | |||
Russian | 4142 | 198 | 4.77% | 20.2 | 14 |
Turkish | 2700 | 26 | 0.97% | 4.1 | 14 |
Acknowledgements
This software is developed at the [http://nlp.fi.muni.cz/en/nlpc
Natural Language Processing Centre] of Masaryk University in Brno,
Czech Republic
in cooperation with [http://lexicalcomputing.com/ Lexical Computing
Ltd.], UK, a corpus tool company.
Contact
Licence
This software is the result of project LM2010013 (LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum). This result is consistent with the expected objectives of the project. The owner of the result is Masaryk University, a public university, ID: 00216224. Masaryk University allows other companies and individuals to use this software free of charge and without territorial restrictions under the terms of the GPL license.
This permission is granted for the duration of property rights.
This software is not subject to special information treatment according to Act No. 412/2005 Coll., as amended. In case that a person who will use the software under this license offer violates the license terms, the permission to use the software terminates.
Attachments (1)
- crawled_sizes_2019.png (208.1 KB ) - added by 4 years ago.
Download all attachments as: .zip