= !SpiderLing =
!SpiderLing — a web spider for linguistics — is software for obtaining
text from the web useful for building text corpora. Many documents on
the web only contain material not suitable for text corpora, such as
site navigation, lists of links, lists of products, and other kind of
text not comprised of full sentences. In fact such pages represent the
vast majority of the web. Therefore, by doing unrestricted web crawls,
we typically download a lot of data which gets filtered out during
post-processing. This makes the process of web corpus collection
inefficient. The aim of our work is to focus the crawling on the text
rich parts of the web and maximize the number of words in the final
corpus per downloaded megabyte.

== Get SpiderLing ==
See [wiki:Downloads] for the latest version.

=== Changelog 0.77 → 0.81 ===
* Improvements proposed by Vlado Benko or Nikola Ljubešić:
 - escape square brackets and backslash in url
 - doc attributes: timestamp with hours, IP address, meta/chared encoding
 - doc id added to arc output
 - MAX_DOCS_CLEANED limit per domain
 - create the temp dir if needed
* Support for processing doc/docx/ps/pdf (not working well yet, URLs of documents are saved to a separate file for manual download and processing)
* Crawling multiple languages (Inspired by Nikola's contribution) (not tested yet)
* Stop crawling by sending SIGTERM to the main process
* Domain distances (distance of web domains from seed web domains, will be used in scheduling in the future)
* Config values tweaked
 - MAX_URL_QUEUE, MAX_URL_SELECT greatly increased
 - better spread of domains in the crawling queue => faster crawling
* Python --> !PyPy
 - scheduler and crawler processes dynamically compiled by pypy
 - saves approx. 1/4 RAM
 - better CPU effectivity not so visible (waiting for host/IP timeouts, waiting for doc processors)
 - process.py requires lxml which does not work with pypy (will be replaced by lxml-cffi in the future)
* Readme updated (more information, known bugs)
* Bug fixes

== Publications ==
We presented our results at the following venues:

[http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/tentens_14may2013.docx
The TenTen Corpus Family]\\
by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel\\
at [http://ucrel.lancs.ac.uk/cl2013/ 7th International Corpus
Linguistics Conference], Lancaster, July 2013.

[http://nlp.fi.muni.cz/~xsuchom2/papers/BaisaSuchomel_TurkicResources.pdf
Large Corpora for Turkic Languages and Unsupervised Morphological
Analysis]\\
by Vít Baisa, Vít Suchomel\\
at [http://multisaund.eu/lrec2012_turkiclanguage.php Language
Resources and Technologies for Turkic Languages] (at conference LREC),
Istanbul, May 2012

[http://nlp.fi.muni.cz/~xsuchom2/papers/PomikalekSuchomel_SpiderlingEfficiency.pdf
Efficient Web Crawling for Large Text Corpora]\\
by Jan Pomikálek, Vít Suchomel\\
at [http://sigwac.org.uk/wiki/WAC7 ACL SIGWAC Web as Corpus] (at
conference WWW), Lyon, April 2012


== Large textual corpora built using !SpiderLing since October 2011 ==
||= language  =||= raw data size [GB] =||= cleaned data size [GB] =||= yield rate =||= corpus size [billion tokens] =||= crawling duration [days] =||
||American Spanish ||   1874||   44||  2.36%||   8.7||   14||
||Arabic           ||   2015||   58||  2.89%||   6.6||   14||
||Bulgarian        ||       ||     ||       ||   0.9||    8||
||Czech            ||  ~4000||     ||       ||   5.8||  ~40||
||English          ||   2859||  108||  3.78%||  17.8||   17||
||Estonian         ||    100||    3||  2.67%||   0.3||   14||
||French           ||   3273||   72||  2.19%||  12.4||   15||
||German           ||   5554||  145||  2.61%||  19.7||   30||
||Hungarian        ||       ||     ||       ||   3.1||   20||
||Japanese         ||   2806||   61||  2.19%||  11.1||   28||
||Korean           ||       ||     ||       ||   0.5||   20||
||Polish           ||       ||     ||       ||   9.5||   17||
||Russian          ||   4142||  198||  4.77%||  20.2||   14||
||Turkish          ||   2700||   26||  0.97%||   4.1||   14||


== Acknowledgements ==
This software is developed at the [http://nlp.fi.muni.cz/en/nlpc
Natural Language Processing Centre] of Masaryk University in Brno,
Czech Republic \\
in cooperation with [http://lexicalcomputing.com/ Lexical Computing
Ltd.], UK, a [http://www.sketchengine.co.uk/ corpus tool] company.

== Contact ==
{{{
#!html
<SCRIPT TYPE="text/javascript">
  emailE='fi.muni.cz';
  emailE=('xsuchom2' + '@' + emailE);
  document.write('Vít Suchomel: <A href="mailto:' + emailE + '">' +
emailE + '</a>');
</SCRIPT>
}}}

== Licence ==
This software is the result of project LM2010013 (LINDAT-Clarin -
Vybudování a provoz českého uzlu pan-evropské infrastruktury pro
výzkum). This result is consistent with the expected objectives of the
project. The owner of the result is Masaryk University, a public
university, ID: 00216224. Masaryk University allows other companies
and individuals to use this software free of charge and without
territorial restrictions under the terms of the
[http://www.gnu.org/licenses/gpl.txt GPL license].

This permission is granted for the duration of property rights.

This software is not subject to special information treatment
according to Act No. 412/2005 Coll., as amended. In case that a person
who will use the software under this license offer violates the
license terms, the permission to use the software terminates.