wiki:SpiderLing

SpiderLing

SpiderLing — a web spider for linguistics — is software for obtaining text from the web useful for building text corpora. Many documents on the web only contain material not suitable for text corpora, such as site navigation, lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient. The aim of our work is to focus the crawling on the text rich parts of the web and maximize the number of words in the final corpus per downloaded megabyte. Nevertheless the crawler can be configured to ignore the yield rate of web domains and download from low yield sites too.

Get SpiderLing

Download the latest version. Please note the software is distributed as is, without a guaranteed support.

Changelog 0.82 → 0.85

  • Mistakenly ignored links fixed in process.py
    • the bug caused not crawling a significant part of the web
  • Multithreading issues fixed (possible race conditions in shared memory)
    • delegated classes for explicit locking of all shared objects (esp. DomainQueue)
  • Deadlock observed in the case of a large scale crawling fixed (up to v. 0.84)
  • Several Domain and DomainQueue methods were rewritten for bulk operations (e.g. adding all paths into a single domain together) to improve performance
  • External Robot exclusion protocol module rewritten
    • unused code removed
    • performance issues and bug in re fixed -- requires re2 now
  • Chunked HTTP reponse and URL handling methods rewritten (performance, bugs)
  • Justext configuration
    • more permissive setting than the justext default
    • configurable in util/config.py

Changelog 0.81 → 0.82

  • Crawling multiple languages improved
    • recognise multiple languages, accept a subset of these languages

Changelog 0.77 → 0.81

  • Improvements proposed by Vlado Benko or Nikola Ljubešić:
    • escape square brackets and backslash in url
    • doc attributes: timestamp with hours, IP address, meta/chared encoding
    • doc id added to arc output
    • MAX_DOCS_CLEANED limit per domain
    • create the temp dir if needed
  • Support for processing doc/docx/ps/pdf (not working well yet, URLs of documents are saved to a separate file for manual download and processing)
  • Crawling multiple languages (Inspired by Nikola's contribution) (not tested yet)
  • Stop crawling by sending SIGTERM to the main process
  • Domain distances (distance of web domains from seed web domains, will be used in scheduling in the future)
  • Config values tweaked
    • MAX_URL_QUEUE, MAX_URL_SELECT greatly increased
    • better spread of domains in the crawling queue => faster crawling
  • Python --> PyPy
    • scheduler and crawler processes dynamically compiled by pypy
    • saves approx. 1/4 RAM
    • better CPU effectivity not so visible (waiting for host/IP timeouts, waiting for doc processors)
    • process.py requires lxml which does not work with pypy (will be replaced by lxml-cffi in the future)
  • Readme updated (more information, known bugs)
  • Bug fixes

Publications

We presented our results at the following venues:

The TenTen Corpus Family
by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel
at 7th International Corpus Linguistics Conference, Lancaster, July 2013.

Large Corpora for Turkic Languages and Unsupervised Morphological Analysis
by Vít Baisa, Vít Suchomel
at Language Resources and Technologies for Turkic Languages (at conference LREC), Istanbul, May 2012

Efficient Web Crawling for Large Text Corpora
by Jan Pomikálek, Vít Suchomel
at ACL SIGWAC Web as Corpus (at conference WWW), Lyon, April 2012

Large textual corpora built using SpiderLing since October 2011

language raw data size [GB] cleaned data size [GB] yield rate corpus size [billion tokens] crawling duration [days]
American Spanish 1874 44 2.36% 8.7 14
Arabic 2015 58 2.89% 6.6 14
Bulgarian 0.9 8
Czech ~4000 5.8 ~40
English 2859 108 3.78% 17.8 17
Estonian 100 3 2.67% 0.3 14
French 3273 72 2.19% 12.4 15
German 5554 145 2.61% 19.7 30
Hungarian 3.1 20
Japanese 2806 61 2.19% 11.1 28
Korean 0.5 20
Polish 9.5 17
Russian 4142 198 4.77% 20.2 14
Turkish 2700 26 0.97% 4.1 14

Requires

2.7.9 <= python < 3,
pypy >= 2.2.1,
justext >= 1.2 (http://corpus.tools/wiki/Justext),
chared >= 1.3 (http://corpus.tools/wiki/Chared),
lxml >= 2.2.4 (http://lxml.de/),
openssl >= 1.0.1,
re2 (python, https://pypi.python.org/pypi/re2/, make sure `import re2` works),
or alternatively cffi_re2 (python/pypy, https://pypi.python.org/pypi/cffi_re2,
make sure `import cffi_re2` works in this case),
pdftotext, ps2ascii, antiword (text processing tools).

Runs in Linux, tested in Fedora and Ubuntu. Minimum hardware configuration (very small crawls):

  • 2 core CPU,
  • 4 GB system memory,
  • some storage space,
  • broadband internet connection.

Recommended hardware configuration (crawling ~30 bn words of English text):

  • 4-24 core CPU (the more CPUs the faster the processing of crawled data),
  • 8-250 GB operational memory (the more RAM the more domains kept in memory and thus more webs visited),
  • lots of storage space,
  • connection to an internet backbone line.

Includes

A robot exclusion rules parser for Python (v. 1.6.2) by Philip Semanchuk, BSD Licence (see util/robotparser.py) Language detection using character trigrams by Douglas Bagnall, Python Software Foundation Licence (see util/trigrams.py) docx2txt by Sandeep Kumar, GNU GPL 3+ (see util/doc2txt.pl)

Installation

  • unpack,
  • install required tools,
  • check justext.core and chared.detector can be imported by pypy,
  • make sure the crawler can write to it's directory and config.PIPE_DIR.

Settings -- edit util/config.py

  • !!!IMPORTANT!!! Set AGENT, AGENT_URL, USER_AGENT,
  • raise ulimit -n accoring to MAX_OPEN_CONNS,
  • set MAX_RUN_TIME to specify max crawling time in seconds,
  • set DOC_PROCESSOR_COUNT to (partially) control CPU usage,
  • configure language dependent settings
  • set MAX_DOMS_READY to (partially) control memory usage,
  • set MAX_DOMS_WAITING_FOR_SINGLE_IP, MAX_IP_REPEAT,
  • set MAX_OPEN_CONNS, IP_CONN_INTERVAL, HOST_CONN_INTERVAL,
  • set and mkdir PIPE_DIR (pipes for communication of subprocesses).

Language models

  • plaintext in the target language in util/lang_samples/, e.g. put plaintexts from several dozens of English web documents and English Wikipedia articels in ./util/lang_samples/English
  • jusText stoplist for that language in jusText stoplist path, e.g. <justext directory>/stoplists/English.txt
  • chared model for that language, e.g. <chared directory>/models/English

Usage

pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP] or pypy spiderling.py [SAVEPOINT_TIMESTAMP] < SEED_URLS

SEED_URLS is a text file containing seed URLs (the crawling starts there), one per line, specify at least 50 URLs. SAVEPOINT_TIMESTAMP causes the state from the specified savepoint to be loaded, e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'.

Python can be used instead of pypy if the latter is not available. It is recommended to run the crawler in screen.

Files created by the crawler:

  • *.log.* .. log & debug files,
  • *.arc.gz .. gzipped arc files (raw http responses),
  • *.prevert_d .. preverticals with duplicate documents,
  • *.duplicates .. files duplicate document IDs,
  • *.domain_{bad,oversize,distance} .. see util/domain.py,
  • *.links_unproc .. unprocessed urls from removed domains,
  • *.links_ignored .. urls not passing domain blacklist or TLD filter,
  • *.links_binfile .. urls of binary files (pdf, ps, doc, docx) not processed

in case config.CONVERSION_ENABLED is disabled,

  • *.state* .. savepoints that can be used for a new run (not tested much).

To remove duplicate documents from preverticals, run

rm spiderling.prevert
for i in $(seq 0 <config.DOC_PROCESSOR_COUNT - 1>)
do 
    pypy util/remove_duplicates.py spiderling.${i}.duplicates \
        < spiderling.${i}.prevert_d >> spiderling.prevert
done

File spiderling.prevert is the final output.

To stop the crawler before MAX_RUN_TIME, send SIGTERM to the main process (spiderling.py). To re-process arc files with current process.py and util/config.py, run zcat spiderling.*.arc.gz | pypy reprocess.py

Performance tips

  • Using PyPy? reduces CPU and memory cost (saves approx. 1/4 RAM).
  • Set ALLOWED_TLDS_RE to avoid crawling domains not in the target language (it takes some resources to detect it otherwise).
  • Set LOG_LEVEL to debug and set INFO_PERIOD to check the debug output in *.log.crawl and *.log.eval to see where the bottleneck is, modify the settings accordingly, e.g. add doc processors if the doc queue is always full.

Known bugs

  • Some non critical I/O errors output to stderr.
  • Domain distances should be made part of document metadata instead of storing them in a separate file.
  • Processing binary files (pdf, ps, doc, docx) is disabled by default since it was not tested and may slow processing significantly.
  • DNS resolvers are implemented as blocking threads => useless to have more than one, will be changed to separate processes in the future.
  • Compressed connections are not accepted/processed. Some servers might be discouraged from sending an uncompressed response (not tested).
  • Some advanced features of robots.txt are not observed, e.g. Crawl-delay. It would require major changes in the design of the download scheduler. A warning is emitted when the crawl delay < config.HOST_CONN_INTERVAL.

Support

There is no guaranteed support offered. (The author may help you a bit in his free time.) Please note the tool is distributed as is, it may not work under your conditions.

Acknowledgements

The author would like to express many thanks to Jan Pomikálek, Pavel Rychlý and Miloš Jakubíček for guidance, key design advice and help with debugging. Thanks also to Vlado Benko and Nikola Ljubešić for ideas for improvement.

This software is developed at the Natural Language Processing Centre of Masaryk University in Brno, Czech Republic
in cooperation with Lexical Computing Ltd., UK, a corpus tool company.

Contact

'zc.inum.if@2mohcusx'[::-1]

Licence

This software is the result of project LM2010013 (LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum). This result is consistent with the expected objectives of the project. The owner of the result is Masaryk University, a public university, ID: 00216224. Masaryk University allows other companies and individuals to use this software free of charge and without territorial restrictions under the terms of the GPL license.

This permission is granted for the duration of property rights.

This software is not subject to special information treatment according to Act No. 412/2005 Coll., as amended. In case that a person who will use the software under this license offer violates the license terms, the permission to use the software terminates.

Last modified 5 months ago Last modified on Oct 28, 2016, 2:34:14 PM