Version 7 (modified by 8 years ago) ( diff ) | ,
---|
SpiderLing
SpiderLing — a web spider for linguistics — is software for obtaining text from the web useful for building text corpora. Many documents on the web only contain material not suitable for text corpora, such as site navigation, lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient. The aim of our work is to focus the crawling on the text rich parts of the web and maximize the number of words in the final corpus per downloaded megabyte. Nevertheless the crawler can be configured to ignore the yield rate of web domains and download from low yield sites too.
Get SpiderLing
See Downloads for the latest version. Please note the software is distributed as is, without a guaranteed support
Changelog 0.82 → 0.84
- Mistakenly ignored links fixed in process.py
- the bug caused not crawling a significant part of the web
- Multithreading issues fixed (possible race conditions in shared memory)
- delegated classes for explicit locking of all shared objects (esp. DomainQueue)
- Deadlock -- the scheduler stops working after some time of successful crawling
- observed in the case of a large scale crawling
- may be related to changing the code because of multithreading problems
- IMPORTANT -- may not be resolved yet, testing in progress, 0.81 should be stable
- Several Domain and DomainQueue methods were rewritten for bulk operations (e.g. adding all paths into a single domain together) to improve performance
- Justext configuration
- more permissive setting than the justext default
- configurable in util/config.py
Changelog 0.81 → 0.82
- Crawling multiple languages improved
- recognise multiple languages, accept a subset of these languages
Changelog 0.77 → 0.81
- Improvements proposed by Vlado Benko or Nikola Ljubešić:
- escape square brackets and backslash in url
- doc attributes: timestamp with hours, IP address, meta/chared encoding
- doc id added to arc output
- MAX_DOCS_CLEANED limit per domain
- create the temp dir if needed
- Support for processing doc/docx/ps/pdf (not working well yet, URLs of documents are saved to a separate file for manual download and processing)
- Crawling multiple languages (Inspired by Nikola's contribution) (not tested yet)
- Stop crawling by sending SIGTERM to the main process
- Domain distances (distance of web domains from seed web domains, will be used in scheduling in the future)
- Config values tweaked
- MAX_URL_QUEUE, MAX_URL_SELECT greatly increased
- better spread of domains in the crawling queue => faster crawling
- Python --> PyPy
- scheduler and crawler processes dynamically compiled by pypy
- saves approx. 1/4 RAM
- better CPU effectivity not so visible (waiting for host/IP timeouts, waiting for doc processors)
- process.py requires lxml which does not work with pypy (will be replaced by lxml-cffi in the future)
- Readme updated (more information, known bugs)
- Bug fixes
Publications
We presented our results at the following venues:
The TenTen Corpus Family
by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel
at 7th International Corpus Linguistics Conference, Lancaster, July 2013.
Large Corpora for Turkic Languages and Unsupervised Morphological Analysis
by Vít Baisa, Vít Suchomel
at Language Resources and Technologies for Turkic Languages (at conference LREC), Istanbul, May 2012
Efficient Web Crawling for Large Text Corpora
by Jan Pomikálek, Vít Suchomel
at ACL SIGWAC Web as Corpus (at conference WWW), Lyon, April 2012
Large textual corpora built using SpiderLing since October 2011
language | raw data size [GB] | cleaned data size [GB] | yield rate | corpus size [billion tokens] | crawling duration [days] |
---|---|---|---|---|---|
American Spanish | 1874 | 44 | 2.36% | 8.7 | 14 |
Arabic | 2015 | 58 | 2.89% | 6.6 | 14 |
Bulgarian | 0.9 | 8 | |||
Czech | ~4000 | 5.8 | ~40 | ||
English | 2859 | 108 | 3.78% | 17.8 | 17 |
Estonian | 100 | 3 | 2.67% | 0.3 | 14 |
French | 3273 | 72 | 2.19% | 12.4 | 15 |
German | 5554 | 145 | 2.61% | 19.7 | 30 |
Hungarian | 3.1 | 20 | |||
Japanese | 2806 | 61 | 2.19% | 11.1 | 28 |
Korean | 0.5 | 20 | |||
Polish | 9.5 | 17 | |||
Russian | 4142 | 198 | 4.77% | 20.2 | 14 |
Turkish | 2700 | 26 | 0.97% | 4.1 | 14 |
README
== Requires == 2.7.9 <= python < 3, pypy >= 2.2.1, justext >= 1.2 (http://corpus.tools/wiki/Justext), chared >= 1.3 (http://corpus.tools/wiki/Chared), lxml >= 2.2.4 (http://lxml.de/), openssl >= 1.0.1, pdftotext, ps2ascii, antiword (text processing tools). Runs in Linux, tested in Fedora and Ubuntu. Minimum hardware configuration (very small crawls): - 2 core CPU, - 4 GB system memory, - some storage space, - broadband internet connection. Recommended hardware configuration (crawling ~30 bn words of English text): - 4-24 core CPU (the more CPUs the faster the processing of crawled data), - 8-250 GB operational memory (the more RAM the more domains kept in memory and thus more webs visited), - lots of storage space, - connection to an internet backbone line. == Includes == A robot exclusion rules parser for Python (v. 1.6.2) - by Philip Semanchuk, BSD Licence - see util/robotparser.py Language detection using character trigrams - by Douglas Bagnall, Python Software Foundation Licence - see util/trigrams.py docx2txt - by Sandeep Kumar, GNU GPL 3+ - see util/doc2txt.pl == Installation == - unpack, - install required tools, - check justext.core and chared.detector can be imported by pypy, - make sure the crawler can write to it's directory and config.PIPE_DIR. == Settings -- edit util/config.py == - !!!IMPORTANT!!! Set AGENT, AGENT_URL, USER_AGENT, - raise ulimit -n accoring to MAX_OPEN_CONNS, - set MAX_RUN_TIME to specify max crawling time in seconds, - set DOC_PROCESSOR_COUNT to (partially) control CPU usage, - configure language dependent settings - set MAX_DOMS_READY to (partially) control memory usage, - set MAX_DOMS_WAITING_FOR_SINGLE_IP, MAX_IP_REPEAT, - set MAX_OPEN_CONNS, IP_CONN_INTERVAL, HOST_CONN_INTERVAL, - set and mkdir PIPE_DIR (pipes for communication of subprocesses). == Language models == - plaintext in the target language in util/lang_samples/, e.g. put plaintexts from several dozens of English web documents and English Wikipedia articels in ./util/lang_samples/English - jusText stoplist for that language in jusText stoplist path, e.g. <justext directory>/stoplists/English.txt - chared model for that language, e.g. <chared directory>/models/English == Usage == pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP] SEED_URLS is a text file containing seed URLs (the crawling starts there), one per line, specify at least 50 URLs. SAVEPOINT_TIMESTAMP causes the state from the specified savepoint to be loaded, e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'. Running the crawler in the background is recommended. The crawler creates - *.log.* .. log & debug files, - *.arc.gz .. gzipped arc files (raw http responses), - *.prevert_d .. preverticals with duplicate documents, - *.duplicates .. files duplicate document IDs, - *.domain_{bad,oversize,distance} .. see util/domain.py, - *.ignored_refs .. links ignored for configurable or hardcoded reasons, - *.unproc_urls .. urls of non html documents that were not processed (bug). To remove duplicate documents from preverticals, run rm spiderling.prevert for i in $(seq 0 15) do pypy util/remove_duplicates.py spiderling.${i}.duplicates \ < spiderling.${i}.prevert_d >> spiderling.prevert done File spiderling.prevert is the final output. To stop the crawler before MAX_RUN_TIME, send SIGTERM to the main process (spiderling.py). To re-process arc files with current process.py and util/config.py, run zcat spiderling.*.arc.gz | pypy reprocess.py == Performance tips == - Using PyPy reduces CPU and memory cost (saves approx. 1/4 RAM). - Set ALLOWED_TLDS_RE to avoid crawling domains not in the target language (it takes some resources to detect it otherwise). - Set LOG_LEVEL to debug and set INFO_PERIOD to check the debug output in *.log.crawl and *.log.eval to see where the bottleneck is, modify the settings accordingly, e.g. add doc processors if the doc queue is always full. == Known bugs == - Domain distances should be made part of document metadata instead of storing them in a separate file. - Non html documents are not processed (urls stored in *.unproc_urls instead). - DNS resolvers are implemented as blocking threads => useless to have more than one, will be changed to separate processes in the future. - Compressed connections are not accepted/processed. Some servers might be discouraged from sending an uncompressed response (not tested). - Some advanced features of robots.txt are not observed, e.g. Crawl-delay. It would require major changes in the design of the download scheduler. == Support == There is no guaranteed support offered. (The author may help you a bit in his free time.) Please note the tool is distributed as is, it may not work under your conditions.
Acknowledgements
The author would like to express many thanks to Jan Pomikálek, Pavel Rychlý and Miloš Jakubíček for guidance, key design advice and help with debugging. Thanks also to Vlado Benko and Nikola Ljubešić for ideas for improvement.
This software is developed at the Natural Language Processing Centre of Masaryk University in Brno, Czech Republic
in cooperation with Lexical Computing Ltd., UK, a corpus tool company.
Contact
'zc.inum.if@2mohcusx'[::-1]
Licence
This software is the result of project LM2010013 (LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum). This result is consistent with the expected objectives of the project. The owner of the result is Masaryk University, a public university, ID: 00216224. Masaryk University allows other companies and individuals to use this software free of charge and without territorial restrictions under the terms of the GPL license.
This permission is granted for the duration of property rights.
This software is not subject to special information treatment according to Act No. 412/2005 Coll., as amended. In case that a person who will use the software under this license offer violates the license terms, the permission to use the software terminates.
Attachments (1)
- crawled_sizes_2019.png (208.1 KB ) - added by 4 years ago.
Download all attachments as: .zip