Changes between Version 6 and Version 7 of SpiderLing

Timestamp:: 10/13/16 17:36:23 (9 years ago)
Author:: admin
Comment:: 0.84

Legend:

: Unmodified
: Added
: Removed
: Modified

SpiderLing

-              v6
+              v7
 = !SpiderLing =
+!SpiderLing — a web spider for linguistics — is software for obtaining
+text from the web useful for building text corpora. Many documents on
+the web only contain material not suitable for text corpora, such as
+site navigation, lists of links, lists of products, and other kind of
+text not comprised of full sentences. In fact such pages represent the
+vast majority of the web. Therefore, by doing unrestricted web crawls,
+we typically download a lot of data which gets filtered out during
+post-processing. This makes the process of web corpus collection
+inefficient. The aim of our work is to focus the crawling on the text
+rich parts of the web and maximize the number of words in the final
+corpus per downloaded megabyte.
+!SpiderLing — a web spider for linguistics — is software for obtaining text from the web useful for building text corpora.
+Many documents on the web only contain material not suitable for text corpora, such as site navigation, lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient.
+The aim of our work is to focus the crawling on the text rich parts of the web and maximize the number of words in the final corpus per downloaded megabyte. Nevertheless the crawler can be configured to ignore the yield rate of web domains and download from low yield sites too.
 == Get SpiderLing ==
+See [wiki:Downloads] for the latest version.
+See [wiki:Downloads] for the latest version. '''Please note the software is distributed as is, without a guaranteed support'''
+=== Changelog 0.82 → 0.84 ===
+* Mistakenly ignored links fixed in process.py
+ - the bug caused not crawling a significant part of the web
+* Multithreading issues fixed (possible race conditions in shared memory)
+ - delegated classes for explicit locking of all shared objects (esp. DomainQueue)
+* Deadlock -- the scheduler stops working after some time of successful crawling
+ - observed in the case of a large scale crawling
+ - may be related to changing the code because of multithreading problems
+ - IMPORTANT -- may not be resolved yet, testing in progress, 0.81 should be stable
+* Several Domain and DomainQueue methods were rewritten for bulk operations (e.g. adding all paths into a single domain together) to improve performance
+* Justext configuration
+ - more permissive setting than the justext default
+ - configurable in util/config.py
 === Changelog 0.81 → 0.82 ===
 …
 {{{
 == Requires ==
+.7.9 <= python < 3,
 pypy >= 2.2.1,
 .7 <= python < 3,
 justext >= 1.2,
 chared >= 1.3,
 lxml >= 2.2.4,
+text processing tools: pdftotext, ps2ascii, antiword,
 works in Ubuntu (Debian Linux), does not work in Windows.
+justext >= 1.2 (http://corpus.tools/wiki/Justext),
+chared >= 1.3 (http://corpus.tools/wiki/Chared),
+lxml >= 2.2.4 (http://lxml.de/),
+openssl >= 1.0.1,
+pdftotext, ps2ascii, antiword (text processing tools).
+Runs in Linux, tested in Fedora and Ubuntu.
 Minimum hardware configuration (very small crawls):
 - 2 core CPU,
 …
 == Includes ==
+A robot exclusion rules parser for Python by Philip Semanchuk (v. 1.6.2)
+... see util/robotparser.py
+Language detection using character trigrams by Douglas Bagnall
+... see util/trigrams.py
+docx2txt by Sandeep Kumar
+... see util/doc2txt.pl
+A robot exclusion rules parser for Python (v. 1.6.2)
+- by Philip Semanchuk, BSD Licence
+- see util/robotparser.py
+Language detection using character trigrams
+- by Douglas Bagnall, Python Software Foundation Licence
+- see util/trigrams.py
+docx2txt
+- by Sandeep Kumar, GNU GPL 3+
+- see util/doc2txt.pl
 == Installation ==
 …
 - jusText stoplist for that language in jusText stoplist path,
     e.g. <justext directory>/stoplists/English.txt
 - chared model for that language,
+- chared model for that language,
     e.g. <chared directory>/models/English
 == Usage ==
 pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP]
 SEED_URLS is a text file containing seed URLs (the crawling starts there),
+SEED_URLS is a text file containing seed URLs (the crawling starts there),
     one per line, specify at least 50 URLs.
 SAVEPOINT_TIMESTAMP causes the state from the specified savepoint to be loaded,
     e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'.
 Running the crawler in the background is recommended.
 The crawler creates
+The crawler creates
 - *.log.* .. log & debug files,
 - *.arc.gz .. gzipped arc files (raw http responses),
 - *.prevert_d .. preverticals with duplicate documents,
 - *.duplicates .. files duplicate document IDs,
+- *.unproc_urls .. urls of non html documents that were not processed (bug)
+To remove duplicate documents from preverticals, run
+- *.domain_{bad,oversize,distance} .. see util/domain.py,
+- *.ignored_refs .. links ignored for configurable or hardcoded reasons,
+- *.unproc_urls .. urls of non html documents that were not processed (bug).
+To remove duplicate documents from preverticals, run
 rm spiderling.prevert
 for i in $(seq 0 15)
 do
+do
     pypy util/remove_duplicates.py spiderling.${i}.duplicates \
         < spiderling.${i}.prevert_d >> spiderling.prevert
 …
 == Known bugs ==
+- Domain distances should be made part of document metadata instead of storing
+  them in a separate file.
 - Non html documents are not processed (urls stored in *.unproc_urls instead).
 - DNS resolvers are implemented as blocking threads => useless to have more
 …
 - Some advanced features of robots.txt are not observed, e.g. Crawl-delay.
   It would require major changes in the design of the download scheduler.
+- Https requests are not implemented properly, may not work at all.
+- Path bloating, e.g. http://example.com/a/ yielding the same content as
+  http://example.com/a/a/ etc., should be avoided. Might be a bot trap.
+== Support ==
+There is no guaranteed support offered. (The author may help you a bit in his
+free time.) Please note the tool is distributed as is, it may not work under
+your conditions.
 }}}
 == Acknowledgements ==
+The author would like to express many thanks to Jan Pomikalek and Pavel Rychly for guidance and key design advice.
+Thanks also to Vlado Benko and Nikola Ljubesic for ideas for improvement.
+The author would like to express many thanks to Jan Pomikálek, Pavel Rychlý
+and Miloš Jakubíček for guidance, key design advice and help with debugging.
+Thanks also to Vlado Benko and Nikola Ljubešić for ideas for improvement.
 This software is developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of Masaryk University in Brno, Czech Republic \\