Changes between Version 8 and Version 9 of SpiderLing

Timestamp:

10/28/16 14:34:14 (9 years ago)

Author:

admin

Comment:

0.85

Legend:

: Unmodified
: Added
: Removed
: Modified

SpiderLing

-              v8
+              v9
 == Get SpiderLing ==
 See [wiki:Downloads] for the latest version. Please note the software is distributed as is, without a guaranteed support.
 === Changelog 0.82 → 0.84 ===
+Download [https://nlp.fi.muni.cz/projects/spiderling/ the latest version]. Please note the software is distributed as is, without a guaranteed support.
+=== Changelog 0.82 → 0.85 ===
 * Mistakenly ignored links fixed in process.py
  - the bug caused not crawling a significant part of the web
 * Multithreading issues fixed (possible race conditions in shared memory)
  - delegated classes for explicit locking of all shared objects (esp. !DomainQueue)
+* Deadlock -- the scheduler stops working after some time of successful crawling
+ - observed in the case of a large scale crawling
+ - may be related to changing the code because of multithreading problems
+ - IMPORTANT -- may not be resolved yet, testing in progress, 0.81 should be stable
+* Deadlock observed in the case of a large scale crawling fixed (up to v. 0.84)
 * Several Domain and !DomainQueue methods were rewritten for bulk operations (e.g. adding all paths into a single domain together) to improve performance
+* External Robot exclusion protocol module rewritten
+ - unused code removed
+ - performance issues and bug in re fixed -- requires re2 now
+* Chunked HTTP reponse and URL handling methods rewritten (performance, bugs)
 * Justext configuration
  - more permissive setting than the justext default
 …
 ||Turkish          ||   2700||   26||  0.97%||   4.1||   14||
 == README ==
+== Requires ==
 {{{
-== Requires ==
 .7.9 <= python < 3,
 pypy >= 2.2.1,
 …
 lxml >= 2.2.4 (http://lxml.de/),
 openssl >= 1.0.1,
+re2 (python, https://pypi.python.org/pypi/re2/, make sure `import re2` works),
+or alternatively cffi_re2 (python/pypy, https://pypi.python.org/pypi/cffi_re2,
+make sure `import cffi_re2` works in this case),
 pdftotext, ps2ascii, antiword (text processing tools).
+}}}
 Runs in Linux, tested in Fedora and Ubuntu.
 Minimum hardware configuration (very small crawls):
 …
 Recommended hardware configuration (crawling ~30 bn words of English text):
 - 4-24 core CPU (the more CPUs the faster the processing of crawled data),
+- 8-250 GB operational memory
+    (the more RAM the more domains kept in memory and thus more webs visited),
+- 8-250 GB operational memory (the more RAM the more domains kept in memory and thus more webs visited),
 - lots of storage space,
 - connection to an internet backbone line.
 == Includes ==
+A robot exclusion rules parser for Python (v. 1.6.2)
+- by Philip Semanchuk, BSD Licence
+- see util/robotparser.py
+Language detection using character trigrams
+- by Douglas Bagnall, Python Software Foundation Licence
+- see util/trigrams.py
+docx2txt
+- by Sandeep Kumar, GNU GPL 3+
+- see util/doc2txt.pl
+A robot exclusion rules parser for Python (v. 1.6.2) by Philip Semanchuk, BSD Licence (see util/robotparser.py)
+Language detection using character trigrams by Douglas Bagnall, Python Software Foundation Licence (see util/trigrams.py)
+docx2txt by Sandeep Kumar, GNU GPL 3+ (see util/doc2txt.pl)
 == Installation ==
 …
 == Language models ==
+- plaintext in the target language in util/lang_samples/,
+    e.g. put plaintexts from several dozens of English web documents and
+    English Wikipedia articels in ./util/lang_samples/English
+- jusText stoplist for that language in jusText stoplist path,
+    e.g. <justext directory>/stoplists/English.txt
+- chared model for that language,
+    e.g. <chared directory>/models/English
+- plaintext in the target language in util/lang_samples/, e.g. put plaintexts from several dozens of English web documents and English Wikipedia articels in ./util/lang_samples/English
+- jusText stoplist for that language in jusText stoplist path, e.g. <justext directory>/stoplists/English.txt
+- chared model for that language, e.g. <chared directory>/models/English
 == Usage ==
+pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP]
+SEED_URLS is a text file containing seed URLs (the crawling starts there),
+    one per line, specify at least 50 URLs.
+SAVEPOINT_TIMESTAMP causes the state from the specified savepoint to be loaded,
+    e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'.
+Running the crawler in the background is recommended.
+The crawler creates
+{{{pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP]}}}
+or
+{{{pypy spiderling.py [SAVEPOINT_TIMESTAMP] < SEED_URLS}}}
+{{{SEED_URLS}}} is a text file containing seed URLs (the crawling starts there),
+one per line, specify at least 50 URLs.
+{{{SAVEPOINT_TIMESTAMP}}} causes the state from the specified savepoint to be loaded,
+e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'.
+Python can be used instead of pypy if the latter is not available.
+It is recommended to run the crawler in `screen`.
+Files created by the crawler:
 - *.log.* .. log & debug files,
 - *.arc.gz .. gzipped arc files (raw http responses),
 …
 - *.duplicates .. files duplicate document IDs,
 - *.domain_{bad,oversize,distance} .. see util/domain.py,
+- *.ignored_refs .. links ignored for configurable or hardcoded reasons,
+- *.unproc_urls .. urls of non html documents that were not processed (bug).
+To remove duplicate documents from preverticals, run
+- *.links_unproc .. unprocessed urls from removed domains,
+- *.links_ignored .. urls not passing domain blacklist or TLD filter,
+- *.links_binfile .. urls of binary files (pdf, ps, doc, docx) not processed
+    in case config.CONVERSION_ENABLED is disabled,
+- *.state* .. savepoints that can be used for a new run (not tested much).
+To remove duplicate documents from preverticals, run
+{{{
 rm spiderling.prevert
 for i in $(seq 0 15)
+for i in $(seq 0 <config.DOC_PROCESSOR_COUNT - 1>)
 do
     pypy util/remove_duplicates.py spiderling.${i}.duplicates \
         < spiderling.${i}.prevert_d >> spiderling.prevert
 done
+}}}
 File spiderling.prevert is the final output.
 To stop the crawler before MAX_RUN_TIME, send SIGTERM to the main process
 (spiderling.py).
 To re-process arc files with current process.py and util/config.py, run
+zcat spiderling.*.arc.gz | pypy reprocess.py
+{{{zcat spiderling.*.arc.gz | pypy reprocess.py}}}
 == Performance tips ==
 …
 == Known bugs ==
+- Some non critical I/O errors output to stderr.
 - Domain distances should be made part of document metadata instead of storing
   them in a separate file.
+- Non html documents are not processed (urls stored in *.unproc_urls instead).
+- Processing binary files (pdf, ps, doc, docx) is disabled by default since it
+  was not tested and may slow processing significantly.
 - DNS resolvers are implemented as blocking threads => useless to have more
   than one, will be changed to separate processes in the future.
 …
 - Some advanced features of robots.txt are not observed, e.g. Crawl-delay.
   It would require major changes in the design of the download scheduler.
+  A warning is emitted when the crawl delay < config.HOST_CONN_INTERVAL.
 == Support ==
 …
 free time.) Please note the tool is distributed as is, it may not work under
 your conditions.
-}}}
 == Acknowledgements ==