Changes between Version 8 and Version 9 of SpiderLing


Ignore:
Timestamp:
10/28/16 14:34:14 (8 years ago)
Author:
admin
Comment:
  1. 0.85

Legend:

Unmodified
Added
Removed
Modified
  • SpiderLing

    v8 v9  
    55
    66== Get SpiderLing ==
    7 See [wiki:Downloads] for the latest version. Please note the software is distributed as is, without a guaranteed support.
    8 
    9 === Changelog 0.82 → 0.84 ===
     7Download [https://nlp.fi.muni.cz/projects/spiderling/ the latest version]. Please note the software is distributed as is, without a guaranteed support.
     8
     9=== Changelog 0.82 → 0.85 ===
    1010* Mistakenly ignored links fixed in process.py
    1111 - the bug caused not crawling a significant part of the web
    1212* Multithreading issues fixed (possible race conditions in shared memory)
    1313 - delegated classes for explicit locking of all shared objects (esp. !DomainQueue)
    14 * Deadlock -- the scheduler stops working after some time of successful crawling
    15  - observed in the case of a large scale crawling
    16  - may be related to changing the code because of multithreading problems
    17  - IMPORTANT -- may not be resolved yet, testing in progress, 0.81 should be stable
     14* Deadlock observed in the case of a large scale crawling fixed (up to v. 0.84)
    1815* Several Domain and !DomainQueue methods were rewritten for bulk operations (e.g. adding all paths into a single domain together) to improve performance
     16* External Robot exclusion protocol module rewritten
     17 - unused code removed
     18 - performance issues and bug in re fixed -- requires re2 now
     19* Chunked HTTP reponse and URL handling methods rewritten (performance, bugs)
    1920* Justext configuration
    2021 - more permissive setting than the justext default
     
    8081||Turkish          ||   2700||   26||  0.97%||   4.1||   14||
    8182
    82 == README ==
     83== Requires ==
    8384{{{
    84 == Requires ==
    85852.7.9 <= python < 3,
    8686pypy >= 2.2.1,
     
    8989lxml >= 2.2.4 (http://lxml.de/),
    9090openssl >= 1.0.1,
     91re2 (python, https://pypi.python.org/pypi/re2/, make sure `import re2` works),
     92or alternatively cffi_re2 (python/pypy, https://pypi.python.org/pypi/cffi_re2,
     93make sure `import cffi_re2` works in this case),
    9194pdftotext, ps2ascii, antiword (text processing tools).
     95}}}
    9296Runs in Linux, tested in Fedora and Ubuntu.
    9397Minimum hardware configuration (very small crawls):
     
    98102Recommended hardware configuration (crawling ~30 bn words of English text):
    99103- 4-24 core CPU (the more CPUs the faster the processing of crawled data),
    100 - 8-250 GB operational memory
    101     (the more RAM the more domains kept in memory and thus more webs visited),
     104- 8-250 GB operational memory (the more RAM the more domains kept in memory and thus more webs visited),
    102105- lots of storage space,
    103106- connection to an internet backbone line.
    104107
    105108== Includes ==
    106 A robot exclusion rules parser for Python (v. 1.6.2)
    107 - by Philip Semanchuk, BSD Licence
    108 - see util/robotparser.py
    109 Language detection using character trigrams
    110 - by Douglas Bagnall, Python Software Foundation Licence
    111 - see util/trigrams.py
    112 docx2txt
    113 - by Sandeep Kumar, GNU GPL 3+
    114 - see util/doc2txt.pl
     109A robot exclusion rules parser for Python (v. 1.6.2) by Philip Semanchuk, BSD Licence (see util/robotparser.py)
     110Language detection using character trigrams by Douglas Bagnall, Python Software Foundation Licence (see util/trigrams.py)
     111docx2txt by Sandeep Kumar, GNU GPL 3+ (see util/doc2txt.pl)
    115112
    116113== Installation ==
     
    132129
    133130== Language models ==
    134 - plaintext in the target language in util/lang_samples/,
    135     e.g. put plaintexts from several dozens of English web documents and
    136     English Wikipedia articels in ./util/lang_samples/English
    137 - jusText stoplist for that language in jusText stoplist path,
    138     e.g. <justext directory>/stoplists/English.txt
    139 - chared model for that language,
    140     e.g. <chared directory>/models/English
     131- plaintext in the target language in util/lang_samples/, e.g. put plaintexts from several dozens of English web documents and English Wikipedia articels in ./util/lang_samples/English
     132- jusText stoplist for that language in jusText stoplist path, e.g. <justext directory>/stoplists/English.txt
     133- chared model for that language, e.g. <chared directory>/models/English
    141134
    142135== Usage ==
    143 pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP]
    144 SEED_URLS is a text file containing seed URLs (the crawling starts there),
    145     one per line, specify at least 50 URLs.
    146 SAVEPOINT_TIMESTAMP causes the state from the specified savepoint to be loaded,
    147     e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'.
    148 Running the crawler in the background is recommended.
    149 The crawler creates
     136{{{pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP]}}}
     137or
     138{{{pypy spiderling.py [SAVEPOINT_TIMESTAMP] < SEED_URLS}}}
     139
     140{{{SEED_URLS}}} is a text file containing seed URLs (the crawling starts there),
     141one per line, specify at least 50 URLs.
     142{{{SAVEPOINT_TIMESTAMP}}} causes the state from the specified savepoint to be loaded,
     143e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'.
     144
     145Python can be used instead of pypy if the latter is not available.
     146It is recommended to run the crawler in `screen`.
     147
     148Files created by the crawler:
    150149- *.log.* .. log & debug files,
    151150- *.arc.gz .. gzipped arc files (raw http responses),
     
    153152- *.duplicates .. files duplicate document IDs,
    154153- *.domain_{bad,oversize,distance} .. see util/domain.py,
    155 - *.ignored_refs .. links ignored for configurable or hardcoded reasons,
    156 - *.unproc_urls .. urls of non html documents that were not processed (bug).
    157 To remove duplicate documents from preverticals, run
     154- *.links_unproc .. unprocessed urls from removed domains,
     155- *.links_ignored .. urls not passing domain blacklist or TLD filter,
     156- *.links_binfile .. urls of binary files (pdf, ps, doc, docx) not processed
     157    in case config.CONVERSION_ENABLED is disabled,
     158- *.state* .. savepoints that can be used for a new run (not tested much).
     159
     160To remove duplicate documents from preverticals, run
     161{{{
    158162rm spiderling.prevert
    159 for i in $(seq 0 15)
     163for i in $(seq 0 <config.DOC_PROCESSOR_COUNT - 1>)
    160164do
    161165    pypy util/remove_duplicates.py spiderling.${i}.duplicates \
    162166        < spiderling.${i}.prevert_d >> spiderling.prevert
    163167done
     168}}}
    164169File spiderling.prevert is the final output.
     170
    165171To stop the crawler before MAX_RUN_TIME, send SIGTERM to the main process
    166172(spiderling.py).
    167173To re-process arc files with current process.py and util/config.py, run
    168 zcat spiderling.*.arc.gz | pypy reprocess.py
     174{{{zcat spiderling.*.arc.gz | pypy reprocess.py}}}
    169175
    170176== Performance tips ==
     
    177183
    178184== Known bugs ==
     185- Some non critical I/O errors output to stderr.
    179186- Domain distances should be made part of document metadata instead of storing
    180187  them in a separate file.
    181 - Non html documents are not processed (urls stored in *.unproc_urls instead).
     188- Processing binary files (pdf, ps, doc, docx) is disabled by default since it
     189  was not tested and may slow processing significantly.
    182190- DNS resolvers are implemented as blocking threads => useless to have more
    183191  than one, will be changed to separate processes in the future.
     
    186194- Some advanced features of robots.txt are not observed, e.g. Crawl-delay.
    187195  It would require major changes in the design of the download scheduler.
     196  A warning is emitted when the crawl delay < config.HOST_CONN_INTERVAL.
    188197
    189198== Support ==
     
    191200free time.) Please note the tool is distributed as is, it may not work under
    192201your conditions.
    193 
    194 }}}
    195202
    196203== Acknowledgements ==