Changes between Version 8 and Version 9 of SpiderLing
- Timestamp:
- 10/28/16 14:34:14 (8 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
SpiderLing
v8 v9 5 5 6 6 == Get SpiderLing == 7 See [wiki:Downloads] for the latest version. Please note the software is distributed as is, without a guaranteed support.8 9 === Changelog 0.82 → 0.8 4===7 Download [https://nlp.fi.muni.cz/projects/spiderling/ the latest version]. Please note the software is distributed as is, without a guaranteed support. 8 9 === Changelog 0.82 → 0.85 === 10 10 * Mistakenly ignored links fixed in process.py 11 11 - the bug caused not crawling a significant part of the web 12 12 * Multithreading issues fixed (possible race conditions in shared memory) 13 13 - delegated classes for explicit locking of all shared objects (esp. !DomainQueue) 14 * Deadlock -- the scheduler stops working after some time of successful crawling 15 - observed in the case of a large scale crawling 16 - may be related to changing the code because of multithreading problems 17 - IMPORTANT -- may not be resolved yet, testing in progress, 0.81 should be stable 14 * Deadlock observed in the case of a large scale crawling fixed (up to v. 0.84) 18 15 * Several Domain and !DomainQueue methods were rewritten for bulk operations (e.g. adding all paths into a single domain together) to improve performance 16 * External Robot exclusion protocol module rewritten 17 - unused code removed 18 - performance issues and bug in re fixed -- requires re2 now 19 * Chunked HTTP reponse and URL handling methods rewritten (performance, bugs) 19 20 * Justext configuration 20 21 - more permissive setting than the justext default … … 80 81 ||Turkish || 2700|| 26|| 0.97%|| 4.1|| 14|| 81 82 82 == R EADME==83 == Requires == 83 84 {{{ 84 == Requires ==85 85 2.7.9 <= python < 3, 86 86 pypy >= 2.2.1, … … 89 89 lxml >= 2.2.4 (http://lxml.de/), 90 90 openssl >= 1.0.1, 91 re2 (python, https://pypi.python.org/pypi/re2/, make sure `import re2` works), 92 or alternatively cffi_re2 (python/pypy, https://pypi.python.org/pypi/cffi_re2, 93 make sure `import cffi_re2` works in this case), 91 94 pdftotext, ps2ascii, antiword (text processing tools). 95 }}} 92 96 Runs in Linux, tested in Fedora and Ubuntu. 93 97 Minimum hardware configuration (very small crawls): … … 98 102 Recommended hardware configuration (crawling ~30 bn words of English text): 99 103 - 4-24 core CPU (the more CPUs the faster the processing of crawled data), 100 - 8-250 GB operational memory 101 (the more RAM the more domains kept in memory and thus more webs visited), 104 - 8-250 GB operational memory (the more RAM the more domains kept in memory and thus more webs visited), 102 105 - lots of storage space, 103 106 - connection to an internet backbone line. 104 107 105 108 == Includes == 106 A robot exclusion rules parser for Python (v. 1.6.2) 107 - by Philip Semanchuk, BSD Licence 108 - see util/robotparser.py 109 Language detection using character trigrams 110 - by Douglas Bagnall, Python Software Foundation Licence 111 - see util/trigrams.py 112 docx2txt 113 - by Sandeep Kumar, GNU GPL 3+ 114 - see util/doc2txt.pl 109 A robot exclusion rules parser for Python (v. 1.6.2) by Philip Semanchuk, BSD Licence (see util/robotparser.py) 110 Language detection using character trigrams by Douglas Bagnall, Python Software Foundation Licence (see util/trigrams.py) 111 docx2txt by Sandeep Kumar, GNU GPL 3+ (see util/doc2txt.pl) 115 112 116 113 == Installation == … … 132 129 133 130 == Language models == 134 - plaintext in the target language in util/lang_samples/, 135 e.g. put plaintexts from several dozens of English web documents and 136 English Wikipedia articels in ./util/lang_samples/English 137 - jusText stoplist for that language in jusText stoplist path, 138 e.g. <justext directory>/stoplists/English.txt 139 - chared model for that language, 140 e.g. <chared directory>/models/English 131 - plaintext in the target language in util/lang_samples/, e.g. put plaintexts from several dozens of English web documents and English Wikipedia articels in ./util/lang_samples/English 132 - jusText stoplist for that language in jusText stoplist path, e.g. <justext directory>/stoplists/English.txt 133 - chared model for that language, e.g. <chared directory>/models/English 141 134 142 135 == Usage == 143 pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP] 144 SEED_URLS is a text file containing seed URLs (the crawling starts there), 145 one per line, specify at least 50 URLs. 146 SAVEPOINT_TIMESTAMP causes the state from the specified savepoint to be loaded, 147 e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'. 148 Running the crawler in the background is recommended. 149 The crawler creates 136 {{{pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP]}}} 137 or 138 {{{pypy spiderling.py [SAVEPOINT_TIMESTAMP] < SEED_URLS}}} 139 140 {{{SEED_URLS}}} is a text file containing seed URLs (the crawling starts there), 141 one per line, specify at least 50 URLs. 142 {{{SAVEPOINT_TIMESTAMP}}} causes the state from the specified savepoint to be loaded, 143 e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'. 144 145 Python can be used instead of pypy if the latter is not available. 146 It is recommended to run the crawler in `screen`. 147 148 Files created by the crawler: 150 149 - *.log.* .. log & debug files, 151 150 - *.arc.gz .. gzipped arc files (raw http responses), … … 153 152 - *.duplicates .. files duplicate document IDs, 154 153 - *.domain_{bad,oversize,distance} .. see util/domain.py, 155 - *.ignored_refs .. links ignored for configurable or hardcoded reasons, 156 - *.unproc_urls .. urls of non html documents that were not processed (bug). 157 To remove duplicate documents from preverticals, run 154 - *.links_unproc .. unprocessed urls from removed domains, 155 - *.links_ignored .. urls not passing domain blacklist or TLD filter, 156 - *.links_binfile .. urls of binary files (pdf, ps, doc, docx) not processed 157 in case config.CONVERSION_ENABLED is disabled, 158 - *.state* .. savepoints that can be used for a new run (not tested much). 159 160 To remove duplicate documents from preverticals, run 161 {{{ 158 162 rm spiderling.prevert 159 for i in $(seq 0 15)163 for i in $(seq 0 <config.DOC_PROCESSOR_COUNT - 1>) 160 164 do 161 165 pypy util/remove_duplicates.py spiderling.${i}.duplicates \ 162 166 < spiderling.${i}.prevert_d >> spiderling.prevert 163 167 done 168 }}} 164 169 File spiderling.prevert is the final output. 170 165 171 To stop the crawler before MAX_RUN_TIME, send SIGTERM to the main process 166 172 (spiderling.py). 167 173 To re-process arc files with current process.py and util/config.py, run 168 zcat spiderling.*.arc.gz | pypy reprocess.py 174 {{{zcat spiderling.*.arc.gz | pypy reprocess.py}}} 169 175 170 176 == Performance tips == … … 177 183 178 184 == Known bugs == 185 - Some non critical I/O errors output to stderr. 179 186 - Domain distances should be made part of document metadata instead of storing 180 187 them in a separate file. 181 - Non html documents are not processed (urls stored in *.unproc_urls instead). 188 - Processing binary files (pdf, ps, doc, docx) is disabled by default since it 189 was not tested and may slow processing significantly. 182 190 - DNS resolvers are implemented as blocking threads => useless to have more 183 191 than one, will be changed to separate processes in the future. … … 186 194 - Some advanced features of robots.txt are not observed, e.g. Crawl-delay. 187 195 It would require major changes in the design of the download scheduler. 196 A warning is emitted when the crawl delay < config.HOST_CONN_INTERVAL. 188 197 189 198 == Support == … … 191 200 free time.) Please note the tool is distributed as is, it may not work under 192 201 your conditions. 193 194 }}}195 202 196 203 == Acknowledgements ==