| 76 | == README == |
| 77 | {{{ |
| 78 | == Requires == |
| 79 | pypy >= 2.2.1, |
| 80 | 2.7 <= python < 3, |
| 81 | justext >= 1.2, |
| 82 | chared >= 1.3, |
| 83 | lxml >= 2.2.4, |
| 84 | text processing tools: pdftotext, ps2ascii, antiword, |
| 85 | works in Ubuntu (Debian Linux), does not work in Windows. |
| 86 | Minimum hardware configuration (very small crawls): |
| 87 | - 2 core CPU, |
| 88 | - 4 GB system memory, |
| 89 | - some storage space, |
| 90 | - broadband internet connection. |
| 91 | Recommended hardware configuration (crawling ~30 bn words of English text): |
| 92 | - 4-24 core CPU (the more CPUs the faster the processing of crawled data), |
| 93 | - 8-250 GB operational memory |
| 94 | (the more RAM the more domains kept in memory and thus more webs visited), |
| 95 | - lots of storage space, |
| 96 | - connection to an internet backbone line. |
| 97 | |
| 98 | == Includes == |
| 99 | A robot exclusion rules parser for Python by Philip Semanchuk (v. 1.6.2) |
| 100 | ... see util/robotparser.py |
| 101 | Language detection using character trigrams by Douglas Bagnall |
| 102 | ... see util/trigrams.py |
| 103 | docx2txt by Sandeep Kumar |
| 104 | ... see util/doc2txt.pl |
| 105 | |
| 106 | == Installation == |
| 107 | - unpack, |
| 108 | - install required tools, |
| 109 | - check justext.core and chared.detector can be imported by pypy, |
| 110 | - make sure the crawler can write to it's directory and config.PIPE_DIR. |
| 111 | |
| 112 | == Settings -- edit util/config.py == |
| 113 | - !!!IMPORTANT!!! Set AGENT, AGENT_URL, USER_AGENT, |
| 114 | - raise ulimit -n accoring to MAX_OPEN_CONNS, |
| 115 | - set MAX_RUN_TIME to specify max crawling time in seconds, |
| 116 | - set DOC_PROCESSOR_COUNT to (partially) control CPU usage, |
| 117 | - configure language dependent settings |
| 118 | - set MAX_DOMS_READY to (partially) control memory usage, |
| 119 | - set MAX_DOMS_WAITING_FOR_SINGLE_IP, MAX_IP_REPEAT, |
| 120 | - set MAX_OPEN_CONNS, IP_CONN_INTERVAL, HOST_CONN_INTERVAL, |
| 121 | - set and mkdir PIPE_DIR (pipes for communication of subprocesses). |
| 122 | |
| 123 | == Language models == |
| 124 | - plaintext in the target language in util/lang_samples/, |
| 125 | e.g. put plaintexts from several dozens of English web documents and |
| 126 | English Wikipedia articels in ./util/lang_samples/English |
| 127 | - jusText stoplist for that language in jusText stoplist path, |
| 128 | e.g. <justext directory>/stoplists/English.txt |
| 129 | - chared model for that language, |
| 130 | e.g. <chared directory>/models/English |
| 131 | |
| 132 | == Usage == |
| 133 | pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP] |
| 134 | SEED_URLS is a text file containing seed URLs (the crawling starts there), |
| 135 | one per line, specify at least 50 URLs. |
| 136 | SAVEPOINT_TIMESTAMP causes the state from the specified savepoint to be loaded, |
| 137 | e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'. |
| 138 | Running the crawler in the background is recommended. |
| 139 | The crawler creates |
| 140 | - *.log.* .. log & debug files, |
| 141 | - *.arc.gz .. gzipped arc files (raw http responses), |
| 142 | - *.prevert_d .. preverticals with duplicate documents, |
| 143 | - *.duplicates .. files duplicate document IDs, |
| 144 | - *.unproc_urls .. urls of non html documents that were not processed (bug) |
| 145 | To remove duplicate documents from preverticals, run |
| 146 | rm spiderling.prevert |
| 147 | for i in $(seq 0 15) |
| 148 | do |
| 149 | pypy util/remove_duplicates.py spiderling.${i}.duplicates \ |
| 150 | < spiderling.${i}.prevert_d >> spiderling.prevert |
| 151 | done |
| 152 | File spiderling.prevert is the final output. |
| 153 | To stop the crawler before MAX_RUN_TIME, send SIGTERM to the main process |
| 154 | (spiderling.py). |
| 155 | To re-process arc files with current process.py and util/config.py, run |
| 156 | zcat spiderling.*.arc.gz | pypy reprocess.py |
| 157 | |
| 158 | == Performance tips == |
| 159 | - Using PyPy reduces CPU and memory cost (saves approx. 1/4 RAM). |
| 160 | - Set ALLOWED_TLDS_RE to avoid crawling domains not in the target language |
| 161 | (it takes some resources to detect it otherwise). |
| 162 | - Set LOG_LEVEL to debug and set INFO_PERIOD to check the debug output in |
| 163 | *.log.crawl and *.log.eval to see where the bottleneck is, modify the |
| 164 | settings accordingly, e.g. add doc processors if the doc queue is always full. |
| 165 | |
| 166 | == Known bugs == |
| 167 | - Non html documents are not processed (urls stored in *.unproc_urls instead). |
| 168 | - DNS resolvers are implemented as blocking threads => useless to have more |
| 169 | than one, will be changed to separate processes in the future. |
| 170 | - Compressed connections are not accepted/processed. Some servers might be |
| 171 | discouraged from sending an uncompressed response (not tested). |
| 172 | - Some advanced features of robots.txt are not observed, e.g. Crawl-delay. |
| 173 | It would require major changes in the design of the download scheduler. |
| 174 | - Https requests are not implemented properly, may not work at all. |
| 175 | - Path bloating, e.g. http://example.com/a/ yielding the same content as |
| 176 | http://example.com/a/a/ etc., should be avoided. Might be a bot trap. |
| 177 | }}} |