Changes between Version 5 and Version 6 of SpiderLing


Ignore:
Timestamp:
01/21/16 14:16:37 (9 years ago)
Author:
admin
Comment:

update, url fixes, contact, readme

Legend:

Unmodified
Added
Removed
Modified
  • SpiderLing

    v5 v6  
    4444We presented our results at the following venues:
    4545
    46 [http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/tentens_14may2013.docx
    47 The TenTen Corpus Family]\\
     46[http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/tentens_14may2013.docx The TenTen Corpus Family]\\
    4847by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel\\
    49 at [http://ucrel.lancs.ac.uk/cl2013/ 7th International Corpus
    50 Linguistics Conference], Lancaster, July 2013.
    51 
    52 [http://nlp.fi.muni.cz/~xsuchom2/papers/BaisaSuchomel_TurkicResources.pdf
    53 Large Corpora for Turkic Languages and Unsupervised Morphological
    54 Analysis]\\
     48at [http://ucrel.lancs.ac.uk/cl2013/ 7th International Corpus Linguistics Conference], Lancaster, July 2013.
     49
     50[http://nlp.fi.muni.cz/~xsuchom2/papers/BaisaSuchomel_TurkicResources.pdf Large Corpora for Turkic Languages and Unsupervised Morphological Analysis]\\
    5551by Vít Baisa, Vít Suchomel\\
    56 at [http://multisaund.eu/lrec2012_turkiclanguage.php Language
    57 Resources and Technologies for Turkic Languages] (at conference LREC),
    58 Istanbul, May 2012
    59 
    60 [http://nlp.fi.muni.cz/~xsuchom2/papers/PomikalekSuchomel_SpiderlingEfficiency.pdf
    61 Efficient Web Crawling for Large Text Corpora]\\
     52at [http://multisaund.eu/lrec2012_turkiclanguage.php Language Resources and Technologies for Turkic Languages] (at conference LREC), Istanbul, May 2012
     53
     54[http://nlp.fi.muni.cz/~xsuchom2/papers/PomikalekSuchomel_SpiderlingEfficiency.pdf Efficient Web Crawling for Large Text Corpora]\\
    6255by Jan Pomikálek, Vít Suchomel\\
    63 at [http://sigwac.org.uk/wiki/WAC7 ACL SIGWAC Web as Corpus] (at
    64 conference WWW), Lyon, April 2012
     56at [http://sigwac.org.uk/wiki/WAC7 ACL SIGWAC Web as Corpus] (at conference WWW), Lyon, April 2012
    6557
    6658
     
    8274||Turkish          ||   2700||   26||  0.97%||   4.1||   14||
    8375
     76== README ==
     77{{{
     78== Requires ==
     79pypy >= 2.2.1,
     802.7 <= python < 3,
     81justext >= 1.2,
     82chared >= 1.3,
     83lxml >= 2.2.4,
     84text processing tools: pdftotext, ps2ascii, antiword,
     85works in Ubuntu (Debian Linux), does not work in Windows.
     86Minimum hardware configuration (very small crawls):
     87- 2 core CPU,
     88- 4 GB system memory,
     89- some storage space,
     90- broadband internet connection.
     91Recommended hardware configuration (crawling ~30 bn words of English text):
     92- 4-24 core CPU (the more CPUs the faster the processing of crawled data),
     93- 8-250 GB operational memory
     94    (the more RAM the more domains kept in memory and thus more webs visited),
     95- lots of storage space,
     96- connection to an internet backbone line.
     97
     98== Includes ==
     99A robot exclusion rules parser for Python by Philip Semanchuk (v. 1.6.2)
     100... see util/robotparser.py
     101Language detection using character trigrams by Douglas Bagnall
     102... see util/trigrams.py
     103docx2txt by Sandeep Kumar
     104... see util/doc2txt.pl
     105
     106== Installation ==
     107- unpack,
     108- install required tools,
     109- check justext.core and chared.detector can be imported by pypy,
     110- make sure the crawler can write to it's directory and config.PIPE_DIR.
     111
     112== Settings -- edit util/config.py ==
     113- !!!IMPORTANT!!! Set AGENT, AGENT_URL, USER_AGENT,
     114- raise ulimit -n accoring to MAX_OPEN_CONNS,
     115- set MAX_RUN_TIME to specify max crawling time in seconds,
     116- set DOC_PROCESSOR_COUNT to (partially) control CPU usage,
     117- configure language dependent settings
     118- set MAX_DOMS_READY to (partially) control memory usage,
     119- set MAX_DOMS_WAITING_FOR_SINGLE_IP, MAX_IP_REPEAT,
     120- set MAX_OPEN_CONNS, IP_CONN_INTERVAL, HOST_CONN_INTERVAL,
     121- set and mkdir PIPE_DIR (pipes for communication of subprocesses).
     122
     123== Language models ==
     124- plaintext in the target language in util/lang_samples/,
     125    e.g. put plaintexts from several dozens of English web documents and
     126    English Wikipedia articels in ./util/lang_samples/English
     127- jusText stoplist for that language in jusText stoplist path,
     128    e.g. <justext directory>/stoplists/English.txt
     129- chared model for that language,
     130    e.g. <chared directory>/models/English
     131
     132== Usage ==
     133pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP]
     134SEED_URLS is a text file containing seed URLs (the crawling starts there),
     135    one per line, specify at least 50 URLs.
     136SAVEPOINT_TIMESTAMP causes the state from the specified savepoint to be loaded,
     137    e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'.
     138Running the crawler in the background is recommended.
     139The crawler creates
     140- *.log.* .. log & debug files,
     141- *.arc.gz .. gzipped arc files (raw http responses),
     142- *.prevert_d .. preverticals with duplicate documents,
     143- *.duplicates .. files duplicate document IDs,
     144- *.unproc_urls .. urls of non html documents that were not processed (bug)
     145To remove duplicate documents from preverticals, run
     146rm spiderling.prevert
     147for i in $(seq 0 15)
     148do
     149    pypy util/remove_duplicates.py spiderling.${i}.duplicates \
     150        < spiderling.${i}.prevert_d >> spiderling.prevert
     151done
     152File spiderling.prevert is the final output.
     153To stop the crawler before MAX_RUN_TIME, send SIGTERM to the main process
     154(spiderling.py).
     155To re-process arc files with current process.py and util/config.py, run
     156zcat spiderling.*.arc.gz | pypy reprocess.py
     157
     158== Performance tips ==
     159- Using PyPy reduces CPU and memory cost (saves approx. 1/4 RAM).
     160- Set ALLOWED_TLDS_RE to avoid crawling domains not in the target language
     161  (it takes some resources to detect it otherwise).
     162- Set LOG_LEVEL to debug and set INFO_PERIOD to check the debug output in
     163  *.log.crawl and *.log.eval to see where the bottleneck is, modify the
     164  settings accordingly, e.g. add doc processors if the doc queue is always full.
     165
     166== Known bugs ==
     167- Non html documents are not processed (urls stored in *.unproc_urls instead).
     168- DNS resolvers are implemented as blocking threads => useless to have more
     169  than one, will be changed to separate processes in the future.
     170- Compressed connections are not accepted/processed. Some servers might be
     171  discouraged from sending an uncompressed response (not tested).
     172- Some advanced features of robots.txt are not observed, e.g. Crawl-delay.
     173  It would require major changes in the design of the download scheduler.
     174- Https requests are not implemented properly, may not work at all.
     175- Path bloating, e.g. http://example.com/a/ yielding the same content as
     176  http://example.com/a/a/ etc., should be avoided. Might be a bot trap.
     177}}}
    84178
    85179== Acknowledgements ==
    86 This software is developed at the [http://nlp.fi.muni.cz/en/nlpc
    87 Natural Language Processing Centre] of Masaryk University in Brno,
    88 Czech Republic \\
    89 in cooperation with [http://lexicalcomputing.com/ Lexical Computing
    90 Ltd.], UK, a [http://www.sketchengine.co.uk/ corpus tool] company.
     180The author would like to express many thanks to Jan Pomikalek and Pavel Rychly for guidance and key design advice.
     181Thanks also to Vlado Benko and Nikola Ljubesic for ideas for improvement.
     182
     183This software is developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of Masaryk University in Brno, Czech Republic \\
     184in cooperation with [http://lexicalcomputing.com/ Lexical Computing Ltd.], UK, a [http://www.sketchengine.co.uk/ corpus tool] company.
    91185
    92186== Contact ==
    93 {{{
    94 #!html
    95 <SCRIPT TYPE="text/javascript">
    96   emailE='fi.muni.cz';
    97   emailE=('xsuchom2' + '@' + emailE);
    98   document.write('Vít Suchomel: <A href="mailto:' + emailE + '">' +
    99 emailE + '</a>');
    100 </SCRIPT>
    101 }}}
     187{{{'zc.inum.if@2mohcusx'[::-1]}}}
    102188
    103189== Licence ==