Changes between Version 5 and Version 6 of SpiderLing

Timestamp:: 01/21/16 14:16:37 (9 years ago)
Author:: admin
Comment:: update, url fixes, contact, readme

Legend:

: Unmodified
: Added
: Removed
: Modified

SpiderLing

-              v5
+              v6
 We presented our results at the following venues:
+[http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/tentens_14may2013.docx
+The TenTen Corpus Family]\\
+[http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/tentens_14may2013.docx The TenTen Corpus Family]\\
 by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel\\
+at [http://ucrel.lancs.ac.uk/cl2013/ 7th International Corpus
+Linguistics Conference], Lancaster, July 2013.
+[http://nlp.fi.muni.cz/~xsuchom2/papers/BaisaSuchomel_TurkicResources.pdf
+Large Corpora for Turkic Languages and Unsupervised Morphological
+Analysis]\\
+at [http://ucrel.lancs.ac.uk/cl2013/ 7th International Corpus Linguistics Conference], Lancaster, July 2013.
+[http://nlp.fi.muni.cz/~xsuchom2/papers/BaisaSuchomel_TurkicResources.pdf Large Corpora for Turkic Languages and Unsupervised Morphological Analysis]\\
 by Vít Baisa, Vít Suchomel\\
+at [http://multisaund.eu/lrec2012_turkiclanguage.php Language
+Resources and Technologies for Turkic Languages] (at conference LREC),
+Istanbul, May 2012
+[http://nlp.fi.muni.cz/~xsuchom2/papers/PomikalekSuchomel_SpiderlingEfficiency.pdf
+Efficient Web Crawling for Large Text Corpora]\\
+at [http://multisaund.eu/lrec2012_turkiclanguage.php Language Resources and Technologies for Turkic Languages] (at conference LREC), Istanbul, May 2012
+[http://nlp.fi.muni.cz/~xsuchom2/papers/PomikalekSuchomel_SpiderlingEfficiency.pdf Efficient Web Crawling for Large Text Corpora]\\
 by Jan Pomikálek, Vít Suchomel\\
+at [http://sigwac.org.uk/wiki/WAC7 ACL SIGWAC Web as Corpus] (at
+conference WWW), Lyon, April 2012
+at [http://sigwac.org.uk/wiki/WAC7 ACL SIGWAC Web as Corpus] (at conference WWW), Lyon, April 2012
 …
 ||Turkish          ||   2700||   26||  0.97%||   4.1||   14||
+== README ==
+{{{
+== Requires ==
+pypy >= 2.2.1,
+.7 <= python < 3,
+justext >= 1.2,
+chared >= 1.3,
+lxml >= 2.2.4,
+text processing tools: pdftotext, ps2ascii, antiword,
+works in Ubuntu (Debian Linux), does not work in Windows.
+Minimum hardware configuration (very small crawls):
+- 2 core CPU,
+- 4 GB system memory,
+- some storage space,
+- broadband internet connection.
+Recommended hardware configuration (crawling ~30 bn words of English text):
+- 4-24 core CPU (the more CPUs the faster the processing of crawled data),
+- 8-250 GB operational memory
+    (the more RAM the more domains kept in memory and thus more webs visited),
+- lots of storage space,
+- connection to an internet backbone line.
+== Includes ==
+A robot exclusion rules parser for Python by Philip Semanchuk (v. 1.6.2)
+... see util/robotparser.py
+Language detection using character trigrams by Douglas Bagnall
+... see util/trigrams.py
+docx2txt by Sandeep Kumar
+... see util/doc2txt.pl
+== Installation ==
+- unpack,
+- install required tools,
+- check justext.core and chared.detector can be imported by pypy,
+- make sure the crawler can write to it's directory and config.PIPE_DIR.
+== Settings -- edit util/config.py ==
+- !!!IMPORTANT!!! Set AGENT, AGENT_URL, USER_AGENT,
+- raise ulimit -n accoring to MAX_OPEN_CONNS,
+- set MAX_RUN_TIME to specify max crawling time in seconds,
+- set DOC_PROCESSOR_COUNT to (partially) control CPU usage,
+- configure language dependent settings
+- set MAX_DOMS_READY to (partially) control memory usage,
+- set MAX_DOMS_WAITING_FOR_SINGLE_IP, MAX_IP_REPEAT,
+- set MAX_OPEN_CONNS, IP_CONN_INTERVAL, HOST_CONN_INTERVAL,
+- set and mkdir PIPE_DIR (pipes for communication of subprocesses).
+== Language models ==
+- plaintext in the target language in util/lang_samples/,
+    e.g. put plaintexts from several dozens of English web documents and
+    English Wikipedia articels in ./util/lang_samples/English
+- jusText stoplist for that language in jusText stoplist path,
+    e.g. <justext directory>/stoplists/English.txt
+- chared model for that language,
+    e.g. <chared directory>/models/English
+== Usage ==
+pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP]
+SEED_URLS is a text file containing seed URLs (the crawling starts there),
+    one per line, specify at least 50 URLs.
+SAVEPOINT_TIMESTAMP causes the state from the specified savepoint to be loaded,
+    e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'.
+Running the crawler in the background is recommended.
+The crawler creates
+- *.log.* .. log & debug files,
+- *.arc.gz .. gzipped arc files (raw http responses),
+- *.prevert_d .. preverticals with duplicate documents,
+- *.duplicates .. files duplicate document IDs,
+- *.unproc_urls .. urls of non html documents that were not processed (bug)
+To remove duplicate documents from preverticals, run
+rm spiderling.prevert
+for i in $(seq 0 15)
+do
+    pypy util/remove_duplicates.py spiderling.${i}.duplicates \
+        < spiderling.${i}.prevert_d >> spiderling.prevert
+done
+File spiderling.prevert is the final output.
+To stop the crawler before MAX_RUN_TIME, send SIGTERM to the main process
+(spiderling.py).
+To re-process arc files with current process.py and util/config.py, run
+zcat spiderling.*.arc.gz | pypy reprocess.py
+== Performance tips ==
+- Using PyPy reduces CPU and memory cost (saves approx. 1/4 RAM).
+- Set ALLOWED_TLDS_RE to avoid crawling domains not in the target language
+  (it takes some resources to detect it otherwise).
+- Set LOG_LEVEL to debug and set INFO_PERIOD to check the debug output in
+  *.log.crawl and *.log.eval to see where the bottleneck is, modify the
+  settings accordingly, e.g. add doc processors if the doc queue is always full.
+== Known bugs ==
+- Non html documents are not processed (urls stored in *.unproc_urls instead).
+- DNS resolvers are implemented as blocking threads => useless to have more
+  than one, will be changed to separate processes in the future.
+- Compressed connections are not accepted/processed. Some servers might be
+  discouraged from sending an uncompressed response (not tested).
+- Some advanced features of robots.txt are not observed, e.g. Crawl-delay.
+  It would require major changes in the design of the download scheduler.
+- Https requests are not implemented properly, may not work at all.
+- Path bloating, e.g. http://example.com/a/ yielding the same content as
+  http://example.com/a/a/ etc., should be avoided. Might be a bot trap.
+}}}
 == Acknowledgements ==
 This software is developed at the [http://nlp.fi.muni.cz/en/nlpc
+Natural Language Processing Centre] of Masaryk University in Brno,
+Czech Republic \\
+in cooperation with [http://lexicalcomputing.com/ Lexical Computing
 Ltd.], UK, a [http://www.sketchengine.co.uk/ corpus tool] company.
+The author would like to express many thanks to Jan Pomikalek and Pavel Rychly for guidance and key design advice.
+Thanks also to Vlado Benko and Nikola Ljubesic for ideas for improvement.
+This software is developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of Masaryk University in Brno, Czech Republic \\
+in cooperation with [http://lexicalcomputing.com/ Lexical Computing Ltd.], UK, a [http://www.sketchengine.co.uk/ corpus tool] company.
 == Contact ==
+{{{
+#!html
+<SCRIPT TYPE="text/javascript">
+  emailE='fi.muni.cz';
+  emailE=('xsuchom2' + '@' + emailE);
+  document.write('Vít Suchomel: <A href="mailto:' + emailE + '">' +
+emailE + '</a>');
+</SCRIPT>
+}}}
+{{{'zc.inum.if@2mohcusx'[::-1]}}}
 == Licence ==