Changes between Version 6 and Version 7 of SpiderLing


Ignore:
Timestamp:
10/13/16 17:36:23 (8 years ago)
Author:
admin
Comment:

0.84

Legend:

Unmodified
Added
Removed
Modified
  • SpiderLing

    v6 v7  
    11= !SpiderLing =
    2 !SpiderLing — a web spider for linguistics — is software for obtaining
    3 text from the web useful for building text corpora. Many documents on
    4 the web only contain material not suitable for text corpora, such as
    5 site navigation, lists of links, lists of products, and other kind of
    6 text not comprised of full sentences. In fact such pages represent the
    7 vast majority of the web. Therefore, by doing unrestricted web crawls,
    8 we typically download a lot of data which gets filtered out during
    9 post-processing. This makes the process of web corpus collection
    10 inefficient. The aim of our work is to focus the crawling on the text
    11 rich parts of the web and maximize the number of words in the final
    12 corpus per downloaded megabyte.
     2!SpiderLing — a web spider for linguistics — is software for obtaining text from the web useful for building text corpora.
     3Many documents on the web only contain material not suitable for text corpora, such as site navigation, lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient.
     4The aim of our work is to focus the crawling on the text rich parts of the web and maximize the number of words in the final corpus per downloaded megabyte. Nevertheless the crawler can be configured to ignore the yield rate of web domains and download from low yield sites too.
    135
    146== Get SpiderLing ==
    15 See [wiki:Downloads] for the latest version.
     7See [wiki:Downloads] for the latest version. '''Please note the software is distributed as is, without a guaranteed support'''
     8
     9=== Changelog 0.82 → 0.84 ===
     10* Mistakenly ignored links fixed in process.py
     11 - the bug caused not crawling a significant part of the web
     12* Multithreading issues fixed (possible race conditions in shared memory)
     13 - delegated classes for explicit locking of all shared objects (esp. DomainQueue)
     14* Deadlock -- the scheduler stops working after some time of successful crawling
     15 - observed in the case of a large scale crawling
     16 - may be related to changing the code because of multithreading problems
     17 - IMPORTANT -- may not be resolved yet, testing in progress, 0.81 should be stable
     18* Several Domain and DomainQueue methods were rewritten for bulk operations (e.g. adding all paths into a single domain together) to improve performance
     19* Justext configuration
     20 - more permissive setting than the justext default
     21 - configurable in util/config.py
    1622
    1723=== Changelog 0.81 → 0.82 ===
     
    7783{{{
    7884== Requires ==
     852.7.9 <= python < 3,
    7986pypy >= 2.2.1,
    80 2.7 <= python < 3,
    81 justext >= 1.2,
    82 chared >= 1.3,
    83 lxml >= 2.2.4,
    84 text processing tools: pdftotext, ps2ascii, antiword,
    85 works in Ubuntu (Debian Linux), does not work in Windows.
     87justext >= 1.2 (http://corpus.tools/wiki/Justext),
     88chared >= 1.3 (http://corpus.tools/wiki/Chared),
     89lxml >= 2.2.4 (http://lxml.de/),
     90openssl >= 1.0.1,
     91pdftotext, ps2ascii, antiword (text processing tools).
     92Runs in Linux, tested in Fedora and Ubuntu.
    8693Minimum hardware configuration (very small crawls):
    8794- 2 core CPU,
     
    97104
    98105== Includes ==
    99 A robot exclusion rules parser for Python by Philip Semanchuk (v. 1.6.2)
    100 ... see util/robotparser.py
    101 Language detection using character trigrams by Douglas Bagnall
    102 ... see util/trigrams.py
    103 docx2txt by Sandeep Kumar
    104 ... see util/doc2txt.pl
     106A robot exclusion rules parser for Python (v. 1.6.2)
     107- by Philip Semanchuk, BSD Licence
     108- see util/robotparser.py
     109Language detection using character trigrams
     110- by Douglas Bagnall, Python Software Foundation Licence
     111- see util/trigrams.py
     112docx2txt
     113- by Sandeep Kumar, GNU GPL 3+
     114- see util/doc2txt.pl
    105115
    106116== Installation ==
     
    127137- jusText stoplist for that language in jusText stoplist path,
    128138    e.g. <justext directory>/stoplists/English.txt
    129 - chared model for that language,
     139- chared model for that language, 
    130140    e.g. <chared directory>/models/English
    131141
    132142== Usage ==
    133143pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP]
    134 SEED_URLS is a text file containing seed URLs (the crawling starts there),
     144SEED_URLS is a text file containing seed URLs (the crawling starts there), 
    135145    one per line, specify at least 50 URLs.
    136146SAVEPOINT_TIMESTAMP causes the state from the specified savepoint to be loaded,
    137147    e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'.
    138148Running the crawler in the background is recommended.
    139 The crawler creates
     149The crawler creates 
    140150- *.log.* .. log & debug files,
    141151- *.arc.gz .. gzipped arc files (raw http responses),
    142152- *.prevert_d .. preverticals with duplicate documents,
    143153- *.duplicates .. files duplicate document IDs,
    144 - *.unproc_urls .. urls of non html documents that were not processed (bug)
    145 To remove duplicate documents from preverticals, run
     154- *.domain_{bad,oversize,distance} .. see util/domain.py,
     155- *.ignored_refs .. links ignored for configurable or hardcoded reasons,
     156- *.unproc_urls .. urls of non html documents that were not processed (bug).
     157To remove duplicate documents from preverticals, run
    146158rm spiderling.prevert
    147159for i in $(seq 0 15)
    148 do
     160do 
    149161    pypy util/remove_duplicates.py spiderling.${i}.duplicates \
    150162        < spiderling.${i}.prevert_d >> spiderling.prevert
     
    165177
    166178== Known bugs ==
     179- Domain distances should be made part of document metadata instead of storing
     180  them in a separate file.
    167181- Non html documents are not processed (urls stored in *.unproc_urls instead).
    168182- DNS resolvers are implemented as blocking threads => useless to have more
     
    172186- Some advanced features of robots.txt are not observed, e.g. Crawl-delay.
    173187  It would require major changes in the design of the download scheduler.
    174 - Https requests are not implemented properly, may not work at all.
    175 - Path bloating, e.g. http://example.com/a/ yielding the same content as
    176   http://example.com/a/a/ etc., should be avoided. Might be a bot trap.
     188
     189== Support ==
     190There is no guaranteed support offered. (The author may help you a bit in his
     191free time.) Please note the tool is distributed as is, it may not work under
     192your conditions.
     193
    177194}}}
    178195
    179196== Acknowledgements ==
    180 The author would like to express many thanks to Jan Pomikalek and Pavel Rychly for guidance and key design advice.
    181 Thanks also to Vlado Benko and Nikola Ljubesic for ideas for improvement.
     197The author would like to express many thanks to Jan Pomikálek, Pavel Rychlý
     198and Miloš Jakubíček for guidance, key design advice and help with debugging.
     199Thanks also to Vlado Benko and Nikola Ljubešić for ideas for improvement.
    182200
    183201This software is developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of Masaryk University in Brno, Czech Republic \\