Changes between Version 3 and Version 4 of SpiderLing


Ignore:
Timestamp:
11/20/15 13:50:10 (9 years ago)
Author:
admin
Comment:

Changelog 0.77 --> 0.81

Legend:

Unmodified
Added
Removed
Modified
  • SpiderLing

    v3 v4  
    1414== Get SpiderLing ==
    1515See [wiki:Downloads] for the latest version.
     16
     17=== Changelog 0.77 → 0.81 ===
     18* Improvements proposed by Vlado Benko or Nikola Ljubešić:
     19 - escape square brackets and backslash in url
     20 - doc attributes: timestamp with hours, IP address, meta/chared encoding
     21 - doc id added to arc output
     22 - MAX_DOCS_CLEANED limit per domain
     23 - create the temp dir if needed
     24* Support for processing doc/docx/ps/pdf (not working well yet, URLs of documents are saved to a separate file for manual download and processing)
     25* Crawling multiple languages (Inspired by Nikola's contribution) (not tested yet)
     26* Stop crawling by sending SIGTERM to the main process
     27* Domain distances (distance of web domains from seed web domains, will be used in scheduling in the future)
     28* Config values tweaked
     29 - MAX_URL_QUEUE, MAX_URL_SELECT greatly increased
     30 - better spread of domains in the crawling queue => faster crawling
     31* Python --> !PyPy
     32 - scheduler and crawler processes dynamically compiled by pypy
     33 - saves approx. 1/4 RAM
     34 - better CPU effectivity not so visible (waiting for host/IP timeouts, waiting for doc processors)
     35 - process.py requires lxml which does not work with pypy (will be replaced by lxml-cffi in the future)
     36* Readme updated (more information, known bugs)
     37* Bug fixes
    1638
    1739== Publications ==