Changes between Version 12 and Version 13 of SpiderLing


Ignore:
Timestamp:
07/23/21 14:06:34 (3 years ago)
Author:
admin
Comment:

version 2.0

Legend:

Unmodified
Added
Removed
Modified
  • SpiderLing

    v12 v13  
    1616
    1717== Get SpiderLing ==
    18 Download [https://nlp.fi.muni.cz/projects/spiderling/ the latest version]. Please note the software is distributed as is, without a guaranteed support.
     18Download [https://corpus.tools/raw-attachment/wiki/Downloads/spiderling-src-2.0.tar.xz the latest version]. Please note the software is distributed as is, without a guaranteed support.
    1919
    2020== Publications ==
     
    7575  - pdftotext (from poppler-utils),
    7676  - ps2ascii (from ghostscript-core),
    77   - antiword (from antiword),
     77  - antiword (from antiword + perl-Time-HiRes),
     78  - odfpy,
    7879- nice (coreutils) (optional),
    7980- ionice (util-linux) (optional),
     
    8889Recommended hardware configuration (crawling ~30 bn words of English text):
    8990- 8-32 core CPU (the more CPUs the faster the processing of crawled data),
    90 - 32-256 GB system memory
     91- 32-512 GB system memory
    9192    (the more RAM the more domains kept in memory and thus more webs visited),
    9293- lots of storage space,
     
    105106
    106107== Installation ==
    107 - unpack,
     108- unpack: tar -xJvf spiderling-src-*.tar.xz,
    108109- install required tools, see install_rpm.sh for rpm based systems
    109110- check importing the following dependences by pypy3/python3:
     
    119120- raise ulimit -n accoring to MAX_OPEN_CONNS;
    120121- then increase MAX_OPEN_CONNS and OPEN_AT_ONCE;
    121 - configure language dependent settings.
     122- configure language and TLD dependent settings.
    122123
    123124== Language models for all recognised languages ==