Changes between Version 9 and Version 10 of SpiderLing


Ignore:
Timestamp:
03/11/20 13:12:20 (5 years ago)
Author:
admin
Comment:

version 1.1

Legend:

Unmodified
Added
Removed
Modified
  • SpiderLing

    v9 v10  
    11= !SpiderLing =
    22!SpiderLing — a web spider for linguistics — is software for obtaining text from the web useful for building text corpora.
     3
    34Many documents on the web only contain material not suitable for text corpora, such as site navigation, lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient.
     5
    46The aim of our work is to focus the crawling on the text rich parts of the web and maximize the number of words in the final corpus per downloaded megabyte. Nevertheless the crawler can be configured to ignore the yield rate of web domains and download from low yield sites too.
    57
    68== Get SpiderLing ==
    79Download [https://nlp.fi.muni.cz/projects/spiderling/ the latest version]. Please note the software is distributed as is, without a guaranteed support.
     10
     11== Publications ==
     12We presented our results at the following venues:
     13
     14[http://nlp.fi.muni.cz/~xsuchom2/papers/PomikalekSuchomel_SpiderlingEfficiency.pdf Efficient Web Crawling for Large Text Corpora]\\
     15by Jan Pomikálek, Vít Suchomel\\
     16at [http://sigwac.org.uk/wiki/WAC7 ACL SIGWAC Web as Corpus] (at conference WWW), Lyon, April 2012
     17
     18[http://nlp.fi.muni.cz/~xsuchom2/papers/BaisaSuchomel_TurkicResources.pdf Large Corpora for Turkic Languages and Unsupervised Morphological Analysis]\\
     19by Vít Baisa, Vít Suchomel\\
     20at [http://multisaund.eu/lrec2012_turkiclanguage.php Language Resources and Technologies for Turkic Languages] (at conference LREC), Istanbul, May 2012
     21
     22[http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/tentens_14may2013.docx The TenTen Corpus Family]\\
     23by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel\\
     24at [http://ucrel.lancs.ac.uk/cl2013/ 7th International Corpus Linguistics Conference], Lancaster, July 2013.
     25
     26
     27== Large textual corpora built using !SpiderLing ==
     28=== Since 2017 ===
     29Corpora of total size of ca. 200 billion tokens in various languages (mostly English) were built from data crawled by SpiderLing from 2017 to March 2020.
     30
     31=== From 2011 to 2014 ===
     32||= language  =||= raw data size [GB] =||= cleaned data size [GB] =||= yield rate =||= corpus size [billion tokens] =||= crawling duration [days] =||
     33||American Spanish ||   1874||   44||  2.36%||   8.7||   14||
     34||Arabic           ||   2015||   58||  2.89%||   6.6||   14||
     35||Bulgarian        ||       ||     ||       ||   0.9||    8||
     36||Czech            ||  ~4000||     ||       ||   5.8||  ~40||
     37||English          ||   2859||  108||  3.78%||  17.8||   17||
     38||Estonian         ||    100||    3||  2.67%||   0.3||   14||
     39||French           ||   3273||   72||  2.19%||  12.4||   15||
     40||German           ||   5554||  145||  2.61%||  19.7||   30||
     41||Hungarian        ||       ||     ||       ||   3.1||   20||
     42||Japanese         ||   2806||   61||  2.19%||  11.1||   28||
     43||Korean           ||       ||     ||       ||   0.5||   20||
     44||Polish           ||       ||     ||       ||   9.5||   17||
     45||Russian          ||   4142||  198||  4.77%||  20.2||   14||
     46||Turkish          ||   2700||   26||  0.97%||   4.1||   14||
     47
     48
     49== Requires ==
     50- python >= 3.6,
     51- pypy3 >= 5.5 (optional),
     52- justext >= 3.0 (http://corpus.tools/wiki/Justext),
     53- chared >= 2.0 (http://corpus.tools/wiki/Chared),
     54- lxml >= 4.2 (http://lxml.de/),
     55- openssl >= 1.1,
     56- pyre2 >= 0.2.23 (https://github.com/andreasvc/pyre2),
     57- text processing tools (if binary format conversion is on):
     58  - pdftotext (from poppler-utils),
     59  - ps2ascii (from ghostscript-core),
     60  - antiword (from antiword),
     61- nice (coreutils) (optional),
     62- ionice (util-linux) (optional),
     63- gzip (optional).
     64
     65Runs in Linux, tested in Fedora and Ubuntu.
     66Minimum hardware configuration (very small crawls):
     67- 4 core CPU,
     68- 8 GB system memory,
     69- some storage space,
     70- broadband internet connection.
     71Recommended hardware configuration (crawling ~30 bn words of English text):
     72- 8-32 core CPU (the more CPUs the faster the processing of crawled data),
     73- 32-256 GB system memory
     74    (the more RAM the more domains kept in memory and thus more webs visited),
     75- lots of storage space,
     76- connection to an internet backbone line.
     77
     78== Includes ==
     79A robot exclusion rules parser for Python (v. 1.6.2)
     80- by Philip Semanchuk, BSD Licence
     81- see util/robotparser.py
     82Language detection using character trigrams
     83- by Douglas Bagnall, Python Software Foundation Licence
     84- see util/trigrams.py
     85docx2txt
     86- by Sandeep Kumar, GNU GPL 3+
     87- see util/doc2txt.pl
     88
     89== Installation ==
     90- unpack,
     91- install required tools, see install_rpm.sh for rpm based systems
     92- check importing the following dependences by pypy3/python3:
     93  python3 -c 'import justext.core, chared.detector, ssl, lxml, re2'
     94  pypy3 -c 'import ssl; from ssl import PROTOCOL_TLS'
     95- make sure the crawler can write to config.RUN_DIR and config.PIPE_DIR.
     96
     97== Settings -- edit util/config.py ==
     98- !!!IMPORTANT!!! Set AGENT, AGENT_URL, USER_AGENT;
     99- set MAX_RUN_TIME to specify max crawling time in seconds;
     100- set DOC_PROCESSOR_COUNT to (partially) control CPU usage;
     101- set MAX_OPEN_CONNS, IP_CONN_INTERVAL, HOST_CONN_INTERVAL;
     102- raise ulimit -n accoring to MAX_OPEN_CONNS;
     103- then increase MAX_OPEN_CONNS and OPEN_AT_ONCE;
     104- configure language dependent settings.
     105
     106== Language models for all recognised languages ==
     107- Plaintext in util/lang_samples/,
     108    e.g. put plaintexts from several dozens of nice web documents and Wikipedia
     109    articels, manually checked, in util/lang_samples/{Czech,Slovak,English};
     110- jusText wordlists in util/justext_wordlists/,
     111    e.g. use the Justext default or 2000 most frequent manually cleaned words,
     112    one per line, in util/justext_wordlists/{Czech,Slovak,English};
     113- chared model in util/chared_models/,
     114    e.g. copy the default chared models {czech,slovak,english}.edm to
     115    util/chared_models/{Czech,Slovak,English}
     116See default English resources in the respective directories.
     117
     118== Usage ==
     119See {{{./spiderling.py -h}}}.
     120
     121It is recommended to run the crawler in `screen`.
     122Example:
     123{{{
     124screen -S crawling
     125./spiderling.py < seed_urls &> run/out &
     126}}}
     127
     128Files created by the crawler in run/:
     129- *.log.* .. log & debug files,
     130- arc/*.arc.gz .. gzipped arc files (raw http responses),
     131- prevert/*.prevert_d .. preverticals with duplicate documents,
     132- prevert/duplicate_ids .. files duplicate document IDs,
     133- ignored/* .. ignored URLs (binary files (pdf, ps, doc, docx) which were
     134    not processed and urls not passing the domain blacklist or the TLD filter),
     135- save/* .. savepoints that can be used for a new run,
     136- other directories can be erased after stopping the crawler.
     137
     138To remove duplicate documents from preverticals, run
     139{{{
     140for pdup in run/prevert/*.prevert_d
     141do
     142    p=`echo $pdup | sed -r 's,prevert_d$,prevert,'`
     143    pypy3 util/remove_duplicates.py run/prevert/duplicate_ids < $pdup > p
     144done
     145}}}
     146Files run/prevert/*.prevert are the final output.
     147
     148Onion (http://corpus.tools/wiki/Onion) is recommended to remove near-duplicate
     149paragraphs of text.
     150
     151To stop the crawler before MAX_RUN_TIME, send SIGTERM to the main process
     152(pypy/python spiderling.py).
     153Example:
     154{{{
     155ps aux | grep 'spiderling\.py' #find the PID of the main process
     156kill -s SIGTERM <PID>
     157}}}
     158
     159To re-process arc files with current process.py and util/config.py, run
     160{{{
     161for arcf in run/arc/*.arc.gz
     162do
     163    p=`echo $arcf | sed -r 's,run/arc/([0-9]+)\.arc\.gz$,\1.prevert_re_d,'`
     164    zcat $arcf | pypy3 reprocess.py > $p
     165done
     166}}}
     167
     168To re-start crawling from the last saved state:
     169{{{
     170mv -iv run old #rename the old `run' directory
     171mkdir run
     172for d in docmeta dompath domrobot domsleep robots
     173do
     174    ln -s ../old/$d/ run/$d
     175    #make the new run continue from the previous state by symlinking old data
     176done
     177screen -r crawling
     178./spiderling.py \
     179    --state-files=old/save/domains_T,old/save/raw_hashes_T,old/save/txt_hashes_T \
     180    --old-tuples < old/url/urls_waiting &> run/out &
     181}}}
     182(Assuming T is the timestamp of the last save, e.g. 20190616151500.)
     183
     184== Performance tips ==
     185- Start with thousands of seed URLs. Give more URLs per domain.
     186  It is possible to start with tens of millions of seed URLs.
     187  If you need to start with a small number of seed URLs, set
     188  VERY_SMALL_START = True in util/config.py.
     189- Using !PyPy reduces CPU cost and may cost more RAM.
     190- Set TLD_WHITELIST_RE to avoid crawling domains not in the target language
     191  (it takes some resources to detect it otherwise).
     192- Set LOG_LEVEL to debug and set INFO_PERIOD to check the debug output in
     193  *.log.crawl and *.log.eval to see where the bottleneck is, modify the
     194  settings accordingly, e.g. add doc processors if the doc queue is always full.
     195
     196== Known bugs ==
     197- Domain distances should be made part of document metadata instead of storing
     198  them in a separate file. It will be resolved in the next version.
     199- Processing binary files (pdf, ps, doc, docx) is disabled by default since it
     200  was not tested and may slow processing significantly.
     201  Also, the quality of the text output of these converters may suffer from
     202  various problems: headers, footers, lines breaked by hyphenated words, etc.
     203- Compressed connections are not accepted/processed. Some servers might be
     204  discouraged from sending an uncompressed response (not tested).
     205- Some advanced features of robots.txt are not observed, e.g. Crawl-delay.
     206  It would require major changes in the design of the download scheduler.
     207  A warning is emitted when the crawl delay < config.HOST_CONN_INTERVAL.
     208
     209== Support ==
     210There is no guarantee of support. (The author may help you a bit in his free
     211time.) Please note the tool is distributed as is, it may not work under your
     212conditions.
     213
     214== Acknowledgements ==
     215The author would like to express many thanks to Jan Pomikálek, Pavel Rychlý
     216and Miloš Jakubíček for guidance, key design advice and help with debugging.
     217Thanks also to Vlado Benko and Nikola Ljubešić for ideas for improvement.
     218
     219This software is developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of Masaryk University in Brno, Czech Republic in cooperation with [http://lexicalcomputing.com/ Lexical Computing], [http://www.sketchengine.eu/ corpus tool] company.
     220
     221This work was partly supported by the Ministry of Education of CR within the
     222LINDAT-Clarin project LM2015071.
     223This work was partly supported by the Norwegian Financial Mechanism 2009–2014
     224and the Ministry of Education, Youth and Sports under Project Contract no.
     225MSMT-28477/2014 within the HaBiT Project 7F14047.
     226
     227== Contact ==
     228{{{'zc.inum.if@2mohcusx'[::-1]}}}
     229
     230== Licence ==
     231This software is the result of project LM2010013 (LINDAT-Clarin -
     232Vybudování a provoz českého uzlu pan-evropské infrastruktury pro
     233výzkum). This result is consistent with the expected objectives of the
     234project. The owner of the result is Masaryk University, a public
     235university, ID: 00216224. Masaryk University allows other companies
     236and individuals to use this software free of charge and without
     237territorial restrictions under the terms of the
     238[http://www.gnu.org/licenses/gpl.txt GPL license].
     239
     240This permission is granted for the duration of property rights.
     241
     242This software is not subject to special information treatment
     243according to Act No. 412/2005 Coll., as amended. In case that a person
     244who will use the software under this license offer violates the
     245license terms, the permission to use the software terminates.
     246
     247== Changelog ==
     248=== Changelog 1.0 → 1.1 ===
     249* Important bug fix: Store the best extracted paragraph data in process.py. (Fixes a bug that caused Chared model for the last tested language rather than the best best tested language was used for character decoding. E.g. koi8-r encoding could have been assumed for Czech documents in some cases and when the last language in config.LANGUAGES was Russian.) Also, chared encodings were renamed to canonical encoding names.
     250* Encoding detection simplified, Chared is preferred now
     251* Session id (and similar strings) removed from paths in domains to prevent downloading the same content again
     252* Path scheduling priority by path length
     253* Memory consumption improvements
     254* More robust error handling, e.g. socket errors in crawl.py updated to OSError
     255* reprocess.py can work with wpage files too
     256* config.py: some default values changed for better performance (e.g. increasing the maximum open connections limit helps a lot), added a switch for the case of starting with a small count of seed URLs, tidied up
     257* Debug logging of memory size of data structures
     258
     259=== Changelog 0.95 → 1.0 ===
     260* Python 3.6+ compatible
     261* Domain scheduling priority by domain distance and hostname length
     262* Bug fixes: domain distance, domain loading, Chared returns "utf_8" instead of "utf-8", '<', '>' and "'" in doc.title/doc.url, robots.txt redirected from http to https, missing content-type and more
     263* More robust handling of some issues
     264* Less used features such as state file loading and data reprocessing better explained
     265* Global domain blacklist added
     266* English models and sample added
     267* Program run examples in README
     268
     269=== Changelog 0.82 → 0.95 ===
     270* Interprocess communication rewritten to files
     271* Write URLs that cannot be downloaded soon into a "wpage" file -- speeds up the downloader
     272* New reprocess.py allowing reprocessing of arc files
     273* Https hosts separated from http hosts with the same hostname
     274* Many features, e.g. redirections, made in a more robust way
     275* More options exposed to allow configuration, more logging and debug info
     276* Big/small crawling profiles setting multiple variables in config
     277* Performance and memory saving improvements
     278* Bugfixes: chunked HTTP, XML parsing, double quotes in URL, HTTP redirection to the same URL, SSL layer not ready and more
    8279
    9280=== Changelog 0.82 → 0.85 ===
     
    47318* Readme updated (more information, known bugs)
    48319* Bug fixes
    49 
    50 == Publications ==
    51 We presented our results at the following venues:
    52 
    53 [http://trac.sketchengine.co.uk/raw-attachment/wiki/AK/Papers/tentens_14may2013.docx The TenTen Corpus Family]\\
    54 by Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vít Suchomel\\
    55 at [http://ucrel.lancs.ac.uk/cl2013/ 7th International Corpus Linguistics Conference], Lancaster, July 2013.
    56 
    57 [http://nlp.fi.muni.cz/~xsuchom2/papers/BaisaSuchomel_TurkicResources.pdf Large Corpora for Turkic Languages and Unsupervised Morphological Analysis]\\
    58 by Vít Baisa, Vít Suchomel\\
    59 at [http://multisaund.eu/lrec2012_turkiclanguage.php Language Resources and Technologies for Turkic Languages] (at conference LREC), Istanbul, May 2012
    60 
    61 [http://nlp.fi.muni.cz/~xsuchom2/papers/PomikalekSuchomel_SpiderlingEfficiency.pdf Efficient Web Crawling for Large Text Corpora]\\
    62 by Jan Pomikálek, Vít Suchomel\\
    63 at [http://sigwac.org.uk/wiki/WAC7 ACL SIGWAC Web as Corpus] (at conference WWW), Lyon, April 2012
    64 
    65 
    66 == Large textual corpora built using !SpiderLing since October 2011 ==
    67 ||= language  =||= raw data size [GB] =||= cleaned data size [GB] =||= yield rate =||= corpus size [billion tokens] =||= crawling duration [days] =||
    68 ||American Spanish ||   1874||   44||  2.36%||   8.7||   14||
    69 ||Arabic           ||   2015||   58||  2.89%||   6.6||   14||
    70 ||Bulgarian        ||       ||     ||       ||   0.9||    8||
    71 ||Czech            ||  ~4000||     ||       ||   5.8||  ~40||
    72 ||English          ||   2859||  108||  3.78%||  17.8||   17||
    73 ||Estonian         ||    100||    3||  2.67%||   0.3||   14||
    74 ||French           ||   3273||   72||  2.19%||  12.4||   15||
    75 ||German           ||   5554||  145||  2.61%||  19.7||   30||
    76 ||Hungarian        ||       ||     ||       ||   3.1||   20||
    77 ||Japanese         ||   2806||   61||  2.19%||  11.1||   28||
    78 ||Korean           ||       ||     ||       ||   0.5||   20||
    79 ||Polish           ||       ||     ||       ||   9.5||   17||
    80 ||Russian          ||   4142||  198||  4.77%||  20.2||   14||
    81 ||Turkish          ||   2700||   26||  0.97%||   4.1||   14||
    82 
    83 == Requires ==
    84 {{{
    85 2.7.9 <= python < 3,
    86 pypy >= 2.2.1,
    87 justext >= 1.2 (http://corpus.tools/wiki/Justext),
    88 chared >= 1.3 (http://corpus.tools/wiki/Chared),
    89 lxml >= 2.2.4 (http://lxml.de/),
    90 openssl >= 1.0.1,
    91 re2 (python, https://pypi.python.org/pypi/re2/, make sure `import re2` works),
    92 or alternatively cffi_re2 (python/pypy, https://pypi.python.org/pypi/cffi_re2,
    93 make sure `import cffi_re2` works in this case),
    94 pdftotext, ps2ascii, antiword (text processing tools).
    95 }}}
    96 Runs in Linux, tested in Fedora and Ubuntu.
    97 Minimum hardware configuration (very small crawls):
    98 - 2 core CPU,
    99 - 4 GB system memory,
    100 - some storage space,
    101 - broadband internet connection.
    102 Recommended hardware configuration (crawling ~30 bn words of English text):
    103 - 4-24 core CPU (the more CPUs the faster the processing of crawled data),
    104 - 8-250 GB operational memory (the more RAM the more domains kept in memory and thus more webs visited),
    105 - lots of storage space,
    106 - connection to an internet backbone line.
    107 
    108 == Includes ==
    109 A robot exclusion rules parser for Python (v. 1.6.2) by Philip Semanchuk, BSD Licence (see util/robotparser.py)
    110 Language detection using character trigrams by Douglas Bagnall, Python Software Foundation Licence (see util/trigrams.py)
    111 docx2txt by Sandeep Kumar, GNU GPL 3+ (see util/doc2txt.pl)
    112 
    113 == Installation ==
    114 - unpack,
    115 - install required tools,
    116 - check justext.core and chared.detector can be imported by pypy,
    117 - make sure the crawler can write to it's directory and config.PIPE_DIR.
    118 
    119 == Settings -- edit util/config.py ==
    120 - !!!IMPORTANT!!! Set AGENT, AGENT_URL, USER_AGENT,
    121 - raise ulimit -n accoring to MAX_OPEN_CONNS,
    122 - set MAX_RUN_TIME to specify max crawling time in seconds,
    123 - set DOC_PROCESSOR_COUNT to (partially) control CPU usage,
    124 - configure language dependent settings
    125 - set MAX_DOMS_READY to (partially) control memory usage,
    126 - set MAX_DOMS_WAITING_FOR_SINGLE_IP, MAX_IP_REPEAT,
    127 - set MAX_OPEN_CONNS, IP_CONN_INTERVAL, HOST_CONN_INTERVAL,
    128 - set and mkdir PIPE_DIR (pipes for communication of subprocesses).
    129 
    130 == Language models ==
    131 - plaintext in the target language in util/lang_samples/, e.g. put plaintexts from several dozens of English web documents and English Wikipedia articels in ./util/lang_samples/English
    132 - jusText stoplist for that language in jusText stoplist path, e.g. <justext directory>/stoplists/English.txt
    133 - chared model for that language, e.g. <chared directory>/models/English
    134 
    135 == Usage ==
    136 {{{pypy spiderling.py SEED_URLS [SAVEPOINT_TIMESTAMP]}}}
    137 or
    138 {{{pypy spiderling.py [SAVEPOINT_TIMESTAMP] < SEED_URLS}}}
    139 
    140 {{{SEED_URLS}}} is a text file containing seed URLs (the crawling starts there),
    141 one per line, specify at least 50 URLs.
    142 {{{SAVEPOINT_TIMESTAMP}}} causes the state from the specified savepoint to be loaded,
    143 e.g. '121111224202' causes loading files 'spiderling.state-121111224202-*'.
    144 
    145 Python can be used instead of pypy if the latter is not available.
    146 It is recommended to run the crawler in `screen`.
    147 
    148 Files created by the crawler:
    149 - *.log.* .. log & debug files,
    150 - *.arc.gz .. gzipped arc files (raw http responses),
    151 - *.prevert_d .. preverticals with duplicate documents,
    152 - *.duplicates .. files duplicate document IDs,
    153 - *.domain_{bad,oversize,distance} .. see util/domain.py,
    154 - *.links_unproc .. unprocessed urls from removed domains,
    155 - *.links_ignored .. urls not passing domain blacklist or TLD filter,
    156 - *.links_binfile .. urls of binary files (pdf, ps, doc, docx) not processed
    157     in case config.CONVERSION_ENABLED is disabled,
    158 - *.state* .. savepoints that can be used for a new run (not tested much).
    159 
    160 To remove duplicate documents from preverticals, run
    161 {{{
    162 rm spiderling.prevert
    163 for i in $(seq 0 <config.DOC_PROCESSOR_COUNT - 1>)
    164 do
    165     pypy util/remove_duplicates.py spiderling.${i}.duplicates \
    166         < spiderling.${i}.prevert_d >> spiderling.prevert
    167 done
    168 }}}
    169 File spiderling.prevert is the final output.
    170 
    171 To stop the crawler before MAX_RUN_TIME, send SIGTERM to the main process
    172 (spiderling.py).
    173 To re-process arc files with current process.py and util/config.py, run
    174 {{{zcat spiderling.*.arc.gz | pypy reprocess.py}}}
    175 
    176 == Performance tips ==
    177 - Using PyPy reduces CPU and memory cost (saves approx. 1/4 RAM).
    178 - Set ALLOWED_TLDS_RE to avoid crawling domains not in the target language
    179   (it takes some resources to detect it otherwise).
    180 - Set LOG_LEVEL to debug and set INFO_PERIOD to check the debug output in
    181   *.log.crawl and *.log.eval to see where the bottleneck is, modify the
    182   settings accordingly, e.g. add doc processors if the doc queue is always full.
    183 
    184 == Known bugs ==
    185 - Some non critical I/O errors output to stderr.
    186 - Domain distances should be made part of document metadata instead of storing
    187   them in a separate file.
    188 - Processing binary files (pdf, ps, doc, docx) is disabled by default since it
    189   was not tested and may slow processing significantly.
    190 - DNS resolvers are implemented as blocking threads => useless to have more
    191   than one, will be changed to separate processes in the future.
    192 - Compressed connections are not accepted/processed. Some servers might be
    193   discouraged from sending an uncompressed response (not tested).
    194 - Some advanced features of robots.txt are not observed, e.g. Crawl-delay.
    195   It would require major changes in the design of the download scheduler.
    196   A warning is emitted when the crawl delay < config.HOST_CONN_INTERVAL.
    197 
    198 == Support ==
    199 There is no guaranteed support offered. (The author may help you a bit in his
    200 free time.) Please note the tool is distributed as is, it may not work under
    201 your conditions.
    202 
    203 == Acknowledgements ==
    204 The author would like to express many thanks to Jan Pomikálek, Pavel Rychlý
    205 and Miloš Jakubíček for guidance, key design advice and help with debugging.
    206 Thanks also to Vlado Benko and Nikola Ljubešić for ideas for improvement.
    207 
    208 This software is developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of Masaryk University in Brno, Czech Republic \\
    209 in cooperation with [http://lexicalcomputing.com/ Lexical Computing Ltd.], UK, a [http://www.sketchengine.co.uk/ corpus tool] company.
    210 
    211 == Contact ==
    212 {{{'zc.inum.if@2mohcusx'[::-1]}}}
    213 
    214 == Licence ==
    215 This software is the result of project LM2010013 (LINDAT-Clarin -
    216 Vybudování a provoz českého uzlu pan-evropské infrastruktury pro
    217 výzkum). This result is consistent with the expected objectives of the
    218 project. The owner of the result is Masaryk University, a public
    219 university, ID: 00216224. Masaryk University allows other companies
    220 and individuals to use this software free of charge and without
    221 territorial restrictions under the terms of the
    222 [http://www.gnu.org/licenses/gpl.txt GPL license].
    223 
    224 This permission is granted for the duration of property rights.
    225 
    226 This software is not subject to special information treatment
    227 according to Act No. 412/2005 Coll., as amended. In case that a person
    228 who will use the software under this license offer violates the
    229 license terms, the permission to use the software terminates.