2 | | !SpiderLing — a web spider for linguistics — is software for obtaining |
3 | | text from the web useful for building text corpora. Many documents on |
4 | | the web only contain material not suitable for text corpora, such as |
5 | | site navigation, lists of links, lists of products, and other kind of |
6 | | text not comprised of full sentences. In fact such pages represent the |
7 | | vast majority of the web. Therefore, by doing unrestricted web crawls, |
8 | | we typically download a lot of data which gets filtered out during |
9 | | post-processing. This makes the process of web corpus collection |
10 | | inefficient. The aim of our work is to focus the crawling on the text |
11 | | rich parts of the web and maximize the number of words in the final |
12 | | corpus per downloaded megabyte. |
| 2 | !SpiderLing — a web spider for linguistics — is software for obtaining text from the web useful for building text corpora. |
| 3 | Many documents on the web only contain material not suitable for text corpora, such as site navigation, lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient. |
| 4 | The aim of our work is to focus the crawling on the text rich parts of the web and maximize the number of words in the final corpus per downloaded megabyte. Nevertheless the crawler can be configured to ignore the yield rate of web domains and download from low yield sites too. |
15 | | See [wiki:Downloads] for the latest version. |
| 7 | See [wiki:Downloads] for the latest version. '''Please note the software is distributed as is, without a guaranteed support''' |
| 8 | |
| 9 | === Changelog 0.82 → 0.84 === |
| 10 | * Mistakenly ignored links fixed in process.py |
| 11 | - the bug caused not crawling a significant part of the web |
| 12 | * Multithreading issues fixed (possible race conditions in shared memory) |
| 13 | - delegated classes for explicit locking of all shared objects (esp. DomainQueue) |
| 14 | * Deadlock -- the scheduler stops working after some time of successful crawling |
| 15 | - observed in the case of a large scale crawling |
| 16 | - may be related to changing the code because of multithreading problems |
| 17 | - IMPORTANT -- may not be resolved yet, testing in progress, 0.81 should be stable |
| 18 | * Several Domain and DomainQueue methods were rewritten for bulk operations (e.g. adding all paths into a single domain together) to improve performance |
| 19 | * Justext configuration |
| 20 | - more permissive setting than the justext default |
| 21 | - configurable in util/config.py |
80 | | 2.7 <= python < 3, |
81 | | justext >= 1.2, |
82 | | chared >= 1.3, |
83 | | lxml >= 2.2.4, |
84 | | text processing tools: pdftotext, ps2ascii, antiword, |
85 | | works in Ubuntu (Debian Linux), does not work in Windows. |
| 87 | justext >= 1.2 (http://corpus.tools/wiki/Justext), |
| 88 | chared >= 1.3 (http://corpus.tools/wiki/Chared), |
| 89 | lxml >= 2.2.4 (http://lxml.de/), |
| 90 | openssl >= 1.0.1, |
| 91 | pdftotext, ps2ascii, antiword (text processing tools). |
| 92 | Runs in Linux, tested in Fedora and Ubuntu. |
99 | | A robot exclusion rules parser for Python by Philip Semanchuk (v. 1.6.2) |
100 | | ... see util/robotparser.py |
101 | | Language detection using character trigrams by Douglas Bagnall |
102 | | ... see util/trigrams.py |
103 | | docx2txt by Sandeep Kumar |
104 | | ... see util/doc2txt.pl |
| 106 | A robot exclusion rules parser for Python (v. 1.6.2) |
| 107 | - by Philip Semanchuk, BSD Licence |
| 108 | - see util/robotparser.py |
| 109 | Language detection using character trigrams |
| 110 | - by Douglas Bagnall, Python Software Foundation Licence |
| 111 | - see util/trigrams.py |
| 112 | docx2txt |
| 113 | - by Sandeep Kumar, GNU GPL 3+ |
| 114 | - see util/doc2txt.pl |
144 | | - *.unproc_urls .. urls of non html documents that were not processed (bug) |
145 | | To remove duplicate documents from preverticals, run |
| 154 | - *.domain_{bad,oversize,distance} .. see util/domain.py, |
| 155 | - *.ignored_refs .. links ignored for configurable or hardcoded reasons, |
| 156 | - *.unproc_urls .. urls of non html documents that were not processed (bug). |
| 157 | To remove duplicate documents from preverticals, run |