Changes between Version 16 and Version 17 of SpiderLing

Timestamp:: 12/01/22 15:39:54 (2 years ago)
Author:: admin
Comment:: Changelog 2.0 → 2.1 → 2.2

Legend:

: Unmodified
: Added
: Removed
: Modified

SpiderLing

-              v16
+              v17
 == Changelog ==
+=== Changelog 2.0 → 2.1 → 2.2 ===
+Language detection:
+* CLD2 is used together with the simple character trigram model at the document level now. CLD2 is more accurate and it is able to identify languages that the old model does not know. The old model is more permissive but introduces other language texts that have to be filtered later. CLD2 => precision, simple trigrams => recall.
+* Language detection at the paragraph level was removed since a paragraph was too short for reliable predictions.
+* Lowercase and trim all strings for the trigram langid.
+Performance improvements:
+* Gzip compression of files to save space where performance is not crucial (i.e. in document processors, not in the downloader).
+* Split the ARC output to 100 GB parts, split the prevertical output to 10 GB parts to allow post-processing of closed parts even before the crawl is done.
+* When the document processors are overloaded, a limited number of documents from each domain is processed from a single wpage file (the rest is postponed) so for all domains there is some data getting steadily from processors to the scheduler not to stop crawling domains because of their perceived inactivity caused by overloaded processors, not by depleted domains.
+* Ignore edit pages and diff pages which usually contribute no new texts.
+* Limit the size of wpage processing queue not to eat up the memory by big crawls: config.MAX_WPAGE_PROCESSING_COUNT.
+Bugfixes:
+* Write the rest of duplicate doc IDs after termination.
+* Fixed a missing primary multilanguage flag in re-processing code.
+Other:
+* More domains in the default blacklist.
+* Pad prevertical file names to two digits to ease their processing and sorting.
+* Allow keeping arc output when re-processing wpage data.
+* Use --saved-hashes to initialize the hashes of raw data and plaintext by savepoint values when re-processing wpage data.
+* url/download_queue added to the restart command line to load a saved downloader queue.
 === Changelog 1.3 → 2.0 ===
 Major bugfixes
 …
 * binary files discard fixed (text extraction from pdf, doc,... works now)
 Major updates
 * multilingual webiste support (see util/config.py)
+* multilingual website support (see util/config.py)
 * keeping near-good or even bad paragraphs allowed
 Minor updates