Changes between Version 13 and Version 14 of SpiderLing

SpiderLing

-              v13
+              v14
 == Changelog ==
+=== Changelog 1.3 → 2.0 ===
+Major bugfixes
+* ignored redirection to path + "/" fixed
+* binary files discard fixed (text extraction from pdf, doc,... works now)
+Major updates
+* multilingual webiste support (see util/config.py)
+* keeping near-good or even bad paragraphs allowed
+Minor updates
+* machine translation filter (based on some known MT identifiers in HTML)
+* extract text from ODF format (.odt files)
+* get file type from Content-Type from the HTTP header
+* add HTTP Last-Modified date to prevertical
+* Justext classification added to paragraph attributes
+=== Changelog 1.1 → 1.3 ===
+* decode IDNA hostnames in prevertical
+* adding URLs to download on-the-fly enabled
+* bugfixes
 === Changelog 1.0 → 1.1 ===
 * Important bug fix: Store the best extracted paragraph data in process.py. (Fixes a bug that caused Chared model for the last tested language rather than the best best tested language was used for character decoding. E.g. koi8-r encoding could have been assumed for Czech documents in some cases and when the last language in config.LANGUAGES was Russian.) Also, chared encodings were renamed to canonical encoding names.