Changes between Version 13 and Version 14 of SpiderLing


Ignore:
Timestamp:
07/23/21 15:49:44 (3 years ago)
Author:
admin
Comment:

Changelog

Legend:

Unmodified
Added
Removed
Modified
  • SpiderLing

    v13 v14  
    264264
    265265== Changelog ==
     266=== Changelog 1.3 → 2.0 ===
     267Major bugfixes
     268* ignored redirection to path + "/" fixed
     269* binary files discard fixed (text extraction from pdf, doc,... works now)
     270Major updates
     271* multilingual webiste support (see util/config.py)
     272* keeping near-good or even bad paragraphs allowed
     273Minor updates
     274* machine translation filter (based on some known MT identifiers in HTML)
     275* extract text from ODF format (.odt files)
     276* get file type from Content-Type from the HTTP header
     277* add HTTP Last-Modified date to prevertical
     278* Justext classification added to paragraph attributes
     279
     280=== Changelog 1.1 → 1.3 ===
     281* decode IDNA hostnames in prevertical
     282* adding URLs to download on-the-fly enabled
     283* bugfixes
     284
    266285=== Changelog 1.0 → 1.1 ===
    267286* Important bug fix: Store the best extracted paragraph data in process.py. (Fixes a bug that caused Chared model for the last tested language rather than the best best tested language was used for character decoding. E.g. koi8-r encoding could have been assumed for Czech documents in some cases and when the last language in config.LANGUAGES was Russian.) Also, chared encodings were renamed to canonical encoding names.