== v4 (2023-08-08) == * The old XHTML parser was replaced by a HTML5 parser. * Some HTML5 elements – e.g. {{{
}}} – are interpreted as paragraphs now so their text content is not ignored any more. * HTML5 entities are supported now – e.g. {{{á}}} is translated to {{{á}}} now. * Python 3.11 compatible. * Requires [https://pypi.org/project/html5-parser Python HTML5 parser]. == v3.0 (Feb 06 2019) == * Python 3.6 & Python 2.7 compatible * Joining text separated by
or inline tags prevented – thanks to [https://github.com/msklvsk/ Bohdan Moskalevskyi], partially based on [https://github.com/msklvsk/justext/commit/5153f756a61e776a89eb5ebc5e0472333a80707f his commit] * Lowercased stoplist * Various stoplists updated Apr 11 2018 Norwegian joined wordlist Apr 11 2018 More wordlists Sep 11 2017 Lowercased stoplist Aug 24 2017 New and updated wordlists Aug 24 2017 Justext 1.4 Aug 24 2017 Web demo Aug 24 2017 max_good_distance, a new context classification parameter Maximum distance (in paragraphs) of a short paragraph from a good paragraph to re-classify the short paragraph as good. Jun 30 2017 Minor package updates Jun 30 2017 Justext 1.3 Jun 29 2017 Preprocess split to get_html_root and preprocess_html_root Allows using the DOM root before the head (and other possibly useful elements) are removed. Needed to get the page title from the head. Apr 12 2017 new README Apr 12 2017 filter out HTML(5) elements Feb 24 2017 remove words containing Latin characters from Korean stoplist Jan 12 2015 Move * out of trunk/ Nov 11 2012 Temporary workaround for issue #2: Remove any text nodes that cannot be decoded. Jan 26 2012 Added stoplists for Kazakh, Kyrgyz, Turkmen and Uzbek. Dec 6 2011 Fixed inserting spaces between text nodes. Before, content such as "abcefg" became "abc efg" after processing. Now it correctly becomes "abcefg". Aug 8 2011 jusText 1.2 Aug 8 2011 Edited wiki page Algorithm through web user interface. Aug 4 2011 Use character counts instead of word counts where possible (length-low, length-high, max-heading-distance and for computing link density). This is to make the algorithm work well in the language independent mode (without a stoplist) for languages where counting words is not easy (Japanese, Chinese, Thai, etc). The default thresholds have been adjusted correspondingly. Aug 4 2011 More robust parsing of meta tags containing the information about used charset. Jun 6 2011 Bug fix: Corrected decoding of HTML entities € to Ÿ Mar 28 2011 Edited wiki page Algorithm through web user interface. Mar 28 2011 Edited wiki page Algorithm through web user interface. Mar 23 2011 Edited wiki page Algorithm through web user interface. Mar 17 2011 Edited wiki page Algorithm through web user interface. Mar 9 2011 Edited wiki page Algorithm through web user interface. Mar 9 2011 Edited wiki page Algorithm through web user interface. Mar 9 2011 Edited wiki page Algorithm through web user interface. Mar 9 2011 Edited wiki page Algorithm through web user interface. Mar 9 2011 Created wiki page through web user interface. Mar 9 2011 jusText 1.1 Mar 9 2011 Initial import.