Justext_changelog – Corpus tools

wiki:Justext_changelog

Version 4 (modified by admin, 20 months ago) ( diff )
v4 (2023-08-08)

v4 (2023-08-08)

The old XHTML parser was replaced by a HTML5 parser.
- Some HTML5 elements – e.g. <blockquote/> – are interpreted as paragraphs now so their text content is not ignored any more.
- HTML5 entities are supported now – e.g. á is translated to á now.
Python 3.11 compatible.
Requires Python HTML5 parser.

v3.0 (Feb 06 2019)

Python 3.6 & Python 2.7 compatible
Joining text separated by <br/> or inline tags prevented – thanks to Bohdan Moskalevskyi, partially based on his commit
Lowercased stoplist
Various stoplists updated

Apr 11 2018

Norwegian joined wordlist

Apr 11 2018

More wordlists

Sep 11 2017

Lowercased stoplist

Aug 24 2017

New and updated wordlists

Aug 24 2017

Justext 1.4

Aug 24 2017

Web demo

Aug 24 2017

max_good_distance, a new context classification parameter Maximum distance (in paragraphs) of a short paragraph from a good paragraph to re-classify the short paragraph as good.

Jun 30 2017

Minor package updates

Jun 30 2017

Justext 1.3

Jun 29 2017

Preprocess split to get_html_root and preprocess_html_root Allows using the DOM root before the head (and other possibly useful elements) are removed. Needed to get the page title from the head.

Apr 12 2017

new README

Apr 12 2017

filter out HTML(5) elements

Feb 24 2017

remove words containing Latin characters from Korean stoplist

Jan 12 2015

Move * out of trunk/

Nov 11 2012

Temporary workaround for issue #2: Remove any text nodes that cannot be decoded.

Jan 26 2012

Added stoplists for Kazakh, Kyrgyz, Turkmen and Uzbek.

Dec 6 2011

Fixed inserting spaces between text nodes. Before, content such as "abc<b>efg</b>" became "abc efg" after processing. Now it correctly becomes "abcefg".

Aug 8 2011

jusText 1.2

Aug 8 2011

Edited wiki page Algorithm through web user interface.

Aug 4 2011

Use character counts instead of word counts where possible (length-low, length-high, max-heading-distance and for computing link density). This is to make the algorithm work well in the language independent mode (without a stoplist) for languages where counting words is not easy (Japanese, Chinese, Thai, etc). The default thresholds have been adjusted correspondingly.

Aug 4 2011

More robust parsing of meta tags containing the information about used charset.

Jun 6 2011

Bug fix: Corrected decoding of HTML entities  to 

Mar 28 2011

Edited wiki page Algorithm through web user interface.

Mar 28 2011

Edited wiki page Algorithm through web user interface.

Mar 23 2011

Edited wiki page Algorithm through web user interface.

Mar 17 2011

Edited wiki page Algorithm through web user interface.

Mar 9 2011

Edited wiki page Algorithm through web user interface.

Mar 9 2011

Edited wiki page Algorithm through web user interface.

Mar 9 2011

Edited wiki page Algorithm through web user interface.

Mar 9 2011

Edited wiki page Algorithm through web user interface.

Mar 9 2011

Created wiki page through web user interface.

Mar 9 2011

jusText 1.1

Mar 9 2011

Initial import.

Download in other formats:

Plain Text