Version 4 (modified by 4 weeks ago) ( diff ) | ,
---|
v4 (2023-08-08)
- The old XHTML parser was replaced by a HTML5 parser.
- Some HTML5 elements – e.g.
<blockquote/>
– are interpreted as paragraphs now so their text content is not ignored any more. - HTML5 entities are supported now – e.g.
á
is translated toá
now.
- Some HTML5 elements – e.g.
- Python 3.11 compatible.
- Requires Python HTML5 parser.
v3.0 (Feb 06 2019)
- Python 3.6 & Python 2.7 compatible
- Joining text separated by <br/> or inline tags prevented – thanks to Bohdan Moskalevskyi, partially based on his commit
- Lowercased stoplist
- Various stoplists updated
Apr 11 2018
Norwegian joined wordlist
Apr 11 2018
More wordlists
Sep 11 2017
Lowercased stoplist
Aug 24 2017
New and updated wordlists
Aug 24 2017
Justext 1.4
Aug 24 2017
Web demo
Aug 24 2017
max_good_distance, a new context classification parameter Maximum distance (in paragraphs) of a short paragraph from a good paragraph to re-classify the short paragraph as good.
Jun 30 2017
Minor package updates
Jun 30 2017
Justext 1.3
Jun 29 2017
Preprocess split to get_html_root and preprocess_html_root Allows using the DOM root before the head (and other possibly useful elements) are removed. Needed to get the page title from the head.
Apr 12 2017
new README
Apr 12 2017
filter out HTML(5) elements
Feb 24 2017
remove words containing Latin characters from Korean stoplist
Jan 12 2015
Move * out of trunk/
Nov 11 2012
Temporary workaround for issue #2: Remove any text nodes that cannot be decoded.
Jan 26 2012
Added stoplists for Kazakh, Kyrgyz, Turkmen and Uzbek.
Dec 6 2011
Fixed inserting spaces between text nodes. Before, content such as "abc<b>efg</b>" became "abc efg" after processing. Now it correctly becomes "abcefg".
Aug 8 2011
jusText 1.2
Aug 8 2011
Edited wiki page Algorithm through web user interface.
Aug 4 2011
Use character counts instead of word counts where possible (length-low, length-high, max-heading-distance and for computing link density). This is to make the algorithm work well in the language independent mode (without a stoplist) for languages where counting words is not easy (Japanese, Chinese, Thai, etc). The default thresholds have been adjusted correspondingly.
Aug 4 2011
More robust parsing of meta tags containing the information about used charset.
Jun 6 2011
Bug fix: Corrected decoding of HTML entities € to Ÿ
Mar 28 2011
Edited wiki page Algorithm through web user interface.
Mar 28 2011
Edited wiki page Algorithm through web user interface.
Mar 23 2011
Edited wiki page Algorithm through web user interface.
Mar 17 2011
Edited wiki page Algorithm through web user interface.
Mar 9 2011
Edited wiki page Algorithm through web user interface.
Mar 9 2011
Edited wiki page Algorithm through web user interface.
Mar 9 2011
Edited wiki page Algorithm through web user interface.
Mar 9 2011
Edited wiki page Algorithm through web user interface.
Mar 9 2011
Created wiki page through web user interface.
Mar 9 2011
jusText 1.1
Mar 9 2011
Initial import.