Changes between Version 1 and Version 2 of Justext/Algorithm


Ignore:
Timestamp:
Mar 24, 2015, 4:12:07 PM (3 years ago)
Author:
admin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Justext/Algorithm

    v1 v2  
     1= jusText algorithm =
     2
    13== Introduction ==
    24The algorithm uses a simple way of segmentation. The contents of some HTML tags are (by default) visually formatted as blocks by Web browsers. The idea is to form textual blocks by splitting the HTML page on these tags. The full list of the used block-level tags includes: BLOCKQUOTE, CAPTION, CENTER, COL, COLGROUP, DD, DIV, DL, DT, FIELDSET, FORM, H1, H2, H3, H4, H5, H6, LEGEND, LI, OPTGROUP, OPTION, P, PRE, TABLE, TD, TEXTAREA, TFOOT, TH, THEAD, TR, UL. A sequence of two or more BR tags also separates blocks.
     
    2931
    3032{{{
     33#!python
    3134if link_density > MAX_LINK_DENSITY:
    3235    return 'bad'