Changes between Version 17 and Version 18 of Justext


Ignore:
Timestamp:
11/08/23 17:26:29 (13 months ago)
Author:
admin
Comment:
  1. 4

Legend:

Unmodified
Added
Removed
Modified
  • Justext

    v17 v18  
    1 = jusText 3 =
     1= jusText 4 =
    22
    33jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.
     
    2020
    2121== Installation ==
    22 1. Make sure you have Python 2.7 and lxml library version 4.1 or later installed.
     221. Make sure you have Python 3.6 or newer and lxml library version 4.1 or later (system-wide, e.g. python3-lxml in Fedora, or pip install --user lxml) and Python HTML 5 parser (system-wide, e.g. python3-html5-parser in Fedora, or pip install --user html5-parser) installed.
    23232. Download the sources:
    2424{{{
    25 wget http://corpus.tools/raw-attachment/wiki/Downloads/justext-3.0.tar.gz
     25wget http://corpus.tools/raw-attachment/wiki/Downloads/justext-4.2.tar.gz
    2626}}}
    27273. Extract the downloaded file:
    2828{{{
    29 tar xzvf justext-3.0.tar.gz
     29tar xzvf justext-4.2.tar.gz
    3030}}}
    31314. Install the package (omit --user to install for all users):
    3232{{{
    33 cd justext-3.0/
     33cd justext-4.2/
    3434python3 setup.py install --user
    3535}}}