= jusText 4 =
jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.
[wiki:Justext_changelog Changelog]
{{{
#!html
Paper
|
Cite
|
Licence
}}}
== How it works ==
See what is kept and what is discarded from a [attachment:nlp_jusText_fi.jpg typical web page].
Read a description of the jusText [Justext/Algorithm algorithm].
== Installation ==
1. Make sure you have Python 3.6 or newer. Required packages are installed via pip automatically (or system-wide, e.g. python3-lxml and python3-html5-parser in Fedora).
2. Download, extract, install:
{{{
wget https://corpus.tools/raw-attachment/wiki/Downloads/justext-4.3.tar.gz
tar xzvf justext-4.3.tar.gz
cd justext-4.3/
pip install --user . #omit --user to install for all users
}}}
=== Legacy installation (using distutils, deprecated) ===
{{{
wget https://corpus.tools/raw-attachment/wiki/Downloads/justext-4.2.5.tar.gz
tar xzvf justext-4.2.5.tar.gz
cd justext-4.2.5/
python3 setup.py install --user #omit --user to install for all users
}}}
== Quick start ==
{{{
wget -O page.html https://planetpython.org/
justext -s English page.html > cleaned-page.txt
}}}
For usage information see:
{{{
justext --help
}}}
== Python API ==
Python 3.6 & Python 2.7 compatible
{{{
import justext
import requests
page = requests.get('https://planetpython.org/').text.encode('utf-8')
paragraphs = justext.justext(page, justext.get_stoplist('English'))
for paragraph in paragraphs:
if paragraph['class'] == 'good':
print(paragraph['text'])
}}}
== Online demo ==
[https://nlp.fi.muni.cz/projects/justext/]
== Acknowledgements ==
This software has been developed at the [https://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of [https://www.muni.cz/ Masaryk University in Brno] with financial support from [https://presemt.eu PRESEMT] and [https://www.sketchengine.eu Lexical Computing Ltd.] It also relates to Jan Pomikálek's [https://is.muni.cz/th/45523/fi_d/phdthesis.pdf PhD research].
== Licence ==
Justext is licensed under the [https://opensource.org/licenses/BSD-3-Clause BSD 3-Clause License]