Version 16 (modified by admin, 5 years ago) ( diff )


jusText 3

jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.


How it works

See what is kept and what is discarded from a typical web page.

Read a description of the jusText algorithm.


  1. Make sure you have Python 2.7 and lxml library version 4.1 or later installed.
  2. Download the sources:
  3. Extract the downloaded file:
    tar xzvf justext-3.0.tar.gz
  4. Install the package (omit --user to install for all users):
    cd justext-3.0/
    python3 install --user

Quick start

wget -O page.html
justext -s English page.html > cleaned-page.txt

For usage information see:

justext --help

Python API

Python 3.6 & Python 2.7 compatible

import justext
import requests
page = requests.get('').text.encode('utf-8')

paragraphs = justext.justext(page, justext.get_stoplist('English'))
for paragraph in paragraphs:
    if paragraph['class'] == 'good':

Online demo


This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with financial support from PRESEMT and Lexical Computing Ltd. It also relates to Jan Pomikálek's PhD research.


Justext is licensed under the BSD 3-Clause License

Attachments (1)

Download all attachments as: .zip