wiki:Justext

jusText 4

jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.

Changelog

Paper | Cite | Licence

How it works

See what is kept and what is discarded from a typical web page.

Read a description of the jusText algorithm.

Installation

  1. Make sure you have Python 3.11 or newer. Required packages are installed via pip automatically (or system-wide, e.g. python3-lxml and python3-html5-parser in Fedora).
  2. Download, extract, install:
    wget https://corpus.tools/raw-attachment/wiki/Downloads/justext-4.3.tar.gz
    tar xzvf justext-4.3.tar.gz
    cd justext-4.3/
    pip install --user . #omit --user to install for all users
    

Legacy installation using deprecated distutils

Python 3.6 & Python 2.7 compatible

wget https://corpus.tools/raw-attachment/wiki/Downloads/justext-4.2.5.tar.gz
tar xzvf justext-4.2.5.tar.gz
cd justext-4.2.5/
python3 setup.py install --user #omit --user to install for all users

Quick start

wget -O page.html https://planetpython.org/
justext -s English page.html > cleaned-page.txt

For usage information see:

justext --help

Python API

import justext
import requests
page = requests.get('https://planetpython.org/').text.encode('utf-8')

paragraphs = justext.justext(page, justext.get_stoplist('English'))
for paragraph in paragraphs:
    if paragraph['class'] == 'good':
        print(paragraph['text'])

Online demo

https://nlp.fi.muni.cz/projects/justext/

Acknowledgements

This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with financial support from PRESEMT and Lexical Computing Ltd. It also relates to Jan Pomikálek's PhD research.

Licence

Justext is licensed under the BSD 3-Clause License

Last modified 4 months ago Last modified on 12/05/23 17:11:29

Attachments (1)

Download all attachments as: .zip