wiki:Justext

Version 3 (modified by admin, 11 years ago) ( diff )
--

jusText

jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.

What's new

Mišo Belica created a jusText fork on GitHub with some tweaks.

jusText is now also available on PyPi.

Changelog

Installation

Make sure you have Python and lxml library version 2.2.4 or later installed.

Download the sources:

wget http://corpus.tools/attachment/wiki/Downloads/justext-1.2.tar.gz

Extract the downloaded file:
```
tar xzvf justext-1.2.tar.gz
```
Install the package (you may need sudo or a root shell for the latter command):
```
cd justext-1.2/
python setup.py install
```

Quick start

wget -O page.html http://planet.python.org/
justext -s English page.html > cleaned-page.txt

For usage information see:

justext --help

Python API

import urllib2
import justext

page = urllib2.urlopen('http://planet.python.org/').read()
paragraphs = justext.justext(page, justext.get_stoplist('English'))
for paragraph in paragraphs:
    if paragraph['class'] == 'good':
        print paragraph['text']

Online demo

http://nlp.fi.muni.cz/projects/justext/

Acknowledgements

This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with financial support from PRESEMT and Lexical Computing Ltd. It also relates to Jan Pomikálek's PhD research.