Version 20 (modified by 12 months ago) ( diff ) | ,
---|
jusText 4
jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.
Paper | Cite | LicenceHow it works
See what is kept and what is discarded from a typical web page.
Read a description of the jusText algorithm.
Installation
- Make sure you have Python 3.6 or newer. Required packages are installed via pip automatically (or system-wide, e.g. python3-lxml and python3-html5-parser in Fedora).
- Download, extract, install:
wget http://corpus.tools/raw-attachment/wiki/Downloads/justext-4.3.tar.gz tar xzvf justext-4.3.tar.gz cd justext-4.3/ pip install --user . #omit --user to install for all users
Legacy installation (using distutils, deprecated)
wget http://corpus.tools/raw-attachment/wiki/Downloads/justext-4.2.5.tar.gz tar xzvf justext-4.2.5.tar.gz cd justext-4.2.5/ python3 setup.py install --user #omit --user to install for all users
Quick start
wget -O page.html http://planet.python.org/ justext -s English page.html > cleaned-page.txt
For usage information see:
justext --help
Python API
Python 3.6 & Python 2.7 compatible
import justext import requests page = requests.get('http://planet.python.org/').text.encode('utf-8') paragraphs = justext.justext(page, justext.get_stoplist('English')) for paragraph in paragraphs: if paragraph['class'] == 'good': print(paragraph['text'])
Online demo
http://nlp.fi.muni.cz/projects/justext/
Acknowledgements
This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with financial support from PRESEMT and Lexical Computing Ltd. It also relates to Jan Pomikálek's PhD research.
Licence
Justext is licensed under the BSD 3-Clause License
Attachments (1)
- nlp_jusText_fi.jpg (830.9 KB ) - added by 10 years ago.
Download all attachments as: .zip