Changes between Initial Version and Version 1 of Justext


Ignore:
Timestamp:
01/12/15 18:19:57 (9 years ago)
Author:
admin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Justext

    v1 v1  
     1= justext =
     2
     3jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora.
     4
     5== What's new==
     6Mišo Belica created a [https://github.com/miso-belica/jusText jusText fork on GitHub] with some tweaks.
     7
     8jusText is now also [https://pypi.python.org/pypi/jusText available on PyPi].
     9
     10[http://corpus.tools/browser/justext/CHANGES Changelog]
     11
     12== Installation ==
     131. Make sure you have Python and lxml library version 2.2.4 or later installed.
     142. Download the sources:
     15{{{
     16wget http://corpus.tools/attachment/wiki/Downloads/justext-1.2.tar.gz
     17}}}
     183. Extract the downloaded file:
     19{{{
     20tar xzvf justext-1.2.tar.gz
     21}}}
     224. Install the package (you may need sudo or a root shell for the latter command):
     23{{{
     24cd justext-1.2/
     25python setup.py install
     26}}}
     27
     28== Quick start ==
     29{{
     30wget -O page.html http://planet.python.org/
     31justext -s English page.html > cleaned-page.txt
     32}}
     33For usage information see:
     34{{{
     35justext --help
     36}}}
     37
     38== Python API ==
     39{{{
     40import urllib2
     41import justext
     42
     43page = urllib2.urlopen('http://planet.python.org/').read()
     44paragraphs = justext.justext(page, justext.get_stoplist('English'))
     45for paragraph in paragraphs:
     46    if paragraph['class'] == 'good':
     47        print paragraph['text']
     48}}}
     49
     50== Online demo ==
     51[http://nlp.fi.muni.cz/projects/justext/]
     52
     53== Acknowledgements ==
     54This software has been developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of [http://www.muni.cz/ Masaryk University in Brno] with financial support from [http://presemt.eu PRESEMT] and [http://www.sketchengine.co.uk Lexical Computing Ltd.] It also relates to Jan Pomikálek's [http://is.muni.cz/th/45523/fi_d/phdthesis.pdf PhD research].