Changes between Initial Version and Version 1 of Chared


Ignore:
Timestamp:
08/06/15 13:44:38 (9 years ago)
Author:
admin
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Chared

    v1 v1  
     1= Chared =
     2
     3Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.
     4
     5== Installation ==
     61. Make sure you have Python 2.6 or later and lxml library version 2.2.4 or later installed.
     72. Download the sources:
     8{{{
     9wget http://chared.googlecode.com/files/chared-1.2.tar.gz
     10}}}
     113. Extract the downloaded file:
     12{{{
     13tar xzvf chared-1.2.tar.gz
     14}}}
     154. Install the package (you may need sudo or a root shell for the latter command):
     16{{{
     17cd chared-1.2/
     18python setup.py install
     19}}}
     20
     21== Quick start ==
     22Detect the character encoding for a file or URL:
     23{{{
     24chared -m czech http://nlp.fi.muni.cz/cs/nlplab
     25}}}
     26Create a custom character encoding detection model from a collection of HTML pages (e.g. for Swahili):
     27{{{
     28chared-learn -o swahili.edm swahili_pages/*.html
     29}}}
     30... or if you have a sample text in Swahili (plain text, UTF-8) and want to apply language filtering on the input HTML files (recommended):
     31{{{
     32chared-learn -o swahili.edm -S swahili_sample.txt swahili_pages/*.html
     33}}}
     34For usage information see:
     35{{{
     36chared --help
     37chared-learn --help
     38}}}
     39
     40== Python API ==
     41
     42{{{
     43>>> import urllib2
     44>>> import chared.detector
     45>>> page = urllib2.urlopen('http://nlp.fi.muni.cz/cs/nlplab').read()
     46>>> cz_model_path = chared.detector.get_model_path('czech')
     47>>> cz_model = chared.detector.EncodingDetector.load(cz_model_path)
     48>>> cz_model.classify(page)
     49['utf_8']
     50}}}
     51
     52== Acknowledgements ==
     53This software has been developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of [http://www.muni.cz/ Masaryk University in Brno] with financial support from [http://presemt.eu PRESEMT] and [http://www.sketchengine.co.uk Lexical Computing Ltd.]
     54
     55== See also ==
     56[http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html Unicode over 60 percent of the web] at Google blog