wiki:Chared

Version 4 (modified by admin, 4 years ago) ( diff )

--

Chared

Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.

Paper | Cite | Licence

Installation

  1. Make sure you have Python 3.6 or later and lxml library version 4.1 or later installed.
  2. Download the sources:
    wget http://corpus.tools/raw-attachment/wiki/Downloads/chared-2.0.tar.gz
    
  3. Extract the downloaded file:
    tar xzvf chared-2.0.tar.gz
    cd chared-2.0/
    
  4. Install the package (to install for all users, omit --user, you may need sudo or a root shell):
    python3 setup.py install --user
    

Quick start

Detect the character encoding for a file or URL:

chared -m czech http://nlp.fi.muni.cz/cs/nlplab

Create a custom character encoding detection model from a collection of HTML pages (e.g. for Swahili):

chared-learn -o swahili.edm swahili_pages/*.html

... or if you have a sample text in Swahili (plain text, UTF-8) and want to apply language filtering on the input HTML files (recommended):

chared-learn -o swahili.edm -S swahili_sample.txt swahili_pages/*.html

For usage information see:

chared --help
chared-learn --help

Python API

>>> from urllib.request import urlopen
>>> import chared.detector
>>> page = urlopen('https://nlp.fi.muni.cz/').read()
>>> cz_model_path = chared.detector.get_model_path('czech')
>>> cz_model = chared.detector.EncodingDetector.load(cz_model_path)
>>> cz_model.classify(page)
['utf_8']

Acknowledgements

This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with financial support from PRESEMT and Lexical Computing Ltd.

See also

Unicode over 60 percent of the web at Google blog