Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.

Paper | Cite | Licence


  1. Make sure you have Python 3.6 or later and lxml library version 4.1 or later installed.
  2. Download the sources:
  3. Extract the downloaded file:
    tar xzvf chared-2.0.tar.gz
    cd chared-2.0/
  4. Install the package (to install for all users, omit --user, you may need sudo or a root shell):
    python3 install --user

Quick start

Detect the character encoding for a file or URL:

chared -m czech

Create a custom character encoding detection model from a collection of HTML pages (e.g. for Swahili):

chared-learn -o swahili.edm swahili_pages/*.html

... or if you have a sample text in Swahili (plain text, UTF-8) and want to apply language filtering on the input HTML files (recommended):

chared-learn -o swahili.edm -S swahili_sample.txt swahili_pages/*.html

For usage information see:

chared --help
chared-learn --help

Python API

>>> from urllib.request import urlopen
>>> import chared.detector
>>> page = urlopen('').read()
>>> cz_model_path = chared.detector.get_model_path('czech')
>>> cz_model = chared.detector.EncodingDetector.load(cz_model_path)
>>> cz_model.classify(page)


This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with financial support from PRESEMT and Lexical Computing Ltd.

See also

Unicode over 60 percent of the web at Google blog

Last modified 3 years ago Last modified on 06/18/20 20:55:20