wiki:Chared

Chared

Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.

Paper | Cite | Licence

Installation

Make sure you have Python 3.6 or newer. Required packages are installed via pip automatically (or system-wide, e.g. python3-lxml in Fedora).

Download, extract, install:

wget https://corpus.tools/raw-attachment/wiki/Downloads/chared-2.1.tar.gz
tar xzvf chared-2.1.tar.gz
cd chared-2.1/
pip install --user . #omit --user to install for all users

Legacy installation (using distutils, deprecated)

wget https://corpus.tools/raw-attachment/wiki/Downloads/chared-2.0.tar.gz
tar xzvf chared-2.0.tar.gz
cd chared-2.0/
python3 setup.py install --user #omit --user to install for all users

Quick start

Detect the character encoding for a file or URL:

chared -m czech http://nlp.fi.muni.cz/cs/nlplab

Create a custom character encoding detection model from a collection of HTML pages (e.g. for Swahili):

chared-learn -o swahili.edm swahili_pages/*.html

... or if you have a sample text in Swahili (plain text, UTF-8) and want to apply language filtering on the input HTML files (recommended):

chared-learn -o swahili.edm -S swahili_sample.txt swahili_pages/*.html

For usage information see:

chared --help
chared-learn --help

Python API

>>> from urllib.request import urlopen
>>> import chared.detector
>>> page = urlopen('https://nlp.fi.muni.cz/').read()
>>> cz_model_path = chared.detector.get_model_path('czech')
>>> cz_model = chared.detector.EncodingDetector.load(cz_model_path)
>>> cz_model.classify(page)
['utf_8']

Acknowledgements

This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with financial support from PRESEMT and Lexical Computing Ltd.