wiki:Chared

Chared

Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.

Paper | Cite | Licence

Installation

  1. Make sure you have Python 3.6 or newer. Required packages are installed via pip automatically (or system-wide, e.g. python3-lxml in Fedora).
  2. Download, extract, install:
    wget https://corpus.tools/raw-attachment/wiki/Downloads/chared-2.1.tar.gz
    tar xzvf chared-2.1.tar.gz
    cd chared-2.1/
    pip install --user . #omit --user to install for all users
    

Legacy installation (using distutils, deprecated)

wget https://corpus.tools/raw-attachment/wiki/Downloads/chared-2.0.tar.gz
tar xzvf chared-2.0.tar.gz
cd chared-2.0/
python3 setup.py install --user #omit --user to install for all users

Quick start

Detect the character encoding for a file or URL:

chared -m czech http://nlp.fi.muni.cz/cs/nlplab

Create a custom character encoding detection model from a collection of HTML pages (e.g. for Swahili):

chared-learn -o swahili.edm swahili_pages/*.html

... or if you have a sample text in Swahili (plain text, UTF-8) and want to apply language filtering on the input HTML files (recommended):

chared-learn -o swahili.edm -S swahili_sample.txt swahili_pages/*.html

For usage information see:

chared --help
chared-learn --help

Python API

>>> from urllib.request import urlopen
>>> import chared.detector
>>> page = urlopen('https://nlp.fi.muni.cz/').read()
>>> cz_model_path = chared.detector.get_model_path('czech')
>>> cz_model = chared.detector.EncodingDetector.load(cz_model_path)
>>> cz_model.classify(page)
['utf_8']

Acknowledgements

This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with financial support from PRESEMT and Lexical Computing Ltd.

See also

Unicode over 60 percent of the web at Google blog

Last modified 10 months ago Last modified on 11/16/23 17:18:21