= Chared = Chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints. {{{ #!html Paper | Cite | Licence }}} == Installation == 1. Make sure you have Python 3.6 or later and lxml library version 4.1 or later installed. 2. Download the sources: {{{ wget http://corpus.tools/raw-attachment/wiki/Downloads/chared-2.0.tar.gz }}} 3. Extract the downloaded file: {{{ tar xzvf chared-2.0.tar.gz cd chared-2.0/ }}} 4. Install the package (to install for all users, omit --user, you may need sudo or a root shell): {{{ python3 setup.py install --user }}} == Quick start == Detect the character encoding for a file or URL: {{{ chared -m czech http://nlp.fi.muni.cz/cs/nlplab }}} Create a custom character encoding detection model from a collection of HTML pages (e.g. for Swahili): {{{ chared-learn -o swahili.edm swahili_pages/*.html }}} ... or if you have a sample text in Swahili (plain text, UTF-8) and want to apply language filtering on the input HTML files (recommended): {{{ chared-learn -o swahili.edm -S swahili_sample.txt swahili_pages/*.html }}} For usage information see: {{{ chared --help chared-learn --help }}} == Python API == {{{ >>> from urllib.request import urlopen >>> import chared.detector >>> page = urlopen('https://nlp.fi.muni.cz/').read() >>> cz_model_path = chared.detector.get_model_path('czech') >>> cz_model = chared.detector.EncodingDetector.load(cz_model_path) >>> cz_model.classify(page) ['utf_8'] }}} == Acknowledgements == This software has been developed at the [http://nlp.fi.muni.cz/en/nlpc Natural Language Processing Centre] of [http://www.muni.cz/ Masaryk University in Brno] with financial support from [http://presemt.eu PRESEMT] and [http://www.sketchengine.co.uk Lexical Computing Ltd.] == See also == [http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html Unicode over 60 percent of the web] at Google blog