Last week, I has been fighting against the hordes of the character encoding. My new task in tha job-tasks-pool is develop a friendly web app to manage configuration files of cluster apps. Ohh, great idea!, why didn’t I realize it before? (irony) . Only I need 4 things: one parser for each conffile syntax, a stable and secure way to get/save remote files, work with non-controlled kind of differents char encodings, … and a fancy GUI. As you can intuit, the task is not just a bit app, but this post only refer to a very useful python lib ( chardet discovered a few days ago. This lib can be auto-detect the encoding of a file very reliable. I suggest you that visit chardet homesite to see some clear examples.
Using it in a couple of lines of code:
import io
import chardets=io.open(‘channels_info’, ‘r’, errors=’replace’)
r=s.read()# For example:
# fileencoding=”iso-8859-15″
fe = chardet.detect(r)[‘encoding’]
fl = fl.decode(fe)
print fl