Skip to content
Prev 155411 / 398506 Next

Request for advice on character set conversions (those damn Excel files, again ...)

Emmanuel Charpentier wrote:
In full generality it is impossible, but you might get something if you 
make certain assumptions. If you can convert from UTF8, then it probably 
is UTF8 (or ASCII but in either case, you're done). Otherwise it is a 
single-byte 8-bit encoding if  the language can be assumed to be 
French.  If it uses characters between 0x80 and 0x9f, then it is not 
latin1 but rather cp437, 850, or 1252.   The tricky bit is that although 
the presence of say 0x82 suggests that it is not cp1252, but rather 437 
or 850 (e aigu) , it just might be 1252 after all (single low quote). 
Some sort of naive Bayes classifier might work.
(If so, please respell "occasional". Ouch! And actually, the cent/yen 
thing is also a difference between cp437 and cp850, so the story may be 
a bit too colourful.)