Request for advice on character set conversions (those damn Excel files, again ...)
On Mon, 08 Sep 2008 01:45:51 +0200, Peter Dalgaard wrote?:
Emmanuel Charpentier wrote:
Dear list,
[ Snip ... ]
This looks reasonably sane, I think. The last loop could be d[] <- lapply(d, conv1, from, to), but I think that is cosmetic. You can't really do much better because there is no simple way of distinguishing between the various 8-bit character sets.
Thank you Peter ! Could you point me to some not-so-simple (or even doubleplusunsimple) ways ? I get the problem not so rarely, and I'd like to pull this chard outta my poor tired foot one and for all... and I suppose that I am not alone in this predicament.
You could presumably setup some heuristics. like the fact that the occurrence of 0x82 or 0x8a probably indicates cp437, but it gets tricky. (At least, in French, you don't have the Danish/Norwegian peculiarity that upper/lowercase o-slash were missing in cp437, and therefore often replaced yen and cent symbols in matrix printer ROMs. We still get the occational parcel addressed to "?ster Farimagsgade".)
Peter, you're gravely underestimating the ingenuity of some Excel l^Husers... (and your story is a possible candidate for a fortune() entry...). Emmanuel Charpentier