Dear list,
I have to read a not-so-small bunch of not-so-small Excel files, which
seem to have traversed Window 3.1, Windows95 and Windows NT versions of
the thing (with maybe a Mac or two thrown in for good measure...).
The problem is that 1) I need to read strings, and 2) those
strings may have various encodings. In the same sheet of the same file,
some cells may be latin1, some UTF-8 and some CP437 (!).
read.xls() alows me to read those things in sets of dataframes. my
problem is to convert the encodings to UTF8 without cloberring those who
are already (looking like) UTF8.
I came to the following solution :
foo<-function(d, from="latin1",to="UTF-8"){
# Semi-smart conversion of a dataframe between charsets.
# Needed to ease use of those [@!] Excel files
# that have survived the Win3.1 --> Win95 --> NT transition,
# usually in poor shape..
conv1<-function(v,from,to) {
condconv<-function(v,from,to) {
cnv<-is.na(iconv(v,to,to))
v[cnv]<-iconv(v[cnv],from,to)
return(v)
}
if (is.factor(v)) {
l<-condconv(levels(v),from,to)
levels(v)<-l
return(v)
}
else if (is.character(v)) return(condconv(v,from,to))
else return(v)
}
for(i in names(d)) d[,i]<-conv1(d[,i],from,to)
return(d)
}
Any advice for enhancement is welcome...
Sincerely yours,
Emmanuel Charpentier
Request for advice on character set conversions (those damn Excel files, again ...)
4 messages · Emmanuel Charpentier, Peter Dalgaard
Emmanuel Charpentier wrote:
Dear list,
I have to read a not-so-small bunch of not-so-small Excel files, which
seem to have traversed Window 3.1, Windows95 and Windows NT versions of
the thing (with maybe a Mac or two thrown in for good measure...).
The problem is that 1) I need to read strings, and 2) those
strings may have various encodings. In the same sheet of the same file,
some cells may be latin1, some UTF-8 and some CP437 (!).
read.xls() alows me to read those things in sets of dataframes. my
problem is to convert the encodings to UTF8 without cloberring those who
are already (looking like) UTF8.
I came to the following solution :
foo<-function(d, from="latin1",to="UTF-8"){
# Semi-smart conversion of a dataframe between charsets.
# Needed to ease use of those [@!] Excel files
# that have survived the Win3.1 --> Win95 --> NT transition,
# usually in poor shape..
conv1<-function(v,from,to) {
condconv<-function(v,from,to) {
cnv<-is.na(iconv(v,to,to))
v[cnv]<-iconv(v[cnv],from,to)
return(v)
}
if (is.factor(v)) {
l<-condconv(levels(v),from,to)
levels(v)<-l
return(v)
}
else if (is.character(v)) return(condconv(v,from,to))
else return(v)
}
for(i in names(d)) d[,i]<-conv1(d[,i],from,to)
return(d)
}
Any advice for enhancement is welcome...
This looks reasonably sane, I think. The last loop could be d[] <- lapply(d, conv1, from, to), but I think that is cosmetic. You can't really do much better because there is no simple way of distinguishing between the various 8-bit character sets. You could presumably setup some heuristics. like the fact that the occurrence of 0x82 or 0x8a probably indicates cp437, but it gets tricky. (At least, in French, you don't have the Danish/Norwegian peculiarity that upper/lowercase o-slash were missing in cp437, and therefore often replaced yen and cent symbols in matrix printer ROMs. We still get the occational parcel addressed to "?ster Farimagsgade".)
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
On Mon, 08 Sep 2008 01:45:51 +0200, Peter Dalgaard wrote?:
Emmanuel Charpentier wrote:
Dear list,
[ Snip ... ]
This looks reasonably sane, I think. The last loop could be d[] <- lapply(d, conv1, from, to), but I think that is cosmetic. You can't really do much better because there is no simple way of distinguishing between the various 8-bit character sets.
Thank you Peter ! Could you point me to some not-so-simple (or even doubleplusunsimple) ways ? I get the problem not so rarely, and I'd like to pull this chard outta my poor tired foot one and for all... and I suppose that I am not alone in this predicament.
You could presumably setup some heuristics. like the fact that the occurrence of 0x82 or 0x8a probably indicates cp437, but it gets tricky. (At least, in French, you don't have the Danish/Norwegian peculiarity that upper/lowercase o-slash were missing in cp437, and therefore often replaced yen and cent symbols in matrix printer ROMs. We still get the occational parcel addressed to "?ster Farimagsgade".)
Peter, you're gravely underestimating the ingenuity of some Excel l^Husers... (and your story is a possible candidate for a fortune() entry...). Emmanuel Charpentier
Emmanuel Charpentier wrote:
On Mon, 08 Sep 2008 01:45:51 +0200, Peter Dalgaard wrote :
Emmanuel Charpentier wrote:
Dear list,
[ Snip ... ]
This looks reasonably sane, I think. The last loop could be d[] <-
lapply(d, conv1, from, to), but I think that is cosmetic. You can't
really do much better because there is no simple way of distinguishing
between the various 8-bit character sets.
Thank you Peter ! Could you point me to some not-so-simple (or even doubleplusunsimple) ways ? I get the problem not so rarely, and I'd like to pull this chard outta my poor tired foot one and for all... and I suppose that I am not alone in this predicament.
In full generality it is impossible, but you might get something if you make certain assumptions. If you can convert from UTF8, then it probably is UTF8 (or ASCII but in either case, you're done). Otherwise it is a single-byte 8-bit encoding if the language can be assumed to be French. If it uses characters between 0x80 and 0x9f, then it is not latin1 but rather cp437, 850, or 1252. The tricky bit is that although the presence of say 0x82 suggests that it is not cp1252, but rather 437 or 850 (e aigu) , it just might be 1252 after all (single low quote). Some sort of naive Bayes classifier might work.
You could presumably setup
some heuristics. like the fact that the occurrence of 0x82 or 0x8a
probably indicates cp437, but it gets tricky. (At least, in French, you
don't have the Danish/Norwegian peculiarity that upper/lowercase o-slash
were missing in cp437, and therefore often replaced yen and cent symbols
in matrix printer ROMs. We still get the occational parcel addressed to
"?ster Farimagsgade".)
Peter, you're gravely underestimating the ingenuity of some Excel l^Husers... (and your story is a possible candidate for a fortune() entry...)
(If so, please respell "occasional". Ouch! And actually, the cent/yen thing is also a difference between cp437 and cp850, so the story may be a bit too colourful.)
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907