R 1.2.1 - read.table - factors problem or is it a data.frame problem
Brian Ripley notes:
On Fri, 2 Feb 2001, Martin Maechler wrote:
"PD" == Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> writes:
PD> "Heberto Ghezzo" <Heberto at meakins.lan.mcgill.ca> writes:
>> I have some problems with read.table and floats turning up as
>> factors. In my case it was not a blank in the file but an unary
>> minus!! so 3.24,-57.23,... the 3.24 is numeric but -57.23 is a
>> factor. Yes I turned it into a numeric with
>> as.numeric(as.character(.. but I think it will be better to modify
>> somehow the read.table/read.csv code.
>> Thanks anyway.
PD> That certainly sounds like a bug, but I can't reproduce it:
PD> $ cat > xx
PD> -1,2,3
PD> 1,-2,3
PD> $ R
PD> ...
>> summary(read.csv('xx',head=F))
PD> V1 V2 V3
PD> Min. :-1.0 Min. :-2 Min. :3
PD> 1st Qu.:-0.5 1st Qu.:-1 1st Qu.:3
PD> Median : 0.0 Median : 0 Median :3
PD> Mean : 0.0 Mean : 0 Mean :3
PD> 3rd Qu.: 0.5 3rd Qu.: 1 3rd Qu.:3
PD> Max. : 1.0 Max. : 2 Max. :3
PD> Could you give us some further details on the setup that is
causing PD> that effect?
Heberto uses a Windoze mailer, hence probably ..
It could be that the problem comes from the fact that some win users
use non-ASCII minus characters (i.e. not "minus", but these find them on
their keyboards when typing in the data ..):
In iso_8859-1 aka "latin-1" (of which most European MSWin localizations
are said to be a superset) there are three kinds of "-" :
Oct Dec Hex Char Description
--------------------------------------------------------------------
055 45 2D - Minux [The standard ASCII one]
255 173 AD Â SOFT HYPHEN
257 175 AF ¯ MACRON
Actually, not as far as I can find out (and I have been working on encodings for the next releases of R). The first really is hyphen in both latin-1 and WinAnsi (the main Windows char set: the other, WinOEM, is not a superset of latin-1). Minus is not in the WinAnsi char set, but it does have hyphen at 45 and 173 (it has two spaces too). Unfortunately Adobe's ISOLatin1 encoding for postscript is not the same as latin-1. That does have minus at 45 and (real) hyphen at 173. As Windows NT/2000 machines support Unicode, on those the set of possible inputs is much wider and I don't think R will cope with Unicode-encoded files. In Unicode minus is at 138 (and hyphen at 45). It's a possible explanation, but then I don't think as.numeric(as.character( would work. My guess was that there was some other non-printing character in that field, but that has the same counter-argument.
I had sought help a few days earlier for a problem with some similarities. In my case I had failed to recognize the existence of some NA's. I had a data set which originated in 1966. Some IBM statistical packages of the era encoded NA's as binary negative zeros. These were propogated in passes through the SAS first edition. I can't remember how they were then encoded in EBCDIC by different FORTRAN compilers, nor ultimately in ASCII conversions. However they relied on program filters and were otherwise invisible. Gordon M. Harrington Mail: 3720 Village Place, #6308 Professor Emeritus Waterloo, IA 50702-5848 University of Northern Iowa Phone: 319-291-8535 gordon.harrington at uni.edu Fax: 319-291-8491 dryfly at aya.yale.edu 319-291-8324 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._