Skip to content

read.csv

6 messages · Gabor Grothendieck, (Ted Harding), Petr Savicky

#
If read.csv's colClasses= argument is NOT used then read.csv accepts
double quoted numerics:

1: > read.csv(stdin())
0: A,B
1: "1",1
2: "2",2
3:
  A B
1 1 1
2 2 2

However, if colClasses is used then it seems that it does not:
0: A,B
1: "1",1
2: "2",2
3:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  scan() expected 'a real', got '"1"'

Is this really intended?  I would have expected that a csv file in which
each field is surrounded with double quotes is acceptable in both
cases.   This may be documented as is yet seems undesirable from
both a consistency viewpoint and the viewpoint that it should be
possible to double quote fields in a csv file.
#
On 14-Jun-09 18:56:01, Gabor Grothendieck wrote:
Well, the default for colClasses is NA, for which ?read.csv says:
  [...]
  Possible values are 'NA' (when 'type.convert' is used),
  [...]
and then ?type.convert says:
  This is principally a helper function for 'read.table'. Given a
  character vector, it attempts to convert it to logical, integer,
  numeric or complex, and failing that converts it to factor unless
  'as.is = TRUE'.  The first type that can accept all the non-missing
  values is chosen.

It would seem that type 'logical' won't accept integer (naively one
might expect 1 --> TRUE, but see experiment below), so the first
acceptable type for "1" is integer, and that is what happens.
So it is indeed documented (in the R[ecursive] sense of "documented" :))

However, presumably when colClasses is used then type.convert() is
not called, in which case R sees itself being asked to assign a
character entity to a destination which it has been told shall be
integer, and therefore, since the default for as.is is
  as.is = !stringsAsFactors
but for this ?read.csv says that stringsAsFactors "is overridden
bu [sic] 'as.is' and 'colClasses', both of which allow finer
control.", so that wouldn't come to the rescue either.

Experiment:
  X <-logical(10)
  class(X)
  # [1] "logical"
  X[1]<-1
  X
  # [1] 1 0 0 0 0 0 0 0 0 0
  class(X)
  # [1] "numeric"
so R has converted X from class 'logical' to class 'numeric'
on being asked to assign a number to a logical; but in this
case its hands were not tied by colClasses.

Or am I missing something?!!

Ted.



--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 14-Jun-09                                       Time: 21:21:22
------------------------------ XFMail ------------------------------
#
On Sun, Jun 14, 2009 at 4:21 PM, Ted
Harding<Ted.Harding at manchester.ac.uk> wrote:
The point of this is that the current behavior is not desirable since you can't
have quoted numeric fields if you specify colClasses = "numeric" yet you
can if you don't.  The concepts are not orthogonal but should be.  If you
specify or not specify colClasses the numeric fields ought to be treated
the same way and if the documentation says otherwise it further means
there is a problem with the design.

One could define their own type quotedNumeric as a workaround
(see below) but I think it would be better if specifying "numeric" or
not specifying
numeric had the same effect.  The way it is now the concepts are intertwined
and not orthogonal.

library(methods)
setClass("quotedNumeric")
setAs("character", "quotedNumeric",
  function(from) as.numeric(gsub("\"", "", from)))
Lines <- 'A,B
"1",1
"2",2'
read.csv(textConnection(Lines), colClasses = c("quotedNumeric", "numeric"))
1 day later
#
On Sun, Jun 14, 2009 at 09:21:24PM +0100, Ted Harding wrote:
In my opinion, you explain, how it happens that there is a difference
in the behavior between
  read.csv(stdin(), colClasses = "numeric")
and
  read.csv(stdin())
but not, why it is so.

The algorithm "use the smallest type, which accepts all non-missing values"
may well be applied to the input values either literally or after removing
the quotes. Is there a reason, why
  read.csv(stdin(), colClasses = "numeric")
removes quotes from the input values and
  read.csv(stdin())
does not?

Using double-quote characters is a part of the definition of CSV file, see,
for example
  http://en.wikipedia.org/wiki/Comma_separated_values
where one may find
  Fields may always be enclosed within double-quote characters, whether necessary or not.

Petr.
8 days later
#
On Sun, Jun 14, 2009 at 02:56:01PM -0400, Gabor Grothendieck wrote:
The problem is not specific to read.csv(). The same difference appears
for read.table().
  read.table(stdin())
  "1" 1
  2 "2"
  
  #   V1 V2
  # 1  1  1
  # 2  2  2
but
  read.table(stdin(), colClasses = "numeric")
  "1" 1
  2 "2"
  
  Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got '"1"'

The error occurs in the call of scan() at line 152 in src/library/utils/R/readtable.R,
which is
  data <- scan(file = file, what = what, sep = sep, quote = quote, ...
(This is the third call of scan() in the source code of read.table())

In this call, scan() gets the types of columns in "what" argument. If the type 
is specified, scan() performs the conversion itself and fails, if a numeric field
is quoted. If the type is not specified, the output of scan() is of type character,
but with quotes eliminated, if there are some in the input file. Columns with
unknown type are then converted using type.convert(), which receives the data
already without quotes.

The call of type.convert() is contained in a cycle
    for (i in (1L:cols)[do]) {
        data[[i]] <-
            if (is.na(colClasses[i]))
                type.convert(data[[i]], as.is = as.is[i], dec = dec,
                             na.strings = character(0L))
        ## as na.strings have already been converted to <NA>
            else if (colClasses[i] == "factor") as.factor(data[[i]])
            else if (colClasses[i] == "Date") as.Date(data[[i]])
            else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]])
            else methods::as(data[[i]], colClasses[i])
    }
which contains also lines, which could perform conversion for columns with
a specified type, but these lines are not used, since the vector "do" 
is defined as
  do <- keep & !known 
where "known" determines for which columns the type is known.

It is possible to modify the code so that scan() is called with all types
unspecified and leave the conversion to the lines
            else if (colClasses[i] == "factor") as.factor(data[[i]])
            else if (colClasses[i] == "Date") as.Date(data[[i]])
            else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]])
            else methods::as(data[[i]], colClasses[i])
above. Since this solution is already prepared in the code, the patch is very simple
  --- R-devel/src/library/utils/R/readtable.R     2009-05-18 17:53:08.000000000 +0200
  +++ R-devel-readtable/src/library/utils/R/readtable.R   2009-06-25 10:20:06.000000000 +0200
  @@ -143,9 +143,6 @@
       names(what) <- col.names
   
       colClasses[colClasses %in% c("real", "double")] <- "numeric"
  -    known <- colClasses %in%
  -                c("logical", "integer", "numeric", "complex", "character")
  -    what[known] <- sapply(colClasses[known], do.call, list(0))
       what[colClasses %in% "NULL"] <- list(NULL)
       keep <- !sapply(what, is.null)
   
  @@ -189,7 +186,7 @@
          stop(gettextf("'as.is' has the wrong length %d  != cols = %d",
                        length(as.is), cols), domain = NA)
   
  -    do <- keep & !known # & !as.is
  +    do <- keep & !as.is
       if(rlabp) do[1L] <- FALSE # don't convert "row.names"
       for (i in (1L:cols)[do]) {
           data[[i]] <-
(Also in attachment)

I did a test as follows
  d1 <- read.table(stdin())
  "1" TRUE   3.5
  2   NA     "0.1"
  NA  FALSE  0.1
  3   "TRUE" NA

  sapply(d1, typeof)
  #        V1        V2        V3 
  # "integer" "logical"  "double" 
  is.na(d1)
  #         V1    V2    V3
  # [1,] FALSE FALSE FALSE
  # [2,] FALSE  TRUE FALSE
  # [3,]  TRUE FALSE FALSE
  # [4,] FALSE FALSE  TRUE
  
  d2 <- read.table(stdin(), colClasses=c("integer", "logical", "double"))
  "1" TRUE   3.5
  2   NA     "0.1"
  NA  FALSE  0.1
  3   "TRUE" NA

  sapply(d2, typeof)
  #        V1        V2        V3 
  # "integer" "logical"  "double" 
  is.na(d2)
  #         V1    V2    V3
  # [1,] FALSE FALSE FALSE
  # [2,] FALSE  TRUE FALSE
  # [3,]  TRUE FALSE FALSE
  # [4,] FALSE FALSE  TRUE

I think, there was a reason to let scan() to perform the type conversion, for
example, it may be more efficient. So, if correct, the above patch is a possible
solution, but some other may be more appropriate. In particular, function scan()
may be modified to remove quotes also from fields specified as numeric.

Petr.
#
I am sorry for not including the attachment mentioned in my
previous email. Attached now. Petr.
-------------- next part --------------
--- R-devel/src/library/utils/R/readtable.R	2009-05-18 17:53:08.000000000 +0200
+++ R-devel-readtable/src/library/utils/R/readtable.R	2009-06-25 10:20:06.000000000 +0200
@@ -143,9 +143,6 @@
     names(what) <- col.names
 
     colClasses[colClasses %in% c("real", "double")] <- "numeric"
-    known <- colClasses %in%
-                c("logical", "integer", "numeric", "complex", "character")
-    what[known] <- sapply(colClasses[known], do.call, list(0))
     what[colClasses %in% "NULL"] <- list(NULL)
     keep <- !sapply(what, is.null)
 
@@ -189,7 +186,7 @@
 	stop(gettextf("'as.is' has the wrong length %d  != cols = %d",
                      length(as.is), cols), domain = NA)
 
-    do <- keep & !known # & !as.is
+    do <- keep & !as.is
     if(rlabp) do[1L] <- FALSE # don't convert "row.names"
     for (i in (1L:cols)[do]) {
         data[[i]] <-