Skip to content
Prev 33092 / 63421 Next

read.csv

On Sun, Jun 14, 2009 at 02:56:01PM -0400, Gabor Grothendieck wrote:
The problem is not specific to read.csv(). The same difference appears
for read.table().
  read.table(stdin())
  "1" 1
  2 "2"
  
  #   V1 V2
  # 1  1  1
  # 2  2  2
but
  read.table(stdin(), colClasses = "numeric")
  "1" 1
  2 "2"
  
  Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got '"1"'

The error occurs in the call of scan() at line 152 in src/library/utils/R/readtable.R,
which is
  data <- scan(file = file, what = what, sep = sep, quote = quote, ...
(This is the third call of scan() in the source code of read.table())

In this call, scan() gets the types of columns in "what" argument. If the type 
is specified, scan() performs the conversion itself and fails, if a numeric field
is quoted. If the type is not specified, the output of scan() is of type character,
but with quotes eliminated, if there are some in the input file. Columns with
unknown type are then converted using type.convert(), which receives the data
already without quotes.

The call of type.convert() is contained in a cycle
    for (i in (1L:cols)[do]) {
        data[[i]] <-
            if (is.na(colClasses[i]))
                type.convert(data[[i]], as.is = as.is[i], dec = dec,
                             na.strings = character(0L))
        ## as na.strings have already been converted to <NA>
            else if (colClasses[i] == "factor") as.factor(data[[i]])
            else if (colClasses[i] == "Date") as.Date(data[[i]])
            else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]])
            else methods::as(data[[i]], colClasses[i])
    }
which contains also lines, which could perform conversion for columns with
a specified type, but these lines are not used, since the vector "do" 
is defined as
  do <- keep & !known 
where "known" determines for which columns the type is known.

It is possible to modify the code so that scan() is called with all types
unspecified and leave the conversion to the lines
            else if (colClasses[i] == "factor") as.factor(data[[i]])
            else if (colClasses[i] == "Date") as.Date(data[[i]])
            else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]])
            else methods::as(data[[i]], colClasses[i])
above. Since this solution is already prepared in the code, the patch is very simple
  --- R-devel/src/library/utils/R/readtable.R     2009-05-18 17:53:08.000000000 +0200
  +++ R-devel-readtable/src/library/utils/R/readtable.R   2009-06-25 10:20:06.000000000 +0200
  @@ -143,9 +143,6 @@
       names(what) <- col.names
   
       colClasses[colClasses %in% c("real", "double")] <- "numeric"
  -    known <- colClasses %in%
  -                c("logical", "integer", "numeric", "complex", "character")
  -    what[known] <- sapply(colClasses[known], do.call, list(0))
       what[colClasses %in% "NULL"] <- list(NULL)
       keep <- !sapply(what, is.null)
   
  @@ -189,7 +186,7 @@
          stop(gettextf("'as.is' has the wrong length %d  != cols = %d",
                        length(as.is), cols), domain = NA)
   
  -    do <- keep & !known # & !as.is
  +    do <- keep & !as.is
       if(rlabp) do[1L] <- FALSE # don't convert "row.names"
       for (i in (1L:cols)[do]) {
           data[[i]] <-
(Also in attachment)

I did a test as follows
  d1 <- read.table(stdin())
  "1" TRUE   3.5
  2   NA     "0.1"
  NA  FALSE  0.1
  3   "TRUE" NA

  sapply(d1, typeof)
  #        V1        V2        V3 
  # "integer" "logical"  "double" 
  is.na(d1)
  #         V1    V2    V3
  # [1,] FALSE FALSE FALSE
  # [2,] FALSE  TRUE FALSE
  # [3,]  TRUE FALSE FALSE
  # [4,] FALSE FALSE  TRUE
  
  d2 <- read.table(stdin(), colClasses=c("integer", "logical", "double"))
  "1" TRUE   3.5
  2   NA     "0.1"
  NA  FALSE  0.1
  3   "TRUE" NA

  sapply(d2, typeof)
  #        V1        V2        V3 
  # "integer" "logical"  "double" 
  is.na(d2)
  #         V1    V2    V3
  # [1,] FALSE FALSE FALSE
  # [2,] FALSE  TRUE FALSE
  # [3,]  TRUE FALSE FALSE
  # [4,] FALSE FALSE  TRUE

I think, there was a reason to let scan() to perform the type conversion, for
example, it may be more efficient. So, if correct, the above patch is a possible
solution, but some other may be more appropriate. In particular, function scan()
may be modified to remove quotes also from fields specified as numeric.

Petr.