read.table() with quoted integers
On Oct 4, 2013, at 17:10 , Henrik Bengtsson wrote:
On Fri, Oct 4, 2013 at 4:55 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 13-10-04 7:31 AM, Joshua Ulrich wrote:
On Tue, Oct 1, 2013 at 11:29 AM, David Winsemius <dwinsemius at comcast.net> wrote:
On Sep 30, 2013, at 6:38 AM, Joshua Ulrich wrote:
On Mon, Sep 30, 2013 at 7:33 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
Hi! It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider quoted integers as an acceptable value for columns for which colClasses="integer". But when colClasses is omitted, these columns are read as integer anyway. For example, let's consider a file named file.dat, containing: "1" "2"
read.table("file.dat", colClasses="integer")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'an integer' and got '"1"' But:
str(read.table("file.dat"))
'data.frame': 2 obs. of 1 variable:
$ V1: int 1 2
The latter result is indeed documented in ?read.table:
Unless ?colClasses? is specified, all columns are read as
character columns and then converted using ?type.convert? to
logical, integer, numeric, complex or (depending on ?as.is?)
factor as appropriate. Quotes are (by default) interpreted in all
fields, so a column of values like ?"42"? will result in an
integer column.
Should the former behavior be considered a bug?
No. If you tell read.table the column is integer and it's actually character on disk, it should be an error.
My reading of the `read.table` help page is that one should expect that when there is an 'integer'-class and an `as.integer` function and "integer" is the argument to colClasses, that `as.integer` will be applied to the values in the column. Should I be reading elsewhere?
I assume you're referring to the paragraph below. Possible values are ?NA? (the default, when ?type.convert? is used), ?"NULL"? (when the column is skipped), one of the atomic vector classes (logical, integer, numeric, complex, character, raw), or ?"factor"?, ?"Date"? or ?"POSIXct"?. Otherwise there needs to be an ?as? method (from package ?methods?) for conversion from ?"character"? to the specified formal class. I read that as meaning that an "as" method is required for classes not already listed in the prior sentence. It doesn't say an "as" method will be applied if colClasses is one of the atomic, factor, Date, or POSIXct classes; but I can see how you might assume that, since all the atomic, factor, Date, and POSIXct classes already have "as" methods...
And this does suggest a workaround for ffdf: instead of declaring the class to be "integer", declare a class "ffdf_integer", and write a conversion method. Or simply read everything as character and call as.integer() explicitly.
Just a note of concert since several proposed it:
concerN?
colClasses="character") followed by as.integer() or strtoi() misses the validation, e.g. "foo" will be turned into NA_integer_. Using read.table() or scan() gives an error.
The obvious fix for that would seem to be to use scan() on the character vector:
y <- c("1","2",3,4,5)
y
[1] "1" "2" "3" "4" "5"
scan(text=y)
Read 5 items [1] 1 2 3 4 5
y <- c("1","2",3,4,"NA")
scan(text=y)
Read 5 items [1] 1 2 3 4 NA
y <- c("1","2",3,4,"foo")
scan(text=y)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got 'foo'
/Henrik
Duncan Murdoch
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com