RFC: type conversion in read.table
Prof Brian Ripley wrote:
Currently read.table is rather limited in its type conversion. The algorithm is 0) Read as character 1) Try to convert to numeric. If that works, quit 2) Convert to factor unless !as.is. I am thinking about adding more flexibility and more classes by the following two changes. A) Anticipating the arrival of classes for all R objects, add an argument say `colClasses' that allows the user to specify the desired class for every column. This could default to "auto", or NA if people think "auto" might be a relevant class name one day. The effect would be equivalent to running data[[i]] <- as(data[[i]], colClasses[i]) instead of data[[i]] <- type.convert(data[[i]], as.is = as.is[i], dec = dec) except that standard classes such as "numeric", "factor", "logical", "character" would be dispatched directly, and argument "dec" would be consulted where appropriate. colClasses = "character" would suppress all conversions, which cannot currently be done. B) Make the default "auto" option somewhat cleverer. I am thinking of trying the following in turn logical integer numeric complex factor (only if !as.is[i] for backwards compatibility). The `dec' option needs to be used for numeric/complex. This would be done by a documented typeConvert function, and should normally be fast (just look at the first item to rule out much of the list). This does mean that data frames would be much more likely to end up containing integer or logical variables (although they can now). I have already fixed model.frame/matrix to handle logical variables, and would need to check that they do handle integer variables. Questions: 1) Is this desirable?
Yes, definitely. It also fits very well into the formal class idiom. Couple of suggestions below.
2) Are the names sensible? 3) Is there any need to allow users to specify either the set of classes used by "auto" or lists of classes on a column-specific basis?
I think the most flexible way to get what you want is something like the following. The natural default for the colClasses argument is the name of a class, but a "virtual" class in green book terminology. I've been playing around with some data-frame related software mostly as tests for the methods code (in SLanguage/SModels in the Omegahat tree). The class used there for this purpose is called "dataVariable", meaning anything that can conceptually be a variable in a data frame. Actual classes for variables extend this class, maybe trivially, maybe by some method. What's needed for the default here is essentially a method to coerce class "character" to "dataVariable" (or whatever name one wants to use). When we are really using formal methods, this would be specified by a call to setAs (green book, p307). Then in effect data[[i]] <- as(data[[i]], colClasses[i]) applies in the default case as well. Users could specialize the default by over-riding the setAs, but a better way would be to define a new virtual class, with its own method for coercion. Users would then have essentially unlimited flexibility, by supplying the name of that class in the colClasses argument.
4) Currently the default is to get something without much information loss, and that would remain. My intention is that if a class is specified and conversion is not possible that the result would be (mainly?) NAs. Any problem with that?
As a default, seems fine. When the user supplies a class, this implies an as() method, which can then decide what to do in case of problems--error, NA, or whatever.
Brian -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
John
John M. Chambers jmc@bell-labs.com Bell Labs, Lucent Technologies office: (908)582-2681 700 Mountain Avenue, Room 2C-282 fax: (908)582-3340 Murray Hill, NJ 07974 web: http://www.cs.bell-labs.com/~jmc -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._