Skip to content

RFC: type conversion in read.table

5 messages · John Chambers, Kurt Hornik, Brian Ripley +1 more

#
Currently read.table is rather limited in its type conversion.
The algorithm is

0) Read as character
1) Try to convert to numeric. If that works, quit
2) Convert to factor unless !as.is.

I am thinking about adding more flexibility and more classes by the
following two changes.


A) Anticipating the arrival of classes for all R objects, add an
argument say `colClasses' that allows the user to specify the desired
class for every column.  This could default to "auto", or NA if people
think "auto" might be a relevant class name one day.

The effect would be equivalent to running

data[[i]] <- as(data[[i]], colClasses[i])

instead of

data[[i]] <- type.convert(data[[i]], as.is = as.is[i], dec = dec)

except that standard classes such as "numeric", "factor", "logical",
"character" would be dispatched directly, and argument "dec" would be
consulted where appropriate.

colClasses = "character" would suppress all conversions, which cannot
currently be done.


B) Make the default "auto" option somewhat cleverer.  I am thinking of
trying the following in turn

logical
integer
numeric
complex
factor   (only if !as.is[i] for backwards compatibility).

The `dec' option needs to be used for numeric/complex.

This would be done by a documented typeConvert function, and
should normally be fast (just look at the first item to rule
out much of the list).


This does mean that data frames would be much more likely to end up
containing integer or logical variables (although they can now).
I have already fixed model.frame/matrix to handle logical variables,
and would need to check that they do handle integer variables.


Questions:

1) Is this desirable?

2) Are the names sensible?

3) Is there any need to allow users to specify either the set of
   classes used by "auto" or lists of classes on a column-specific
   basis?

4) Currently the default is to get something without much information
   loss, and that would remain.  My intention is that if a class is
   specified and conversion is not possible that the result would be
   (mainly?) NAs.  Any problem with that?


Brian
#
Prof Brian Ripley wrote:
Yes, definitely.  It also fits very well into the formal class idiom. 
Couple of suggestions below.
I think the most flexible way to get what you want is something like the
following.

The natural default for the colClasses argument is the name of a class,
but a "virtual" class in green book terminology.

I've been playing around with some data-frame related software mostly as
tests for the methods code (in SLanguage/SModels in the Omegahat tree).

The class used there for this purpose is called "dataVariable", meaning
anything that can conceptually be a variable in a data frame.  Actual
classes for variables extend this class, maybe trivially, maybe by some
method.

What's needed for the default here is essentially a method to coerce
class "character" to "dataVariable" (or whatever name one wants to
use).  When we are really using formal methods, this would be specified
by a call to setAs (green book, p307).  Then in effect
  data[[i]] <- as(data[[i]], colClasses[i])
applies in the default case as well.

Users could specialize the default by over-riding the setAs, but a
better way would be to define a new virtual class, with its own method
for coercion.  Users would then have essentially unlimited flexibility,
by supplying the name of that class in the colClasses argument.
As a default, seems fine.  When the user supplies a class, this implies
an as() method, which can then decide what to do in case of
problems--error, NA, or whatever.
John
6 days later
#
Just a small remark.  I would prefer `NA' to "auto" (or "unknown").  May
be too late to change this now :-)

I would also be happier if we did not refer to the variables explicitly
as `columns'.  (This sounds a bit stupid from the person who wrote
write.table and introduced arguments `row.names' and `col.names'.
Although, at least one of these was modelled after an existing
function).  E.g. something like

	read.table(......, caseNames, varNames, varClasses, .....)

would be nice ...

-k
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Fri, 31 Aug 2001, Kurt Hornik wrote:

            
Anything can be changed up to 1.4.0 release.  In particular, the present
code will have to be changed unless as() is in base by then.
The problem is that what is being referred to *is* columns and not
variables.  If you have row names on the file, the numbering is different.
So it matters to use sufficiently precise terminology.

Brian
#
>> I would also be happier if we did not refer to the variables
    >> explicitly as `columns'.  (This sounds a bit stupid from the
    >> person who wrote write.table and introduced arguments
    >> `row.names' and `col.names'.  Although, at least one of these
    >> was modelled after an existing function).  E.g. something like
    >> 
    >> read.table(......, caseNames, varNames, varClasses, .....)
    >> 
    >> would be nice ...

    BDR> The problem is that what is being referred to *is* columns
    BDR> and not variables.  If you have row names on the file, the
    BDR> numbering is different.  So it matters to use sufficiently
    BDR> precise terminology.

I would tend to agree with Brian.  To me, caseNames / varNames
sounds a rather bit arrogant, since there are a number of other
"formats" (contingency tables come to mind) for which read.table is
one possible way for slurping in the data prior to munging it, though
I guess one could argue that this is an abuse of tools.

best,
-tony