read.table() and NULL for colClasses
NULL is not a valid value for colClasses and I don't see why you thought it was. colClasses has to be character according to the documentation, so "NULL" is allowed but not NULL. Your diff appears to be backwards for a patch. A patch against the current R-devel sources is what is needed, including some regression tests.
On Wed, 28 Jul 2004, Henrik Bengtsson wrote:
Hi, is there are reason for not supporting NULL or "NULL" values for argument colClasses in read.table(), much like you can use NULL values for argument 'what' in scan()? This would help quite a bit when reading large data files where only a few columns are of interest.
Is that a common enough case to make this worth the code complication, given that scan() (or better, a DBMS) can be used? The usual reason is that R is maintained by a small and overworked team and adding complications needs justification, not not adding them.
I've modfied read.table() to so it calls scan(what=...) also with NULLs for the fields to be skipped. Here's the diff of readtable.R (from the R-1.9.1.tgz; 9,591,217 bytes): diff readtable.new.R readtable.R 117,123d116 < # Skip NULL columns in scan() < void <- sapply(colClasses, FUN=identical, "NULL") | < sapply(colClasses, FUN=is.null) < # If all (data) columns are NULL, return empty data frame. < if (sum(!void) <= 1*rlabp) < return(data.frame()) < what[void] <- list(NULL) 131c124 < nlines <- length(data[[which(!void)[1]]]) ---
nlines <- length(data[[1]])
161c154
< for (i in (1:cols)[!known & !void]) {
---
for (i in 1:cols) {
171,178d163
< # Skipped row names equals row.names=NULL.
< if (rlabp) {
< if (void[1]) {
< row.names <- NULL
< data <- data[-1]
< }
< void <- void[-1]
< }
201,202d185
< # Remove NULL columns
< data[void] <- NULL
and a diff for read.table.Rd:
diff read.table.new.Rd read.table.Rd
102,104c102
< \code{NA} when \code{\link{type.convert}} is used. Columns for
< which the value is \code{"NULL"} (or \code{NULL} in a list) are
< skipped. NB: \code{as} is
---
\code{NA} when \code{\link{type.convert}} is used. NB: \code{as} is
181,183c179
< the five atomic vector classes. Skipping columns with \code{"NULL"}
< (or \code{NULL} will also require less memory.
<
---
the five atomic vector classes.
Note that there is already an, what I assume is unintentional, effect of setting a colClasses to "NULL". The data conversion, which happens *after* scan() has read the data anyway, "NULL" will NULL a column via as(x, "NULL"), but unfortunately the wrong column. If not the above modifications, maybe a warning for the latter?
That's not usage as documented so the effect is definitely unintentional. We can't catch all misuses!
Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595