On 11 February 2011 19:39, Ben Bolker <bbolker <at> gmail.com> wrote:
[snip]
Bump. Is there any opinion about this from R-core??
Will I be scolded if I submit this as a bug ... ??
What is dangerous/confusing is that R silently **wraps** longer lines if
fill=TRUE (which is the default for read.csv). I encountered this when
working with a colleague on a long, messy CSV file that had some phantom
extra fields in some rows, which then turned into empty lines in the
data frame.
[snip snip]
Here is an example and a workaround that runs count.fields on the
whole file to find the maximum column length and set col.names
accordingly. (It assumes you don't already have a file named "test.csv"
in your working directory ...)
I haven't dug in to try to write a patch for this -- I wanted to test
the waters and see what people thought first, and I realize that
read.table() is a very complicated piece of code that embodies a lot of
tradeoffs, so there could be lots of different approaches to trying to
mitigate this problem. I appreciate very much how hard it is to write a
robust and general function to read data files, but I also think it's
really important to minimize the number of traps in read.table(), which
will often be the first part of R that new users encounter ...
A quick fix for this might be to allow the number of lines analyzed
for length to be settable by the user, or to allow a settable 'maxcols'
parameter, although those would only help in the case where the user
already knows there is a problem.
cheers
Ben Bolker
## assumes header=TRUE, fill=TRUE; should be a little more careful
## with comment, quote arguments (possibly explicit)
## ... contains information about quote, comment.char, sep
Read.csv <- function(fn,sep=",",...) {
colnames <- scan(fn,nlines=1,what="character",sep=sep,...)
ncolnames <- length(colnames)
maxcols <- max(count.fields(fn,sep=sep,...))
if (maxcols>ncolnames) {
colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep=""))
}
## assumes you don't have any other columns labeled "V[large number]"
read.csv(fn,...,col.names=colnames)
}
Read.csv("test.csv")