read.csv trap

Bump.

  It's been a week since I posted this to r-devel.  Any
thoughts/discussion?  Would R-core be irritated if I submitted a bug report?

  cheers
    Ben

-------- Original Message --------
Subject: read.csv trap
Date: Fri, 04 Feb 2011 11:16:36 -0500
From: Ben Bolker <bbolker at gmail.com>
To: r-devel at stat.math.ethz.ch <r-devel at stat.math.ethz.ch>,  David Earn
<earn at math.mcmaster.ca>

  This is not specifically a bug, but an (implicitly/obscurely)
documented behavior of read.csv (or read.table with fill=TRUE) that can
be quite dangerous/confusing for users.  I would love to hear some
discussion from other users and/or R-core about this ...  As always, I
apologize if I have missed some obvious workaround or reason that this
is actually the desired behavior ...

  In a nutshell, when fill=TRUE R guesses the number of columns from the
first 5 rows of the data set.  That's fine, and ?read.table documents this:

   The number of data columns is determined by looking at the first
     five lines of input (or the whole file if it has less than five
     lines), or from the length of ?col.names? if it is specified and
     is longer.  This could conceivably be wrong if ?fill? or
     ?blank.lines.skip? are true, so specify ?col.names? if necessary.

What is dangerous/confusing is that R silently **wraps** longer lines if
fill=TRUE (which is the default for read.csv).  I encountered this when
working with a colleague on a long, messy CSV file that had some phantom
extra fields in some rows, which then turned into empty lines in the
data frame.

  Here is an example and a workaround that runs count.fields on the
whole file to find the maximum column length and set col.names
accordingly.  (It assumes you don't already have a file named "test.csv"
in your working directory ...)

  I haven't dug in to try to write a patch for this -- I wanted to test
the waters and see what people thought first, and I realize that
read.table() is a very complicated piece of code that embodies a lot of
tradeoffs, so there could be lots of different approaches to trying to
mitigate this problem. I appreciate very much how hard it is to write a
robust and general function to read data files, but I also think it's
really important to minimize the number of traps in read.table(), which
will often be the first part of R that new users encounter ...

  A quick fix for this might be to allow the number of lines analyzed
for length to be settable by the user, or to allow a settable 'maxcols'
parameter, although those would only help in the case where the user
already knows there is a problem.

  cheers
    Ben Bolker

===============
writeLines(c("A,B,C,D",
             "1,a,b,c",
             "2,f,g,c",
             "3,a,i,j",
             "4,a,b,c",
             "5,d,e,f",
             "6,g,h,i,j,k,l,m,n"),
           con=file("test.csv"))

read.csv("test.csv")
try(read.csv("test.csv",fill=FALSE))

## assumes header=TRUE, fill=TRUE; should be a little more careful
##  with comment, quote arguments (possibly explicit)
## ... contains information about quote, comment.char, sep
Read.csv <- function(fn,sep=",",...) {
  colnames <- scan(fn,nlines=1,what="character",sep=sep,...)
  ncolnames <- length(colnames)
  maxcols <- max(count.fields(fn,sep=sep,...))
  if (maxcols>ncolnames) {
    colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep=""))
  }
  ## assumes you don't have any other columns labeled "V[large number]"
  read.csv(fn,...,col.names=colnames)
}

Read.csv("test.csv")