Skip to content

read.csv trap

4 messages · Ken.Williams at thomsonreuters.com, Laurent Gatto, Ben Bolker

#
Bump.

  It's been a week since I posted this to r-devel.  Any
thoughts/discussion?  Would R-core be irritated if I submitted a bug report?

  cheers
    Ben


-------- Original Message --------
Subject: read.csv trap
Date: Fri, 04 Feb 2011 11:16:36 -0500
From: Ben Bolker <bbolker at gmail.com>
To: r-devel at stat.math.ethz.ch <r-devel at stat.math.ethz.ch>,  David Earn
<earn at math.mcmaster.ca>

  This is not specifically a bug, but an (implicitly/obscurely)
documented behavior of read.csv (or read.table with fill=TRUE) that can
be quite dangerous/confusing for users.  I would love to hear some
discussion from other users and/or R-core about this ...  As always, I
apologize if I have missed some obvious workaround or reason that this
is actually the desired behavior ...

  In a nutshell, when fill=TRUE R guesses the number of columns from the
first 5 rows of the data set.  That's fine, and ?read.table documents this:

   The number of data columns is determined by looking at the first
     five lines of input (or the whole file if it has less than five
     lines), or from the length of ?col.names? if it is specified and
     is longer.  This could conceivably be wrong if ?fill? or
     ?blank.lines.skip? are true, so specify ?col.names? if necessary.

What is dangerous/confusing is that R silently **wraps** longer lines if
fill=TRUE (which is the default for read.csv).  I encountered this when
working with a colleague on a long, messy CSV file that had some phantom
extra fields in some rows, which then turned into empty lines in the
data frame.

  Here is an example and a workaround that runs count.fields on the
whole file to find the maximum column length and set col.names
accordingly.  (It assumes you don't already have a file named "test.csv"
in your working directory ...)

  I haven't dug in to try to write a patch for this -- I wanted to test
the waters and see what people thought first, and I realize that
read.table() is a very complicated piece of code that embodies a lot of
tradeoffs, so there could be lots of different approaches to trying to
mitigate this problem. I appreciate very much how hard it is to write a
robust and general function to read data files, but I also think it's
really important to minimize the number of traps in read.table(), which
will often be the first part of R that new users encounter ...

  A quick fix for this might be to allow the number of lines analyzed
for length to be settable by the user, or to allow a settable 'maxcols'
parameter, although those would only help in the case where the user
already knows there is a problem.

  cheers
    Ben Bolker

===============
writeLines(c("A,B,C,D",
             "1,a,b,c",
             "2,f,g,c",
             "3,a,i,j",
             "4,a,b,c",
             "5,d,e,f",
             "6,g,h,i,j,k,l,m,n"),
           con=file("test.csv"))


read.csv("test.csv")
try(read.csv("test.csv",fill=FALSE))

## assumes header=TRUE, fill=TRUE; should be a little more careful
##  with comment, quote arguments (possibly explicit)
## ... contains information about quote, comment.char, sep
Read.csv <- function(fn,sep=",",...) {
  colnames <- scan(fn,nlines=1,what="character",sep=sep,...)
  ncolnames <- length(colnames)
  maxcols <- max(count.fields(fn,sep=sep,...))
  if (maxcols>ncolnames) {
    colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep=""))
  }
  ## assumes you don't have any other columns labeled "V[large number]"
  read.csv(fn,...,col.names=colnames)
}

Read.csv("test.csv")
#
On 2/11/11 1:39 PM, "Ben Bolker" <bbolker at gmail.com> wrote:

            
Based on your description, I would be very irritated if I encountered the
behavior you describe.  I would consider it a bug, though my opinion
doesn't necessarily count for much.

--
Ken Williams
Senior Research Scientist
Thomson Reuters
Phone: 651-848-7712
ken.williams at thomsonreuters.com
http://labs.thomsonreuters.com
#
On 11 February 2011 19:39, Ben Bolker <bbolker at gmail.com> wrote:
[snip]
As a matter of fact, this is exactly what happened to a colleague of
mine yesterday and caused her quite a bit of trouble. On the other
hand, it could also be considered as a 'bug' in the csv file. Although
no formal specification exist for the csv format, RFC 4180 [1]
indicates that 'each line should contain the same number of fields
throughout the file'.

[1] http://tools.ietf.org/html/rfc4180

Best wishes,

Laurent

  
    
#
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 02/11/2011 03:37 PM, Laurent Gatto wrote:
Asserting that the bug is in the CSV file is logically consistent, but
if this is true then the "fill=TRUE" argument (which is only needed when
the lines contain different numbers of fields) should not be allowed.

 I had never seen RFC4180 before -- interesting!  I note especially
points 5-7 which define the handling of double quotation marks (but says
nothing about single quotes or using backslashes as escape characters).

  Dealing with read.[table|csv] seems a bit of an Augean task
<http://en.wikipedia.org/wiki/Augeas> (hmmm, maybe I should write a
parallel document to Burns's _Inferno_ ...)

  cheers
    Ben
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1VsX4ACgkQc5UpGjwzenPwsgCfTtGo0kJSXhUTPcY+p7cgaiuq
zHAAnikRORUhqLP9O+6M5SwyZcFEW9uT
=Rb2R
-----END PGP SIGNATURE-----