Skip to content

The behaviour of read.csv().

16 messages · Phil Spector, Peter Ehlers, David Scott +3 more

#
I have recently been bitten by an aspect of the behaviour of
the read.csv() function.

Some lines in a (fairly large) *.csv file that I read in had
too many entries.  I would have hoped that this would cause
read.csv() to throw an error, or at least issue a warning,
but it read the file without complaint, putting the extra
entries into an additional line.

This behaviour is illustrated by the toy example in the
attached file ``junk.csv''.  Just do

	junk <- read.csv("junk.csv",header=TRUE)
	junk

to see the problem.

If the offending over-long line were in the fourth line of data
or earlier, an error would be thrown, but if it is in the fifth line
of data or later no error is given.

This is in a way compatible with what the help on read.csv()
says:

	The number of data columns is determined by looking at
	the first five lines of input (or the whole file if it
	has less than five lines), or from the length of col.names
	if it is specified and is longer.

However, the help for read.table() says the same thing.  And yet if
one does

	gorp <- read.table("junk.csv",sep=",",header=TRUE)

one gets an error, whereas read.csv() gives none.

Am I correct in saying that is inappropriate behaviour on
the part of read.csv(), or am I missing something?

		cheers,

			Rolf Turner

-------------- next part --------------


P. S.:
R version 2.12.0 (2010-10-15)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_NZ.UTF-8/en_NZ.UTF-8/C/C/en_NZ.UTF-8/en_NZ.UTF-8

attached base packages:
[1] datasets  utils     stats     graphics  grDevices methods   base     

other attached packages:
[1] misc_0.0-13     gtools_2.6.2    spatstat_1.21-2 deldir_0.0-13  
[5] mgcv_1.6-2      fortunes_1.4-0  MASS_7.3-8     

loaded via a namespace (and not attached):
[1] grid_2.12.0        lattice_0.19-13    Matrix_0.999375-44 nlme_3.1-97       
[5] tools_2.12.0
#
Rolf -
    I'd suggest using

     junk <- read.csv("junk.csv",header=TRUE,fill=FALSE)

if you don't want the behaviour you're seeing.

 					- Phil Spector
 					 Statistical Computing Facility
 					 Department of Statistics
 					 UC Berkeley
 					 spector at stat.berkeley.edu
On Fri, 3 Dec 2010, Rolf Turner wrote:

            
#
On 3/12/2010, at 1:08 PM, Phil Spector wrote:

            
The point is not that I don't want this kind of behaviour.
The point is that it seems to me to be unexpected and dangerous.

I can indeed take precautions against it, now that I know about it,
by specifying fill=FALSE.  Given that I remember to do so.

Now that you've pointed it out I can see that this is the reason
for the different behaviour between read.table() and read.csv();
in read.table() fill=FALSE is effectively the default.

Having fill=TRUE being the default in read.csv() strikes me as
being counter-intuitive and dangerous.

	cheers,

		Rolf
#
On 2010-12-02 16:26, Rolf Turner wrote:
Rolf,
This is not to argue with your point re counter-intuitive,
but I always run a count.fields() first if I haven't seen
(or can't easily see) the file in my editor. I must have
learned that the hard way a long time ago.

Peter Ehlers
#
On 02/12/2010 8:04 PM, Peter Ehlers wrote:
I think the fill=TRUE option arrived about 10 years ago, in R 1.2.0. 
The comment in the NEWS file suggests it was in response to some strange 
csv file coming out of Excel.

The real problem with the CSV format is that there really isn't a well 
defined standard for it.  The first RFC about it was published in 2005, 
and it doesn't claim to be authoritative.  Excel is kind of a standard, 
but it does some very weird things.  (For example:  enter the string 01 
into a field.  To keep the leading 0, you need to type it as '01.  Save 
the file, read it back:  goodbye 0.  At least that's what a website I 
was just on says about Excel, and what OpenOffice does.)

I've been burned so many times by storing data in .csv files, that I 
just avoid them whenever I can.

Duncan Murdoch
#
On 3/12/2010, at 2:04 PM, Peter Ehlers wrote:
<SNIP>
Sound advice!  Thanks.  I'd just like to point out however
that it might be an idea to set quote="\"" in the call to
count.fields() --- to make its idea of how many fields there
are consistent with that of read.csv().  In count.fields()
quote defaults to "\"'" whereas in read.csv() it defaults
to "\"".

	cheers,

		Rolf
#
On Dec 2, 2010, at 8:33 PM, Duncan Murdoch wrote:
snipped
In both Excel and in OO,org you can select a column (or any other  
range) and set its format to text. (The default is numeric, not that  
different that read.table()'s default behavior.) Once a format has  
been set, you then do not need leading quotes. I just created a small  
example with OO.org Calc entered leading "0" without leading quotes  
and this code runs as desired after copying the three cells to the  
clipboard:

 > read.table(pipe("pbpaste"), colClasses="character")
     V1
1   01
2  004
3 0005

The same applies to date field in both OO.org and Excel. In this  
regard, it is simply a matter of understanding what is the defined  
behavior of your software and how one can manipulate it. This is no  
different than learning R's classes, coercing them to your ends, and  
dealing with other formatting issues.
No argument there. I know one physician whose weapon of choice is  
Stata who always uses "|" as his separator, but that's perhaps because  
he works entirely in Windows. I imagine that might not be the most  
uncommon character in *NIXen.

--

David Winsemius, MD
West Hartford, CT
#
On 02/12/2010 9:18 PM, David Winsemius wrote:
You're right, I shouldn't have picked on Excel particularly here, but it 
really is a bizarre format that says the default way to read a file 
containing

"V1"
"01"
"004"
"0005"

is to assume that the column contains numeric values.  (Yes, read.csv() 
makes this same assumption.)  My main complaint is with the format.

Duncan Murdoch
#
On 03/12/10 14:33, Duncan Murdoch wrote:
Absolutely agree with this Duncan. Playing around with .csv files is 
like playing with some sort of unstable explosive. I also avoid them as 
much as possible.

David Scott

  
    
#
On 3/12/2010, at 3:48 PM, David Scott wrote:

            
<SNIP>
Where I work, everybody but me uses (yeuuccchhh!!!) Excel or SPSS.  If
we are to share data sets, *.csv files seem to be the most efficacious,
if not the only, way to go.

So far, we've had very few problems.  The one that started off this thread
is the only one I can think of that related to the *.csv format.

At least *.csv files have the virtue of being ASCII files, whence if things
go wrong it is at least possible to dig into them with a text editor and
figure out just what the problem is.

	cheers,

		Rolf
#
On Dec 2, 2010, at 9:33 PM, Duncan Murdoch wrote:

            
I'm a bit puzzled. Or maybe not. If you are criticizing the default  
behavior of R's read.table then I do understand (but have been taught  
by my reading of the FM that "numeric" happens iff all first <n> _are_  
coercible to "numeric" without NA generation is what one should  
expect). Excel is offering text exactly in the instances it has been  
told that the cell format is "text".
Meaning the defaults chosen for read.csv()?
#
On 02/12/2010 9:59 PM, Rolf Turner wrote:
I was going to suggest using DIF rather than CSV.  It contains more 
internal information about the file (including the type of each entry), 
but has the disadvantage of being less readable, even though it is ascii.

However, in putting together a little demo, I found a couple of bugs in 
the R implementation of read.DIF, and it looks as though it ignores the 
internal type information.  Sigh.

Duncan Murdoch
2 days later
#
On 03/12/2010 7:08 AM, Duncan Murdoch wrote:
As of r53778, the bugs I noticed should be fixed.  read.DIF now respects 
the internal type information, so it will keep character strings like 
"001" as type character (unless you ask it to change the type).

Duncan Murdoch
#
On 6/12/2010, at 3:00 AM, Duncan Murdoch wrote:

            
I don't think DIF is really the answer. My colleagues are familiar
	with the *.csv concept; they have never heard of ``DIF''.

	As I have said, we have had but few problems using *.csv.  Better the
	devil you know ...

	Furthermore I have to deal with data provided by various sources ``external''
	to the research project that I work for. I have to use the data that these
	sources provide, in the format in which they provide it.  If they give me
	*.csv files I count myself lucky.

	Finally, there seems to be no ``write.DIF'' function, i.e. there is no way
	to produce *.DIF output, as far as I can tell.  Hence it would not seem
	practical to use *.DIF as a data exchange standard.
What does ``r53778'' mean?

		cheers,

			Rolf
#
On Dec 5, 2010, at 2:14 PM, Rolf Turner wrote:

            
I assumed it was a version sequence number:

http://cran.r-project.org/src/base-prerelease/
David Winsemius, MD
West Hartford, CT
#
On 05/12/2010 2:14 PM, Rolf Turner wrote:
Sure, those are good points.
Revision 53778 from the version control system.  When you start 
R-patched or R-devel it will print this in the startup message, e.g.

R version 2.13.0 Under development (unstable) (2010-12-05 r53775)
                                                           ^^^^^^

(from just before I saved the changes).

Duncan Murdoch