The behaviour of read.csv().
On Dec 2, 2010, at 9:33 PM, Duncan Murdoch wrote:
On 02/12/2010 9:18 PM, David Winsemius wrote:
On Dec 2, 2010, at 8:33 PM, Duncan Murdoch wrote: snipped
I think the fill=TRUE option arrived about 10 years ago, in R 1.2.0. The comment in the NEWS file suggests it was in response to some strange csv file coming out of Excel. The real problem with the CSV format is that there really isn't a well defined standard for it. The first RFC about it was published in 2005, and it doesn't claim to be authoritative. Excel is kind of a standard, but it does some very weird things. (For example: enter the string 01 into a field. To keep the leading 0, you need to type it as '01. Save the file, read it back: goodbye 0. At least that's what a website I was just on says about Excel, and what OpenOffice does.)
In both Excel and in OO,org you can select a column (or any other range) and set its format to text. (The default is numeric, not that different that read.table()'s default behavior.) Once a format has been set, you then do not need leading quotes. I just created a small example with OO.org Calc entered leading "0" without leading quotes and this code runs as desired after copying the three cells to the clipboard:
> read.table(pipe("pbpaste"), colClasses="character")
V1 1 01 2 004 3 0005 The same applies to date field in both OO.org and Excel. In this regard, it is simply a matter of understanding what is the defined behavior of your software and how one can manipulate it. This is no different than learning R's classes, coercing them to your ends, and dealing with other formatting issues.
You're right, I shouldn't have picked on Excel particularly here, but it really is a bizarre format that says the default way to read a file containing "V1" # minor quibble. The V1 was added by read.table() "01" "004" "0005" is to assume that the column contains numeric values.
I'm a bit puzzled. Or maybe not. If you are criticizing the default behavior of R's read.table then I do understand (but have been taught by my reading of the FM that "numeric" happens iff all first <n> _are_ coercible to "numeric" without NA generation is what one should expect). Excel is offering text exactly in the instances it has been told that the cell format is "text".
(Yes, read.csv() makes this same assumption.) My main complaint is with the format.
Meaning the defaults chosen for read.csv()?
David. > > Duncan Murdoch > > >> >>> >>> I've been burned so many times by storing data in .csv files, that I >>> just avoid them whenever I can. >> >> No argument there. I know one physician whose weapon of choice is >> Stata who always uses "|" as his separator, but that's perhaps >> because >> he works entirely in Windows. I imagine that might not be the most >> uncommon character in *NIXen. >> >> -- >> >> David Winsemius, MD >> West Hartford, CT >> > David Winsemius, MD West Hartford, CT