Skip to content

CSV value not being read as it appears

13 messages · jim holtman, Brian Ripley, Bob Green +5 more

#
I have a frustrating issue which I am hoping someone may have a suggestion
about.

I am running XP and R 2.12.0 and saved an EXCEL file that I was sent as a
csv file.

The initial code I ran follows.

dec <- read.csv("g://FMH/FO30122010.csv",header=T)
dec.open <- subset (dec, Status == "Open")
table(dec.open$AMHS)

I was checking the output and noticed a difference between my manual count
and R output. Two subject's rows were not being detected by the subset
command:

For the AMHS where there was a discrepancy I then ran:
wm <- subset (dec, AMHS == "WM")

The problem appears to be that there is a space before the 'Open" value
for two indivduals, as per the example below.

10/02/2010  Open
22/08/2007   Open

Checking in EXCEL there does not appear to be a space and the format is
the same (e.g 'general').  I resolved the problem by copying over the
values for the two individuals where I identified  a problem.

Given this problem was not detected by visual scanning I would appreciate
advice on how this problem can be detected in future without my having to
manually check raw data against R output.

Any assistance is appreciated,

Bob
#
try strip.white=TRUE to strip out white space

Sent from my iPad
On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:

            
#
As a further note, this is a reminder that whenever you get data via a 
spreadsheet the first thing to do is examine it and clean up any 
problems. A basic requirement is to tabulate any categorical variable. 
Spreadsheets allow any sort of data to be entered, with no controls. My 
experience is that those who enter data into spreadsheets enter all 
sorts of variations of what a human would wish to treat as the same 
("Open", "Open  ", "open", etc.), even when told not to.

David Scott
On 14/01/2011 4:03 p.m., Jim Holtman wrote:

  
    
#
On Fri, 14 Jan 2011, David Scott wrote:

            
Another common problem is that they enter characters such as 
non-breaking space or zero-width characters: we added support for 
known encodings of NBSP to strip.white about five years ago.

  
    
#
Hello David,

Thanks for your e-mail. The data was a report derived from a statewide
database, saved in EXCEL format, so the usual issue of the vagaries of
human data entry variation wasn't the issue as the data was an automated
report, which is run every three months. I would not have even noticed
this problem if I hadn't been double checking the numbers of people by
district. Visual inspection didn't reveal this problem - no white space
was obvious and the spelling was identical. Tabulation via R wouldn't have
detected this - I was obtaining the EXCEL totals via filter which I then
compared with R output. I'm hoping I can skip this step, in future, with
Jim's suggestion.

regards

Bob
#
Brian,

Thanks. My response to David follows. I should add that this problem has
never occurred previously as far as I know (I have now checked the
previous report I was sent):

Hello David,

Thanks for your e-mail. The data was a report derived from a statewide
database, saved in EXCEL format, so the usual issue of the vagaries of
human data entry variation wasn't the issue as the data was an automated
report, which is run every three months. I would not have even noticed
this problem if I hadn't been double checking the numbers of people by
district. Visual inspection didn't reveal this problem - no white space
was obvious and the spelling was identical. Tabulation via R wouldn't have
detected this - I was obtaining the EXCEL totals via filter which I then
compared with R output. I'm hoping I can skip this step, in future, with
Jim's suggestion.

regards

Bob
#
On Fri, Jan 14, 2011 at 07:58:07PM +1000, bgreen at dyson.brisnet.org.au wrote:
If this problem occurs with computer generated data, it may also be
worthwhile to talk to whoever is in charge of that reporting system
and hope to get the bug fixed.

And just to add one of my favorite inital checks: I always double
check if the number of levels of each factor in my data.frame seems to
make sense.

cu
	Philipp
#
On 2011-01-14 02:09, bgreen at dyson.brisnet.org.au wrote:
This problem occurs to me frequently. Like Philipp and David,
I too always check imported categorical variables. The worst
cases are trailing spaces (in quoted text).

It is hardly R's fault that Excel users routinely commit
crimes against data.

Peter Ehlers
#
At 14.01.2011 07:09 -0800, Peter Ehlers wrote:
These are still the best "worst cases". My favourite "worst cases" 
are entries like "5-10" or similar that are trasformed into dates, 
e.g. 05Oct2011. My problem is, however that I don't know any other 
universally known format to exchange data with a medical  colleague 
or with a social scientist.

Heinz
#
I am a bit confused by this. You are doing a transfer from Excel (.xls 
or .xlsx) to .csv, then a subset in R and ending up with a couple of 
entries which are " Open" rather than "Open". So where are they coming 
from? You say they are not in the original Excel, so that suggests the 
transfer to .csv is the problem. I would be very surprised if the subset 
was a problem, but as others have indicated transfer to .csv can be 
downright ugly.

You can check the .csv file by opening it in an editor (I use Emacs). 
Just go to the line and have a look if the extra space is there nestling 
between two commas.

The other advice is, don't go through .csv. Go directly from Excel to R. 
My favourite tools are RODBC and xlsReadWrite for that step. Both work 
extremely well.

As others have indicated, the big bugbear in the .csv route is dates, or 
what Excel decides are dates. My experience was the conversion of New 
Zealand health ID numbers to dates. They are three letters then 4 
digits, so AUG2699 became a date.

David Scott
On 14/01/2011 10:58 p.m., bgreen at dyson.brisnet.org.au wrote:

  
    
#
David,

Thanks. When I am back at work I will try to find out some specifics
regarding the original data base and how the reports are generated. The
differencs are not apparent via manual inspection.

I will look at the csv file in an editor as well and look into xlsReadWrite.

I agree EXCEL formatting of dates can be a problem.

regards

Bob
2 days later
#
David Scott wrote:

            
I like using the ?describe? function in the ?Hmisc? package for this. If you 
run the result through the ?latex? function, you get an even nicer output, 
with small histograms for each numerical variable.
#
Peter Ehlers wrote:

            
A ?fortune? candidate?