I have a frustrating issue which I am hoping someone may have a suggestion
about.
I am running XP and R 2.12.0 and saved an EXCEL file that I was sent as a
csv file.
The initial code I ran follows.
dec <- read.csv("g://FMH/FO30122010.csv",header=T)
dec.open <- subset (dec, Status == "Open")
table(dec.open$AMHS)
I was checking the output and noticed a difference between my manual count
and R output. Two subject's rows were not being detected by the subset
command:
For the AMHS where there was a discrepancy I then ran:
wm <- subset (dec, AMHS == "WM")
The problem appears to be that there is a space before the 'Open" value
for two indivduals, as per the example below.
10/02/2010 Open
22/08/2007 Open
Checking in EXCEL there does not appear to be a space and the format is
the same (e.g 'general'). I resolved the problem by copying over the
values for the two individuals where I identified a problem.
Given this problem was not detected by visual scanning I would appreciate
advice on how this problem can be detected in future without my having to
manually check raw data against R output.
Any assistance is appreciated,
Bob
CSV value not being read as it appears
13 messages · jim holtman, Brian Ripley, Bob Green +5 more
try strip.white=TRUE to strip out white space Sent from my iPad
On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
I have a frustrating issue which I am hoping someone may have a suggestion
about.
I am running XP and R 2.12.0 and saved an EXCEL file that I was sent as a
csv file.
The initial code I ran follows.
dec <- read.csv("g://FMH/FO30122010.csv",header=T)
dec.open <- subset (dec, Status == "Open")
table(dec.open$AMHS)
I was checking the output and noticed a difference between my manual count
and R output. Two subject's rows were not being detected by the subset
command:
For the AMHS where there was a discrepancy I then ran:
wm <- subset (dec, AMHS == "WM")
The problem appears to be that there is a space before the 'Open" value
for two indivduals, as per the example below.
10/02/2010 Open
22/08/2007 Open
Checking in EXCEL there does not appear to be a space and the format is
the same (e.g 'general'). I resolved the problem by copying over the
values for the two individuals where I identified a problem.
Given this problem was not detected by visual scanning I would appreciate
advice on how this problem can be detected in future without my having to
manually check raw data against R output.
Any assistance is appreciated,
Bob
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
As a further note, this is a reminder that whenever you get data via a
spreadsheet the first thing to do is examine it and clean up any
problems. A basic requirement is to tabulate any categorical variable.
Spreadsheets allow any sort of data to be entered, with no controls. My
experience is that those who enter data into spreadsheets enter all
sorts of variations of what a human would wish to treat as the same
("Open", "Open ", "open", etc.), even when told not to.
David Scott
On 14/01/2011 4:03 p.m., Jim Holtman wrote:
try strip.white=TRUE to strip out white space Sent from my iPad On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
I have a frustrating issue which I am hoping someone may have a suggestion
about.
I am running XP and R 2.12.0 and saved an EXCEL file that I was sent as a
csv file.
The initial code I ran follows.
dec<- read.csv("g://FMH/FO30122010.csv",header=T)
dec.open<- subset (dec, Status == "Open")
table(dec.open$AMHS)
I was checking the output and noticed a difference between my manual count
and R output. Two subject's rows were not being detected by the subset
command:
For the AMHS where there was a discrepancy I then ran:
wm<- subset (dec, AMHS == "WM")
The problem appears to be that there is a space before the 'Open" value
for two indivduals, as per the example below.
10/02/2010 Open
22/08/2007 Open
Checking in EXCEL there does not appear to be a space and the format is
the same (e.g 'general'). I resolved the problem by copying over the
values for the two individuals where I identified a problem.
Given this problem was not detected by visual scanning I would appreciate
advice on how this problem can be detected in future without my having to
manually check raw data against R output.
Any assistance is appreciated,
Bob
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
_________________________________________________________________ David Scott Department of Statistics The University of Auckland, PB 92019 Auckland 1142, NEW ZEALAND Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055 Email: d.scott at auckland.ac.nz, Fax: +64 9 373 7018 Director of Consulting, Department of Statistics
On Fri, 14 Jan 2011, David Scott wrote:
As a further note, this is a reminder that whenever you get data via
a spreadsheet the first thing to do is examine it and clean up any
problems. A basic requirement is to tabulate any categorical
variable. Spreadsheets allow any sort of data to be entered, with no
controls. My experience is that those who enter data into
spreadsheets enter all sorts of variations of what a human would
wish to treat as the same ("Open", "Open ", "open", etc.), even when
told not to.
Another common problem is that they enter characters such as non-breaking space or zero-width characters: we added support for known encodings of NBSP to strip.white about five years ago.
David Scott On 14/01/2011 4:03 p.m., Jim Holtman wrote:
try strip.white=TRUE to strip out white space Sent from my iPad On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
I have a frustrating issue which I am hoping someone may have a suggestion
about.
I am running XP and R 2.12.0 and saved an EXCEL file that I was sent as a
csv file.
The initial code I ran follows.
dec<- read.csv("g://FMH/FO30122010.csv",header=T)
dec.open<- subset (dec, Status == "Open")
table(dec.open$AMHS)
I was checking the output and noticed a difference between my manual count
and R output. Two subject's rows were not being detected by the subset
command:
For the AMHS where there was a discrepancy I then ran:
wm<- subset (dec, AMHS == "WM")
The problem appears to be that there is a space before the 'Open" value
for two indivduals, as per the example below.
10/02/2010 Open
22/08/2007 Open
Checking in EXCEL there does not appear to be a space and the format is
the same (e.g 'general'). I resolved the problem by copying over the
values for the two individuals where I identified a problem.
Given this problem was not detected by visual scanning I would appreciate
advice on how this problem can be detected in future without my having to
manually check raw data against R output.
Any assistance is appreciated,
Bob
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
--
_________________________________________________________________ David Scott Department of Statistics The University of Auckland, PB 92019 Auckland 1142, NEW ZEALAND Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055 Email: d.scott at auckland.ac.nz, Fax: +64 9 373 7018 Director of Consulting, Department of Statistics ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Hello David, Thanks for your e-mail. The data was a report derived from a statewide database, saved in EXCEL format, so the usual issue of the vagaries of human data entry variation wasn't the issue as the data was an automated report, which is run every three months. I would not have even noticed this problem if I hadn't been double checking the numbers of people by district. Visual inspection didn't reveal this problem - no white space was obvious and the spelling was identical. Tabulation via R wouldn't have detected this - I was obtaining the EXCEL totals via filter which I then compared with R output. I'm hoping I can skip this step, in future, with Jim's suggestion. regards Bob
As a further note, this is a reminder that whenever you get data via a
spreadsheet the first thing to do is examine it and clean up any
problems. A basic requirement is to tabulate any categorical variable.
Spreadsheets allow any sort of data to be entered, with no controls. My
experience is that those who enter data into spreadsheets enter all
sorts of variations of what a human would wish to treat as the same
("Open", "Open ", "open", etc.), even when told not to.
David Scott
On 14/01/2011 4:03 p.m., Jim Holtman wrote:
try strip.white=TRUE to strip out white space Sent from my iPad On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
I have a frustrating issue which I am hoping someone may have a
suggestion
about.
I am running XP and R 2.12.0 and saved an EXCEL file that I was sent as
a
csv file.
The initial code I ran follows.
dec<- read.csv("g://FMH/FO30122010.csv",header=T)
dec.open<- subset (dec, Status == "Open")
table(dec.open$AMHS)
I was checking the output and noticed a difference between my manual
count
and R output. Two subject's rows were not being detected by the subset
command:
For the AMHS where there was a discrepancy I then ran:
wm<- subset (dec, AMHS == "WM")
The problem appears to be that there is a space before the 'Open" value
for two indivduals, as per the example below.
10/02/2010 Open
22/08/2007 Open
Checking in EXCEL there does not appear to be a space and the format is
the same (e.g 'general'). I resolved the problem by copying over the
values for the two individuals where I identified a problem.
Given this problem was not detected by visual scanning I would
appreciate
advice on how this problem can be detected in future without my having
to
manually check raw data against R output.
Any assistance is appreciated,
Bob
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
--
_________________________________________________________________ David Scott Department of Statistics The University of Auckland, PB 92019 Auckland 1142, NEW ZEALAND Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055 Email: d.scott at auckland.ac.nz, Fax: +64 9 373 7018 Director of Consulting, Department of Statistics
Brian, Thanks. My response to David follows. I should add that this problem has never occurred previously as far as I know (I have now checked the previous report I was sent): Hello David, Thanks for your e-mail. The data was a report derived from a statewide database, saved in EXCEL format, so the usual issue of the vagaries of human data entry variation wasn't the issue as the data was an automated report, which is run every three months. I would not have even noticed this problem if I hadn't been double checking the numbers of people by district. Visual inspection didn't reveal this problem - no white space was obvious and the spelling was identical. Tabulation via R wouldn't have detected this - I was obtaining the EXCEL totals via filter which I then compared with R output. I'm hoping I can skip this step, in future, with Jim's suggestion. regards Bob
On Fri, 14 Jan 2011, David Scott wrote:
As a further note, this is a reminder that whenever you get data via
a spreadsheet the first thing to do is examine it and clean up any
problems. A basic requirement is to tabulate any categorical
variable. Spreadsheets allow any sort of data to be entered, with no
controls. My experience is that those who enter data into
spreadsheets enter all sorts of variations of what a human would
wish to treat as the same ("Open", "Open ", "open", etc.), even when
told not to.
Another common problem is that they enter characters such as non-breaking space or zero-width characters: we added support for known encodings of NBSP to strip.white about five years ago.
David Scott On 14/01/2011 4:03 p.m., Jim Holtman wrote:
try strip.white=TRUE to strip out white space Sent from my iPad On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
I have a frustrating issue which I am hoping someone may have a
suggestion
about.
I am running XP and R 2.12.0 and saved an EXCEL file that I was sent
as a
csv file.
The initial code I ran follows.
dec<- read.csv("g://FMH/FO30122010.csv",header=T)
dec.open<- subset (dec, Status == "Open")
table(dec.open$AMHS)
I was checking the output and noticed a difference between my manual
count
and R output. Two subject's rows were not being detected by the subset
command:
For the AMHS where there was a discrepancy I then ran:
wm<- subset (dec, AMHS == "WM")
The problem appears to be that there is a space before the 'Open"
value
for two indivduals, as per the example below.
10/02/2010 Open
22/08/2007 Open
Checking in EXCEL there does not appear to be a space and the format
is
the same (e.g 'general'). I resolved the problem by copying over the
values for the two individuals where I identified a problem.
Given this problem was not detected by visual scanning I would
appreciate
advice on how this problem can be detected in future without my having
to
manually check raw data against R output.
Any assistance is appreciated,
Bob
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
--
_________________________________________________________________ David Scott Department of Statistics The University of Auckland, PB 92019 Auckland 1142, NEW ZEALAND Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055 Email: d.scott at auckland.ac.nz, Fax: +64 9 373 7018 Director of Consulting, Department of Statistics ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
On Fri, Jan 14, 2011 at 07:58:07PM +1000, bgreen at dyson.brisnet.org.au wrote:
Thanks for your e-mail. The data was a report derived from a statewide database, saved in EXCEL format, so the usual issue of the vagaries of human data entry variation wasn't the issue as the data was an automated report, which is run every three months.
If this problem occurs with computer generated data, it may also be worthwhile to talk to whoever is in charge of that reporting system and hope to get the bug fixed. And just to add one of my favorite inital checks: I always double check if the number of levels of each factor in my data.frame seems to make sense. cu Philipp
Dr. Philipp Pagel Lehrstuhl f?r Genomorientierte Bioinformatik Technische Universit?t M?nchen Wissenschaftszentrum Weihenstephan Maximus-von-Imhof-Forum 3 85354 Freising, Germany http://webclu.bio.wzw.tum.de/~pagel/
On 2011-01-14 02:09, bgreen at dyson.brisnet.org.au wrote:
Brian, Thanks. My response to David follows. I should add that this problem has never occurred previously as far as I know (I have now checked the previous report I was sent):
This problem occurs to me frequently. Like Philipp and David, I too always check imported categorical variables. The worst cases are trailing spaces (in quoted text). It is hardly R's fault that Excel users routinely commit crimes against data. Peter Ehlers
Hello David, Thanks for your e-mail. The data was a report derived from a statewide database, saved in EXCEL format, so the usual issue of the vagaries of human data entry variation wasn't the issue as the data was an automated report, which is run every three months. I would not have even noticed this problem if I hadn't been double checking the numbers of people by district. Visual inspection didn't reveal this problem - no white space was obvious and the spelling was identical. Tabulation via R wouldn't have detected this - I was obtaining the EXCEL totals via filter which I then compared with R output. I'm hoping I can skip this step, in future, with Jim's suggestion. regards Bob
On Fri, 14 Jan 2011, David Scott wrote:
As a further note, this is a reminder that whenever you get data via
a spreadsheet the first thing to do is examine it and clean up any
problems. A basic requirement is to tabulate any categorical
variable. Spreadsheets allow any sort of data to be entered, with no
controls. My experience is that those who enter data into
spreadsheets enter all sorts of variations of what a human would
wish to treat as the same ("Open", "Open ", "open", etc.), even when
told not to.
Another common problem is that they enter characters such as non-breaking space or zero-width characters: we added support for known encodings of NBSP to strip.white about five years ago.
David Scott On 14/01/2011 4:03 p.m., Jim Holtman wrote:
try strip.white=TRUE to strip out white space Sent from my iPad On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
I have a frustrating issue which I am hoping someone may have a
suggestion
about.
I am running XP and R 2.12.0 and saved an EXCEL file that I was sent
as a
csv file.
The initial code I ran follows.
dec<- read.csv("g://FMH/FO30122010.csv",header=T)
dec.open<- subset (dec, Status == "Open")
table(dec.open$AMHS)
I was checking the output and noticed a difference between my manual
count
and R output. Two subject's rows were not being detected by the subset
command:
For the AMHS where there was a discrepancy I then ran:
wm<- subset (dec, AMHS == "WM")
The problem appears to be that there is a space before the 'Open"
value
for two indivduals, as per the example below.
10/02/2010 Open
22/08/2007 Open
Checking in EXCEL there does not appear to be a space and the format
is
the same (e.g 'general'). I resolved the problem by copying over the
values for the two individuals where I identified a problem.
Given this problem was not detected by visual scanning I would
appreciate
advice on how this problem can be detected in future without my having
to
manually check raw data against R output.
Any assistance is appreciated,
Bob
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
--
_________________________________________________________________ David Scott Department of Statistics The University of Auckland, PB 92019 Auckland 1142, NEW ZEALAND Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055 Email: d.scott at auckland.ac.nz, Fax: +64 9 373 7018 Director of Consulting, Department of Statistics ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
At 14.01.2011 07:09 -0800, Peter Ehlers wrote:
On 2011-01-14 02:09, bgreen at dyson.brisnet.org.au wrote:
Brian, Thanks. My response to David follows. I should add that this problem has never occurred previously as far as I know (I have now checked the previous report I was sent):
This problem occurs to me frequently. Like Philipp and David, I too always check imported categorical variables. The worst cases are trailing spaces (in quoted text).
These are still the best "worst cases". My favourite "worst cases" are entries like "5-10" or similar that are trasformed into dates, e.g. 05Oct2011. My problem is, however that I don't know any other universally known format to exchange data with a medical colleague or with a social scientist. Heinz
It is hardly R's fault that Excel users routinely commit crimes against data. Peter Ehlers
Hello David, Thanks for your e-mail. The data was a report derived from a statewide database, saved in EXCEL format, so the usual issue of the vagaries of human data entry variation wasn't the issue as the data was an automated report, which is run every three months. I would not have even noticed this problem if I hadn't been double checking the numbers of people by district. Visual inspection didn't reveal this problem - no white space was obvious and the spelling was identical. Tabulation via R wouldn't have detected this - I was obtaining the EXCEL totals via filter which I then compared with R output. I'm hoping I can skip this step, in future, with Jim's suggestion. regards Bob
On Fri, 14 Jan 2011, David Scott wrote:
As a further note, this is a reminder that whenever you get data via
a spreadsheet the first thing to do is examine it and clean up any
problems. A basic requirement is to tabulate any categorical
variable. Spreadsheets allow any sort of data to be entered, with no
controls. My experience is that those who enter data into
spreadsheets enter all sorts of variations of what a human would
wish to treat as the same ("Open", "Open ", "open", etc.), even when
told not to.
Another common problem is that they enter characters such as non-breaking space or zero-width characters: we added support for known encodings of NBSP to strip.white about five years ago.
David Scott On 14/01/2011 4:03 p.m., Jim Holtman wrote:
try strip.white=TRUE to strip out white space Sent from my iPad On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
I have a frustrating issue which I am hoping someone may have a
suggestion
about.
I am running XP and R 2.12.0 and saved an EXCEL file that I was sent
as a
csv file.
The initial code I ran follows.
dec<- read.csv("g://FMH/FO30122010.csv",header=T)
dec.open<- subset (dec, Status == "Open")
table(dec.open$AMHS)
I was checking the output and noticed a difference between my manual
count
and R output. Two subject's rows were not being detected by the subset
command:
For the AMHS where there was a discrepancy I then ran:
wm<- subset (dec, AMHS == "WM")
The problem appears to be that there is a space before the 'Open"
value
for two indivduals, as per the example below.
10/02/2010 Open
22/08/2007 Open
Checking in EXCEL there does not appear to be a space and the format
is
the same (e.g 'general'). I resolved the problem by copying over the
values for the two individuals where I identified a problem.
Given this problem was not detected by visual scanning I would
appreciate
advice on how this problem can be detected in future without my having
to
manually check raw data against R output.
Any assistance is appreciated,
Bob
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
--
_________________________________________________________________
David Scott Department of Statistics
The University of Auckland, PB 92019
Auckland 1142, NEW ZEALAND
Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055
Email: d.scott at auckland.ac.nz, Fax: +64 9 373 7018
Director of Consulting, Department of Statistics
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I am a bit confused by this. You are doing a transfer from Excel (.xls or .xlsx) to .csv, then a subset in R and ending up with a couple of entries which are " Open" rather than "Open". So where are they coming from? You say they are not in the original Excel, so that suggests the transfer to .csv is the problem. I would be very surprised if the subset was a problem, but as others have indicated transfer to .csv can be downright ugly. You can check the .csv file by opening it in an editor (I use Emacs). Just go to the line and have a look if the extra space is there nestling between two commas. The other advice is, don't go through .csv. Go directly from Excel to R. My favourite tools are RODBC and xlsReadWrite for that step. Both work extremely well. As others have indicated, the big bugbear in the .csv route is dates, or what Excel decides are dates. My experience was the conversion of New Zealand health ID numbers to dates. They are three letters then 4 digits, so AUG2699 became a date. David Scott
On 14/01/2011 10:58 p.m., bgreen at dyson.brisnet.org.au wrote:
Hello David, Thanks for your e-mail. The data was a report derived from a statewide database, saved in EXCEL format, so the usual issue of the vagaries of human data entry variation wasn't the issue as the data was an automated report, which is run every three months. I would not have even noticed this problem if I hadn't been double checking the numbers of people by district. Visual inspection didn't reveal this problem - no white space was obvious and the spelling was identical. Tabulation via R wouldn't have detected this - I was obtaining the EXCEL totals via filter which I then compared with R output. I'm hoping I can skip this step, in future, with Jim's suggestion. regards Bob
As a further note, this is a reminder that whenever you get data via a
spreadsheet the first thing to do is examine it and clean up any
problems. A basic requirement is to tabulate any categorical variable.
Spreadsheets allow any sort of data to be entered, with no controls. My
experience is that those who enter data into spreadsheets enter all
sorts of variations of what a human would wish to treat as the same
("Open", "Open ", "open", etc.), even when told not to.
David Scott
On 14/01/2011 4:03 p.m., Jim Holtman wrote:
try strip.white=TRUE to strip out white space Sent from my iPad On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
I have a frustrating issue which I am hoping someone may have a
suggestion
about.
I am running XP and R 2.12.0 and saved an EXCEL file that I was sent as
a
csv file.
The initial code I ran follows.
dec<- read.csv("g://FMH/FO30122010.csv",header=T)
dec.open<- subset (dec, Status == "Open")
table(dec.open$AMHS)
I was checking the output and noticed a difference between my manual
count
and R output. Two subject's rows were not being detected by the subset
command:
For the AMHS where there was a discrepancy I then ran:
wm<- subset (dec, AMHS == "WM")
The problem appears to be that there is a space before the 'Open" value
for two indivduals, as per the example below.
10/02/2010 Open
22/08/2007 Open
Checking in EXCEL there does not appear to be a space and the format is
the same (e.g 'general'). I resolved the problem by copying over the
values for the two individuals where I identified a problem.
Given this problem was not detected by visual scanning I would
appreciate
advice on how this problem can be detected in future without my having
to
manually check raw data against R output.
Any assistance is appreciated,
Bob
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
--
_________________________________________________________________ David Scott Department of Statistics The University of Auckland, PB 92019 Auckland 1142, NEW ZEALAND Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055 Email: d.scott at auckland.ac.nz, Fax: +64 9 373 7018 Director of Consulting, Department of Statistics
_________________________________________________________________ David Scott Department of Statistics The University of Auckland, PB 92019 Auckland 1142, NEW ZEALAND Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055 Email: d.scott at auckland.ac.nz, Fax: +64 9 373 7018 Director of Consulting, Department of Statistics
David, Thanks. When I am back at work I will try to find out some specifics regarding the original data base and how the reports are generated. The differencs are not apparent via manual inspection. I will look at the csv file in an editor as well and look into xlsReadWrite. I agree EXCEL formatting of dates can be a problem. regards Bob
I am a bit confused by this. You are doing a transfer from Excel (.xls or .xlsx) to .csv, then a subset in R and ending up with a couple of entries which are " Open" rather than "Open". So where are they coming from? You say they are not in the original Excel, so that suggests the transfer to .csv is the problem. I would be very surprised if the subset was a problem, but as others have indicated transfer to .csv can be downright ugly. You can check the .csv file by opening it in an editor (I use Emacs). Just go to the line and have a look if the extra space is there nestling between two commas. The other advice is, don't go through .csv. Go directly from Excel to R. My favourite tools are RODBC and xlsReadWrite for that step. Both work extremely well. As others have indicated, the big bugbear in the .csv route is dates, or what Excel decides are dates. My experience was the conversion of New Zealand health ID numbers to dates. They are three letters then 4 digits, so AUG2699 became a date. David Scott On 14/01/2011 10:58 p.m., bgreen at dyson.brisnet.org.au wrote:
Hello David, Thanks for your e-mail. The data was a report derived from a statewide database, saved in EXCEL format, so the usual issue of the vagaries of human data entry variation wasn't the issue as the data was an automated report, which is run every three months. I would not have even noticed this problem if I hadn't been double checking the numbers of people by district. Visual inspection didn't reveal this problem - no white space was obvious and the spelling was identical. Tabulation via R wouldn't have detected this - I was obtaining the EXCEL totals via filter which I then compared with R output. I'm hoping I can skip this step, in future, with Jim's suggestion. regards Bob
As a further note, this is a reminder that whenever you get data via a
spreadsheet the first thing to do is examine it and clean up any
problems. A basic requirement is to tabulate any categorical variable.
Spreadsheets allow any sort of data to be entered, with no controls. My
experience is that those who enter data into spreadsheets enter all
sorts of variations of what a human would wish to treat as the same
("Open", "Open ", "open", etc.), even when told not to.
David Scott
On 14/01/2011 4:03 p.m., Jim Holtman wrote:
try strip.white=TRUE to strip out white space Sent from my iPad On Jan 13, 2011, at 21:44, bgreen at dyson.brisnet.org.au wrote:
I have a frustrating issue which I am hoping someone may have a
suggestion
about.
I am running XP and R 2.12.0 and saved an EXCEL file that I was sent
as
a
csv file.
The initial code I ran follows.
dec<- read.csv("g://FMH/FO30122010.csv",header=T)
dec.open<- subset (dec, Status == "Open")
table(dec.open$AMHS)
I was checking the output and noticed a difference between my manual
count
and R output. Two subject's rows were not being detected by the
subset
command:
For the AMHS where there was a discrepancy I then ran:
wm<- subset (dec, AMHS == "WM")
The problem appears to be that there is a space before the 'Open"
value
for two indivduals, as per the example below.
10/02/2010 Open
22/08/2007 Open
Checking in EXCEL there does not appear to be a space and the format
is
the same (e.g 'general'). I resolved the problem by copying over the
values for the two individuals where I identified a problem.
Given this problem was not detected by visual scanning I would
appreciate
advice on how this problem can be detected in future without my
having
to
manually check raw data against R output.
Any assistance is appreciated,
Bob
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
--
_________________________________________________________________ David Scott Department of Statistics The University of Auckland, PB 92019 Auckland 1142, NEW ZEALAND Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055 Email: d.scott at auckland.ac.nz, Fax: +64 9 373 7018 Director of Consulting, Department of Statistics
--
_________________________________________________________________ David Scott Department of Statistics The University of Auckland, PB 92019 Auckland 1142, NEW ZEALAND Phone: +64 9 923 5055, or +64 9 373 7599 ext 85055 Email: d.scott at auckland.ac.nz, Fax: +64 9 373 7018 Director of Consulting, Department of Statistics
2 days later
David Scott wrote:
As a further note, this is a reminder that whenever you get data via a spreadsheet the first thing to do is examine it and clean up any problems. A basic requirement is to tabulate any categorical variable.
I like using the ?describe? function in the ?Hmisc? package for this. If you run the result through the ?latex? function, you get an even nicer output, with small histograms for each numerical variable.
Karl Ove Hufthammer
Peter Ehlers wrote:
It is hardly R's fault that Excel users routinely commit crimes against data.
A ?fortune? candidate?
Karl Ove Hufthammer