As a tangent to this thread, there is a very
relevant
article in the latest issue of the RSS magazine
"Significance",
which I have just received:
Dr Fisher's Casebook
The trouble with data
Significance, Vol 4 (2007) Issue 2.
Full current contents at
http://www.blackwell-synergy.com/toc/sign/4/2
but unfortunately you can only read any of it by
paying
money to Blackwell (unless you're an RSS member).
Best wishes to all,
Ted.
A lovely article. I'm not a member but the local
university has a subscription.
The examples of "men who claimed to have cervical
smears (F) and women who were 5' tall weighing 15
stone (T) ring true.
I've found people walking at 30 km/hr (F) and an
addict using 240 needles a month (T). I've even found
a set of 16 variables the study designers never heard
of !
[ Arrggh, not reply , but reply to all , cross my fingers again , sorry Peter! ]
Hmm,
I don't think you need a retain statement.
if first.patientID ;
or
if last.patientID ;
ought to do it.
It's actually better than the Vilno version, I must admit, a bit more concise:
if ( not firstrow(patientID) ) deleterow ;
Ah well.
**********************************
For the folks asking for location of software ( I know posted it, but
it didn't connect to the thread, and you get a huge number of posts
each day , sorry):
Vilno , find at
http://code.google.com/p/vilno
DAP & PSPP, find at
http://directory.fsf.org/math/stats
Awk, find at lots of places,
http://www.gnu.org/software/gawk/gawk.html
Anything else? DAP & PSPP are hard to find, I'm sure there's more out there!
What about MDX? Nahh, not really the right problem domain.
Nobody uses MDX for this stuff.
******************************************************
If my examples , using clinical trial data are boring and hard to
understand for those who asked for examples
( and presumably don't work in clinical trials) , let me
know. Some of these other examples I'm reading about are quite interesting.
It doesn't help that clinical trial databases cannot be public. Making
a fake database would take a lot of time.
The irony is , even with my deep understanding of data preparation in
clinical trials,
the pharmas still don't want to give me a job ( because I was gone for
many years).
********************************************************
Let's see if this post works : thanks to the folks who gave me advice
on how to properly respond to a post within a thread . ( Although the
thread in my gmail account is only a subset of the posts visible in
the archives ). Crossing my fingers ....
On 6/10/07, Peter Dalgaard <p.dalgaard at biostat.ku.dk> wrote:
Douglas Bates wrote:
Frank Harrell indicated that it is possible to do a lot of difficult
data transformation within R itself if you try hard enough but that
sometimes means working against the S language and its "whole object"
view to accomplish what you want and it can require knowledge of
subtle aspects of the S language.
Actually, I think Frank's point was subtly different: It is *because* of
the differences in view that it sometimes seems difficult to find the
way to do something in R that is apparently straightforward in SAS.
I.e. the solutions exist and are often elegant, but may require some
lateral thinking.
Case in point: Finding the first or the last observation for each
subject when there are multiple records for each subject. The SAS way
would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that
you can compare the subject ID with the one from the previous record,
working with data that are sorted appropriately.
You can do the same thing in R with a for loop, but there are better
ways e.g.
subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or
maybe
do.call("rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or
something involving aggregate(). (The latter approaches generalize
better to other within-subject functionals like cumulative doses, etc.).
The hardest cases that I know of are the ones where you need to turn one
record into many, such as occurs in survival analysis with
time-dependent, piecewise constant covariates. This may require
"transposing the problem", i.e. for each interval you find out which
subjects contribute and with what, whereas the SAS way would be a
within-subject loop over intervals containing an OUTPUT statement.
Also, there are some really weird data formats, where e.g. the input
format is different in different records. Back in the 80's where
punched-card input was still common, it was quite popular to have one
card with background information on a patient plus several cards
detailing visits, and you'd get a stack of cards containing both kinds.
In R you would most likely split on the card type using grep() and then
read the two kinds separately and merge() them later.
I am posting to this thread that has been quiet for some time because I
remembered the following question.
Christophe Pallier wrote:
Hi,
Can you provide examples of data formats that are problematic to read and
clean with R ?
Today I had a data manipulation problem that I don't know how to do in R
so I solved it with perl. Since I'm always interested in learning more
about complex data manipulation in R I am posting my problem in the
hopes of receiving some hints for doing this in R.
If anyone has nothing better to do than play with other people's data,
I would be happy to send the row files off-list.
Background:
I have been given data that contains two measurements of left
ventricular ejection fraction. One of the methods is echocardiogram
which sometimes gives a true quantitative value and other times a
semi-quantitative value. The desire is to compare echo with the
other method (MUGA). In most cases, patients had either quantitative
or semi-quantitative. Same patients had both. The data came
to me in excel files with, basically, no patient identifiers to link
the "both" with the semi-quantitative patients (the "both" patients
were in multiple data sets).
What I wanted to do was extract from the semi-quantitative data file
those patients with only semi-quantitative. All I have to link with
are the semi-quantitative echo and the MUGA and these pairs of values
are not unique.
To make this more concrete, here are some portions of the raw data.
"Both"
"ID NUM","ECHO","MUGA","Semiquant","Quant"
"B",12,37,10,12
"D",13,13,10,13
"E",13,26,10,15
"F",13,31,10,13
"H",15,15,10,15
"I",15,21,10,15
"J",15,22,10,15
"K",17,22,10,17
"N",17.5,4,10,17.5
"P",18,25,10,18
"R",19,25,10,19
Seimi-quantitative
"echo","muga","quant"
10,20,0 <-- keep
10,20,0 <-- keep
10,21,0 <-- remove
10,21,0 <-- keep
10,24,0 <-- keep
10,25,0 <-- remove
10,25,0 <-- remove
10,25,0 <-- keep
Here is the perl program I wrote for this.
#!/usr/bin/perl
open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
# Discard first row;
$_ = <BOTH>;
while(<BOTH>) {
chomp;
($id, $e, $m, $sq, $qu) = split(/,/);
$both{$sq,$m}++;
}
close(BOTH);
open(OUT, "> qual_echo_only.csv") || die "Can't open qual_echo_only.csv";
print OUT "pid,echo,muga,quant\n";
$pid = 2001;
open(QUAL, "qual_echo.csv") || die "Can't open qual_echo.csv";
# Discard first row
$_ = <QUAL>;
while(<QUAL>) {
chomp;
($echo, $muga, $quant) = split(/,/);
if ($both{$echo,$muga} > 0) {
$both{$echo,$muga}--;
}
else {
print OUT "$pid,$echo,$muga,$quant\n";
$pid++;
}
}
close(QUAL);
close(OUT);
open(OUT, "> both_echo.csv") || die "Can't open both_echo.csv";
print OUT "pid,echo,muga,quant\n";
$pid = 3001;
open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
# Discard first row;
$_ = <BOTH>;
while(<BOTH>) {
chomp;
($id, $e, $m, $sq, $qu) = split(/,/);
print OUT "$pid,$sq,$m,0\n";
print OUT "$pid,$qu,$m,1\n";
$pid++;
}
close(BOTH);
close(OUT);
Kevin E. Thorpe
Biostatistician/Trialist, Knowledge Translation Program
Assistant Professor, Department of Public Health Sciences
Faculty of Medicine, University of Toronto
email: kevin.thorpe at utoronto.ca Tel: 416.864.5776 Fax: 416.864.6057