Skip to content

Tools For Preparing Data For Analysis

29 messages · Robert Wilkins, Robert Duval, Frank E Harrell Jr +15 more

Messages 26–29 of 29

#
--- ted.harding at nessie.mcc.ac.uk wrote:

            
A lovely article.  I'm not a member but the local
university has a subscription.  

The examples of "men who claimed to have cervical 
smears (F) and women who were 5' tall weighing 15
stone (T) ring true.  

I've found people walking at 30 km/hr (F) and an
addict using 240 needles a month (T). I've even found
a set of 16 variables the study designers never heard
of !
#
[ Arrggh, not reply , but reply to all , cross my fingers again , sorry Peter! ]

Hmm,

I don't think you need a retain statement.

if first.patientID ;
or
if last.patientID ;

ought to do it.

It's actually better than the Vilno version, I must admit, a bit more concise:

if ( not firstrow(patientID) ) deleterow ;

Ah well.

**********************************
For the folks asking for location of software ( I know posted it, but
it didn't connect to the thread, and you get a huge number of posts
each day , sorry):

Vilno , find at
http://code.google.com/p/vilno

DAP & PSPP,  find at
http://directory.fsf.org/math/stats

Awk, find at lots of places,
http://www.gnu.org/software/gawk/gawk.html

Anything else? DAP & PSPP are hard to find, I'm sure there's more out there!
What about MDX? Nahh, not really the right problem domain.
Nobody uses MDX for this stuff.

******************************************************

If my examples , using clinical trial data are boring and hard to
understand for those who asked for examples
( and presumably don't work in clinical trials) , let me
know. Some of these other examples I'm reading about are quite interesting.
It doesn't help that clinical trial databases cannot be public. Making
a fake database would take a lot of time.
The irony is , even with my deep understanding of data preparation in
clinical trials,
the pharmas still don't want to give me a job ( because I was gone for
many years).

********************************************************
Let's see if this post works : thanks to the folks who gave me advice
on how to properly respond to a post within a  thread . ( Although the
thread in my gmail account is only a subset of the posts visible in
the archives ). Crossing my fingers ....
On 6/10/07, Peter Dalgaard <p.dalgaard at biostat.ku.dk> wrote:
7 days later
#
I am posting to this thread that has been quiet for some time because I
remembered the following question.
Christophe Pallier wrote:
Today I had a data manipulation problem that I don't know how to do in R
so I solved it with perl.  Since I'm always interested in learning more
about complex data manipulation in R I am posting my problem in the
hopes of receiving some hints for doing this in R.

If anyone has nothing better to do than play with other people's data,
I would be happy to send the row files off-list.

Background:

I have been given data that contains two measurements of left
ventricular ejection fraction.  One of the methods is echocardiogram
which sometimes gives a true quantitative value and other times a
semi-quantitative value.  The desire is to compare echo with the
other method (MUGA).  In most cases, patients had either quantitative
or semi-quantitative.  Same patients had both.  The data came
to me in excel files with, basically, no patient identifiers to link
the "both" with the semi-quantitative patients (the "both" patients
were in multiple data sets).

What I wanted to do was extract from the semi-quantitative data file
those patients with only semi-quantitative.  All I have to link with
are the semi-quantitative echo and the MUGA and these pairs of values
are not unique.

To make this more concrete, here are some portions of the raw data.

"Both"

"ID NUM","ECHO","MUGA","Semiquant","Quant"
"B",12,37,10,12
"D",13,13,10,13
"E",13,26,10,15
"F",13,31,10,13
"H",15,15,10,15
"I",15,21,10,15
"J",15,22,10,15
"K",17,22,10,17
"N",17.5,4,10,17.5
"P",18,25,10,18
"R",19,25,10,19

Seimi-quantitative

"echo","muga","quant"
10,20,0      <-- keep
10,20,0      <-- keep
10,21,0      <-- remove
10,21,0      <-- keep
10,24,0      <-- keep
10,25,0      <-- remove
10,25,0      <-- remove
10,25,0      <-- keep

Here is the perl program I wrote for this.

#!/usr/bin/perl

open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
# Discard first row;
$_ = <BOTH>;
while(<BOTH>) {
    chomp;
    ($id, $e, $m, $sq, $qu) = split(/,/);
    $both{$sq,$m}++;
}
close(BOTH);

open(OUT, "> qual_echo_only.csv") || die "Can't open qual_echo_only.csv";
print OUT "pid,echo,muga,quant\n";
$pid = 2001;

open(QUAL, "qual_echo.csv") || die "Can't open qual_echo.csv";
# Discard first row
$_ = <QUAL>;
while(<QUAL>) {
    chomp;
    ($echo, $muga, $quant) = split(/,/);
    if ($both{$echo,$muga} > 0) {
        $both{$echo,$muga}--;
    }
    else {
        print OUT "$pid,$echo,$muga,$quant\n";
        $pid++;
    }
}
close(QUAL);
close(OUT);

open(OUT, "> both_echo.csv") || die "Can't open both_echo.csv";
print OUT "pid,echo,muga,quant\n";
$pid = 3001;

open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
# Discard first row;
$_ = <BOTH>;
while(<BOTH>) {
    chomp;
    ($id, $e, $m, $sq, $qu) = split(/,/);
    print OUT "$pid,$sq,$m,0\n";
    print OUT "$pid,$qu,$m,1\n";
    $pid++;
}
close(BOTH);
close(OUT);