Skip to content

How to remove rows based on frequency of factor and then difference date scores

7 messages · David Winsemius, Abhijit Dasgupta, Chris Beeley

#
Hello-

A basic question which has nonetheless floored me entirely. I have a
dataset which looks like this:

Type  ID     Date            Value
A       1    16/09/2020       8
A       1     23/09/2010      9
B       3     18/8/2010        7
B       1     13/5/2010        6

There are two Types, which correspond to different individuals in
different conditions, and loads of ID labels (1:50) corresponding to
the different individuals in each condition, and measurements at
different times (from 1 to 10 measurements) for each individual.

I want to perform the following operations:

1) Delete all individuals for whom only one measurement is available.
In the dataset above, you can see that I want to delete the row Type B
ID 3, and Type B ID 1, but without deleting the Type A ID 1 data
because there is more than one measurement for Type A ID 1 (but not
for Type B ID1)

2) Produce difference scores for each of the Dates, so each individual
(Type A ID1 and all the others for whom more than one measurement
exists) starts at Date "1" and goes up in integers according to how
many days have elapsed.

I just know there's some incredibly cunning R-ish way of doing this
but after many hours of fiddling I have had to admit defeat.

I would be very grateful for any words of advice.

Many thanks,
Chris Beeley,
Institute of Mental Health, UK
#
On Aug 24, 2010, at 1:19 PM, Chris Beeley wrote:

            
Not sure about terribly cunning. Let's assume your dataframe was read  
in with stringsAsFactors=FALSE and is called txt.df:


 > txt.df$dt2 <- as.Date(txt.df$Date, format="%d/%m/%Y")
 > txt.df
   Type ID       Date Value        dt2
1    A  1 16/09/2020     8 2020-09-16
2    A  1 23/09/2010     9 2010-09-23
3    B  3  18/8/2010     7 2010-08-18
4    B  1  13/5/2010     6 2010-05-13

 > txt.df$nn <- ave(txt.df$ID,txt.df$ID, FUN=length)
 > txt.df
   Type ID       Date Value        dt2 nn
1    A  1 16/09/2020     8 2020-09-16  3
2    A  1 23/09/2010     9 2010-09-23  3
3    B  3  18/8/2010     7 2010-08-18  1
4    B  1  13/5/2010     6 2010-05-13  3
 > txt.df[ -which( txt.df$nn <=1), ]
   Type ID       Date Value        dt2 nn
1    A  1 16/09/2020     8 2020-09-16  3
2    A  1 23/09/2010     9 2010-09-23  3
4    B  1  13/5/2010     6 2010-05-13  3

# Task #1 accomplished

 > tapply(txt.df$dt2, txt.df$ID, function(x) x[1] -x)
$`1`
Time differences in days
[1]    0 3646 3779

$`3`
Time difference of 0 days

 > unlist( tapply(txt.df$dt2, txt.df$ID, function(x) x[1] -x) )
   11   12   13    3
    0 3646 3779    0
 > txt.df$diffdays <- unlist( tapply(txt.df$dt2, txt.df$ID,  
function(x) x[1] -x) )
 > txt.df
   Type ID       Date Value        dt2 nn diffdays
1    A  1 16/09/2020     8 2020-09-16  3        0
2    A  1 23/09/2010     9 2010-09-23  3     3646
3    B  3  18/8/2010     7 2010-08-18  1     3779
4    B  1  13/5/2010     6 2010-05-13  3        0
 >
David Winsemius, MD
West Hartford, CT
#
An answer to 1)

 > x = data.frame(Type=c('A','A','B','B'), ID=c(1,1,3,1), Date = 
c('16/09/2010','23/09/2010','18/8/2010','13/5/2010'), Value=c(8,9,7,6))
 > x
   Type ID       Date Value
1    A  1 16/09/2010     8
2    A  1 23/09/2010     9
3    B  3  18/8/2010     7
4    B  1  13/5/2010     6
 > x$Date = as.Date(x$Date,format='%d/%m/%Y')
 > library(plyr)
 > x$uniqueID = paste(x$Type, x$ID, sep='')
 > nobs = daply(x, ~uniqueID, nrow)
 > keep = names(nobs)[nobs>1]
 > newx = x[x$uniqueID %in% keep,]

An answer to 2)
 > require(plyr)
 > ddply(newx, ~uniqueID, transform, newDate = as.numeric(Date - 
min(Date)+1))
On 08/24/2010 01:19 PM, Chris Beeley wrote:

  
    
#
The only problem with this is that Chris's unique individuals are a 
combination of Type and ID, as I understand it. So Type=A, ID=1 is a 
different individual from Type=B,ID=1. So we need to create a unique 
identifier per person, simplistically by uniqueID=paste(Type, ID, 
sep=''). Then, using this new identifier, everything follows.
On 08/24/2010 01:53 PM, David Winsemius wrote:

  
    
#
On Aug 24, 2010, at 1:59 PM, Abhijit Dasgupta, PhD wrote:

            
I see your point. I agree that a tapply method should present both  
factors in the indices argument.

 > new.df <- txt.df[ -which( txt.df$nn <=1), ]
 > new.df <- new.df[ with(new.df, order(Type, ID) ), ]  # and possibly  
needs to be ordered?
 > new.df$diffdays <- unlist( tapply(new.df$dt2, list(new.df$ID, new.df 
$Type), function(x) x[1] -x) )
 > new.df
   Type ID       Date Value        dt2 nn diffdays
1    A  1 16/09/2020     8 2020-09-16  3        0
2    A  1 23/09/2010     9 2010-09-23  3     3646
4    B  1  13/5/2010     6 2010-05-13  3        0

But do not agree that you need, in this case at least, to create a  
paste()-y index. Agreed, however, such a construction can be useful in  
other situations.
#
The paste-y argument is my usual trick in these situations. I forget 
that tapply can take multiple ordering arguments :)

Abhijit
On 08/24/2010 02:17 PM, David Winsemius wrote:

  
    
#
Many thanks to you both. I have now filed away for future reference the 2 factor tapply as well as the extremely useful looking plyr library. And the code worked beautifully :-)
On 24 Aug 2010, at 19:47, "Abhijit Dasgupta, PhD" <aikidasgupta at gmail.com> wrote: