Hello- A basic question which has nonetheless floored me entirely. I have a dataset which looks like this: Type ID Date Value A 1 16/09/2020 8 A 1 23/09/2010 9 B 3 18/8/2010 7 B 1 13/5/2010 6 There are two Types, which correspond to different individuals in different conditions, and loads of ID labels (1:50) corresponding to the different individuals in each condition, and measurements at different times (from 1 to 10 measurements) for each individual. I want to perform the following operations: 1) Delete all individuals for whom only one measurement is available. In the dataset above, you can see that I want to delete the row Type B ID 3, and Type B ID 1, but without deleting the Type A ID 1 data because there is more than one measurement for Type A ID 1 (but not for Type B ID1) 2) Produce difference scores for each of the Dates, so each individual (Type A ID1 and all the others for whom more than one measurement exists) starts at Date "1" and goes up in integers according to how many days have elapsed. I just know there's some incredibly cunning R-ish way of doing this but after many hours of fiddling I have had to admit defeat. I would be very grateful for any words of advice. Many thanks, Chris Beeley, Institute of Mental Health, UK
How to remove rows based on frequency of factor and then difference date scores
7 messages · David Winsemius, Abhijit Dasgupta, Chris Beeley
On Aug 24, 2010, at 1:19 PM, Chris Beeley wrote:
Hello- A basic question which has nonetheless floored me entirely. I have a dataset which looks like this: Type ID Date Value A 1 16/09/2020 8 A 1 23/09/2010 9 B 3 18/8/2010 7 B 1 13/5/2010 6 There are two Types, which correspond to different individuals in different conditions, and loads of ID labels (1:50) corresponding to the different individuals in each condition, and measurements at different times (from 1 to 10 measurements) for each individual. I want to perform the following operations: 1) Delete all individuals for whom only one measurement is available. In the dataset above, you can see that I want to delete the row Type B ID 3, and Type B ID 1, but without deleting the Type A ID 1 data because there is more than one measurement for Type A ID 1 (but not for Type B ID1) 2) Produce difference scores for each of the Dates, so each individual (Type A ID1 and all the others for whom more than one measurement exists) starts at Date "1" and goes up in integers according to how many days have elapsed. I just know there's some incredibly cunning R-ish way of doing this but after many hours of fiddling I have had to admit defeat.
Not sure about terribly cunning. Let's assume your dataframe was read
in with stringsAsFactors=FALSE and is called txt.df:
> txt.df$dt2 <- as.Date(txt.df$Date, format="%d/%m/%Y")
> txt.df
Type ID Date Value dt2
1 A 1 16/09/2020 8 2020-09-16
2 A 1 23/09/2010 9 2010-09-23
3 B 3 18/8/2010 7 2010-08-18
4 B 1 13/5/2010 6 2010-05-13
> txt.df$nn <- ave(txt.df$ID,txt.df$ID, FUN=length)
> txt.df
Type ID Date Value dt2 nn
1 A 1 16/09/2020 8 2020-09-16 3
2 A 1 23/09/2010 9 2010-09-23 3
3 B 3 18/8/2010 7 2010-08-18 1
4 B 1 13/5/2010 6 2010-05-13 3
> txt.df[ -which( txt.df$nn <=1), ]
Type ID Date Value dt2 nn
1 A 1 16/09/2020 8 2020-09-16 3
2 A 1 23/09/2010 9 2010-09-23 3
4 B 1 13/5/2010 6 2010-05-13 3
# Task #1 accomplished
> tapply(txt.df$dt2, txt.df$ID, function(x) x[1] -x)
$`1`
Time differences in days
[1] 0 3646 3779
$`3`
Time difference of 0 days
> unlist( tapply(txt.df$dt2, txt.df$ID, function(x) x[1] -x) )
11 12 13 3
0 3646 3779 0
> txt.df$diffdays <- unlist( tapply(txt.df$dt2, txt.df$ID,
function(x) x[1] -x) )
> txt.df
Type ID Date Value dt2 nn diffdays
1 A 1 16/09/2020 8 2020-09-16 3 0
2 A 1 23/09/2010 9 2010-09-23 3 3646
3 B 3 18/8/2010 7 2010-08-18 1 3779
4 B 1 13/5/2010 6 2010-05-13 3 0
>
I would be very grateful for any words of advice. Many thanks, Chris Beeley, Institute of Mental Health, UK
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
An answer to 1)
> x = data.frame(Type=c('A','A','B','B'), ID=c(1,1,3,1), Date =
c('16/09/2010','23/09/2010','18/8/2010','13/5/2010'), Value=c(8,9,7,6))
> x
Type ID Date Value
1 A 1 16/09/2010 8
2 A 1 23/09/2010 9
3 B 3 18/8/2010 7
4 B 1 13/5/2010 6
> x$Date = as.Date(x$Date,format='%d/%m/%Y')
> library(plyr)
> x$uniqueID = paste(x$Type, x$ID, sep='')
> nobs = daply(x, ~uniqueID, nrow)
> keep = names(nobs)[nobs>1]
> newx = x[x$uniqueID %in% keep,]
An answer to 2)
> require(plyr)
> ddply(newx, ~uniqueID, transform, newDate = as.numeric(Date -
min(Date)+1))
On 08/24/2010 01:19 PM, Chris Beeley wrote:
Hello- A basic question which has nonetheless floored me entirely. I have a dataset which looks like this: Type ID Date Value A 1 16/09/2020 8 A 1 23/09/2010 9 B 3 18/8/2010 7 B 1 13/5/2010 6 There are two Types, which correspond to different individuals in different conditions, and loads of ID labels (1:50) corresponding to the different individuals in each condition, and measurements at different times (from 1 to 10 measurements) for each individual. I want to perform the following operations: 1) Delete all individuals for whom only one measurement is available. In the dataset above, you can see that I want to delete the row Type B ID 3, and Type B ID 1, but without deleting the Type A ID 1 data because there is more than one measurement for Type A ID 1 (but not for Type B ID1) 2) Produce difference scores for each of the Dates, so each individual (Type A ID1 and all the others for whom more than one measurement exists) starts at Date "1" and goes up in integers according to how many days have elapsed. I just know there's some incredibly cunning R-ish way of doing this but after many hours of fiddling I have had to admit defeat. I would be very grateful for any words of advice. Many thanks, Chris Beeley, Institute of Mental Health, UK
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Abhijit Dasgupta, PhD Director and Principal Statistician ARAASTAT Ph: 301.385.3067 E: adasgupta at araastat.com W: http://www.araastat.com
The only problem with this is that Chris's unique individuals are a combination of Type and ID, as I understand it. So Type=A, ID=1 is a different individual from Type=B,ID=1. So we need to create a unique identifier per person, simplistically by uniqueID=paste(Type, ID, sep=''). Then, using this new identifier, everything follows.
On 08/24/2010 01:53 PM, David Winsemius wrote:
On Aug 24, 2010, at 1:19 PM, Chris Beeley wrote:
Hello- A basic question which has nonetheless floored me entirely. I have a dataset which looks like this: Type ID Date Value A 1 16/09/2020 8 A 1 23/09/2010 9 B 3 18/8/2010 7 B 1 13/5/2010 6 There are two Types, which correspond to different individuals in different conditions, and loads of ID labels (1:50) corresponding to the different individuals in each condition, and measurements at different times (from 1 to 10 measurements) for each individual. I want to perform the following operations: 1) Delete all individuals for whom only one measurement is available. In the dataset above, you can see that I want to delete the row Type B ID 3, and Type B ID 1, but without deleting the Type A ID 1 data because there is more than one measurement for Type A ID 1 (but not for Type B ID1) 2) Produce difference scores for each of the Dates, so each individual (Type A ID1 and all the others for whom more than one measurement exists) starts at Date "1" and goes up in integers according to how many days have elapsed. I just know there's some incredibly cunning R-ish way of doing this but after many hours of fiddling I have had to admit defeat.
Not sure about terribly cunning. Let's assume your dataframe was read in with stringsAsFactors=FALSE and is called txt.df:
txt.df$dt2 <- as.Date(txt.df$Date, format="%d/%m/%Y") txt.df
Type ID Date Value dt2 1 A 1 16/09/2020 8 2020-09-16 2 A 1 23/09/2010 9 2010-09-23 3 B 3 18/8/2010 7 2010-08-18 4 B 1 13/5/2010 6 2010-05-13
txt.df$nn <- ave(txt.df$ID,txt.df$ID, FUN=length) txt.df
Type ID Date Value dt2 nn 1 A 1 16/09/2020 8 2020-09-16 3 2 A 1 23/09/2010 9 2010-09-23 3 3 B 3 18/8/2010 7 2010-08-18 1 4 B 1 13/5/2010 6 2010-05-13 3
txt.df[ -which( txt.df$nn <=1), ]
Type ID Date Value dt2 nn 1 A 1 16/09/2020 8 2020-09-16 3 2 A 1 23/09/2010 9 2010-09-23 3 4 B 1 13/5/2010 6 2010-05-13 3 # Task #1 accomplished
tapply(txt.df$dt2, txt.df$ID, function(x) x[1] -x)
$`1` Time differences in days [1] 0 3646 3779 $`3` Time difference of 0 days
unlist( tapply(txt.df$dt2, txt.df$ID, function(x) x[1] -x) )
11 12 13 3 0 3646 3779 0
txt.df$diffdays <- unlist( tapply(txt.df$dt2, txt.df$ID, function(x)
x[1] -x) )
txt.df
Type ID Date Value dt2 nn diffdays 1 A 1 16/09/2020 8 2020-09-16 3 0 2 A 1 23/09/2010 9 2010-09-23 3 3646 3 B 3 18/8/2010 7 2010-08-18 1 3779 4 B 1 13/5/2010 6 2010-05-13 3 0
I would be very grateful for any words of advice. Many thanks, Chris Beeley, Institute of Mental Health, UK
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Abhijit Dasgupta, PhD Director and Principal Statistician ARAASTAT Ph: 301.385.3067 E: adasgupta at araastat.com W: http://www.araastat.com
On Aug 24, 2010, at 1:59 PM, Abhijit Dasgupta, PhD wrote:
The only problem with this is that Chris's unique individuals are a combination of Type and ID, as I understand it. So Type=A, ID=1 is a different individual from Type=B,ID=1. So we need to create a unique identifier per person, simplistically by uniqueID=paste(Type, ID, sep=''). Then, using this new identifier, everything follows.
I see your point. I agree that a tapply method should present both factors in the indices argument. > new.df <- txt.df[ -which( txt.df$nn <=1), ] > new.df <- new.df[ with(new.df, order(Type, ID) ), ] # and possibly needs to be ordered? > new.df$diffdays <- unlist( tapply(new.df$dt2, list(new.df$ID, new.df $Type), function(x) x[1] -x) ) > new.df Type ID Date Value dt2 nn diffdays 1 A 1 16/09/2020 8 2020-09-16 3 0 2 A 1 23/09/2010 9 2010-09-23 3 3646 4 B 1 13/5/2010 6 2010-05-13 3 0 But do not agree that you need, in this case at least, to create a paste()-y index. Agreed, however, such a construction can be useful in other situations.
David. > > On 08/24/2010 01:53 PM, David Winsemius wrote: >> >> On Aug 24, 2010, at 1:19 PM, Chris Beeley wrote: >> >>> Hello- >>> >>> A basic question which has nonetheless floored me entirely. I have a >>> dataset which looks like this: >>> >>> Type ID Date Value >>> A 1 16/09/2020 8 >>> A 1 23/09/2010 9 >>> B 3 18/8/2010 7 >>> B 1 13/5/2010 6 >>> >>> There are two Types, which correspond to different individuals in >>> different conditions, and loads of ID labels (1:50) corresponding to >>> the different individuals in each condition, and measurements at >>> different times (from 1 to 10 measurements) for each individual. >>> >>> I want to perform the following operations: >>> >>> 1) Delete all individuals for whom only one measurement is >>> available. >>> In the dataset above, you can see that I want to delete the row >>> Type B >>> ID 3, and Type B ID 1, but without deleting the Type A ID 1 data >>> because there is more than one measurement for Type A ID 1 (but not >>> for Type B ID1) >>> >>> 2) Produce difference scores for each of the Dates, so each >>> individual >>> (Type A ID1 and all the others for whom more than one measurement >>> exists) starts at Date "1" and goes up in integers according to how >>> many days have elapsed. >>> >>> I just know there's some incredibly cunning R-ish way of doing this >>> but after many hours of fiddling I have had to admit defeat. >> >> Not sure about terribly cunning. Let's assume your dataframe was >> read in with stringsAsFactors=FALSE and is called txt.df: >> >> >> > txt.df$dt2 <- as.Date(txt.df$Date, format="%d/%m/%Y") >> > txt.df >> Type ID Date Value dt2 >> 1 A 1 16/09/2020 8 2020-09-16 >> 2 A 1 23/09/2010 9 2010-09-23 >> 3 B 3 18/8/2010 7 2010-08-18 >> 4 B 1 13/5/2010 6 2010-05-13 >> >> > txt.df$nn <- ave(txt.df$ID,txt.df$ID, FUN=length) >> > txt.df >> Type ID Date Value dt2 nn >> 1 A 1 16/09/2020 8 2020-09-16 3 >> 2 A 1 23/09/2010 9 2010-09-23 3 >> 3 B 3 18/8/2010 7 2010-08-18 1 >> 4 B 1 13/5/2010 6 2010-05-13 3 >> > txt.df[ -which( txt.df$nn <=1), ] >> Type ID Date Value dt2 nn >> 1 A 1 16/09/2020 8 2020-09-16 3 >> 2 A 1 23/09/2010 9 2010-09-23 3 >> 4 B 1 13/5/2010 6 2010-05-13 3 >> >> # Task #1 accomplished >> >> > tapply(txt.df$dt2, txt.df$ID, function(x) x[1] -x) >> $`1` >> Time differences in days >> [1] 0 3646 3779 >> >> $`3` >> Time difference of 0 days >> >> > unlist( tapply(txt.df$dt2, txt.df$ID, function(x) x[1] -x) ) >> 11 12 13 3 >> 0 3646 3779 0 >> > txt.df$diffdays <- unlist( tapply(txt.df$dt2, txt.df$ID, >> function(x) x[1] -x) ) >> > txt.df >> Type ID Date Value dt2 nn diffdays >> 1 A 1 16/09/2020 8 2020-09-16 3 0 >> 2 A 1 23/09/2010 9 2010-09-23 3 3646 >> 3 B 3 18/8/2010 7 2010-08-18 1 3779 >> 4 B 1 13/5/2010 6 2010-05-13 3 0 >> > >> >> >> > David Winsemius, MD West Hartford, CT
The paste-y argument is my usual trick in these situations. I forget that tapply can take multiple ordering arguments :) Abhijit
On 08/24/2010 02:17 PM, David Winsemius wrote:
On Aug 24, 2010, at 1:59 PM, Abhijit Dasgupta, PhD wrote:
The only problem with this is that Chris's unique individuals are a combination of Type and ID, as I understand it. So Type=A, ID=1 is a different individual from Type=B,ID=1. So we need to create a unique identifier per person, simplistically by uniqueID=paste(Type, ID, sep=''). Then, using this new identifier, everything follows.
I see your point. I agree that a tapply method should present both factors in the indices argument.
new.df <- txt.df[ -which( txt.df$nn <=1), ] new.df <- new.df[ with(new.df, order(Type, ID) ), ] # and possibly
needs to be ordered?
new.df$diffdays <- unlist( tapply(new.df$dt2, list(new.df$ID,
new.df$Type), function(x) x[1] -x) )
new.df
Type ID Date Value dt2 nn diffdays 1 A 1 16/09/2020 8 2020-09-16 3 0 2 A 1 23/09/2010 9 2010-09-23 3 3646 4 B 1 13/5/2010 6 2010-05-13 3 0 But do not agree that you need, in this case at least, to create a paste()-y index. Agreed, however, such a construction can be useful in other situations.
Abhijit Dasgupta, PhD Director and Principal Statistician ARAASTAT Ph: 301.385.3067 E: adasgupta at araastat.com W: http://www.araastat.com
Many thanks to you both. I have now filed away for future reference the 2 factor tapply as well as the extremely useful looking plyr library. And the code worked beautifully :-)
On 24 Aug 2010, at 19:47, "Abhijit Dasgupta, PhD" <aikidasgupta at gmail.com> wrote:
The paste-y argument is my usual trick in these situations. I forget that tapply can take multiple ordering arguments :) Abhijit On 08/24/2010 02:17 PM, David Winsemius wrote:
On Aug 24, 2010, at 1:59 PM, Abhijit Dasgupta, PhD wrote:
The only problem with this is that Chris's unique individuals are a combination of Type and ID, as I understand it. So Type=A, ID=1 is a different individual from Type=B,ID=1. So we need to create a unique identifier per person, simplistically by uniqueID=paste(Type, ID, sep=''). Then, using this new identifier, everything follows.
I see your point. I agree that a tapply method should present both factors in the indices argument.
new.df <- txt.df[ -which( txt.df$nn <=1), ] new.df <- new.df[ with(new.df, order(Type, ID) ), ] # and possibly needs to be ordered? new.df$diffdays <- unlist( tapply(new.df$dt2, list(new.df$ID, new.df$Type), function(x) x[1] -x) ) new.df
Type ID Date Value dt2 nn diffdays 1 A 1 16/09/2020 8 2020-09-16 3 0 2 A 1 23/09/2010 9 2010-09-23 3 3646 4 B 1 13/5/2010 6 2010-05-13 3 0 But do not agree that you need, in this case at least, to create a paste()-y index. Agreed, however, such a construction can be useful in other situations.
-- Abhijit Dasgupta, PhD Director and Principal Statistician ARAASTAT Ph: 301.385.3067 E: adasgupta at araastat.com W: http://www.araastat.com