________________________________
From: Weijia Wang <wwang.nyu at gmail.com>
To: arun <smartpink111 at yahoo.com>
Sent: Thursday, September 6, 2012 12:21 PM
Subject: Re: [R] R_closest date
Hi, Arun
Do you have idea about good package that split HUGE dataframe by two variables?
I was trying to use 'ddply' to calculate a mean of LDL-C values which had same date for
every patient.
Therefore, I need to break down my dataframe, first by patient ID, then by the date of
the LDL-C, and finally calculate the mean, if there are multiple LDL-C on a same day.
The example is:
? ? ? ? ?PT_ID ?BASE ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY
DAYS_DIFF ? LDL_BASE ?rf
118485 9624295 164.2 2006-11-21 2009-02-17 ? ? 157.6 ? ? ? ?2 ?819 days 2006-11-20 2.5
118486 9624295 164.2 2006-11-21 2006-09-30 ? ? 208.0 ? ? ? ?2 ?-52 days 2006-11-20 2.5
118487 9624295 164.2 2006-11-21 2008-04-09 ? ? 123.8 ? ? ? ?2 ?505 days 2006-11-20 2.5
118488 9624295 164.2 2006-11-21 2008-02-26 ? ? ?17.4 ? ? ? ?1 ?462 days 2006-11-20 2.5
118489 9624295 164.2 2006-11-21 2008-02-26 ? ? 139.0 ? ? ? ?2 ?462 days 2006-11-20 2.5
118490 9624295 164.2 2006-11-21 2007-07-02 ? ? 107.2 ? ? ? ?2 ?223 days 2006-11-20 2.5
118491 9624295 164.2 2006-11-21 2007-02-27 ? ? ?86.0 ? ? ? ?1 ? 98 days 2006-11-20 2.5
118492 9624295 164.2 2006-11-21 2008-09-09 ? ? 131.2 ? ? ? ?2 ?658 days 2006-11-20 2.5
? ? ? ? ?REDUCTION ? ? ? ? ? ?GOAL ?FAILURE
118485 ?0.04019488 ? ? NOT AT GOAL ? ? FAIL
118486 -0.26674787 PRE NOT AT GOAL NOT FAIL
118487 ?0.24604141 ? ? ? ? AT GOAL NOT FAIL
118488 ?0.89403167 ? ? ? ? AT GOAL NOT FAIL
118489 ?0.15347138 ? ? NOT AT GOAL ? ? FAIL
118490 ?0.34713764 ? ? ? ? AT GOAL NOT FAIL
118491 ?0.47624848 ? ? ? ? AT GOAL NOT FAIL
118492 ?0.20097442 ? ? NOT AT GOAL NOT FAIL
So, this patient has two LDL-C readings on '462 days', therefore, I want to get a mean of
these 17.4 and 139.0.
'ddply' did give me a mean when running on a test dataframe, but when I used it on my
dataframe with 200,000ish observations, the computer run for like 5 hours and return
error. Do you have idea about other good function, that focuses on split and apply
function, and rbind?
Best
Weijia
On Mon, Sep 3, 2012 at 2:08 AM, wwang.nyu <wwang.nyu at gmail.com> wrote:
That is actually a great idea, thanks again!
Weijia Wang
On Sep 2, 2012, at 12:12 PM, arun <smartpink111 at yahoo.com> wrote:
Hi,
No problem.
If you use join() instead of merge(), the original order of columns may not get altered.
dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
library(plyr)
?join(dat1,dat3,type="inner")
#Joining by: PT_ID, DAYS_DIFF
?# PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
#1 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2
#2 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2
A.K.
________________________________
From: Weijia Wang <wwang.nyu at gmail.com>
To: arun <smartpink111 at yahoo.com>
Sent: Saturday, September 1, 2012 5:11 PM
Subject: Re: [R] R_closest date
Thank you Arun, for your help again.
Best
______________________________
WANG WEIJIA
Graudate Research and Teaching Assistant
Department of Environmental Medicine
New York University, School of Medicine
wwang.nyu at gmail.com
On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com> wrote:
Hi,
Try this:
dat1 <- read.table(text="
? PT_ID ? ?IDX_DT ?OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
13 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ?183 ? ? ? ?2
14 ?4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1
15 ?4549 2002-08-21 2003-02-18 ? ? ?181 ? ? ? ?89 ? ? ? ?1
16 ?4549 2002-08-21 2003-05-15 ? ? ?267 ? ? ?109 ? ? ? ?2
17 ?4549 2002-08-21 2003-12-16 ? ? ?482 ? ? ? ?96 ? ? ? ?1
128 ?4839 2006-11-28 2006-11-28 ? ? ? ?0 ? ? ?179 ? ? ? ?2
", header=TRUE)
dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
merge(dat1,dat3)
# ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY
#1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2
#2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2
#or,
dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min)
dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2)
?row.names(dat4)<-1:nrow(dat4)
merge(dat1,dat4)
# ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY
#1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2
#2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2
A.K.
----- Original Message -----
From: WANG WEIJIA <wwang.nyu at gmail.com>
To: "r-help at R-project.org" <r-help at r-project.org>
Cc:
Sent: Saturday, September 1, 2012 1:10 PM
Subject: [R] R_closest date
Hi,
I have encountered an issue about finding a date closest to another date
So this is how the data frame looks like:
? ? PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
13 ? 4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2
14 ? 4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1
15 ? 4549 2002-08-21 2003-02-18 ? ? ? 181 ? ? ? ?89 ? ? ? ?1
16 ? 4549 2002-08-21 2003-05-15 ? ? ? 267 ? ? ? 109 ? ? ? ?2
17 ? 4549 2002-08-21 2003-12-16 ? ? ? 482 ? ? ? ?96 ? ? ? ?1
128 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2
I need to find, the single observation, which has the closest date of 'OBS_DATE' to
'IDX_DT'.
For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one
day away from IDX_DT.
I was thinking about using abs(), and I got this:
baseline<- function(x){
+
+ ?#remove all uncessary variables
+ ?baseline<- x[,c("PT_ID","DAYS_DIFF")]
+
+ ?#get a list of every unique ID
+ ?uniqueID <- unique(baseline$PT_ID)
+
+ ?#make a vector that will contain the smallest DAYS_DIFF
+ ?first <- rep(-99,length(uniqueID))
+
+ ?i = 1
+ ?#loop through each unique ID
+ ?for (PT_ID in uniqueID){
+
+ ?#for each iteration get the smallest DAYS_DIFF for that ID
+ ?first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
+
+ ?#up the iteration counter
+ ?i = i + 1
+
+ ?}
+ ?#make a data frame with the lowest DAYS_DIFF and ID
+ ?newdata <- data.frame(uniqueID,first)
+ ?names(newdata) <- c("PT_ID","DAYS_DIFF")
+
+ ?#return the data frame containing the lowest GPI for each ID
+ ?return(newdata)
+ ?}
ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique#
Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID),
abs(baseline$DAYS_DIFF)) :
? undefined columns selected
Can anyone help me in figuring out how to get the minimum value of the absolute
value of DAYS_DIFF for unique ID?
Thanks a lot
? ? [[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.