HI,
For your question to split by two variables, try this:
dat1<-read.table(text="
ID???????? PT_ID? BASE???? IDX_DT?? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF?? LDL_BASE? rf
118485 9624295 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-20 2.5
118486 9624295 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-20 2.5
118487 9624295 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-20 2.5
118488 9624295 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-20 2.5
118489 9624295 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-20 2.5
118490 9624295 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-25 2.5
118491 9624295 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-25 2.5
118492 9624295 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-25 2.5
118485 9624296 164.2 2006-11-21 2009-02-17???? 157.6??????? 2? 819? 2006-11-26 2.5
118486 9624296 164.2 2006-11-21 2006-09-30???? 208.0??????? 2? -52? 2006-11-26 2.5
118487 9624296 164.2 2006-11-21 2008-04-09???? 123.8??????? 2? 505? 2006-11-26 2.5
118488 9624296 164.2 2006-11-21 2008-02-26????? 17.4??????? 1? 462? 2006-11-27 2.5
118489 9624296 164.2 2006-11-21 2008-02-26???? 139.0??????? 2? 462? 2006-11-27 2.5
118490 9624296 164.2 2006-11-21 2007-07-02???? 107.2??????? 2? 223? 2006-11-27 2.5
118491 9624296 164.2 2006-11-21 2007-02-27????? 86.0??????? 1?? 98? 2006-11-27 2.5
118492 9624296 164.2 2006-11-21 2008-09-09???? 131.2??????? 2? 658? 2006-11-27 2.5
?",sep="",header=TRUE)
dat2<-split(dat1,list(dat1$LDL_BASE,dat1$PT_ID))
dat3<-list()
?for(i in seq_along(dat2)){
?dat3[[i]]<-list()
?dat3[[i]]<-ddply(dat2[[i]],.(DAYS_DIFF),summarize,Mean=mean(OBS_VALUE))
?}
dat3
do.call(rbind,dat3)
#?? DAYS_DIFF? Mean
#1??????? -52 208.0
#2??????? 462? 78.2
#3??????? 505 123.8
#4??????? 819 157.6
#5???????? 98? 86.0
#6??????? 223 107.2
#7??????? 658 131.2
#8??????? -52 208.0
#9??????? 505 123.8
#10?????? 819 157.6
#11??????? 98? 86.0
#12?????? 223 107.2
#13?????? 462? 78.2
#14?????? 658 131.2
#Not sure whether this will work or not in your huge dataset.? May be you can try lapply() also.?
A.K.
From: Weijia Wang <wwang.nyu at gmail.com>
To: arun <smartpink111 at yahoo.com>
Sent: Thursday, September 6, 2012 12:21 PM
Subject: Re: [R] R_closest date
To: arun <smartpink111 at yahoo.com>
Sent: Thursday, September 6, 2012 12:21 PM
Subject: Re: [R] R_closest date
Hi, Arun
Do you have idea about good package that split HUGE dataframe by two variables?
I was trying to use 'ddply' to calculate a mean of LDL-C values which had same date for every patient.?
Therefore, I need to break down my dataframe, first by patient ID, then by the date of the LDL-C, and finally calculate the mean, if there are multiple LDL-C on a same day.
The example is:
? ? ? ? ?PT_ID ?BASE ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY DAYS_DIFF ? LDL_BASE ?rf
118485 9624295 164.2 2006-11-21 2009-02-17 ? ? 157.6 ? ? ? ?2 ?819 days 2006-11-20 2.5
118486 9624295 164.2 2006-11-21 2006-09-30 ? ? 208.0 ? ? ? ?2 ?-52 days 2006-11-20 2.5
118487 9624295 164.2 2006-11-21 2008-04-09 ? ? 123.8 ? ? ? ?2 ?505 days 2006-11-20 2.5
118488 9624295 164.2 2006-11-21 2008-02-26 ? ? ?17.4 ? ? ? ?1 ?462 days 2006-11-20 2.5
118489 9624295 164.2 2006-11-21 2008-02-26 ? ? 139.0 ? ? ? ?2 ?462 days 2006-11-20 2.5
118490 9624295 164.2 2006-11-21 2007-07-02 ? ? 107.2 ? ? ? ?2 ?223 days 2006-11-20 2.5
118491 9624295 164.2 2006-11-21 2007-02-27 ? ? ?86.0 ? ? ? ?1 ? 98 days 2006-11-20 2.5
118492 9624295 164.2 2006-11-21 2008-09-09 ? ? 131.2 ? ? ? ?2 ?658 days 2006-11-20 2.5
? ? ? ? ?REDUCTION ? ? ? ? ? ?GOAL ?FAILURE
118485 ?0.04019488 ? ? NOT AT GOAL ? ? FAIL
118486 -0.26674787 PRE NOT AT GOAL NOT FAIL
118487 ?0.24604141 ? ? ? ? AT GOAL NOT FAIL
118488 ?0.89403167 ? ? ? ? AT GOAL NOT FAIL
118489 ?0.15347138 ? ? NOT AT GOAL ? ? FAIL
118490 ?0.34713764 ? ? ? ? AT GOAL NOT FAIL
118491 ?0.47624848 ? ? ? ? AT GOAL NOT FAIL
118492 ?0.20097442 ? ? NOT AT GOAL NOT FAIL
So, this patient has two LDL-C readings on '462 days', therefore, I want to get a mean of these 17.4 and 139.0.
'ddply' did give me a mean when running on a test dataframe, but when I used it on my dataframe with 200,000ish observations, the computer run for like 5 hours and return error. Do you have idea about other good function, that focuses on split and apply function, and rbind?
Best
Weijia
On Mon, Sep 3, 2012 at 2:08 AM, wwang.nyu <wwang.nyu at gmail.com> wrote:
That is actually a great idea, thanks again!
>
>Weijia Wang
>
>
>On Sep 2, 2012, at 12:12 PM, arun <smartpink111 at yahoo.com> wrote:
>
>> Hi,
>> No problem.
>>
>> If you use join() instead of merge(), the original order of columns may not get altered.
>>
>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>> library(plyr)
>> ?join(dat1,dat3,type="inner")
>> #Joining by: PT_ID, DAYS_DIFF
>> ?# PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>> #1 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2
>> #2 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2
>> A.K.
>>
>>
>>
>>
>>
>>
>> ________________________________
>> From: Weijia Wang <wwang.nyu at gmail.com>
>> To: arun <smartpink111 at yahoo.com>
>> Sent: Saturday, September 1, 2012 5:11 PM
>> Subject: Re: [R] R_closest date
>>
>>
>> Thank you Arun, for your help again.
>>
>> Best
>> ______________________________
>> WANG WEIJIA
>> Graudate Research and Teaching Assistant
>> Department of Environmental Medicine
>> New York University, School of Medicine
>> wwang.nyu at gmail.com
>>
>>
>>
>>
>> On Sep 1, 2012, at 5:04 PM, arun <smartpink111 at yahoo.com> wrote:
>>
>> Hi,
>>> Try this:
>>> dat1 <- read.table(text="
>>> ? PT_ID ? ?IDX_DT ?OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13 ?4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ?183 ? ? ? ?2
>>> 14 ?4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1
>>> 15 ?4549 2002-08-21 2003-02-18 ? ? ?181 ? ? ? ?89 ? ? ? ?1
>>> 16 ?4549 2002-08-21 2003-05-15 ? ? ?267 ? ? ?109 ? ? ? ?2
>>> 17 ?4549 2002-08-21 2003-12-16 ? ? ?482 ? ? ? ?96 ? ? ? ?1
>>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ?0 ? ? ?179 ? ? ? ?2
>>> ", header=TRUE)
>>> dat3<-aggregate(DAYS_DIFF~PT_ID,data=dat1,min)
>>> merge(dat1,dat3)
>>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY
>>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2
>>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2
>>>
>>> #or,
>>> dat2<- tapply(dat1$DAYS_DIFF,dat1$PT_ID,min)
>>> dat4<-data.frame(PT_ID=row.names(data.frame(dat2)),DAYS_DIFF=dat2)
>>> ?row.names(dat4)<-1:nrow(dat4)
>>> merge(dat1,dat4)
>>> # ?PT_ID DAYS_DIFF ? ? IDX_DT ? OBS_DATE OBS_VALUE CATEGORY
>>> #1 ?4549 ? ? ? ?-1 2002-08-21 2002-08-20 ? ? ? 183 ? ? ? ?2
>>> #2 ?4839 ? ? ? ? 0 2006-11-28 2006-11-28 ? ? ? 179 ? ? ? ?2
>>> A.K.
>>>
>>>
>>>
>>>
>>>
>>> ----- Original Message -----
>>> From: WANG WEIJIA <wwang.nyu at gmail.com>
>>> To: "r-help at R-project.org" <r-help at r-project.org>
>>> Cc:
>>> Sent: Saturday, September 1, 2012 1:10 PM
>>> Subject: [R] R_closest date
>>>
>>> Hi,
>>>
>>> I have encountered an issue about finding a date closest to another date
>>>
>>> So this is how the data frame looks like:
>>>
>>> ? ? PT_ID ? ? IDX_DT ? OBS_DATE DAYS_DIFF OBS_VALUE CATEGORY
>>> 13 ? 4549 2002-08-21 2002-08-20 ? ? ? ?-1 ? ? ? 183 ? ? ? ?2
>>> 14 ? 4549 2002-08-21 2002-11-14 ? ? ? ?85 ? ? ? ?91 ? ? ? ?1
>>> 15 ? 4549 2002-08-21 2003-02-18 ? ? ? 181 ? ? ? ?89 ? ? ? ?1
>>> 16 ? 4549 2002-08-21 2003-05-15 ? ? ? 267 ? ? ? 109 ? ? ? ?2
>>> 17 ? 4549 2002-08-21 2003-12-16 ? ? ? 482 ? ? ? ?96 ? ? ? ?1
>>> 128 ?4839 2006-11-28 2006-11-28 ? ? ? ? 0 ? ? ? 179 ? ? ? ?2
>>>
>>> I need to find, the single observation, which has the closest date of 'OBS_DATE' to 'IDX_DT'.
>>>
>>> For example, for 'PT_ID' of 4549, I need row 13, of which the OBS_DATE is just one day away from IDX_DT.
>>>
>>> I was thinking about using abs(), and I got this:
>>>
>>> baseline<- function(x){
>>> +
>>> + ?#remove all uncessary variables
>>> + ?baseline<- x[,c("PT_ID","DAYS_DIFF")]
>>> +
>>> + ?#get a list of every unique ID
>>> + ?uniqueID <- unique(baseline$PT_ID)
>>> +
>>> + ?#make a vector that will contain the smallest DAYS_DIFF
>>> + ?first <- rep(-99,length(uniqueID))
>>> +
>>> + ?i = 1
>>> + ?#loop through each unique ID
>>> + ?for (PT_ID in uniqueID){
>>> +
>>> + ?#for each iteration get the smallest DAYS_DIFF for that ID
>>> + ?first[i] <- min(baseline[which(baseline$PT_ID==PT_ID),abs(baseline$DAYS_DIFF)])
>>> +
>>> + ?#up the iteration counter
>>> + ?i = i + 1
>>> +
>>> + ?}
>>> + ?#make a data frame with the lowest DAYS_DIFF and ID
>>> + ?newdata <- data.frame(uniqueID,first)
>>> + ?names(newdata) <- c("PT_ID","DAYS_DIFF")
>>> +
>>> + ?#return the data frame containing the lowest GPI for each ID
>>> + ?return(newdata)
>>> + ?}
>>>
>>> ldl.b<-baseline(ldl) #get all baseline ldl patient ID, total 11368 obs, all unique#
>>>> Error in `[.data.frame`(baseline, which(baseline$PT_ID == PT_ID), abs(baseline$DAYS_DIFF)) :
>>> ? undefined columns selected
>>>
>>> Can anyone help me in figuring out how to get the minimum value of the absolute value of DAYS_DIFF for unique ID?
>>>
>>> Thanks a lot
>>> ? ? [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>