Skip to content

how to subset based on other row values and multiplicity

9 messages · Williams Scott, jim holtman, John McKown +2 more

#
Hi R experts,
 
I have a dataset as sampled below. Values are only regarded as ?confirmed?
in an individual (?id?) if they occur
more than once at least 30 days apart.

 
id   date value
a    2000-01-01 x
a    2000-03-01 x
b    2000-11-11 w
c    2000-11-11 y
c    2000-10-01 y
c    2000-09-10 y
c    2000-12-12 z
c    2000-10-11 z
d    2000-11-11 w
d    2000-11-10 w

 
I wish to subset the data to retain rows where the value for the
individual is confirmed more than 30 days apart. So, after deleting all
rows with just one occurrence of id and value, the rest would be the
earliest occurrence of each value in each case id, provided 31 or more
days exist between the dates. If >1 value is present per id, each value
level needs to be assessed independently. This example would then reduce
to:

 
id   date           value
a    2000-01-01 x
c    2000-09-10 y
c    2000-10-11 z

 
 
I can do this via some crude loops and subsetting, but I am looking for as
much efficiency as possible
as the dataset has around 50 million rows to assess. Any suggestions
welcomed.

Thanks in advance
 
Scott Williams MD
Melbourne, Australia



This email (including any attachments or links) may contain 
confidential and/or legally privileged information and is 
intended only to be read or used by the addressee.  If you 
are not the intended addressee, any use, distribution, 
disclosure or copying of this email is strictly 
prohibited.  
Confidentiality and legal privilege attached to this email 
(including any attachments) are not waived or lost by 
reason of its mistaken delivery to you.
If you have received this email in error, please delete it 
and notify us immediately by telephone or email.  Peter 
MacCallum Cancer Centre provides no guarantee that this 
transmission is free of virus or that it has not been 
intercepted or altered and will not be liable for any delay 
in its receipt.
#
On Wed, Jul 16, 2014 at 8:07 AM, Williams Scott
<Scott.Williams at petermac.org> wrote:
Question: the c-y id-value pair occurs 3 times. In two cases
(2000-11-11 vs. 2000-10-01 & 2000-11-11 vs 2000-09-01) the difference
is >30 days. Why isn't
c 2000-10-01 y
also part of the result? Is it because you only want a single id-value
pair in which the date is the minimal? Or you want the one in which
the date difference is maximal? Or you overlooked that particular
match? I can't figure it out from your description.
Hum, is the source of this data in a relational database such as
Oracle, PostgreSQL, MySQL, MS-SQL, or SQLite (or "other")? I ask
because some of this processing might be easier do to in the data base
using a "self join", instead of reading the entire relational table
into a data.frame and doing it in R.

  
    
#
I can reproduce what you requested, but there was the question about
what happens with the multiple 'c-y' values.

====================
+ a    2000-01-01 x
+ a    2000-03-01 x
+ b    2000-11-11 w
+ c    2000-11-11 y
+ c    2000-10-01 y
+ c    2000-09-10 y
+ c    2000-12-12 z
+ c    2000-10-11 z
+ d    2000-11-11 w
+ d    2000-11-10 w', as.is = TRUE, header = TRUE)
+     , {
+         if (.N == 1) val <- NULL  # only one -- delete
+         else {
+             dif <- difftime(tail(date, -1), head(date, -1), units = 'days')
+             # return first value if any > 31
+             if (any(dif >= 31)) val <- list(date = date[1L])
+             else val <- NULL
+         }
+         val
+       }
+     , keyby = 'id,value'
+     ]
id value       date
1:  a     x 2000-01-01
2:  c     y 2000-09-10
3:  c     z 2000-10-11

Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Wed, Jul 16, 2014 at 9:25 AM, John McKown
<john.archie.mckown at gmail.com> wrote:
#
On Wed, Jul 16, 2014 at 8:51 AM, jim holtman <jholtman at gmail.com> wrote:
Wow, I picked up a couple of _nice_ techniques from that one post!
Looks like "data.table" will let me do SQL like things in R. I have a
warped brain. I think in "result sets" and "matrix operations"

Many thanks.
#
Thanks guys - amazingly prompt solutions from the R community as always.

Yes, the c-y value reverts to just the first date event - the spirit of
this is that I am trying to identify and confirm a list of diagnoses that
a patient has coded in government administrative data. Once a diagnosis is
made and confirmed, I am not interested in whether it is listed again and
again later on. I just need that date at which it first became apparent.
So in the multiple c-y case, the min date is the correct one. Some cases
will have the same diagnosis listed dozens of times, hence the very
bloated dataset.

Time to churn through the data is not a big issue, so I will have a go
with Jim?s neat code he just sent on perhaps a few thousand rows and see
how I get on. 

S
On 17/07/2014 12:09 am, "John McKown" <john.archie.mckown at gmail.com> wrote:

            
This email (including any attachments or links) may contain 
confidential and/or legally privileged information and is 
intended only to be read or used by the addressee.  If you 
are not the intended addressee, any use, distribution, 
disclosure or copying of this email is strictly 
prohibited.  
Confidentiality and legal privilege attached to this email 
(including any attachments) are not waived or lost by 
reason of its mistaken delivery to you.
If you have received this email in error, please delete it 
and notify us immediately by telephone or email.  Peter 
MacCallum Cancer Centre provides no guarantee that this 
transmission is free of virus or that it has not been 
intercepted or altered and will not be liable for any delay 
in its receipt.
#
Hi,
If `dat` is the dataset

library(dplyr)
dat%>%
group_by(id,value)%>% 

arrange(date=as.Date(date))%>%
filter(any(c(abs(diff(as.Date(date))),NA)>31)& date == min(date))
#Source: local data frame [3 x 3]
#Groups: id, value
#
#? id?????? date value
#1? a 2000-01-01???? x
#2? c 2000-09-10???? y
#3? c 2000-10-11???? z
A.K.
On Wednesday, July 16, 2014 9:10 AM, Williams Scott <Scott.Williams at petermac.org> wrote:
Hi R experts,

I have a dataset as sampled below. Values are only regarded as ?confirmed?
in an individual (?id?) if they occur
more than once at least 30 days apart.


id?  date value
a? ? 2000-01-01 x
a? ? 2000-03-01 x
b? ? 2000-11-11 w
c? ? 2000-11-11 y
c? ? 2000-10-01 y
c? ? 2000-09-10 y
c? ? 2000-12-12 z
c? ? 2000-10-11 z
d? ? 2000-11-11 w
d? ? 2000-11-10 w


I wish to subset the data to retain rows where the value for the
individual is confirmed more than 30 days apart. So, after deleting all
rows with just one occurrence of id and value, the rest would be the
earliest occurrence of each value in each case id, provided 31 or more
days exist between the dates. If >1 value is present per id, each value
level needs to be assessed independently. This example would then reduce
to:


id?  date? ? ? ? ?  value
a? ? 2000-01-01 x
c? ? 2000-09-10 y
c? ? 2000-10-11 z



I can do this via some crude loops and subsetting, but I am looking for as
much efficiency as possible
as the dataset has around 50 million rows to assess. Any suggestions
welcomed.

Thanks in advance

Scott Williams MD
Melbourne, Australia



This email (including any attachments or links) may contain
confidential and/or legally privileged information and is
intended only to be read or used by the addressee.? If you
are not the intended addressee, any use, distribution,
disclosure or copying of this email is strictly
prohibited.
Confidentiality and legal privilege attached to this email
(including any attachments) are not waived or lost by
reason of its mistaken delivery to you.
If you have received this email in error, please delete it
and notify us immediately by telephone or email.? Peter
MacCallum Cancer Centre provides no guarantee that this
transmission is free of virus or that it has not been
intercepted or altered and will not be liable for any delay
in its receipt.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
Using base R you can solve this by doing some sorting and comparing
the first and last dates in each id-value group.  Computing the last
and last dates can be vectorized.

f1 <- function(data) {
    # sort by id, break ties with value, break remaining ties with date
    sortedData <- data[with(data, order(id, value, date)), ]
    i <- seq_len(NROW(sortedData)-1)
    # a 'group' has same id and value, entries in group are sorted by date
    isBreakPoint <- with(sortedData, id[i]!=id[i+1] | value[i]!=value[i+1])
    isFirstInGroup <- c(TRUE, isBreakPoint)
    isLastInGroup <- c(isBreakPoint, TRUE)
    sortedData[isFirstInGroup,][sortedData[isLastInGroup,"date"] -
sortedData[isFirstInGroup,"date"] >= 31,]
}
dat <- read.table(colClasses=c("character", "Date", "character"),
header=TRUE, text=
"id   date value
a    2000-01-01 x
a    2000-03-01 x
b    2000-11-11 w
c    2000-11-11 y
c    2000-10-01 y
c    2000-09-10 y
c    2000-12-12 z
c    2000-10-11 z
d    2000-11-11 w
d    2000-11-10 w")
id       date value
1  a 2000-01-01     x
6  c 2000-09-10     y
8  c 2000-10-11     z

Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Wed, Jul 16, 2014 at 7:49 AM, arun <smartpink111 at yahoo.com> wrote:
#
Note that the 'date == min(date)' will cause superfluous output rows
when there are several readings on initial date for a given id/value
pair.  E.g.,
Source: local data frame [2 x 3]
Groups: id, value

  id value       date
1  A     x 2000-10-02
2  A     x 2000-10-02

where f2 is your code wrapped up in a function (to make testing and use easier)

f2 <- function (data)
{
    library(dplyr)
    data %>% group_by(id, value) %>% arrange(date = as.Date(date)) %>%
        filter(any(c(abs(diff(as.Date(date))), NA) > 31) & date == min(date))
}


Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Wed, Jul 16, 2014 at 7:49 AM, arun <smartpink111 at yahoo.com> wrote:
#
Hi Bill,

Modifying `f2` seems to solve the problem.

f2 <- function (data)
{
??? library(dplyr)
??? data%>% 
??? group_by(id, value) %>%
??? mutate(date=as.Date(date))%>% 
??? arrange(date) %>% 
??? filter(indx =any(c(abs(diff(date)),NA) >31)& date==min(date)) %>% 
??? filter(row_number()==1)

}

?f2(dat)
Source: local data frame [3 x 3]
Groups: id, value

? id?????? date value
1? a 2000-01-01???? x
2? c 2000-09-10???? y
3? c 2000-10-11???? z

f2(dat1)
Source: local data frame [1 x 3]
Groups: id, value

? id value?????? date
1? A???? x 2000-10-02



A.K.
On Wednesday, July 16, 2014 4:25 PM, William Dunlap <wdunlap at tibco.com> wrote:
Note that the 'date == min(date)' will cause superfluous output rows
when there are several readings on initial date for a given id/value
pair.? E.g.,
Source: local data frame [2 x 3]
Groups: id, value

? id value? ? ?  date
1? A? ?  x 2000-10-02
2? A? ?  x 2000-10-02

where f2 is your code wrapped up in a function (to make testing and use easier)

f2 <- function (data)
{
? ? library(dplyr)
? ? data %>% group_by(id, value) %>% arrange(date = as.Date(date)) %>%
? ? ? ? filter(any(c(abs(diff(as.Date(date))), NA) > 31) & date == min(date))
}


Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Wed, Jul 16, 2014 at 7:49 AM, arun <smartpink111 at yahoo.com> wrote: