Skip to content

How to pick colums from a ragged array?

7 messages · arun, root, Rui Barradas

#
Hello,

Using one of Arun's ideas, some post ago, this new function returns a 
logical index into id.d of the rows that should be _removed_, hence rm1 
and rm2. I think



getRepLogical <- function(x, first = TRUE){
     fun <- if(first) head else tail
     dte <- tapply(x[,2], x[,1], FUN = function(x) duplicated(fun(x, 2)))
     len <- tapply(x[,2], x[,1], FUN = length)
     lst <- lapply(seq_along(dte), function(i) c(dte[[i]], rep(FALSE, 
if(len[[i]] > 2) len[[i]] - 2 else 0)))
     lst <- if(first) lst else lapply(lst, rev)
     i1 <- unlist(lst)
     dg <- tapply(x[,3], x[,1], FUN = function(x) !duplicated(fun(x, 2)))
     lst <- lapply(seq_along(dte), function(i) c(dg[[i]], rep(FALSE, 
if(len[[i]] > 2) len[[i]] - 2 else 0)))
     lst <- if(first) lst else lapply(lst, rev)
     i2 <- unlist(lst)
     i1 & i2
}

rm1 <- getRepLogical(id.d)
rm2 <- getRepLogical(id.d, first = FALSE)

id.d[rm1, ]
id.d[rm2, ]

id.d$INCLUDE <- !(rm1 | rm2)


Hope this helps,

Rui Barradas
Em 24-10-2012 16:41, Stuart Leask escreveu:
#
Hi Rui,

I think now our results are matching except in the INCLUDE column 

id.d[c(11:13,22:24,38:40),]
#???? ID???? DATE DG INCLUDE
#11? 323 20080407? 1??? TRUE
#12? 323 20080521? 2?? FALSE
#13? 323 20080521? 3??? TRUE
#22? 841 20050421? 1??? TRUE
#23? 841 20050421? 2?? FALSE
#24? 841 20060428? 1??? TRUE
#38 1019 19870508? 2??? TRUE
#39 1019 19870508? 1?? FALSE
#40 1019 19880330? 1??? TRUE


I thought all the rows with the above IDS would be FALSE (from my solution):

res4[c(11:13,22:24,38:40),]
???? ID???? DATE DG INCLUDE
#11? 323 20080407? 1?? FALSE
#12? 323 20080521? 2?? FALSE
#13? 323 20080521? 3?? FALSE
#22? 841 20050421? 1?? FALSE
#23? 841 20050421? 2?? FALSE
#24? 841 20060428? 1?? FALSE
#38 1019 19870508? 1?? FALSE
#39 1019 19870508? 2?? FALSE
#40 1019 19880330? 1?? FALSE

A.K.




----- Original Message -----
From: Rui Barradas <ruipbarradas at sapo.pt>
To: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
Cc: "arun (smartpink111 at yahoo.com)" <smartpink111 at yahoo.com>; PIKAL Petr <petr.pikal at precheza.cz>; r-help <r-help at r-project.org>
Sent: Wednesday, October 24, 2012 1:41 PM
Subject: Re: [r] How to pick colums from a ragged array?

Hello,

Using one of Arun's ideas, some post ago, this new function returns a 
logical index into id.d of the rows that should be _removed_, hence rm1 
and rm2. I think



getRepLogical <- function(x, first = TRUE){
? ?  fun <- if(first) head else tail
? ?  dte <- tapply(x[,2], x[,1], FUN = function(x) duplicated(fun(x, 2)))
? ?  len <- tapply(x[,2], x[,1], FUN = length)
? ?  lst <- lapply(seq_along(dte), function(i) c(dte[[i]], rep(FALSE, 
if(len[[i]] > 2) len[[i]] - 2 else 0)))
? ?  lst <- if(first) lst else lapply(lst, rev)
? ?  i1 <- unlist(lst)
? ?  dg <- tapply(x[,3], x[,1], FUN = function(x) !duplicated(fun(x, 2)))
? ?  lst <- lapply(seq_along(dte), function(i) c(dg[[i]], rep(FALSE, 
if(len[[i]] > 2) len[[i]] - 2 else 0)))
? ?  lst <- if(first) lst else lapply(lst, rev)
? ?  i2 <- unlist(lst)
? ?  i1 & i2
}

rm1 <- getRepLogical(id.d)
rm2 <- getRepLogical(id.d, first = FALSE)

id.d[rm1, ]
id.d[rm2, ]

id.d$INCLUDE <- !(rm1 | rm2)


Hope this helps,

Rui Barradas
Em 24-10-2012 16:41, Stuart Leask escreveu:
#
Hello,

I just realized that function getRepLogical marks the second, not the 
first (eventually from last) to be removed. The first tapply should be

     dte <- tapply(x[,2], x[,1], FUN = function(x) duplicated(fun(x, 2), 
fromLast = TRUE))

in order to remove the first (or last).

Rui Barradas
Em 24-10-2012 18:41, Rui Barradas escreveu:
#
Hello,

Inline.
Em 24-10-2012 19:05, arun escreveu:
Why? Look at the last ID, 1019. The last of all must be included, the 
date doesn't repeat. And one of the first must also be included, if not 
we would be completely excluding that date. Or at least this is how I'm 
understanding the problem.

Rui Barradas
#
Hi,

?According to the OP "So the function should only exclude an ID, having identified a first (or last) DATE duplicate, the DGs for these two dates are different."
Rui:
By running your modified function (using dte <- tapply(x[,2], x[,1], FUN = function(x) duplicated(fun(x, 2),fromLast = TRUE))), 

?id.d$INCLUDE <- !(rm1 | rm2)
?head(id.d)
#???? ID???? DATE DG INCLUDE
#1??? 58 20060821? 1??? TRUE
#2??? 58 20061207? 2??? TRUE
#3??? 58 20080102? 1??? TRUE
#4??? 58 20090904? 1??? TRUE
#5?? 167 20040205? 4?? FALSE
#6?? 167 20040205? 4?? FALSE

For #167, DGs are same.? Not sure whether to exclude it or not.


My modified solution is similar but I am excluding 167 and 814.


fun1<-function(dat){
res1first<- data.frame(flag=tapply(dat[,2],dat[,1],FUN=function(x) head(duplicated(x)|duplicated(x,fromLast=TRUE),1)))
?res1last<- data.frame(flag=tapply(dat[,2],dat[,1],FUN=function(x) tail(duplicated(x)|duplicated(x,fromLast=TRUE),1)))
res2first<-dat[dat[,1]%in%names(res1first[res1first$flag==TRUE,])&(duplicated(dat[,1:2])|duplicated(dat[,1:2],fromLast=TRUE)),]
res2last<-dat[dat[,1]%in%names(res1last[res1last$flag==TRUE,])&(duplicated(dat[,1:2])|duplicated(dat[,1:2],fromLast=TRUE)),]
res3first<-res2first[!res2first$ID%in% res2first[duplicated(res2first)|duplicated(res2first,fromLast=TRUE),]$ID,]
res3last<-res2last[!res2last$ID%in% res2last[duplicated(res2last)|duplicated(res2last,fromLast=TRUE),]$ID,]
res3firstsubset<-do.call(rbind,lapply(split(res3first,res3first$ID),head,1))
res3firstsubset$INCLUDE<-FALSE
res3lastsubset<-do.call(rbind,lapply(split(res3last,res3last$ID),tail,1))
res3lastsubset$INCLUDE<-FALSE
?res4<-merge(dat,merge(res3first,merge(res3firstsubset,merge(res3lastsubset,res3last,all=TRUE),all=TRUE),all=TRUE),all=TRUE)
?res4$INCLUDE[is.na(res4$INCLUDE)]<-TRUE
res4
}

tail(fun1(id.d))
#???? ID???? DATE DG INCLUDE
#35? 910 20080521? 4??? TRUE
#36? 910 20091224? 2??? TRUE
#37? 999 20050503? 2??? TRUE
#38 1019 19870508? 1??? TRUE
#39 1019 19870508? 2?? FALSE
#40 1019 19880330? 1??? TRUE

A.K.












----- Original Message -----
From: Rui Barradas <ruipbarradas at sapo.pt>
To: arun <smartpink111 at yahoo.com>
Cc: R help <r-help at r-project.org>; Stuart Leask <Stuart.Leask at nottingham.ac.uk>
Sent: Wednesday, October 24, 2012 2:50 PM
Subject: Re: [r] How to pick colums from a ragged array?

Hello,

Inline.
Em 24-10-2012 19:05, arun escreveu:
Why? Look at the last ID, 1019. The last of all must be included, the 
date doesn't repeat. And one of the first must also be included, if not 
we would be completely excluding that date. Or at least this is how I'm 
understanding the problem.

Rui Barradas
#
I mis-typed, missing an if. I think you've got it, but let me try again:

"The function should:
-  put FALSE in a column for every instance of an ID
IF ( that ID has a first (or last) DATE duplicated )
AND
IF (the DGs for the duplicated dates are different)."

So for the earliest/first date function, INCLUDE should be TRUE, apart from FALSE for _all_ the instances of IDs 167, 841 and 1019
For the latest/last date function, INCLUDE should be TRUE, apart from FALSE for all the instances of ID  323.

 Stuart

-----Original Message-----
From: arun [mailto:smartpink111 at yahoo.com]
Sent: 24 October 2012 21:30
To: Rui Barradas
Cc: R help; Stuart Leask
Subject: Re: [r] How to pick colums from a ragged array?

Hi,

 According to the OP "So the function should only exclude an ID, having identified a first (or last) DATE duplicate, the DGs for these two dates are different."
Rui:
By running your modified function (using dte <- tapply(x[,2], x[,1], FUN = function(x) duplicated(fun(x, 2),fromLast = TRUE))),

 id.d$INCLUDE <- !(rm1 | rm2)
 head(id.d)
#     ID     DATE DG INCLUDE
#1    58 20060821  1    TRUE
#2    58 20061207  2    TRUE
#3    58 20080102  1    TRUE
#4    58 20090904  1    TRUE
#5   167 20040205  4   FALSE
#6   167 20040205  4   FALSE

For #167, DGs are same.  Not sure whether to exclude it or not.


My modified solution is similar but I am excluding 167 and 814.


fun1<-function(dat){
res1first<- data.frame(flag=tapply(dat[,2],dat[,1],FUN=function(x) head(duplicated(x)|duplicated(x,fromLast=TRUE),1)))
 res1last<- data.frame(flag=tapply(dat[,2],dat[,1],FUN=function(x) tail(duplicated(x)|duplicated(x,fromLast=TRUE),1)))
res2first<-dat[dat[,1]%in%names(res1first[res1first$flag==TRUE,])&(duplicated(dat[,1:2])|duplicated(dat[,1:2],fromLast=TRUE)),]
res2last<-dat[dat[,1]%in%names(res1last[res1last$flag==TRUE,])&(duplicated(dat[,1:2])|duplicated(dat[,1:2],fromLast=TRUE)),]
res3first<-res2first[!res2first$ID%in% res2first[duplicated(res2first)|duplicated(res2first,fromLast=TRUE),]$ID,]
res3last<-res2last[!res2last$ID%in% res2last[duplicated(res2last)|duplicated(res2last,fromLast=TRUE),]$ID,]
res3firstsubset<-do.call(rbind,lapply(split(res3first,res3first$ID),head,1))
res3firstsubset$INCLUDE<-FALSE
res3lastsubset<-do.call(rbind,lapply(split(res3last,res3last$ID),tail,1))
res3lastsubset$INCLUDE<-FALSE
 res4<-merge(dat,merge(res3first,merge(res3firstsubset,merge(res3lastsubset,res3last,all=TRUE),all=TRUE),all=TRUE),all=TRUE)
 res4$INCLUDE[is.na(res4$INCLUDE)]<-TRUE
res4
}

tail(fun1(id.d))
#     ID     DATE DG INCLUDE
#35  910 20080521  4    TRUE
#36  910 20091224  2    TRUE
#37  999 20050503  2    TRUE
#38 1019 19870508  1    TRUE
#39 1019 19870508  2   FALSE
#40 1019 19880330  1    TRUE

A.K.












----- Original Message -----
From: Rui Barradas <ruipbarradas at sapo.pt>
To: arun <smartpink111 at yahoo.com>
Cc: R help <r-help at r-project.org>; Stuart Leask <Stuart.Leask at nottingham.ac.uk>
Sent: Wednesday, October 24, 2012 2:50 PM
Subject: Re: [r] How to pick colums from a ragged array?

Hello,

Inline.
Em 24-10-2012 19:05, arun escreveu:
Why? Look at the last ID, 1019. The last of all must be included, the date doesn't repeat. And one of the first must also be included, if not we would be completely excluding that date. Or at least this is how I'm understanding the problem.

Rui Barradas
#
Hello,

Inline.
Em 24-10-2012 22:40, Stuart Leask escreveu:
In this case forget my last post and use getRepLogical as I posted it 
originaly

Rui Barradas