Skip to content
Prev 308926 / 398503 Next

How to pick colums from a ragged array?

Hi,

With the dataset that you provided, and considering that my solution works (ID? 323 included), the below code outputs a variable INCLUDE.
res1<- data.frame(flag=tapply(id.d[,2],id.d[,1],FUN=function(x) head(duplicated(x)|duplicated(x,fromLast=TRUE),1)|tail(duplicated(x)|duplicated(x,fromLast=TRUE),1)))
res2<-id.d[id.d[,1]%in%names(res1[res1$flag==TRUE,])&(duplicated(id.d[,1:2])|duplicated(id.d[,1:2],fromLast=TRUE)),]
res3<-res2[!res2$ID%in% res2[duplicated(res2)|duplicated(res2,fromLast=TRUE),]$ID,]
id.d1<-id.d
bad<-id.d1[id.d1$ID%in%res3$ID,]
id.d1$id.d_INCLUDE<-TRUE
bad$INCLUDE<-FALSE
res4<-merge(id.d1,bad,all=TRUE)
res4$INCLUDE[is.na(res4$INCLUDE)]<-TRUE
res5<-res4[,-4]
?head(res5)
#?? ID???? DATE DG INCLUDE
#1? 58 20060821? 1??? TRUE
#2? 58 20061207? 2??? TRUE
#3? 58 20080102? 1??? TRUE
#4? 58 20090904? 1??? TRUE
#5 167 20040205? 4??? TRUE
#6 167 20040205? 4??? TRUE
?tail(res5)
#???? ID???? DATE DG INCLUDE
#35? 910 20080521? 4??? TRUE
#36? 910 20091224? 2??? TRUE
#37? 999 20050503? 2??? TRUE
#38 1019 19870508? 1?? FALSE
#39 1019 19870508? 2?? FALSE
#40 1019 19880330? 1?? FALSE
A.K.





----- Original Message -----
From: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
To: "arun (smartpink111 at yahoo.com)" <smartpink111 at yahoo.com>; PIKAL Petr <petr.pikal at precheza.cz>; "Rui Barradas (ruipbarradas at sapo.pt)" <ruipbarradas at sapo.pt>
Cc: 
Sent: Wednesday, October 24, 2012 11:41 AM
Subject: RE: [r] How to pick colums from a ragged array?

(And, considering? the real application, the functions ideally should probably output a variable INCLUDE, the same length as the original data, with TRUE and FALSE for whether or not that row should be included...)

-----Original Message-----
From: Leask Stuart
Sent: 24 October 2012 16:25
To: arun (smartpink111 at yahoo.com); 'PIKAL Petr'; Rui Barradas (ruipbarradas at sapo.pt)
Subject: RE: [r] How to pick colums from a ragged array?

Arun, Petr, Rui, many thanks for your help, and the functions you have written.

You'll recall I wanted to remove these first (or last) duplicates, because they represented instances where two different diagnoses (in this case, variable DG, value 1, 2, 3, 4 or 5) had been recorded on the same day - so I can't say which was 'first' (or 'last').

Your functions have revealed something I wasn't expecting: In some cases, the diagnoses recorded on the duplicated DATEs are the same!
This is a surprise to me, but probably reflects someone going to two different departments in a clinic, and both departments submit data. I have to say that crazy things like this are often a feature of real data, which I'm sure you've come across yourselves.

Of course, I don't want to remove records in which I can determine an unambiguous 'first diagnosis'.

You have all put in so much effort on my behalf, I'm ashamed to ask, but I wonder if any of the functions you've written could do this with a little more
Indexing and the 'duplicate' function
So the function should only exclude an ID, having identified a first (or last) DATE duplicate, the DGs for these two dates are different.

Test dataset:

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080521,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20071205
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521
,20091224,20050503,19870508,19870508,19880330)

DG<-
c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4,3,2,2,2,1,1)

id.d<-data.frame(ID,DATE,DG)
id.d

# Considering Ruis? getRepeat function:

g.r<-getRepeat(id.d)? ? # defaults to first = TRUE getRepeat(id.d, first = FALSE)? to get the last ones
g.rr<-do.call(rbind, g.r) # put the data into a matrix

# I can remove the date duplicates with:
g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),]

I'm not sure how to add this to your suggestions, Arun & Petr...


Stuart


-----Original Message-----
From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
Sent: 23 October 2012 15:24
To: Stuart Leask
Subject: RE: [r] How to pick colums from a ragged array?

Hi

I assumed that id.d is data frame

id.d <- data.frame (ID,DATE )

and

fff(id.d)

works for me

Petr
Message-ID: <1351097420.59538.YahooMailNeo@web142606.mail.bf1.yahoo.com>
In-Reply-To: <D441407B331E2341AD3D6FCDC03058342E4087BB43@EXCHANGE3.ad.nottingham.ac.uk>