Length Class Mode
[1,] 2 data.frame list
[2,] 2 data.frame list
[3,] 2 data.frame list
This leaves me with the same problem I had with my ragged array i.e. how do I put all the second elements from this long list of data frames, into a single list?
I need to end up with a list of all the IDs that have duplicate first (or last) DATES.
Stuart
-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 13:17
To: 'Rui Barradas'
Cc: 'r-help at r-project.org'
Subject: RE: [R] [r] How to pick colums from a ragged array?
Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
Test data is all - see below it failing.
So, I remain very grateful for your function!
Stuart
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
# rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t ) # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ] # rag.am only has IDs with > 1 Date
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1] # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2] # list of dup'd dates
earliest<-tapply(DATE,ID,min) # table of mins
rownames(earliest[earliest%in%nd.b]) # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, # and an earliest date that is the same as another ID's earliest+duplicate.
-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: r-help at r-project.org
Subject: RE: [R] [r] How to pick colums from a ragged array?
Thanks Rui - your initial, very elegant suggestion, has spurred me on!
1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.
2. Your suggestion gave me all the duplicates:
how.many <- ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many > 1, ]
3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:
earliest <- tapply ( DATE, ID, min)
rownames(earliest[earliest%in%nd.b])
This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!
Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).
Stuart
************************************************
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
earliest <- tapply ( DATE, ID, min) # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] ) # IDs of duplicates at the earliest date for that individual. I think...
******************************************************************
-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?
Hello,
Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,200912
10
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
id.d <- cbind(ID, DATE)
getRepeat <- function(x, first = TRUE){
fun <- if(first) head else tail
sp <- split(data.frame(x), x[,1])
first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
n <- unlist(lapply(lst, sum))
sp1 <- sp[n > 1]
i1 <- lst[n > 1]
lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }
getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, first =
FALSE) # to get the last ones
Hope this helps,
Rui Barradas
Em 23-10-2012 10:59, Rui Barradas escreveu:
I have a large dataset (~1 million rows) of three variables: ID
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that
date).
Patients may have been assigned more than one diagnosis at any one
appointment - leading to two rows, same ID and DATE but different
DIAGNOSIS.
The diagnoses may change between appointments.
I want to subset the data in two ways:
- define groups of patients by the first diagnosis given
- define groups of patients by the last diagnosis given.
The problem:
Unfortunately, a small number of patients have been given more than
one diagnosis at their first (or last) appointment. These
individuals I need to identify and remove, as it's not possible to
say uniquely what their first (or last) diagnosis was. So I need to
identify and remove these individuals which have pairs of rows with
the same ID and (lowest or highest) DATE. The size of the dataset
precludes the option of doing this by eye.
I suspect there is a very elegant way of doing this in R.
This is what I've come up with:
- Sort by DATE then ID
- Make a ragged array of DATE by ID
- Remove IDs that only occur once.
- Subtract the first and second DATEs. Remove IDs for which
this = zero, as this will only be true for IDs for which the
appointment is recorded twice (because there were two diagnoses
recorded on this date).
- (Then do the same to get the 'last appointment'
duplicates, by reversing the initial sort by DATE.)
I am stuck at the 'Subtract dates' step: I would like to get the
data out of the ragged array by columns (so e.g. I end up with a
matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by
column from the ragged array.
I hope someone can help. My ugly code is below, with some data for
testing.
Stuart
Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
Lecturer and Honorary Consultant Pychiatrist Institute of Mental
Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK
Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)
id.d <- cbind (ID,DATE )
rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create
ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
row # Since DATE is in 'year mo da', if there's only one date, sum
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t ) # all the
IDs with >1 date
rag.am <- rag.a [ multi.dates ] # rag.am
only has IDs with > 1 Date
# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.
This message and any attachment are intended solely for the
addressee and may contain confidential information. If you have
received this message in error, please send it back to me, and immediately delete
it. Please do not use, copy or disclose the information contained
in this message or in any attachment. Any views or opinions
expressed by the author of this email do not necessarily reflect the
views of the University of Nottingham.
This message has been checked for viruses but the contents of an
attachment may still contain software viruses which could damage
your computer
system:
you are advised to perform your own checks. Email communications
with the University of Nottingham may be monitored as permitted by
UK legislation.
[[alternative HTML version deleted]]