How to pick colums from a ragged array?

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121023/dbd58294/attachment.pl>
Hello,

I'm not sure I understand it well, in the solution below the only 
returned value is ID == 814 but it's not the first nor the last DATE.

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
id.d[how.many > 1, ]

See the help page for ?ave if the repetition of id.d[,1] is confusing. 
The first is the vector to average (to apply FUN to) and the second is 
one of thw two vectors defining the groups.

Hope this helps,

Rui Barradas
Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on that date).
Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-          define groups of patients by the first diagnosis given

-          define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID and (lowest or highest) DATE. The size of the dataset precludes the option of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-          Sort by DATE then ID

-          Make a ragged array of DATE by ID

-          Remove IDs that only occur once.

-          Subtract the first and second DATEs. Remove IDs for which this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date).

-          (Then do the same to get the 'last appointment' duplicates, by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array.

I hope someone can help. My ugly code is below, with some data for testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA
Clinical Senior Lecturer and Honorary Consultant Pychiatrist
Institute of Mental Health, Innovation Park
Triumph Road, Nottingham, Notts. NG7 2TU. UK
Tel. +44 115 82 30419 stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )                         # all the IDs with >1 date
rag.am <- rag.a [ multi.dates ]                           # rag.am only has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.   Please do not use, copy or disclose the information contained in this message or in any attachment.  Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.

This message has been checked for viruses but the contents of an attachment
may still contain software viruses which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.
	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hello,

Thinking again, if you just want the first/last in each ID that repeats 
the DATE, the following function does the job. Since there were no such 
cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)

getRepeat <- function(x, first = TRUE){
     fun <- if(first) head else tail
     sp <- split(data.frame(x), x[,1])
     first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
     lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
     n <- unlist(lapply(lst, sum))
     sp1 <- sp[n > 1]
     i1 <- lst[n > 1]
     lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ])
}

getRepeat(id.d)  # defaults to first = TRUE
getRepeat(id.d, first = FALSE)  # to get the last ones

Hope this helps,

Rui Barradas

Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello,

I'm not sure I understand it well, in the solution below the only 
returned value is ID == 814 but it's not the first nor the last DATE.

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
id.d[how.many > 1, ]

See the help page for ?ave if the repetition of id.d[,1] is confusing. 
The first is the vector to average (to apply FUN to) and the second is 
one of thw two vectors defining the groups.

Hope this helps,

Rui Barradas
Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID 
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that 
date).
Patients may have been assigned more than one diagnosis at any one 
appointment - leading to two rows, same ID and DATE but different 
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-          define groups of patients by the first diagnosis given

-          define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than 
one diagnosis at their first (or last) appointment. These individuals 
I need to identify and remove, as it's not possible to say uniquely 
what their first (or last) diagnosis was. So I need to identify and 
remove these individuals which have pairs of rows with the same ID 
and (lowest or highest) DATE. The size of the dataset precludes the 
option of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-          Sort by DATE then ID

-          Make a ragged array of DATE by ID

-          Remove IDs that only occur once.

-          Subtract the first and second DATEs. Remove IDs for which 
this = zero, as this will only be true for IDs for which the 
appointment is recorded twice (because there were two diagnoses 
recorded on this date).

-          (Then do the same to get the 'last appointment' 
duplicates, by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the data 
out of the ragged array by columns (so e.g. I end up with a matrix of 
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from 
the ragged array.

I hope someone can help. My ugly code is below, with some data for 
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA
Clinical Senior Lecturer and Honorary Consultant Pychiatrist
Institute of Mental Health, Innovation Park
Triumph Road, Nottingham, Notts. NG7 2TU. UK
Tel. +44 115 82 30419 
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create 
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will 
be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )                         # all the 
IDs with >1 date
rag.am <- rag.a [ multi.dates ]                           # rag.am 
only has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the addressee 
and may contain confidential information. If you have received this 
message in error, please send it back to me, and immediately delete 
it.   Please do not use, copy or disclose the information contained 
in this message or in any attachment.  Any views or opinions 
expressed by the author of this email do not necessarily reflect the 
views of the University of Nottingham.

This message has been checked for viruses but the contents of an 
attachment
may still contain software viruses which could damage your computer 
system:
you are advised to perform your own checks. Email communications with 
the
University of Nottingham may be monitored as permitted by UK 
legislation.
    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!) 
I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many  > 1,  ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)                
rownames(earliest[earliest%in%nd.b])   

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!

Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).

Stuart

************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <- 
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

 id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...

******************************************************************

-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt] 
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)

getRepeat <- function(x, first = TRUE){
     fun <- if(first) head else tail
     sp <- split(data.frame(x), x[,1])
     first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
     lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
     n <- unlist(lapply(lst, sum))
     sp1 <- sp[n > 1]
     i1 <- lst[n > 1]
     lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first = FALSE)  # to get the last ones

Hope this helps,

Rui Barradas

Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello,

I'm not sure I understand it well, in the solution below the only 
returned value is ID == 814 but it's not the first nor the last DATE.

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) 
id.d[how.many > 1, ]

See the help page for ?ave if the repetition of id.d[,1] is confusing. 
The first is the vector to average (to apply FUN to) and the second is 
one of thw two vectors defining the groups.

Hope this helps,

Rui Barradas
Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID 
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that 
date).
Patients may have been assigned more than one diagnosis at any one 
appointment - leading to two rows, same ID and DATE but different 
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-          define groups of patients by the first diagnosis given

-          define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than 
one diagnosis at their first (or last) appointment. These individuals 
I need to identify and remove, as it's not possible to say uniquely 
what their first (or last) diagnosis was. So I need to identify and 
remove these individuals which have pairs of rows with the same ID 
and (lowest or highest) DATE. The size of the dataset precludes the 
option of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-          Sort by DATE then ID

-          Make a ragged array of DATE by ID

-          Remove IDs that only occur once.

-          Subtract the first and second DATEs. Remove IDs for which 
this = zero, as this will only be true for IDs for which the 
appointment is recorded twice (because there were two diagnoses 
recorded on this date).

-          (Then do the same to get the 'last appointment' 
duplicates, by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the data 
out of the ragged array by columns (so e.g. I end up with a matrix of 
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from 
the ragged array.

I hope someone can help. My ugly code is below, with some data for 
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer 
and Honorary Consultant Pychiatrist Institute of Mental Health, 
Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 
115 82 30419 
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <- 
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create 
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per 
row # Since DATE is in 'year mo da', if there's only one date, sum 
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )                         # all the 
IDs with >1 date
rag.am <- rag.a [ multi.dates ]                           # rag.am 
only has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the addressee 
and may contain confidential information. If you have received this 
message in error, please send it back to me, and immediately delete
it.   Please do not use, copy or disclose the information contained 
in this message or in any attachment.  Any views or opinions 
expressed by the author of this email do not necessarily reflect the 
views of the University of Nottingham.

This message has been checked for viruses but the contents of an 
attachment may still contain software viruses which could damage your 
computer
system:
you are advised to perform your own checks. Email communications with 
the University of Nottingham may be monitored as permitted by UK 
legislation.
    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi

I did not check your code and rather followed your explanation. BTW, thanks for test data.

small change in data frame to make DATE as Date class

datum<-as.Date(as.character(DATE), format="%Y%m%d")
id.d <- data.frame(ID,datum )

ordering by date

id.d<-id.d[order(id.d$datum),]

two functions to test if first two dates are the same or last two dates are the same

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

change one last date in the data frame to be the same as previous

id.d[35,2]<-id.d[36,2]

and here are results

sapply(split(id.d, id.d$ID), testlast)
   58   167   323   547   794   814   841   910   999  1019 
FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE
sapply(split(id.d, id.d$ID), testfirst)
58   167   323   547   794   814   841   910   999  1019 
FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE

Now you can select ID which is true and remove it from your data
which(sapply(split(id.d, id.d$ID), testlast))

and use it for your data frame to subset/remove
id.d$ID == as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
[37]  TRUE  TRUE  TRUE  TRUE

However I am not sure if this is exactly what you want.

Regards
Petr
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Stuart Leask
Sent: Tuesday, October 23, 2012 11:38 AM
To: r-help at r-project.org
Subject: [R] [r] How to pick colums from a ragged array?

I have a large dataset (~1 million rows) of three variables: ID
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that
date).
Patients may have been assigned more than one diagnosis at any one
appointment - leading to two rows, same ID and DATE but different
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-          define groups of patients by the first diagnosis given

-          define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than one
diagnosis at their first (or last) appointment. These individuals I
need to identify and remove, as it's not possible to say uniquely what
their first (or last) diagnosis was. So I need to identify and remove
these individuals which have pairs of rows with the same ID and (lowest
or highest) DATE. The size of the dataset precludes the option of doing
this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-          Sort by DATE then ID

-          Make a ragged array of DATE by ID

-          Remove IDs that only occur once.

-          Subtract the first and second DATEs. Remove IDs for which
this = zero, as this will only be true for IDs for which the
appointment is recorded twice (because there were two diagnoses
recorded on this date).

-          (Then do the same to get the 'last appointment' duplicates,
by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the data
out of the ragged array by columns (so e.g. I end up with a matrix of
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
the ragged array.

I hope someone can help. My ugly code is below, with some data for
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
and Honorary Consultant Pychiatrist Institute of Mental Health,
Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the
dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be
less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )                         # all the IDs
with >1 date
rag.am <- rag.a [ multi.dates ]                           # rag.am only
has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please send it back to me, and immediately delete it.
Please do not use, copy or disclose the information contained in this
message or in any attachment.  Any views or opinions expressed by the
author of this email do not necessarily reflect the views of the
University of Nottingham.

This message has been checked for viruses but the contents of an
attachment may still contain software viruses which could damage your
computer system:
you are advised to perform your own checks. Email communications with
the University of Nottingham may be monitored as permitted by UK
legislation.
	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi

Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date

fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else
sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast))))

if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,]
}
fff(id.d)
ID     DATE
31 910 20091105
32 910 20091105
33 910 20091117
34 910 20091119
35 910 20091120
36 910 20091210
37 910 20091224
38 910 20091224
fff(id.d, remove=T)
ID     DATE
1    58 20060821
2    58 20061207
3    58 20080102
4    58 20090904
5   167 20040205
6   167 20040323
7   323 20051111
8   323 20060111
9   323 20071119
10  323 20080107
11  323 20080407
12  323 20080521
13  323 20080711
14  547 20041005
15  794 20070905
16  814 20020814
17  814 20021125
18  814 20040429
19  814 20040429
20  814 20071205
21  814 20080227
22  841 20050421
23  841 20060130
24  841 20060428
25  841 20060602
26  841 20060816
27  841 20061025
28  841 20061129
29  841 20070112
30  841 20070514
39  999 20050503
40 1019 19870508
41 1019 19880223
42 1019 19880330
43 1019 19880330

Regards
Petr
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of PIKAL Petr
Sent: Tuesday, October 23, 2012 1:49 PM
To: Stuart Leask; r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hi

I did not check your code and rather followed your explanation. BTW,
thanks for test data.

small change in data frame to make DATE as Date class

datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
data.frame(ID,datum )

ordering by date

id.d<-id.d[order(id.d$datum),]

two functions to test if first two dates are the same or last two dates
are the same

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

change one last date in the data frame to be the same as previous

id.d[35,2]<-id.d[36,2]

and here are results

sapply(split(id.d, id.d$ID), testlast)
   58   167   323   547   794   814   841   910   999  1019
FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE

sapply(split(id.d, id.d$ID), testfirst)
   58   167   323   547   794   814   841   910   999  1019
FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE

Now you can select ID which is true and remove it from your data
which(sapply(split(id.d, id.d$ID), testlast))

and use it for your data frame to subset/remove id.d$ID ==
as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))  [1]
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE

However I am not sure if this is exactly what you want.

Regards
Petr

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Stuart Leask
Sent: Tuesday, October 23, 2012 11:38 AM
To: r-help at r-project.org
Subject: [R] [r] How to pick colums from a ragged array?

I have a large dataset (~1 million rows) of three variables: ID
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that
date).
Patients may have been assigned more than one diagnosis at any one
appointment - leading to two rows, same ID and DATE but different
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-          define groups of patients by the first diagnosis given

-          define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than
one diagnosis at their first (or last) appointment. These individuals
I need to identify and remove, as it's not possible to say uniquely
what their first (or last) diagnosis was. So I need to identify and
remove these individuals which have pairs of rows with the same ID
and
(lowest or highest) DATE. The size of the dataset precludes the
option
of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-          Sort by DATE then ID

-          Make a ragged array of DATE by ID

-          Remove IDs that only occur once.

-          Subtract the first and second DATEs. Remove IDs for which
this = zero, as this will only be true for IDs for which the
appointment is recorded twice (because there were two diagnoses
recorded on this date).

-          (Then do the same to get the 'last appointment'
duplicates,
by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the data
out of the ragged array by columns (so e.g. I end up with a matrix of
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
the ragged array.

I hope someone can help. My ugly code is below, with some data for
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
and Honorary Consultant Pychiatrist Institute of Mental Health,
Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the
dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will
be
less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )                         # all the
IDs
with >1 date
rag.am <- rag.a [ multi.dates ]                           # rag.am
only
has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please send it back to me, and immediately delete
it.
Please do not use, copy or disclose the information contained in this
message or in any attachment.  Any views or opinions expressed by the
author of this email do not necessarily reflect the views of the
University of Nottingham.

This message has been checked for viruses but the contents of an
attachment may still contain software viruses which could damage your
computer system:
you are advised to perform your own checks. Email communications with
the University of Nottingham may be monitored as permitted by UK
legislation.
	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html and provide commented, minimal, self-contained,
reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html
and provide commented, minimal, self-contained, reproducible code.
Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate. 
Test data is all - see below it failing.

So, I remain very grateful for your function!

Stuart 

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

 id.d <- cbind (ID,DATE )
# rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create ragged array, 1-n DATES for every NAME
 # Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the dates per row
 # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t )                         # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ]                           # rag.am only has IDs with > 1 Date

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1]         # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2]         # list of dup'd dates

earliest<-tapply(DATE,ID,min)  # table of mins
rownames(earliest[earliest%in%nd.b])   # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, 
# and an earliest date that is the same as another ID's earliest+duplicate.

-----Original Message-----
From: Leask Stuart 
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: r-help at r-project.org
Subject: RE: [R] [r] How to pick colums from a ragged array?

Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many  > 1,  ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)                
rownames(earliest[earliest%in%nd.b])   

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!

Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).

Stuart

************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

 id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...

******************************************************************

-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)

getRepeat <- function(x, first = TRUE){
     fun <- if(first) head else tail
     sp <- split(data.frame(x), x[,1])
     first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
     lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
     n <- unlist(lapply(lst, sum))
     sp1 <- sp[n > 1]
     i1 <- lst[n > 1]
     lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first = FALSE)  # to get the last ones

Hope this helps,

Rui Barradas

Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello,

I'm not sure I understand it well, in the solution below the only 
returned value is ID == 814 but it's not the first nor the last DATE.

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) 
id.d[how.many > 1, ]

See the help page for ?ave if the repetition of id.d[,1] is confusing. 
The first is the vector to average (to apply FUN to) and the second is 
one of thw two vectors defining the groups.

Hope this helps,

Rui Barradas
Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID 
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that 
date).
Patients may have been assigned more than one diagnosis at any one 
appointment - leading to two rows, same ID and DATE but different 
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-          define groups of patients by the first diagnosis given

-          define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than 
one diagnosis at their first (or last) appointment. These individuals 
I need to identify and remove, as it's not possible to say uniquely 
what their first (or last) diagnosis was. So I need to identify and 
remove these individuals which have pairs of rows with the same ID 
and (lowest or highest) DATE. The size of the dataset precludes the 
option of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-          Sort by DATE then ID

-          Make a ragged array of DATE by ID

-          Remove IDs that only occur once.

-          Subtract the first and second DATEs. Remove IDs for which 
this = zero, as this will only be true for IDs for which the 
appointment is recorded twice (because there were two diagnoses 
recorded on this date).

-          (Then do the same to get the 'last appointment' 
duplicates, by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the data 
out of the ragged array by columns (so e.g. I end up with a matrix of 
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from 
the ragged array.

I hope someone can help. My ugly code is below, with some data for 
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer 
and Honorary Consultant Pychiatrist Institute of Mental Health, 
Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create 
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per 
row # Since DATE is in 'year mo da', if there's only one date, sum 
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )                         # all the 
IDs with >1 date
rag.am <- rag.a [ multi.dates ]                           # rag.am 
only has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the addressee 
and may contain confidential information. If you have received this 
message in error, please send it back to me, and immediately delete
it.   Please do not use, copy or disclose the information contained 
in this message or in any attachment.  Any views or opinions 
expressed by the author of this email do not necessarily reflect the 
views of the University of Nottingham.

This message has been checked for viruses but the contents of an 
attachment may still contain software viruses which could damage your 
computer
system:
you are advised to perform your own checks. Email communications with 
the University of Nottingham may be monitored as permitted by UK 
legislation.
    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi there.

Not sure I follow what you are doing.

I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID.

I have refined my test dataset, to include some tests (e.g. 910 has the same dup as 1019, but for 910 it's not the earliest date):

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

Correct output: 
"167"  "841"  "1019"

Stuart

-----Original Message-----
From: PIKAL Petr [mailto:petr.pikal at precheza.cz] 
Sent: 23 October 2012 13:15
To: Stuart Leask; r-help at r-project.org
Subject: RE: [r] How to pick colums from a ragged array?

Hi

Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date

fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast))))

if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
fff(id.d)
ID     DATE
31 910 20091105
32 910 20091105
33 910 20091117
34 910 20091119
35 910 20091120
36 910 20091210
37 910 20091224
38 910 20091224
fff(id.d, remove=T)
ID     DATE
1    58 20060821
2    58 20061207
3    58 20080102
4    58 20090904
5   167 20040205
6   167 20040323
7   323 20051111
8   323 20060111
9   323 20071119
10  323 20080107
11  323 20080407
12  323 20080521
13  323 20080711
14  547 20041005
15  794 20070905
16  814 20020814
17  814 20021125
18  814 20040429
19  814 20040429
20  814 20071205
21  814 20080227
22  841 20050421
23  841 20060130
24  841 20060428
25  841 20060602
26  841 20060816
27  841 20061025
28  841 20061129
29  841 20070112
30  841 20070514
39  999 20050503
40 1019 19870508
41 1019 19880223
42 1019 19880330
43 1019 19880330

Regards
Petr
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- 
project.org] On Behalf Of PIKAL Petr
Sent: Tuesday, October 23, 2012 1:49 PM
To: Stuart Leask; r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hi

I did not check your code and rather followed your explanation. BTW, 
thanks for test data.

small change in data frame to make DATE as Date class

datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- 
data.frame(ID,datum )

ordering by date

id.d<-id.d[order(id.d$datum),]

two functions to test if first two dates are the same or last two 
dates are the same

testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) 
x[length(x),2]==x[length(x)-1,2]

change one last date in the data frame to be the same as previous

id.d[35,2]<-id.d[36,2]

and here are results

sapply(split(id.d, id.d$ID), testlast)
   58   167   323   547   794   814   841   910   999  1019
FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE

sapply(split(id.d, id.d$ID), testfirst)
   58   167   323   547   794   814   841   910   999  1019
FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE

Now you can select ID which is true and remove it from your data 
which(sapply(split(id.d, id.d$ID), testlast))

and use it for your data frame to subset/remove id.d$ID == 
as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))  [1] 
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE

However I am not sure if this is exactly what you want.

Regards
Petr

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- 
project.org] On Behalf Of Stuart Leask
Sent: Tuesday, October 23, 2012 11:38 AM
To: r-help at r-project.org
Subject: [R] [r] How to pick colums from a ragged array?

I have a large dataset (~1 million rows) of three variables: ID 
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that 
date).
Patients may have been assigned more than one diagnosis at any one 
appointment - leading to two rows, same ID and DATE but different 
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-          define groups of patients by the first diagnosis given

-          define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than 
one diagnosis at their first (or last) appointment. These 
individuals I need to identify and remove, as it's not possible to 
say uniquely what their first (or last) diagnosis was. So I need to 
identify and remove these individuals which have pairs of rows with 
the same ID
and
(lowest or highest) DATE. The size of the dataset precludes the
option
of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-          Sort by DATE then ID

-          Make a ragged array of DATE by ID

-          Remove IDs that only occur once.

-          Subtract the first and second DATEs. Remove IDs for which
this = zero, as this will only be true for IDs for which the 
appointment is recorded twice (because there were two diagnoses 
recorded on this date).

-          (Then do the same to get the 'last appointment'
duplicates,
by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the 
data out of the ragged array by columns (so e.g. I end up with a 
matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by 
column from the ragged array.

I hope someone can help. My ugly code is below, with some data for 
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior 
Lecturer and Honorary Consultant Pychiatrist Institute of Mental 
Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK 
Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the
dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will
be
less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )                         # all the
IDs
with >1 date
rag.am <- rag.a [ multi.dates ]                           # rag.am
only
has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the 
addressee and may contain confidential information. If you have 
received this message in error, please send it back to me, and 
immediately delete
it.
Please do not use, copy or disclose the information contained in 
this message or in any attachment.  Any views or opinions expressed 
by the author of this email do not necessarily reflect the views of 
the University of Nottingham.

This message has been checked for viruses but the contents of an 
attachment may still contain software viruses which could damage 
your computer system:
you are advised to perform your own checks. Email communications 
with the University of Nottingham may be monitored as permitted by 
UK legislation.
	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting- 
guide.html and provide commented, minimal, self-contained, 
reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting- 
guide.html and provide commented, minimal, self-contained, 
reproducible code.
Sorry, I must be a bit thick.!
getRepeat gives me the data with duplicates - but I don't seem to be able to manipulate the result. It looks like a list of dataframes:
g.r<-getRepeat(id.d)
dim(g.r)
NULL
summary(g.r)
Length Class      Mode
[1,] 2      data.frame list
[2,] 2      data.frame list
[3,] 2      data.frame list

This leaves me with the same problem I had with my ragged array i.e. how do I put all the second elements from this long list of data frames, into a single list?

I need to end up with a list of all the IDs that have duplicate first (or last) DATES.

Stuart

-----Original Message-----
From: Leask Stuart 
Sent: 23 October 2012 13:17
To: 'Rui Barradas'
Cc: 'r-help at r-project.org'
Subject: RE: [R] [r] How to pick colums from a ragged array?

Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate. 
Test data is all - see below it failing.

So, I remain very grateful for your function!

Stuart 

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

 id.d <- cbind (ID,DATE )
# rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create ragged array, 1-n DATES for every NAME
 # Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the dates per row
 # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t )                         # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ]                           # rag.am only has IDs with > 1 Date

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1]         # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2]         # list of dup'd dates

earliest<-tapply(DATE,ID,min)  # table of mins
rownames(earliest[earliest%in%nd.b])   # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, # and an earliest date that is the same as another ID's earliest+duplicate.

-----Original Message-----
From: Leask Stuart 
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: r-help at r-project.org
Subject: RE: [R] [r] How to pick colums from a ragged array?

Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many  > 1,  ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)                
rownames(earliest[earliest%in%nd.b])   

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!

Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).

Stuart

************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

 id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...

******************************************************************

-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)

getRepeat <- function(x, first = TRUE){
     fun <- if(first) head else tail
     sp <- split(data.frame(x), x[,1])
     first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
     lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
     n <- unlist(lapply(lst, sum))
     sp1 <- sp[n > 1]
     i1 <- lst[n > 1]
     lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first = FALSE)  # to get the last ones

Hope this helps,

Rui Barradas

Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello,

I'm not sure I understand it well, in the solution below the only 
returned value is ID == 814 but it's not the first nor the last DATE.

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) 
id.d[how.many > 1, ]

See the help page for ?ave if the repetition of id.d[,1] is confusing. 
The first is the vector to average (to apply FUN to) and the second is 
one of thw two vectors defining the groups.

Hope this helps,

Rui Barradas
Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID 
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that 
date).
Patients may have been assigned more than one diagnosis at any one 
appointment - leading to two rows, same ID and DATE but different 
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-          define groups of patients by the first diagnosis given

-          define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than 
one diagnosis at their first (or last) appointment. These individuals 
I need to identify and remove, as it's not possible to say uniquely 
what their first (or last) diagnosis was. So I need to identify and 
remove these individuals which have pairs of rows with the same ID 
and (lowest or highest) DATE. The size of the dataset precludes the 
option of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-          Sort by DATE then ID

-          Make a ragged array of DATE by ID

-          Remove IDs that only occur once.

-          Subtract the first and second DATEs. Remove IDs for which 
this = zero, as this will only be true for IDs for which the 
appointment is recorded twice (because there were two diagnoses 
recorded on this date).

-          (Then do the same to get the 'last appointment' 
duplicates, by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the data 
out of the ragged array by columns (so e.g. I end up with a matrix of 
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from 
the ragged array.

I hope someone can help. My ugly code is below, with some data for 
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer 
and Honorary Consultant Pychiatrist Institute of Mental Health, 
Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create 
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per 
row # Since DATE is in 'year mo da', if there's only one date, sum 
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )                         # all the 
IDs with >1 date
rag.am <- rag.a [ multi.dates ]                           # rag.am 
only has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the addressee 
and may contain confidential information. If you have received this 
message in error, please send it back to me, and immediately delete
it.   Please do not use, copy or disclose the information contained 
in this message or in any attachment.  Any views or opinions 
expressed by the author of this email do not necessarily reflect the 
views of the University of Nottingham.

This message has been checked for viruses but the contents of an 
attachment may still contain software viruses which could damage your 
computer
system:
you are advised to perform your own checks. Email communications with 
the University of Nottingham may be monitored as permitted by UK 
legislation.
    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hello,

You're right, getRepeat returns a list of data.frames, one per each ID. 
To put them all in the same df use

do.call(rbind, g.r)

Rui Barradas
Em 23-10-2012 13:36, Stuart Leask escreveu:
Sorry, I must be a bit thick.!
getRepeat gives me the data with duplicates - but I don't seem to be able to manipulate the result. It looks like a list of dataframes:

g.r<-getRepeat(id.d)
dim(g.r)
NULL
summary(g.r)
      Length Class      Mode
[1,] 2      data.frame list
[2,] 2      data.frame list
[3,] 2      data.frame list

This leaves me with the same problem I had with my ragged array i.e. how do I put all the second elements from this long list of data frames, into a single list?

I need to end up with a list of all the IDs that have duplicate first (or last) DATES.

Stuart

-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 13:17
To: 'Rui Barradas'
Cc: 'r-help at r-project.org'
Subject: RE: [R] [r] How to pick colums from a ragged array?

Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
Test data is all - see below it failing.

So, I remain very grateful for your function!

Stuart

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
  ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
  ,20091224,20050503,19870508,19870508,19880330)

  id.d <- cbind (ID,DATE )
# rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create ragged array, 1-n DATES for every NAME
  # Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the dates per row
  # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t )                         # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ]                           # rag.am only has IDs with > 1 Date

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1]         # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2]         # list of dup'd dates

earliest<-tapply(DATE,ID,min)  # table of mins
rownames(earliest[earliest%in%nd.b])   # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, # and an earliest date that is the same as another ID's earliest+duplicate.

-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: r-help at r-project.org
Subject: RE: [R] [r] How to pick colums from a ragged array?

Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many  > 1,  ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)
rownames(earliest[earliest%in%nd.b])

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!

Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).

Stuart

************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
  ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
  ,20091224,20050503,19870508,19870508,19880330)

  id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...

******************************************************************

-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)

getRepeat <- function(x, first = TRUE){
      fun <- if(first) head else tail
      sp <- split(data.frame(x), x[,1])
      first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
      lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
      n <- unlist(lapply(lst, sum))
      sp1 <- sp[n > 1]
      i1 <- lst[n > 1]
      lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first = FALSE)  # to get the last ones

Hope this helps,

Rui Barradas

Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello,

I'm not sure I understand it well, in the solution below the only
returned value is ID == 814 but it's not the first nor the last DATE.

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
id.d[how.many > 1, ]

See the help page for ?ave if the repetition of id.d[,1] is confusing.
The first is the vector to average (to apply FUN to) and the second is
one of thw two vectors defining the groups.

Hope this helps,

Rui Barradas
Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that
date).
Patients may have been assigned more than one diagnosis at any one
appointment - leading to two rows, same ID and DATE but different
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-          define groups of patients by the first diagnosis given

-          define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than
one diagnosis at their first (or last) appointment. These individuals
I need to identify and remove, as it's not possible to say uniquely
what their first (or last) diagnosis was. So I need to identify and
remove these individuals which have pairs of rows with the same ID
and (lowest or highest) DATE. The size of the dataset precludes the
option of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-          Sort by DATE then ID

-          Make a ragged array of DATE by ID

-          Remove IDs that only occur once.

-          Subtract the first and second DATEs. Remove IDs for which
this = zero, as this will only be true for IDs for which the
appointment is recorded twice (because there were two diagnoses
recorded on this date).

-          (Then do the same to get the 'last appointment'
duplicates, by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the data
out of the ragged array by columns (so e.g. I end up with a matrix of
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
the ragged array.

I hope someone can help. My ugly code is below, with some data for
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
and Honorary Consultant Pychiatrist Institute of Mental Health,
Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
row # Since DATE is in 'year mo da', if there's only one date, sum
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )                         # all the
IDs with >1 date
rag.am <- rag.a [ multi.dates ]                           # rag.am
only has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please send it back to me, and immediately delete
it.   Please do not use, copy or disclose the information contained
in this message or in any attachment.  Any views or opinions
expressed by the author of this email do not necessarily reflect the
views of the University of Nottingham.

This message has been checked for viruses but the contents of an
attachment may still contain software viruses which could damage your
computer
system:
you are advised to perform your own checks. Email communications with
the University of Nottingham may be monitored as permitted by UK
legislation.
     [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi
-----Original Message-----
From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
Sent: Tuesday, October 23, 2012 2:29 PM
To: PIKAL Petr; r-help at r-project.org
Subject: RE: [r] How to pick colums from a ragged array?

Hi there.

Not sure I follow what you are doing.

I want a list of all the IDs that have duplicate DATE entries, only
when the DATE is the earliest (or last) date for that ID.
And that is what the function (with 3 small modifications) does

fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[nrow(x),2]==x[nrow(x)-1,2]

if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, data[,1]), testfirst))))) else
sel <- as.numeric(names(which(unlist(sapply(split(data, data[,1]), testlast)))))

if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,]
}

See the result of your refined data

fff(id.d)
     ID       DATE
5   167 2004-02-05
6   167 2004-02-05
22  841 2005-04-21
23  841 2005-04-21
24  841 2006-04-28
25  841 2006-06-02
26  841 2006-08-16
27  841 2006-10-25
28  841 2006-11-29
29  841 2007-01-12
30  841 2007-05-14
38 1019 1987-05-08
39 1019 1987-05-08
40 1019 1988-03-30
fff(id.d, first=F)
ID       DATE
5 167 2004-02-05
6 167 2004-02-05
fff(id.d, remove=T)
ID       DATE
1   58 2006-08-21
2   58 2006-12-07
3   58 2008-01-02
4   58 2009-09-04
7  323 2005-11-11
8  323 2006-01-11
9  323 2007-11-19
10 323 2008-01-07
11 323 2008-04-07
12 323 2008-05-21
13 323 2008-07-11
14 547 2004-10-05
15 794 2007-09-05
16 814 2002-08-14
17 814 2002-11-25
18 814 2004-04-29
19 814 2004-04-29
20 814 2007-12-05
21 814 2008-02-27
31 910 1987-05-08
32 910 2004-02-05
33 910 2004-02-05
34 910 2009-11-20
35 910 2009-12-10
36 910 2009-12-24
37 999 2005-05-03

You can do surgery on fff function to see what result comes from some piece of the function e.g.

sapply(split(id.d, id.d[,1]), testlast)

Regards
Petr
I have refined my test dataset, to include some tests (e.g. 910 has the
same dup as 1019, but for 910 it's not the earliest date):

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

Correct output:
"167"  "841"  "1019"

Stuart

-----Original Message-----
From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
Sent: 23 October 2012 13:15
To: Stuart Leask; r-help at r-project.org
Subject: RE: [r] How to pick colums from a ragged array?

Hi

Rui's answer brought me to more elaborated solution which still needs
data frame to be ordered by date

fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]),
testfirst)))) else sel <- as.numeric(names(which(sapply(split(data,
data[,1]), testlast))))

if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }

fff(id.d)
    ID     DATE
31 910 20091105
32 910 20091105
33 910 20091117
34 910 20091119
35 910 20091120
36 910 20091210
37 910 20091224
38 910 20091224

fff(id.d, remove=T)
     ID     DATE
1    58 20060821
2    58 20061207
3    58 20080102
4    58 20090904
5   167 20040205
6   167 20040323
7   323 20051111
8   323 20060111
9   323 20071119
10  323 20080107
11  323 20080407
12  323 20080521
13  323 20080711
14  547 20041005
15  794 20070905
16  814 20020814
17  814 20021125
18  814 20040429
19  814 20040429
20  814 20071205
21  814 20080227
22  841 20050421
23  841 20060130
24  841 20060428
25  841 20060602
26  841 20060816
27  841 20061025
28  841 20061129
29  841 20070112
30  841 20070514
39  999 20050503
40 1019 19870508
41 1019 19880223
42 1019 19880330
43 1019 19880330

Regards
Petr

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of PIKAL Petr
Sent: Tuesday, October 23, 2012 1:49 PM
To: Stuart Leask; r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hi

I did not check your code and rather followed your explanation. BTW,
thanks for test data.

small change in data frame to make DATE as Date class

datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
data.frame(ID,datum )

ordering by date

id.d<-id.d[order(id.d$datum),]

two functions to test if first two dates are the same or last two
dates are the same

testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
x[length(x),2]==x[length(x)-1,2]

change one last date in the data frame to be the same as previous

id.d[35,2]<-id.d[36,2]

and here are results

sapply(split(id.d, id.d$ID), testlast)
   58   167   323   547   794   814   841   910   999  1019
FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE

sapply(split(id.d, id.d$ID), testfirst)
   58   167   323   547   794   814   841   910   999  1019
FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE

Now you can select ID which is true and remove it from your data
which(sapply(split(id.d, id.d$ID), testlast))

and use it for your data frame to subset/remove id.d$ID ==
as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))  [1]
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE

However I am not sure if this is exactly what you want.

Regards
Petr

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Stuart Leask
Sent: Tuesday, October 23, 2012 11:38 AM
To: r-help at r-project.org
Subject: [R] [r] How to pick colums from a ragged array?

I have a large dataset (~1 million rows) of three variables: ID
(patient's name), DATE (of appointment) and DIAGNOSIS (given on
that
date).
Patients may have been assigned more than one diagnosis at any one
appointment - leading to two rows, same ID and DATE but different
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-          define groups of patients by the first diagnosis given

-          define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than
one diagnosis at their first (or last) appointment. These
individuals I need to identify and remove, as it's not possible to
say uniquely what their first (or last) diagnosis was. So I need to
identify and remove these individuals which have pairs of rows with
the same ID
and
(lowest or highest) DATE. The size of the dataset precludes the
option
of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-          Sort by DATE then ID

-          Make a ragged array of DATE by ID

-          Remove IDs that only occur once.

-          Subtract the first and second DATEs. Remove IDs for
which
this = zero, as this will only be true for IDs for which the
appointment is recorded twice (because there were two diagnoses
recorded on this date).

-          (Then do the same to get the 'last appointment'
duplicates,
by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the
data out of the ragged array by columns (so e.g. I end up with a
matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by
column from the ragged array.

I hope someone can help. My ugly code is below, with some data for
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
Lecturer and Honorary Consultant Pychiatrist Institute of Mental
Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU.
UK
Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up
the
dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will
be
less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )                         # all the
IDs
with >1 date
rag.am <- rag.a [ multi.dates ]                           # rag.am
only
has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the
addressee and may contain confidential information. If you have
received this message in error, please send it back to me, and
immediately delete
it.
Please do not use, copy or disclose the information contained in
this message or in any attachment.  Any views or opinions expressed
by the author of this email do not necessarily reflect the views of
the University of Nottingham.

This message has been checked for viruses but the contents of an
attachment may still contain software viruses which could damage
your computer system:
you are advised to perform your own checks. Email communications
with the University of Nottingham may be monitored as permitted by
UK legislation.
	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html and provide commented, minimal, self-contained,
reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html and provide commented, minimal, self-contained,
reproducible code.
HI,
I was not following the thread.
May be this is what you are looking for:
new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),]

tapply(new1$ID,new1$DATE,head,1)
#19870508 20040205 20040429 20050421 
? #? 1019????? 167????? 814????? 841 
A.K.

----- Original Message -----
From: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
To: PIKAL Petr <petr.pikal at precheza.cz>; "r-help at r-project.org" <r-help at r-project.org>
Cc: 
Sent: Tuesday, October 23, 2012 8:28 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hi there.

Not sure I follow what you are doing.

I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID.

I have refined my test dataset, to include some tests (e.g. 910 has the same dup as 1019, but for 910 it's not the earliest date):

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

Correct output: 
"167"? "841"? "1019"

Stuart

-----Original Message-----
From: PIKAL Petr [mailto:petr.pikal at precheza.cz] 
Sent: 23 October 2012 13:15
To: Stuart Leask; r-help at r-project.org
Subject: RE: [r] How to pick colums from a ragged array?

Hi

Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date

fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]

if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast))))

if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
fff(id.d)
? ? ID? ?  DATE
31 910 20091105
32 910 20091105
33 910 20091117
34 910 20091119
35 910 20091120
36 910 20091210
37 910 20091224
38 910 20091224
fff(id.d, remove=T)
? ?  ID? ?  DATE
1? ? 58 20060821
2? ? 58 20061207
3? ? 58 20080102
4? ? 58 20090904
5?  167 20040205
6?  167 20040323
7?  323 20051111
8?  323 20060111
9?  323 20071119
10? 323 20080107
11? 323 20080407
12? 323 20080521
13? 323 20080711
14? 547 20041005
15? 794 20070905
16? 814 20020814
17? 814 20021125
18? 814 20040429
19? 814 20040429
20? 814 20071205
21? 814 20080227
22? 841 20050421
23? 841 20060130
24? 841 20060428
25? 841 20060602
26? 841 20060816
27? 841 20061025
28? 841 20061129
29? 841 20070112
30? 841 20070514
39? 999 20050503
40 1019 19870508
41 1019 19880223
42 1019 19880330
43 1019 19880330

Regards
Petr
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- 
project.org] On Behalf Of PIKAL Petr
Sent: Tuesday, October 23, 2012 1:49 PM
To: Stuart Leask; r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hi

I did not check your code and rather followed your explanation. BTW, 
thanks for test data.

small change in data frame to make DATE as Date class

datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- 
data.frame(ID,datum )

ordering by date

id.d<-id.d[order(id.d$datum),]

two functions to test if first two dates are the same or last two 
dates are the same

testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) 
x[length(x),2]==x[length(x)-1,2]

change one last date in the data frame to be the same as previous

id.d[35,2]<-id.d[36,2]

and here are results

sapply(split(id.d, id.d$ID), testlast)
? ? 58?  167?  323?  547?  794?  814?  841?  910?  999? 1019
FALSE FALSE FALSE? ? NA? ? NA FALSE FALSE? TRUE? ? NA FALSE

sapply(split(id.d, id.d$ID), testfirst)
? ? 58?  167?  323?  547?  794?  814?  841?  910?  999? 1019
FALSE FALSE FALSE? ? NA? ? NA FALSE FALSE FALSE? ? NA FALSE

Now you can select ID which is true and remove it from your data 
which(sapply(split(id.d, id.d$ID), testlast))

and use it for your data frame to subset/remove id.d$ID == 
as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))? [1] 
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
FALSE TRUE? TRUE [37]? TRUE? TRUE? TRUE? TRUE

However I am not sure if this is exactly what you want.

Regards
Petr

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- 
project.org] On Behalf Of Stuart Leask
Sent: Tuesday, October 23, 2012 11:38 AM
To: r-help at r-project.org
Subject: [R] [r] How to pick colums from a ragged array?

I have a large dataset (~1 million rows) of three variables: ID 
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that 
date).
Patients may have been assigned more than one diagnosis at any one 
appointment - leading to two rows, same ID and DATE but different 
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-? ? ? ? ? define groups of patients by the first diagnosis given

-? ? ? ? ? define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than 
one diagnosis at their first (or last) appointment. These 
individuals I need to identify and remove, as it's not possible to 
say uniquely what their first (or last) diagnosis was. So I need to 
identify and remove these individuals which have pairs of rows with 
the same ID
and
(lowest or highest) DATE. The size of the dataset precludes the
option
of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-? ? ? ? ? Sort by DATE then ID

-? ? ? ? ? Make a ragged array of DATE by ID

-? ? ? ? ? Remove IDs that only occur once.

-? ? ? ? ? Subtract the first and second DATEs. Remove IDs for which
this = zero, as this will only be true for IDs for which the 
appointment is recorded twice (because there were two diagnoses 
recorded on this date).

-? ? ? ? ? (Then do the same to get the 'last appointment'
duplicates,
by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the 
data out of the ragged array by columns (so e.g. I end up with a 
matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by 
column from the ragged array.

I hope someone can help. My ugly code is below, with some data for 
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior 
Lecturer and Honorary Consultant Pychiatrist Institute of Mental 
Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK 
Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ?  # create
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum)? ? ? ? ? ?  #add up the
dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will
be
less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ?  # all the
IDs
with >1 date
rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ?  # rag.am
only
has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the 
addressee and may contain confidential information. If you have 
received this message in error, please send it back to me, and 
immediately delete
it.
Please do not use, copy or disclose the information contained in 
this message or in any attachment.? Any views or opinions expressed 
by the author of this email do not necessarily reflect the views of 
the University of Nottingham.

This message has been checked for viruses but the contents of an 
attachment may still contain software viruses which could damage 
your computer system:
you are advised to perform your own checks. Email communications 
with the University of Nottingham may be monitored as permitted by 
UK legislation.
??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting- 
guide.html and provide commented, minimal, self-contained, 
reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting- 
guide.html and provide commented, minimal, self-contained, 
reproducible code.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
So I get my list of IDs to exclude from:

g.rr<-do.call(rbind, g.r)[1]
dim(g.rr)
g.rr[1:(dim(g.rr)[1]/2)]

Many thanks.

Stuart

-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt] 
Sent: 23 October 2012 13:42
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: FW: [R] [r] How to pick colums from a ragged array?

Hello,

You're right, getRepeat returns a list of data.frames, one per each ID. 
To put them all in the same df use

do.call(rbind, g.r)

Rui Barradas
Em 23-10-2012 13:36, Stuart Leask escreveu:
Sorry, I must be a bit thick.!
getRepeat gives me the data with duplicates - but I don't seem to be able to manipulate the result. It looks like a list of dataframes:

g.r<-getRepeat(id.d)
dim(g.r)
NULL
summary(g.r)
      Length Class      Mode
[1,] 2      data.frame list
[2,] 2      data.frame list
[3,] 2      data.frame list

This leaves me with the same problem I had with my ragged array i.e. how do I put all the second elements from this long list of data frames, into a single list?

I need to end up with a list of all the IDs that have duplicate first (or last) DATES.

Stuart

-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 13:17
To: 'Rui Barradas'
Cc: 'r-help at r-project.org'
Subject: RE: [R] [r] How to pick colums from a ragged array?

Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
Test data is all - see below it failing.

So, I remain very grateful for your function!

Stuart

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
  ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
  ,20091224,20050503,19870508,19870508,19880330)

  id.d <- cbind (ID,DATE )
# rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create ragged array, 1-n DATES for every NAME
  # Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up the dates per row
  # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t )                         # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ]                           # rag.am only has IDs with > 1 Date

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1]         # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2]         # list of dup'd dates

earliest<-tapply(DATE,ID,min)  # table of mins
rownames(earliest[earliest%in%nd.b])   # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, # and an earliest date that is the same as another ID's earliest+duplicate.

-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: r-help at r-project.org
Subject: RE: [R] [r] How to pick colums from a ragged array?

Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many  > 1,  ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)
rownames(earliest[earliest%in%nd.b])

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!

Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).

Stuart

************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
  ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
  ,20091224,20050503,19870508,19870508,19880330)

  id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...

******************************************************************

-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- 
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,200912
10
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)

getRepeat <- function(x, first = TRUE){
      fun <- if(first) head else tail
      sp <- split(data.frame(x), x[,1])
      first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
      lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
      n <- unlist(lapply(lst, sum))
      sp1 <- sp[n > 1]
      i1 <- lst[n > 1]
      lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first = 
FALSE)  # to get the last ones

Hope this helps,

Rui Barradas

Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello,

I'm not sure I understand it well, in the solution below the only 
returned value is ID == 814 but it's not the first nor the last DATE.

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) 
id.d[how.many > 1, ]

See the help page for ?ave if the repetition of id.d[,1] is confusing.
The first is the vector to average (to apply FUN to) and the second 
is one of thw two vectors defining the groups.

Hope this helps,

Rui Barradas
Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID 
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that 
date).
Patients may have been assigned more than one diagnosis at any one 
appointment - leading to two rows, same ID and DATE but different 
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-          define groups of patients by the first diagnosis given

-          define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than 
one diagnosis at their first (or last) appointment. These 
individuals I need to identify and remove, as it's not possible to 
say uniquely what their first (or last) diagnosis was. So I need to 
identify and remove these individuals which have pairs of rows with 
the same ID and (lowest or highest) DATE. The size of the dataset 
precludes the option of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-          Sort by DATE then ID

-          Make a ragged array of DATE by ID

-          Remove IDs that only occur once.

-          Subtract the first and second DATEs. Remove IDs for which
this = zero, as this will only be true for IDs for which the 
appointment is recorded twice (because there were two diagnoses 
recorded on this date).

-          (Then do the same to get the 'last appointment'
duplicates, by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the 
data out of the ragged array by columns (so e.g. I end up with a 
matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by 
column from the ragged array.

I hope someone can help. My ugly code is below, with some data for 
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior 
Lecturer and Honorary Consultant Pychiatrist Institute of Mental 
Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK 
Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per 
row # Since DATE is in 'year mo da', if there's only one date, sum 
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )                         # all the
IDs with >1 date
rag.am <- rag.a [ multi.dates ]                           # rag.am
only has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the 
addressee and may contain confidential information. If you have 
received this message in error, please send it back to me, and immediately delete
it.   Please do not use, copy or disclose the information contained
in this message or in any attachment.  Any views or opinions 
expressed by the author of this email do not necessarily reflect the 
views of the University of Nottingham.

This message has been checked for viruses but the contents of an 
attachment may still contain software viruses which could damage 
your computer
system:
you are advised to perform your own checks. Email communications 
with the University of Nottingham may be monitored as permitted by 
UK legislation.
     [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi Stuart,

This also should get you the IDs you wanted.
new1<-id.d[duplicated(id.d[,2])|duplicated(id.d[,2],fromLast=TRUE),]
earliest <- tapply ( DATE, ID, min)? ? ? ? ? ? ? ? 
?rownames(earliest[earliest%in% new1])
#[1] "167"? "841"? "1019"
A.K.

----- Original Message -----
From: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
To: Rui Barradas <ruipbarradas at sapo.pt>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Tuesday, October 23, 2012 7:37 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?

Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!) 
I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many? <-? ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many? > 1,? ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)? ? ? ? ? ? ? ? 
rownames(earliest[earliest%in%nd.b])? 

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!

Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).

Stuart

************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <- 
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply? ( DATE, ID, min)? ? ? ? ? ? ? ? ? ? # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )?  # IDs of duplicates at the earliest date for that individual. I think...

******************************************************************

-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt] 
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)

getRepeat <- function(x, first = TRUE){
? ?  fun <- if(first) head else tail
? ?  sp <- split(data.frame(x), x[,1])
? ?  first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
? ?  lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
? ?  n <- unlist(lapply(lst, sum))
? ?  sp1 <- sp[n > 1]
? ?  i1 <- lst[n > 1]
? ?  lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)? # defaults to first = TRUE getRepeat(id.d, first = FALSE)? # to get the last ones

Hope this helps,

Rui Barradas

Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello,

I'm not sure I understand it well, in the solution below the only 
returned value is ID == 814 but it's not the first nor the last DATE.

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) 
id.d[how.many > 1, ]

See the help page for ?ave if the repetition of id.d[,1] is confusing. 
The first is the vector to average (to apply FUN to) and the second is 
one of thw two vectors defining the groups.

Hope this helps,

Rui Barradas
Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID 
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that 
date).
Patients may have been assigned more than one diagnosis at any one 
appointment - leading to two rows, same ID and DATE but different 
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-? ? ? ? ? define groups of patients by the first diagnosis given

-? ? ? ? ? define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than 
one diagnosis at their first (or last) appointment. These individuals 
I need to identify and remove, as it's not possible to say uniquely 
what their first (or last) diagnosis was. So I need to identify and 
remove these individuals which have pairs of rows with the same ID 
and (lowest or highest) DATE. The size of the dataset precludes the 
option of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-? ? ? ? ? Sort by DATE then ID

-? ? ? ? ? Make a ragged array of DATE by ID

-? ? ? ? ? Remove IDs that only occur once.

-? ? ? ? ? Subtract the first and second DATEs. Remove IDs for which 
this = zero, as this will only be true for IDs for which the 
appointment is recorded twice (because there were two diagnoses 
recorded on this date).

-? ? ? ? ? (Then do the same to get the 'last appointment' 
duplicates, by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the data 
out of the ragged array by columns (so e.g. I end up with a matrix of 
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from 
the ragged array.

I hope someone can help. My ugly code is below, with some data for 
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer 
and Honorary Consultant Pychiatrist Institute of Mental Health, 
Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 
115 82 30419 
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <- 
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ?  # create 
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum) #add up the dates per 
row # Since DATE is in 'year mo da', if there's only one date, sum 
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ?  # all the 
IDs with >1 date
rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ?  # rag.am 
only has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the addressee 
and may contain confidential information. If you have received this 
message in error, please send it back to me, and immediately delete
it.?  Please do not use, copy or disclose the information contained 
in this message or in any attachment.? Any views or opinions 
expressed by the author of this email do not necessarily reflect the 
views of the University of Nottingham.

This message has been checked for viruses but the contents of an 
attachment may still contain software viruses which could damage your 
computer
system:
you are advised to perform your own checks. Email communications with 
the University of Nottingham may be monitored as permitted by UK 
legislation.
? ?  [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi,

?res1<-data.frame(col=sapply(tapply(DATE,ID,function(x)? duplicated(head(x,2))),function(x) x[2]))
row.names(subset(res1,col==TRUE))
#[1] "167"? "841"? "1019"
#assuming that dates are sorted

A.K.

----- Original Message -----
From: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
To: Rui Barradas <ruipbarradas at sapo.pt>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Tuesday, October 23, 2012 8:17 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?

Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate. 
Test data is all - see below it failing.

So, I remain very grateful for your function!

Stuart 

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

id.d <- cbind (ID,DATE )
# rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ?  # create ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum)? ? ? ? ? ?  #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ?  # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ?  # rag.am only has IDs with > 1 Date

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1]? ? ? ?  # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2]? ? ? ?  # list of dup'd dates

earliest<-tapply(DATE,ID,min)? # table of mins
rownames(earliest[earliest%in%nd.b])?  # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, 
# and an earliest date that is the same as another ID's earliest+duplicate.

-----Original Message-----
From: Leask Stuart 
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: r-help at r-project.org
Subject: RE: [R] [r] How to pick colums from a ragged array?

Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many? <-? ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many? > 1,? ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)? ? ? ? ? ? ? ? 
rownames(earliest[earliest%in%nd.b])? 

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!

Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).

Stuart

************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply? ( DATE, ID, min)? ? ? ? ? ? ? ? ? ? # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )?  # IDs of duplicates at the earliest date for that individual. I think...

******************************************************************

-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)

getRepeat <- function(x, first = TRUE){
? ?  fun <- if(first) head else tail
? ?  sp <- split(data.frame(x), x[,1])
? ?  first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
? ?  lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
? ?  n <- unlist(lapply(lst, sum))
? ?  sp1 <- sp[n > 1]
? ?  i1 <- lst[n > 1]
? ?  lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)? # defaults to first = TRUE getRepeat(id.d, first = FALSE)? # to get the last ones

Hope this helps,

Rui Barradas

Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello,

I'm not sure I understand it well, in the solution below the only 
returned value is ID == 814 but it's not the first nor the last DATE.

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) 
id.d[how.many > 1, ]

See the help page for ?ave if the repetition of id.d[,1] is confusing. 
The first is the vector to average (to apply FUN to) and the second is 
one of thw two vectors defining the groups.

Hope this helps,

Rui Barradas
Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID 
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that 
date).
Patients may have been assigned more than one diagnosis at any one 
appointment - leading to two rows, same ID and DATE but different 
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-? ? ? ? ? define groups of patients by the first diagnosis given

-? ? ? ? ? define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than 
one diagnosis at their first (or last) appointment. These individuals 
I need to identify and remove, as it's not possible to say uniquely 
what their first (or last) diagnosis was. So I need to identify and 
remove these individuals which have pairs of rows with the same ID 
and (lowest or highest) DATE. The size of the dataset precludes the 
option of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-? ? ? ? ? Sort by DATE then ID

-? ? ? ? ? Make a ragged array of DATE by ID

-? ? ? ? ? Remove IDs that only occur once.

-? ? ? ? ? Subtract the first and second DATEs. Remove IDs for which 
this = zero, as this will only be true for IDs for which the 
appointment is recorded twice (because there were two diagnoses 
recorded on this date).

-? ? ? ? ? (Then do the same to get the 'last appointment' 
duplicates, by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the data 
out of the ragged array by columns (so e.g. I end up with a matrix of 
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from 
the ragged array.

I hope someone can help. My ugly code is below, with some data for 
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer 
and Honorary Consultant Pychiatrist Institute of Mental Health, 
Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ?  # create 
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum) #add up the dates per 
row # Since DATE is in 'year mo da', if there's only one date, sum 
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ?  # all the 
IDs with >1 date
rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ?  # rag.am 
only has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the addressee 
and may contain confidential information. If you have received this 
message in error, please send it back to me, and immediately delete
it.?  Please do not use, copy or disclose the information contained 
in this message or in any attachment.? Any views or opinions 
expressed by the author of this email do not necessarily reflect the 
views of the University of Nottingham.

This message has been checked for viruses but the contents of an 
attachment may still contain software viruses which could damage your 
computer
system:
you are advised to perform your own checks. Email communications with 
the University of Nottingham may be monitored as permitted by UK 
legislation.
? ?  [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hi,
I read ur first post and I think I understand what you meant.? 
I guess this should work.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

id.d <- cbind (ID,DATE )
res1<- data.frame(flag=tapply(id.d[,2],id.d[,1],FUN=function(x) head(duplicated(x)|duplicated(x,fromLast=TRUE),1)|tail(duplicated(x)|duplicated(x,fromLast=TRUE),1)))
?id.d[id.d[,1]%in%names(res1[res1$flag==TRUE,]),]
?# ???? ID???? DATE
?#[1,]? 167 20040205
?#[2,]? 167 20040205
?#[3,]? 841 20050421
?#[4,]? 841 20050421
?#[5,]? 841 20060428
?#[6,]? 841 20060602
?#[7,]? 841 20060816
?#[8,]? 841 20061025
?#[9,]? 841 20061129
#[10,]? 841 20070112
#[11,]? 841 20070514
#[12,] 1019 19870508
#[13,] 1019 19870508
#[14,] 1019 19880330

So, these IDs should be deleted if I am correct.
A.K.

----- Original Message -----
From: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
To: Rui Barradas <ruipbarradas at sapo.pt>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Tuesday, October 23, 2012 8:17 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?

Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate. 
Test data is all - see below it failing.

So, I remain very grateful for your function!

Stuart 

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

id.d <- cbind (ID,DATE )
# rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ?  # create ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum)? ? ? ? ? ?  #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ?  # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ?  # rag.am only has IDs with > 1 Date

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1]? ? ? ?  # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2]? ? ? ?  # list of dup'd dates

earliest<-tapply(DATE,ID,min)? # table of mins
rownames(earliest[earliest%in%nd.b])?  # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, 
# and an earliest date that is the same as another ID's earliest+duplicate.

-----Original Message-----
From: Leask Stuart 
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: r-help at r-project.org
Subject: RE: [R] [r] How to pick colums from a ragged array?

Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many? <-? ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many? > 1,? ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)? ? ? ? ? ? ? ? 
rownames(earliest[earliest%in%nd.b])? 

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!

Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).

Stuart

************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)

id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply? ( DATE, ID, min)? ? ? ? ? ? ? ? ? ? # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )?  # IDs of duplicates at the earliest date for that individual. I think...

******************************************************************

-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)

getRepeat <- function(x, first = TRUE){
? ?  fun <- if(first) head else tail
? ?  sp <- split(data.frame(x), x[,1])
? ?  first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
? ?  lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
? ?  n <- unlist(lapply(lst, sum))
? ?  sp1 <- sp[n > 1]
? ?  i1 <- lst[n > 1]
? ?  lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)? # defaults to first = TRUE getRepeat(id.d, first = FALSE)? # to get the last ones

Hope this helps,

Rui Barradas

Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello,

I'm not sure I understand it well, in the solution below the only 
returned value is ID == 814 but it's not the first nor the last DATE.

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) 
id.d[how.many > 1, ]

See the help page for ?ave if the repetition of id.d[,1] is confusing. 
The first is the vector to average (to apply FUN to) and the second is 
one of thw two vectors defining the groups.

Hope this helps,

Rui Barradas
Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID 
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that 
date).
Patients may have been assigned more than one diagnosis at any one 
appointment - leading to two rows, same ID and DATE but different 
DIAGNOSIS.
The diagnoses may change between appointments.

I want to subset the data in two ways:

-? ? ? ? ? define groups of patients by the first diagnosis given

-? ? ? ? ? define groups of patients by the last diagnosis given.

The problem:
Unfortunately, a small number of patients have been given more than 
one diagnosis at their first (or last) appointment. These individuals 
I need to identify and remove, as it's not possible to say uniquely 
what their first (or last) diagnosis was. So I need to identify and 
remove these individuals which have pairs of rows with the same ID 
and (lowest or highest) DATE. The size of the dataset precludes the 
option of doing this by eye.

I suspect there is a very elegant way of doing this in R.

This is what I've come up with:

-? ? ? ? ? Sort by DATE then ID

-? ? ? ? ? Make a ragged array of DATE by ID

-? ? ? ? ? Remove IDs that only occur once.

-? ? ? ? ? Subtract the first and second DATEs. Remove IDs for which 
this = zero, as this will only be true for IDs for which the 
appointment is recorded twice (because there were two diagnoses 
recorded on this date).

-? ? ? ? ? (Then do the same to get the 'last appointment' 
duplicates, by reversing the initial sort by DATE.)

I am stuck at the 'Subtract dates' step: I would like to get the data 
out of the ragged array by columns (so e.g. I end up with a matrix of 
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from 
the ragged array.

I hope someone can help. My ugly code is below, with some data for 
testing.

Stuart

Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer 
and Honorary Consultant Pychiatrist Institute of Mental Health, 
Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)

id.d <- cbind (ID,DATE )
rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ?  # create 
ragged array, 1-n DATES for every NAME

# Inelegant attempt to remove IDs that only have one entry:

rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum) #add up the dates per 
row # Since DATE is in 'year mo da', if there's only one date, sum 
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ?  # all the 
IDs with >1 date
rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ?  # rag.am 
only has IDs with > 1 Date

# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.

This message and any attachment are intended solely for the addressee 
and may contain confidential information. If you have received this 
message in error, please send it back to me, and immediately delete
it.?  Please do not use, copy or disclose the information contained 
in this message or in any attachment.? Any views or opinions 
expressed by the author of this email do not necessarily reflect the 
views of the University of Nottingham.

This message has been checked for viruses but the contents of an 
attachment may still contain software viruses which could damage your 
computer
system:
you are advised to perform your own checks. Email communications with 
the University of Nottingham may be monitored as permitted by UK 
legislation.
? ?  [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.