An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20121023/dbd58294/attachment.pl>
How to pick colums from a ragged array?
16 messages · PIKAL Petr, root, Rui Barradas +1 more
Hello, I'm not sure I understand it well, in the solution below the only returned value is ID == 814 but it's not the first nor the last DATE. how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) id.d[how.many > 1, ] See the help page for ?ave if the repetition of id.d[,1] is confusing. The first is the vector to average (to apply FUN to) and the second is one of thw two vectors defining the groups. Hope this helps, Rui Barradas Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on that date). Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS. The diagnoses may change between appointments. I want to subset the data in two ways: - define groups of patients by the first diagnosis given - define groups of patients by the last diagnosis given. The problem: Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID and (lowest or highest) DATE. The size of the dataset precludes the option of doing this by eye. I suspect there is a very elegant way of doing this in R. This is what I've come up with: - Sort by DATE then ID - Make a ragged array of DATE by ID - Remove IDs that only occur once. - Subtract the first and second DATEs. Remove IDs for which this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date). - (Then do the same to get the 'last appointment' duplicates, by reversing the initial sort by DATE.) I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array. I hope someone can help. My ugly code is below, with some data for testing. Stuart Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer and Honorary Consultant Pychiatrist Institute of Mental Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 115 82 30419 stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk> Google 'Dr Stuart Leask' ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 ,20091224,20050503,19870508,19880223,19880330) id.d <- cbind (ID,DATE ) rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME # Inelegant attempt to remove IDs that only have one entry: rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000: rag.t <- rag.s [ rag.s > 21000000 ] multi.dates <- rownames ( rag.t ) # all the IDs with >1 date rag.am <- rag.a [ multi.dates ] # rag.am only has IDs with > 1 Date # But now I'm stuck. # Each row of the array is rag.am$ID. # So I can't pick columns of DATEs from the ragged array. This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it. Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hello,
Thinking again, if you just want the first/last in each ID that repeats
the DATE, the following function does the job. Since there were no such
cases in your data example, I've added 3 rows to the dataset.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)
DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
id.d <- cbind(ID, DATE)
getRepeat <- function(x, first = TRUE){
fun <- if(first) head else tail
sp <- split(data.frame(x), x[,1])
first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
n <- unlist(lapply(lst, sum))
sp1 <- sp[n > 1]
i1 <- lst[n > 1]
lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ])
}
getRepeat(id.d) # defaults to first = TRUE
getRepeat(id.d, first = FALSE) # to get the last ones
Hope this helps,
Rui Barradas
Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello, I'm not sure I understand it well, in the solution below the only returned value is ID == 814 but it's not the first nor the last DATE. how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) id.d[how.many > 1, ] See the help page for ?ave if the repetition of id.d[,1] is confusing. The first is the vector to average (to apply FUN to) and the second is one of thw two vectors defining the groups. Hope this helps, Rui Barradas Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that
date).
Patients may have been assigned more than one diagnosis at any one
appointment - leading to two rows, same ID and DATE but different
DIAGNOSIS.
The diagnoses may change between appointments.
I want to subset the data in two ways:
- define groups of patients by the first diagnosis given
- define groups of patients by the last diagnosis given.
The problem:
Unfortunately, a small number of patients have been given more than
one diagnosis at their first (or last) appointment. These individuals
I need to identify and remove, as it's not possible to say uniquely
what their first (or last) diagnosis was. So I need to identify and
remove these individuals which have pairs of rows with the same ID
and (lowest or highest) DATE. The size of the dataset precludes the
option of doing this by eye.
I suspect there is a very elegant way of doing this in R.
This is what I've come up with:
- Sort by DATE then ID
- Make a ragged array of DATE by ID
- Remove IDs that only occur once.
- Subtract the first and second DATEs. Remove IDs for which
this = zero, as this will only be true for IDs for which the
appointment is recorded twice (because there were two diagnoses
recorded on this date).
- (Then do the same to get the 'last appointment'
duplicates, by reversing the initial sort by DATE.)
I am stuck at the 'Subtract dates' step: I would like to get the data
out of the ragged array by columns (so e.g. I end up with a matrix of
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
the ragged array.
I hope someone can help. My ugly code is below, with some data for
testing.
Stuart
Dr Stuart John Leask DM FRCPsych MB BChir MA
Clinical Senior Lecturer and Honorary Consultant Pychiatrist
Institute of Mental Health, Innovation Park
Triumph Road, Nottingham, Notts. NG7 2TU. UK
Tel. +44 115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)
id.d <- cbind (ID,DATE )
rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create
ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will
be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t ) # all the
IDs with >1 date
rag.am <- rag.a [ multi.dates ] # rag.am
only has IDs with > 1 Date
# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.
This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please send it back to me, and immediately delete
it. Please do not use, copy or disclose the information contained
in this message or in any attachment. Any views or opinions
expressed by the author of this email do not necessarily reflect the
views of the University of Nottingham.
This message has been checked for viruses but the contents of an
attachment
may still contain software viruses which could damage your computer
system:
you are advised to perform your own checks. Email communications with
the
University of Nottingham may be monitored as permitted by UK
legislation.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thanks Rui - your initial, very elegant suggestion, has spurred me on!
1. As you noticed, my example data had no examples of duplicate first dates (DOH!)
I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.
2. Your suggestion gave me all the duplicates:
how.many <- ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many > 1, ]
3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:
earliest <- tapply ( DATE, ID, min)
rownames(earliest[earliest%in%nd.b])
This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!
Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).
Stuart
************************************************
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
earliest <- tapply ( DATE, ID, min) # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] ) # IDs of duplicates at the earliest date for that individual. I think...
******************************************************************
-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?
Hello,
Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)
DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
id.d <- cbind(ID, DATE)
getRepeat <- function(x, first = TRUE){
fun <- if(first) head else tail
sp <- split(data.frame(x), x[,1])
first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
n <- unlist(lapply(lst, sum))
sp1 <- sp[n > 1]
i1 <- lst[n > 1]
lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }
getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, first = FALSE) # to get the last ones
Hope this helps,
Rui Barradas
Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello, I'm not sure I understand it well, in the solution below the only returned value is ID == 814 but it's not the first nor the last DATE. how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) id.d[how.many > 1, ] See the help page for ?ave if the repetition of id.d[,1] is confusing. The first is the vector to average (to apply FUN to) and the second is one of thw two vectors defining the groups. Hope this helps, Rui Barradas Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that
date).
Patients may have been assigned more than one diagnosis at any one
appointment - leading to two rows, same ID and DATE but different
DIAGNOSIS.
The diagnoses may change between appointments.
I want to subset the data in two ways:
- define groups of patients by the first diagnosis given
- define groups of patients by the last diagnosis given.
The problem:
Unfortunately, a small number of patients have been given more than
one diagnosis at their first (or last) appointment. These individuals
I need to identify and remove, as it's not possible to say uniquely
what their first (or last) diagnosis was. So I need to identify and
remove these individuals which have pairs of rows with the same ID
and (lowest or highest) DATE. The size of the dataset precludes the
option of doing this by eye.
I suspect there is a very elegant way of doing this in R.
This is what I've come up with:
- Sort by DATE then ID
- Make a ragged array of DATE by ID
- Remove IDs that only occur once.
- Subtract the first and second DATEs. Remove IDs for which
this = zero, as this will only be true for IDs for which the
appointment is recorded twice (because there were two diagnoses
recorded on this date).
- (Then do the same to get the 'last appointment'
duplicates, by reversing the initial sort by DATE.)
I am stuck at the 'Subtract dates' step: I would like to get the data
out of the ragged array by columns (so e.g. I end up with a matrix of
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
the ragged array.
I hope someone can help. My ugly code is below, with some data for
testing.
Stuart
Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
and Honorary Consultant Pychiatrist Institute of Mental Health,
Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)
id.d <- cbind (ID,DATE )
rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create
ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
row # Since DATE is in 'year mo da', if there's only one date, sum
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t ) # all the
IDs with >1 date
rag.am <- rag.a [ multi.dates ] # rag.am
only has IDs with > 1 Date
# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.
This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please send it back to me, and immediately delete
it. Please do not use, copy or disclose the information contained
in this message or in any attachment. Any views or opinions
expressed by the author of this email do not necessarily reflect the
views of the University of Nottingham.
This message has been checked for viruses but the contents of an
attachment may still contain software viruses which could damage your
computer
system:
you are advised to perform your own checks. Email communications with
the University of Nottingham may be monitored as permitted by UK
legislation.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi I did not check your code and rather followed your explanation. BTW, thanks for test data. small change in data frame to make DATE as Date class datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- data.frame(ID,datum ) ordering by date id.d<-id.d[order(id.d$datum),] two functions to test if first two dates are the same or last two dates are the same testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) x[length(x),2]==x[length(x)-1,2] change one last date in the data frame to be the same as previous id.d[35,2]<-id.d[36,2] and here are results sapply(split(id.d, id.d$ID), testlast) 58 167 323 547 794 814 841 910 999 1019 FALSE FALSE FALSE NA NA FALSE FALSE TRUE NA FALSE
sapply(split(id.d, id.d$ID), testfirst)
58 167 323 547 794 814 841 910 999 1019 FALSE FALSE FALSE NA NA FALSE FALSE FALSE NA FALSE Now you can select ID which is true and remove it from your data which(sapply(split(id.d, id.d$ID), testlast)) and use it for your data frame to subset/remove id.d$ID == as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast)))) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE [37] TRUE TRUE TRUE TRUE However I am not sure if this is exactly what you want. Regards Petr
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Stuart Leask Sent: Tuesday, October 23, 2012 11:38 AM To: r-help at r-project.org Subject: [R] [r] How to pick colums from a ragged array? I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on that date). Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS. The diagnoses may change between appointments. I want to subset the data in two ways: - define groups of patients by the first diagnosis given - define groups of patients by the last diagnosis given. The problem: Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID and (lowest or highest) DATE. The size of the dataset precludes the option of doing this by eye. I suspect there is a very elegant way of doing this in R. This is what I've come up with: - Sort by DATE then ID - Make a ragged array of DATE by ID - Remove IDs that only occur once. - Subtract the first and second DATEs. Remove IDs for which this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date). - (Then do the same to get the 'last appointment' duplicates, by reversing the initial sort by DATE.) I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array. I hope someone can help. My ugly code is below, with some data for testing. Stuart Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer and Honorary Consultant Pychiatrist Institute of Mental Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 115 82 30419 stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk> Google 'Dr Stuart Leask' ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 ,20091224,20050503,19870508,19880223,19880330) id.d <- cbind (ID,DATE ) rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME # Inelegant attempt to remove IDs that only have one entry: rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000: rag.t <- rag.s [ rag.s > 21000000 ] multi.dates <- rownames ( rag.t ) # all the IDs with >1 date rag.am <- rag.a [ multi.dates ] # rag.am only has IDs with > 1 Date # But now I'm stuck. # Each row of the array is rag.am$ID. # So I can't pick columns of DATEs from the ragged array. This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it. Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
Hi
Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date
fff<-function(data, first=TRUE, remove=FALSE) {
testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]
if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else
sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast))))
if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,]
}
fff(id.d)
ID DATE 31 910 20091105 32 910 20091105 33 910 20091117 34 910 20091119 35 910 20091120 36 910 20091210 37 910 20091224 38 910 20091224
fff(id.d, remove=T)
ID DATE 1 58 20060821 2 58 20061207 3 58 20080102 4 58 20090904 5 167 20040205 6 167 20040323 7 323 20051111 8 323 20060111 9 323 20071119 10 323 20080107 11 323 20080407 12 323 20080521 13 323 20080711 14 547 20041005 15 794 20070905 16 814 20020814 17 814 20021125 18 814 20040429 19 814 20040429 20 814 20071205 21 814 20080227 22 841 20050421 23 841 20060130 24 841 20060428 25 841 20060602 26 841 20060816 27 841 20061025 28 841 20061129 29 841 20070112 30 841 20070514 39 999 20050503 40 1019 19870508 41 1019 19880223 42 1019 19880330 43 1019 19880330
Regards Petr
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of PIKAL Petr Sent: Tuesday, October 23, 2012 1:49 PM To: Stuart Leask; r-help at r-project.org Subject: Re: [R] [r] How to pick colums from a ragged array? Hi I did not check your code and rather followed your explanation. BTW, thanks for test data. small change in data frame to make DATE as Date class datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- data.frame(ID,datum ) ordering by date id.d<-id.d[order(id.d$datum),] two functions to test if first two dates are the same or last two dates are the same testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) x[length(x),2]==x[length(x)-1,2] change one last date in the data frame to be the same as previous id.d[35,2]<-id.d[36,2] and here are results sapply(split(id.d, id.d$ID), testlast) 58 167 323 547 794 814 841 910 999 1019 FALSE FALSE FALSE NA NA FALSE FALSE TRUE NA FALSE
sapply(split(id.d, id.d$ID), testfirst)
58 167 323 547 794 814 841 910 999 1019 FALSE FALSE FALSE NA NA FALSE FALSE FALSE NA FALSE Now you can select ID which is true and remove it from your data which(sapply(split(id.d, id.d$ID), testlast)) and use it for your data frame to subset/remove id.d$ID == as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast)))) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE [37] TRUE TRUE TRUE TRUE However I am not sure if this is exactly what you want. Regards Petr
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Stuart Leask Sent: Tuesday, October 23, 2012 11:38 AM To: r-help at r-project.org Subject: [R] [r] How to pick colums from a ragged array? I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on that date). Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS. The diagnoses may change between appointments. I want to subset the data in two ways: - define groups of patients by the first diagnosis given - define groups of patients by the last diagnosis given. The problem: Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID
and
(lowest or highest) DATE. The size of the dataset precludes the
option
of doing this by eye. I suspect there is a very elegant way of doing this in R. This is what I've come up with: - Sort by DATE then ID - Make a ragged array of DATE by ID - Remove IDs that only occur once. - Subtract the first and second DATEs. Remove IDs for which this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date). - (Then do the same to get the 'last appointment'
duplicates,
by reversing the initial sort by DATE.) I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array. I hope someone can help. My ugly code is below, with some data for testing. Stuart Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer and Honorary Consultant Pychiatrist Institute of Mental Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 115 82 30419 stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk> Google 'Dr Stuart Leask' ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 ,20091224,20050503,19870508,19880223,19880330) id.d <- cbind (ID,DATE ) rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME # Inelegant attempt to remove IDs that only have one entry: rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row # Since DATE is in 'year mo da', if there's only one date, sum will
be
less than 2100000: rag.t <- rag.s [ rag.s > 21000000 ] multi.dates <- rownames ( rag.t ) # all the
IDs
with >1 date rag.am <- rag.a [ multi.dates ] # rag.am
only
has IDs with > 1 Date # But now I'm stuck. # Each row of the array is rag.am$ID. # So I can't pick columns of DATEs from the ragged array. This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete
it.
Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
Test data is all - see below it failing.
So, I remain very grateful for your function!
Stuart
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
# rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t ) # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ] # rag.am only has IDs with > 1 Date
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1] # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2] # list of dup'd dates
earliest<-tapply(DATE,ID,min) # table of mins
rownames(earliest[earliest%in%nd.b]) # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate,
# and an earliest date that is the same as another ID's earliest+duplicate.
-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: r-help at r-project.org
Subject: RE: [R] [r] How to pick colums from a ragged array?
Thanks Rui - your initial, very elegant suggestion, has spurred me on!
1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.
2. Your suggestion gave me all the duplicates:
how.many <- ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many > 1, ]
3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:
earliest <- tapply ( DATE, ID, min)
rownames(earliest[earliest%in%nd.b])
This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!
Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).
Stuart
************************************************
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
earliest <- tapply ( DATE, ID, min) # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] ) # IDs of duplicates at the earliest date for that individual. I think...
******************************************************************
-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?
Hello,
Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)
DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
id.d <- cbind(ID, DATE)
getRepeat <- function(x, first = TRUE){
fun <- if(first) head else tail
sp <- split(data.frame(x), x[,1])
first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
n <- unlist(lapply(lst, sum))
sp1 <- sp[n > 1]
i1 <- lst[n > 1]
lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }
getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, first = FALSE) # to get the last ones
Hope this helps,
Rui Barradas
Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello, I'm not sure I understand it well, in the solution below the only returned value is ID == 814 but it's not the first nor the last DATE. how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) id.d[how.many > 1, ] See the help page for ?ave if the repetition of id.d[,1] is confusing. The first is the vector to average (to apply FUN to) and the second is one of thw two vectors defining the groups. Hope this helps, Rui Barradas Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that
date).
Patients may have been assigned more than one diagnosis at any one
appointment - leading to two rows, same ID and DATE but different
DIAGNOSIS.
The diagnoses may change between appointments.
I want to subset the data in two ways:
- define groups of patients by the first diagnosis given
- define groups of patients by the last diagnosis given.
The problem:
Unfortunately, a small number of patients have been given more than
one diagnosis at their first (or last) appointment. These individuals
I need to identify and remove, as it's not possible to say uniquely
what their first (or last) diagnosis was. So I need to identify and
remove these individuals which have pairs of rows with the same ID
and (lowest or highest) DATE. The size of the dataset precludes the
option of doing this by eye.
I suspect there is a very elegant way of doing this in R.
This is what I've come up with:
- Sort by DATE then ID
- Make a ragged array of DATE by ID
- Remove IDs that only occur once.
- Subtract the first and second DATEs. Remove IDs for which
this = zero, as this will only be true for IDs for which the
appointment is recorded twice (because there were two diagnoses
recorded on this date).
- (Then do the same to get the 'last appointment'
duplicates, by reversing the initial sort by DATE.)
I am stuck at the 'Subtract dates' step: I would like to get the data
out of the ragged array by columns (so e.g. I end up with a matrix of
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
the ragged array.
I hope someone can help. My ugly code is below, with some data for
testing.
Stuart
Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
and Honorary Consultant Pychiatrist Institute of Mental Health,
Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)
id.d <- cbind (ID,DATE )
rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create
ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
row # Since DATE is in 'year mo da', if there's only one date, sum
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t ) # all the
IDs with >1 date
rag.am <- rag.a [ multi.dates ] # rag.am
only has IDs with > 1 Date
# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.
This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please send it back to me, and immediately delete
it. Please do not use, copy or disclose the information contained
in this message or in any attachment. Any views or opinions
expressed by the author of this email do not necessarily reflect the
views of the University of Nottingham.
This message has been checked for viruses but the contents of an
attachment may still contain software viruses which could damage your
computer
system:
you are advised to perform your own checks. Email communications with
the University of Nottingham may be monitored as permitted by UK
legislation.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi there.
Not sure I follow what you are doing.
I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID.
I have refined my test dataset, to include some tests (e.g. 910 has the same dup as 1019, but for 910 it's not the earliest date):
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
Correct output:
"167" "841" "1019"
Stuart
-----Original Message-----
From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
Sent: 23 October 2012 13:15
To: Stuart Leask; r-help at r-project.org
Subject: RE: [r] How to pick colums from a ragged array?
Hi
Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date
fff<-function(data, first=TRUE, remove=FALSE) {
testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]
if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast))))
if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
fff(id.d)
ID DATE 31 910 20091105 32 910 20091105 33 910 20091117 34 910 20091119 35 910 20091120 36 910 20091210 37 910 20091224 38 910 20091224
fff(id.d, remove=T)
ID DATE 1 58 20060821 2 58 20061207 3 58 20080102 4 58 20090904 5 167 20040205 6 167 20040323 7 323 20051111 8 323 20060111 9 323 20071119 10 323 20080107 11 323 20080407 12 323 20080521 13 323 20080711 14 547 20041005 15 794 20070905 16 814 20020814 17 814 20021125 18 814 20040429 19 814 20040429 20 814 20071205 21 814 20080227 22 841 20050421 23 841 20060130 24 841 20060428 25 841 20060602 26 841 20060816 27 841 20061025 28 841 20061129 29 841 20070112 30 841 20070514 39 999 20050503 40 1019 19870508 41 1019 19880223 42 1019 19880330 43 1019 19880330
Regards Petr
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of PIKAL Petr Sent: Tuesday, October 23, 2012 1:49 PM To: Stuart Leask; r-help at r-project.org Subject: Re: [R] [r] How to pick colums from a ragged array? Hi I did not check your code and rather followed your explanation. BTW, thanks for test data. small change in data frame to make DATE as Date class datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- data.frame(ID,datum ) ordering by date id.d<-id.d[order(id.d$datum),] two functions to test if first two dates are the same or last two dates are the same testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) x[length(x),2]==x[length(x)-1,2] change one last date in the data frame to be the same as previous id.d[35,2]<-id.d[36,2] and here are results sapply(split(id.d, id.d$ID), testlast) 58 167 323 547 794 814 841 910 999 1019 FALSE FALSE FALSE NA NA FALSE FALSE TRUE NA FALSE
sapply(split(id.d, id.d$ID), testfirst)
58 167 323 547 794 814 841 910 999 1019 FALSE FALSE FALSE NA NA FALSE FALSE FALSE NA FALSE Now you can select ID which is true and remove it from your data which(sapply(split(id.d, id.d$ID), testlast)) and use it for your data frame to subset/remove id.d$ID == as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast)))) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE [37] TRUE TRUE TRUE TRUE However I am not sure if this is exactly what you want. Regards Petr
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Stuart Leask Sent: Tuesday, October 23, 2012 11:38 AM To: r-help at r-project.org Subject: [R] [r] How to pick colums from a ragged array? I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on that date). Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS. The diagnoses may change between appointments. I want to subset the data in two ways: - define groups of patients by the first diagnosis given - define groups of patients by the last diagnosis given. The problem: Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID
and
(lowest or highest) DATE. The size of the dataset precludes the
option
of doing this by eye. I suspect there is a very elegant way of doing this in R. This is what I've come up with: - Sort by DATE then ID - Make a ragged array of DATE by ID - Remove IDs that only occur once. - Subtract the first and second DATEs. Remove IDs for which this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date). - (Then do the same to get the 'last appointment'
duplicates,
by reversing the initial sort by DATE.) I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array. I hope someone can help. My ugly code is below, with some data for testing. Stuart Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer and Honorary Consultant Pychiatrist Institute of Mental Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 115 82 30419 stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk> Google 'Dr Stuart Leask' ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 ,20091224,20050503,19870508,19880223,19880330) id.d <- cbind (ID,DATE ) rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME # Inelegant attempt to remove IDs that only have one entry: rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row # Since DATE is in 'year mo da', if there's only one date, sum will
be
less than 2100000: rag.t <- rag.s [ rag.s > 21000000 ] multi.dates <- rownames ( rag.t ) # all the
IDs
with >1 date rag.am <- rag.a [ multi.dates ] # rag.am
only
has IDs with > 1 Date # But now I'm stuck. # Each row of the array is rag.am$ID. # So I can't pick columns of DATEs from the ragged array. This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete
it.
Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
Sorry, I must be a bit thick.! getRepeat gives me the data with duplicates - but I don't seem to be able to manipulate the result. It looks like a list of dataframes:
g.r<-getRepeat(id.d) dim(g.r)
NULL
summary(g.r)
Length Class Mode
[1,] 2 data.frame list
[2,] 2 data.frame list
[3,] 2 data.frame list
This leaves me with the same problem I had with my ragged array i.e. how do I put all the second elements from this long list of data frames, into a single list?
I need to end up with a list of all the IDs that have duplicate first (or last) DATES.
Stuart
-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 13:17
To: 'Rui Barradas'
Cc: 'r-help at r-project.org'
Subject: RE: [R] [r] How to pick colums from a ragged array?
Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
Test data is all - see below it failing.
So, I remain very grateful for your function!
Stuart
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
# rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t ) # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ] # rag.am only has IDs with > 1 Date
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1] # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2] # list of dup'd dates
earliest<-tapply(DATE,ID,min) # table of mins
rownames(earliest[earliest%in%nd.b]) # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, # and an earliest date that is the same as another ID's earliest+duplicate.
-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: r-help at r-project.org
Subject: RE: [R] [r] How to pick colums from a ragged array?
Thanks Rui - your initial, very elegant suggestion, has spurred me on!
1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.
2. Your suggestion gave me all the duplicates:
how.many <- ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many > 1, ]
3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:
earliest <- tapply ( DATE, ID, min)
rownames(earliest[earliest%in%nd.b])
This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!
Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).
Stuart
************************************************
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
earliest <- tapply ( DATE, ID, min) # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] ) # IDs of duplicates at the earliest date for that individual. I think...
******************************************************************
-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?
Hello,
Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)
DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
id.d <- cbind(ID, DATE)
getRepeat <- function(x, first = TRUE){
fun <- if(first) head else tail
sp <- split(data.frame(x), x[,1])
first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
n <- unlist(lapply(lst, sum))
sp1 <- sp[n > 1]
i1 <- lst[n > 1]
lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }
getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, first = FALSE) # to get the last ones
Hope this helps,
Rui Barradas
Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello, I'm not sure I understand it well, in the solution below the only returned value is ID == 814 but it's not the first nor the last DATE. how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) id.d[how.many > 1, ] See the help page for ?ave if the repetition of id.d[,1] is confusing. The first is the vector to average (to apply FUN to) and the second is one of thw two vectors defining the groups. Hope this helps, Rui Barradas Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that
date).
Patients may have been assigned more than one diagnosis at any one
appointment - leading to two rows, same ID and DATE but different
DIAGNOSIS.
The diagnoses may change between appointments.
I want to subset the data in two ways:
- define groups of patients by the first diagnosis given
- define groups of patients by the last diagnosis given.
The problem:
Unfortunately, a small number of patients have been given more than
one diagnosis at their first (or last) appointment. These individuals
I need to identify and remove, as it's not possible to say uniquely
what their first (or last) diagnosis was. So I need to identify and
remove these individuals which have pairs of rows with the same ID
and (lowest or highest) DATE. The size of the dataset precludes the
option of doing this by eye.
I suspect there is a very elegant way of doing this in R.
This is what I've come up with:
- Sort by DATE then ID
- Make a ragged array of DATE by ID
- Remove IDs that only occur once.
- Subtract the first and second DATEs. Remove IDs for which
this = zero, as this will only be true for IDs for which the
appointment is recorded twice (because there were two diagnoses
recorded on this date).
- (Then do the same to get the 'last appointment'
duplicates, by reversing the initial sort by DATE.)
I am stuck at the 'Subtract dates' step: I would like to get the data
out of the ragged array by columns (so e.g. I end up with a matrix of
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
the ragged array.
I hope someone can help. My ugly code is below, with some data for
testing.
Stuart
Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
and Honorary Consultant Pychiatrist Institute of Mental Health,
Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)
id.d <- cbind (ID,DATE )
rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create
ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
row # Since DATE is in 'year mo da', if there's only one date, sum
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t ) # all the
IDs with >1 date
rag.am <- rag.a [ multi.dates ] # rag.am
only has IDs with > 1 Date
# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.
This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please send it back to me, and immediately delete
it. Please do not use, copy or disclose the information contained
in this message or in any attachment. Any views or opinions
expressed by the author of this email do not necessarily reflect the
views of the University of Nottingham.
This message has been checked for viruses but the contents of an
attachment may still contain software viruses which could damage your
computer
system:
you are advised to perform your own checks. Email communications with
the University of Nottingham may be monitored as permitted by UK
legislation.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hello, You're right, getRepeat returns a list of data.frames, one per each ID. To put them all in the same df use do.call(rbind, g.r) Rui Barradas Em 23-10-2012 13:36, Stuart Leask escreveu:
Sorry, I must be a bit thick.! getRepeat gives me the data with duplicates - but I don't seem to be able to manipulate the result. It looks like a list of dataframes:
g.r<-getRepeat(id.d) dim(g.r)
NULL
summary(g.r)
Length Class Mode
[1,] 2 data.frame list
[2,] 2 data.frame list
[3,] 2 data.frame list
This leaves me with the same problem I had with my ragged array i.e. how do I put all the second elements from this long list of data frames, into a single list?
I need to end up with a list of all the IDs that have duplicate first (or last) DATES.
Stuart
-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 13:17
To: 'Rui Barradas'
Cc: 'r-help at r-project.org'
Subject: RE: [R] [r] How to pick colums from a ragged array?
Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
Test data is all - see below it failing.
So, I remain very grateful for your function!
Stuart
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
# rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t ) # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ] # rag.am only has IDs with > 1 Date
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1] # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2] # list of dup'd dates
earliest<-tapply(DATE,ID,min) # table of mins
rownames(earliest[earliest%in%nd.b]) # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, # and an earliest date that is the same as another ID's earliest+duplicate.
-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: r-help at r-project.org
Subject: RE: [R] [r] How to pick colums from a ragged array?
Thanks Rui - your initial, very elegant suggestion, has spurred me on!
1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.
2. Your suggestion gave me all the duplicates:
how.many <- ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many > 1, ]
3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:
earliest <- tapply ( DATE, ID, min)
rownames(earliest[earliest%in%nd.b])
This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!
Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).
Stuart
************************************************
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
earliest <- tapply ( DATE, ID, min) # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] ) # IDs of duplicates at the earliest date for that individual. I think...
******************************************************************
-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?
Hello,
Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)
DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
id.d <- cbind(ID, DATE)
getRepeat <- function(x, first = TRUE){
fun <- if(first) head else tail
sp <- split(data.frame(x), x[,1])
first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
n <- unlist(lapply(lst, sum))
sp1 <- sp[n > 1]
i1 <- lst[n > 1]
lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }
getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, first = FALSE) # to get the last ones
Hope this helps,
Rui Barradas
Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello, I'm not sure I understand it well, in the solution below the only returned value is ID == 814 but it's not the first nor the last DATE. how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) id.d[how.many > 1, ] See the help page for ?ave if the repetition of id.d[,1] is confusing. The first is the vector to average (to apply FUN to) and the second is one of thw two vectors defining the groups. Hope this helps, Rui Barradas Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that
date).
Patients may have been assigned more than one diagnosis at any one
appointment - leading to two rows, same ID and DATE but different
DIAGNOSIS.
The diagnoses may change between appointments.
I want to subset the data in two ways:
- define groups of patients by the first diagnosis given
- define groups of patients by the last diagnosis given.
The problem:
Unfortunately, a small number of patients have been given more than
one diagnosis at their first (or last) appointment. These individuals
I need to identify and remove, as it's not possible to say uniquely
what their first (or last) diagnosis was. So I need to identify and
remove these individuals which have pairs of rows with the same ID
and (lowest or highest) DATE. The size of the dataset precludes the
option of doing this by eye.
I suspect there is a very elegant way of doing this in R.
This is what I've come up with:
- Sort by DATE then ID
- Make a ragged array of DATE by ID
- Remove IDs that only occur once.
- Subtract the first and second DATEs. Remove IDs for which
this = zero, as this will only be true for IDs for which the
appointment is recorded twice (because there were two diagnoses
recorded on this date).
- (Then do the same to get the 'last appointment'
duplicates, by reversing the initial sort by DATE.)
I am stuck at the 'Subtract dates' step: I would like to get the data
out of the ragged array by columns (so e.g. I end up with a matrix of
ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
the ragged array.
I hope someone can help. My ugly code is below, with some data for
testing.
Stuart
Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
and Honorary Consultant Pychiatrist Institute of Mental Health,
Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)
id.d <- cbind (ID,DATE )
rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create
ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
row # Since DATE is in 'year mo da', if there's only one date, sum
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t ) # all the
IDs with >1 date
rag.am <- rag.a [ multi.dates ] # rag.am
only has IDs with > 1 Date
# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.
This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please send it back to me, and immediately delete
it. Please do not use, copy or disclose the information contained
in this message or in any attachment. Any views or opinions
expressed by the author of this email do not necessarily reflect the
views of the University of Nottingham.
This message has been checked for viruses but the contents of an
attachment may still contain software viruses which could damage your
computer
system:
you are advised to perform your own checks. Email communications with
the University of Nottingham may be monitored as permitted by UK
legislation.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi
-----Original Message----- From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk] Sent: Tuesday, October 23, 2012 2:29 PM To: PIKAL Petr; r-help at r-project.org Subject: RE: [r] How to pick colums from a ragged array? Hi there. Not sure I follow what you are doing. I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID.
And that is what the function (with 3 small modifications) does
fff<-function(data, first=TRUE, remove=FALSE) {
testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[nrow(x),2]==x[nrow(x)-1,2]
if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, data[,1]), testfirst))))) else
sel <- as.numeric(names(which(unlist(sapply(split(data, data[,1]), testlast)))))
if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,]
}
See the result of your refined data
fff(id.d)
ID DATE
5 167 2004-02-05
6 167 2004-02-05
22 841 2005-04-21
23 841 2005-04-21
24 841 2006-04-28
25 841 2006-06-02
26 841 2006-08-16
27 841 2006-10-25
28 841 2006-11-29
29 841 2007-01-12
30 841 2007-05-14
38 1019 1987-05-08
39 1019 1987-05-08
40 1019 1988-03-30
fff(id.d, first=F)
ID DATE 5 167 2004-02-05 6 167 2004-02-05
fff(id.d, remove=T)
ID DATE 1 58 2006-08-21 2 58 2006-12-07 3 58 2008-01-02 4 58 2009-09-04 7 323 2005-11-11 8 323 2006-01-11 9 323 2007-11-19 10 323 2008-01-07 11 323 2008-04-07 12 323 2008-05-21 13 323 2008-07-11 14 547 2004-10-05 15 794 2007-09-05 16 814 2002-08-14 17 814 2002-11-25 18 814 2004-04-29 19 814 2004-04-29 20 814 2007-12-05 21 814 2008-02-27 31 910 1987-05-08 32 910 2004-02-05 33 910 2004-02-05 34 910 2009-11-20 35 910 2009-12-10 36 910 2009-12-24 37 999 2005-05-03
You can do surgery on fff function to see what result comes from some piece of the function e.g. sapply(split(id.d, id.d[,1]), testlast) Regards Petr
I have refined my test dataset, to include some tests (e.g. 910 has the
same dup as 1019, but for 910 it's not the earliest date):
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
Correct output:
"167" "841" "1019"
Stuart
-----Original Message-----
From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
Sent: 23 October 2012 13:15
To: Stuart Leask; r-help at r-project.org
Subject: RE: [r] How to pick colums from a ragged array?
Hi
Rui's answer brought me to more elaborated solution which still needs
data frame to be ordered by date
fff<-function(data, first=TRUE, remove=FALSE) {
testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]
if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]),
testfirst)))) else sel <- as.numeric(names(which(sapply(split(data,
data[,1]), testlast))))
if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
fff(id.d)
ID DATE 31 910 20091105 32 910 20091105 33 910 20091117 34 910 20091119 35 910 20091120 36 910 20091210 37 910 20091224 38 910 20091224
fff(id.d, remove=T)
ID DATE 1 58 20060821 2 58 20061207 3 58 20080102 4 58 20090904 5 167 20040205 6 167 20040323 7 323 20051111 8 323 20060111 9 323 20071119 10 323 20080107 11 323 20080407 12 323 20080521 13 323 20080711 14 547 20041005 15 794 20070905 16 814 20020814 17 814 20021125 18 814 20040429 19 814 20040429 20 814 20071205 21 814 20080227 22 841 20050421 23 841 20060130 24 841 20060428 25 841 20060602 26 841 20060816 27 841 20061025 28 841 20061129 29 841 20070112 30 841 20070514 39 999 20050503 40 1019 19870508 41 1019 19880223 42 1019 19880330 43 1019 19880330
Regards Petr
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of PIKAL Petr Sent: Tuesday, October 23, 2012 1:49 PM To: Stuart Leask; r-help at r-project.org Subject: Re: [R] [r] How to pick colums from a ragged array? Hi I did not check your code and rather followed your explanation. BTW, thanks for test data. small change in data frame to make DATE as Date class datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- data.frame(ID,datum ) ordering by date id.d<-id.d[order(id.d$datum),] two functions to test if first two dates are the same or last two dates are the same testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) x[length(x),2]==x[length(x)-1,2] change one last date in the data frame to be the same as previous id.d[35,2]<-id.d[36,2] and here are results sapply(split(id.d, id.d$ID), testlast) 58 167 323 547 794 814 841 910 999 1019 FALSE FALSE FALSE NA NA FALSE FALSE TRUE NA FALSE
sapply(split(id.d, id.d$ID), testfirst)
58 167 323 547 794 814 841 910 999 1019 FALSE FALSE FALSE NA NA FALSE FALSE FALSE NA FALSE Now you can select ID which is true and remove it from your data which(sapply(split(id.d, id.d$ID), testlast)) and use it for your data frame to subset/remove id.d$ID == as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast)))) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
FALSE TRUE TRUE [37] TRUE TRUE TRUE TRUE However I am not sure if this is exactly what you want. Regards Petr
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Stuart Leask Sent: Tuesday, October 23, 2012 11:38 AM To: r-help at r-project.org Subject: [R] [r] How to pick colums from a ragged array? I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on
that
date). Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS. The diagnoses may change between appointments. I want to subset the data in two ways: - define groups of patients by the first diagnosis given - define groups of patients by the last diagnosis given. The problem: Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID
and
(lowest or highest) DATE. The size of the dataset precludes the
option
of doing this by eye. I suspect there is a very elegant way of doing this in R. This is what I've come up with: - Sort by DATE then ID - Make a ragged array of DATE by ID - Remove IDs that only occur once. - Subtract the first and second DATEs. Remove IDs for
which
this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date). - (Then do the same to get the 'last appointment'
duplicates,
by reversing the initial sort by DATE.) I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array. I hope someone can help. My ugly code is below, with some data for testing. Stuart Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer and Honorary Consultant Pychiatrist Institute of Mental Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU.
UK
Tel. +44 115 82 30419 stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk> Google 'Dr Stuart Leask' ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 ,20091224,20050503,19870508,19880223,19880330) id.d <- cbind (ID,DATE ) rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME # Inelegant attempt to remove IDs that only have one entry: rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up
the
dates per row # Since DATE is in 'year mo da', if there's only one date, sum will
be
less than 2100000: rag.t <- rag.s [ rag.s > 21000000 ] multi.dates <- rownames ( rag.t ) # all the
IDs
with >1 date rag.am <- rag.a [ multi.dates ] # rag.am
only
has IDs with > 1 Date # But now I'm stuck. # Each row of the array is rag.am$ID. # So I can't pick columns of DATEs from the ragged array. This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete
it.
Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
HI,
I was not following the thread.
May be this is what you are looking for:
new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),]
tapply(new1$ID,new1$DATE,head,1)
#19870508 20040205 20040429 20050421
? #? 1019????? 167????? 814????? 841
A.K.
----- Original Message -----
From: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
To: PIKAL Petr <petr.pikal at precheza.cz>; "r-help at r-project.org" <r-help at r-project.org>
Cc:
Sent: Tuesday, October 23, 2012 8:28 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?
Hi there.
Not sure I follow what you are doing.
I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID.
I have refined my test dataset, to include some tests (e.g. 910 has the same dup as 1019, but for 910 it's not the earliest date):
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
Correct output:
"167"? "841"? "1019"
Stuart
-----Original Message-----
From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
Sent: 23 October 2012 13:15
To: Stuart Leask; r-help at r-project.org
Subject: RE: [r] How to pick colums from a ragged array?
Hi
Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date
fff<-function(data, first=TRUE, remove=FALSE) {
testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[length(x),2]==x[length(x)-1,2]
if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast))))
if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
fff(id.d)
? ? ID? ? DATE 31 910 20091105 32 910 20091105 33 910 20091117 34 910 20091119 35 910 20091120 36 910 20091210 37 910 20091224 38 910 20091224
fff(id.d, remove=T)
? ? ID? ? DATE 1? ? 58 20060821 2? ? 58 20061207 3? ? 58 20080102 4? ? 58 20090904 5? 167 20040205 6? 167 20040323 7? 323 20051111 8? 323 20060111 9? 323 20071119 10? 323 20080107 11? 323 20080407 12? 323 20080521 13? 323 20080711 14? 547 20041005 15? 794 20070905 16? 814 20020814 17? 814 20021125 18? 814 20040429 19? 814 20040429 20? 814 20071205 21? 814 20080227 22? 841 20050421 23? 841 20060130 24? 841 20060428 25? 841 20060602 26? 841 20060816 27? 841 20061025 28? 841 20061129 29? 841 20070112 30? 841 20070514 39? 999 20050503 40 1019 19870508 41 1019 19880223 42 1019 19880330 43 1019 19880330
Regards Petr
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of PIKAL Petr Sent: Tuesday, October 23, 2012 1:49 PM To: Stuart Leask; r-help at r-project.org Subject: Re: [R] [r] How to pick colums from a ragged array? Hi I did not check your code and rather followed your explanation. BTW, thanks for test data. small change in data frame to make DATE as Date class datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- data.frame(ID,datum ) ordering by date id.d<-id.d[order(id.d$datum),] two functions to test if first two dates are the same or last two dates are the same testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) x[length(x),2]==x[length(x)-1,2] change one last date in the data frame to be the same as previous id.d[35,2]<-id.d[36,2] and here are results sapply(split(id.d, id.d$ID), testlast) ? ? 58? 167? 323? 547? 794? 814? 841? 910? 999? 1019 FALSE FALSE FALSE? ? NA? ? NA FALSE FALSE? TRUE? ? NA FALSE
sapply(split(id.d, id.d$ID), testfirst)
? ? 58? 167? 323? 547? 794? 814? 841? 910? 999? 1019 FALSE FALSE FALSE? ? NA? ? NA FALSE FALSE FALSE? ? NA FALSE Now you can select ID which is true and remove it from your data which(sapply(split(id.d, id.d$ID), testlast)) and use it for your data frame to subset/remove id.d$ID == as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))? [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE? TRUE [37]? TRUE? TRUE? TRUE? TRUE However I am not sure if this is exactly what you want. Regards Petr
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Stuart Leask Sent: Tuesday, October 23, 2012 11:38 AM To: r-help at r-project.org Subject: [R] [r] How to pick colums from a ragged array? I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on that date). Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS. The diagnoses may change between appointments. I want to subset the data in two ways: -? ? ? ? ? define groups of patients by the first diagnosis given -? ? ? ? ? define groups of patients by the last diagnosis given. The problem: Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID
and
(lowest or highest) DATE. The size of the dataset precludes the
option
of doing this by eye. I suspect there is a very elegant way of doing this in R. This is what I've come up with: -? ? ? ? ? Sort by DATE then ID -? ? ? ? ? Make a ragged array of DATE by ID -? ? ? ? ? Remove IDs that only occur once. -? ? ? ? ? Subtract the first and second DATEs. Remove IDs for which this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date). -? ? ? ? ? (Then do the same to get the 'last appointment'
duplicates,
by reversing the initial sort by DATE.) I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array. I hope someone can help. My ugly code is below, with some data for testing. Stuart Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer and Honorary Consultant Pychiatrist Institute of Mental Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 115 82 30419 stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk> Google 'Dr Stuart Leask' ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 ,20091224,20050503,19870508,19880223,19880330) id.d <- cbind (ID,DATE ) rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ? # create ragged array, 1-n DATES for every NAME # Inelegant attempt to remove IDs that only have one entry: rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum)? ? ? ? ? ? #add up the dates per row # Since DATE is in 'year mo da', if there's only one date, sum will
be
less than 2100000: rag.t <- rag.s [ rag.s > 21000000 ] multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ? # all the
IDs
with >1 date rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ? # rag.am
only
has IDs with > 1 Date # But now I'm stuck. # Each row of the array is rag.am$ID. # So I can't pick columns of DATEs from the ragged array. This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete
it.
Please do not use, copy or disclose the information contained in this message or in any attachment.? Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. ??? [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
So I get my list of IDs to exclude from: g.rr<-do.call(rbind, g.r)[1] dim(g.rr) g.rr[1:(dim(g.rr)[1]/2)] Many thanks. Stuart -----Original Message----- From: Rui Barradas [mailto:ruipbarradas at sapo.pt] Sent: 23 October 2012 13:42 To: Stuart Leask Cc: r-help at r-project.org Subject: Re: FW: [R] [r] How to pick colums from a ragged array? Hello, You're right, getRepeat returns a list of data.frames, one per each ID. To put them all in the same df use do.call(rbind, g.r) Rui Barradas Em 23-10-2012 13:36, Stuart Leask escreveu:
Sorry, I must be a bit thick.! getRepeat gives me the data with duplicates - but I don't seem to be able to manipulate the result. It looks like a list of dataframes:
g.r<-getRepeat(id.d) dim(g.r)
NULL
summary(g.r)
Length Class Mode
[1,] 2 data.frame list
[2,] 2 data.frame list
[3,] 2 data.frame list
This leaves me with the same problem I had with my ragged array i.e. how do I put all the second elements from this long list of data frames, into a single list?
I need to end up with a list of all the IDs that have duplicate first (or last) DATES.
Stuart
-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 13:17
To: 'Rui Barradas'
Cc: 'r-help at r-project.org'
Subject: RE: [R] [r] How to pick colums from a ragged array?
Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
Test data is all - see below it failing.
So, I remain very grateful for your function!
Stuart
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
# rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t ) # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ] # rag.am only has IDs with > 1 Date
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1] # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2] # list of dup'd dates
earliest<-tapply(DATE,ID,min) # table of mins
rownames(earliest[earliest%in%nd.b]) # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, # and an earliest date that is the same as another ID's earliest+duplicate.
-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: r-help at r-project.org
Subject: RE: [R] [r] How to pick colums from a ragged array?
Thanks Rui - your initial, very elegant suggestion, has spurred me on!
1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.
2. Your suggestion gave me all the duplicates:
how.many <- ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many > 1, ]
3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:
earliest <- tapply ( DATE, ID, min)
rownames(earliest[earliest%in%nd.b])
This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!
Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).
Stuart
************************************************
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
earliest <- tapply ( DATE, ID, min) # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] ) # IDs of duplicates at the earliest date for that individual. I think...
******************************************************************
-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?
Hello,
Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,200912
10
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
id.d <- cbind(ID, DATE)
getRepeat <- function(x, first = TRUE){
fun <- if(first) head else tail
sp <- split(data.frame(x), x[,1])
first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
n <- unlist(lapply(lst, sum))
sp1 <- sp[n > 1]
i1 <- lst[n > 1]
lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }
getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, first =
FALSE) # to get the last ones
Hope this helps,
Rui Barradas
Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello, I'm not sure I understand it well, in the solution below the only returned value is ID == 814 but it's not the first nor the last DATE. how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) id.d[how.many > 1, ] See the help page for ?ave if the repetition of id.d[,1] is confusing. The first is the vector to average (to apply FUN to) and the second is one of thw two vectors defining the groups. Hope this helps, Rui Barradas Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID
(patient's name), DATE (of appointment) and DIAGNOSIS (given on that
date).
Patients may have been assigned more than one diagnosis at any one
appointment - leading to two rows, same ID and DATE but different
DIAGNOSIS.
The diagnoses may change between appointments.
I want to subset the data in two ways:
- define groups of patients by the first diagnosis given
- define groups of patients by the last diagnosis given.
The problem:
Unfortunately, a small number of patients have been given more than
one diagnosis at their first (or last) appointment. These
individuals I need to identify and remove, as it's not possible to
say uniquely what their first (or last) diagnosis was. So I need to
identify and remove these individuals which have pairs of rows with
the same ID and (lowest or highest) DATE. The size of the dataset
precludes the option of doing this by eye.
I suspect there is a very elegant way of doing this in R.
This is what I've come up with:
- Sort by DATE then ID
- Make a ragged array of DATE by ID
- Remove IDs that only occur once.
- Subtract the first and second DATEs. Remove IDs for which
this = zero, as this will only be true for IDs for which the
appointment is recorded twice (because there were two diagnoses
recorded on this date).
- (Then do the same to get the 'last appointment'
duplicates, by reversing the initial sort by DATE.)
I am stuck at the 'Subtract dates' step: I would like to get the
data out of the ragged array by columns (so e.g. I end up with a
matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by
column from the ragged array.
I hope someone can help. My ugly code is below, with some data for
testing.
Stuart
Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
Lecturer and Honorary Consultant Pychiatrist Institute of Mental
Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK
Tel. +44
115 82 30419
stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
Google 'Dr Stuart Leask'
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
,20091224,20050503,19870508,19880223,19880330)
id.d <- cbind (ID,DATE )
rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create
ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
row # Since DATE is in 'year mo da', if there's only one date, sum
will be less than 2100000:
rag.t <- rag.s [ rag.s > 21000000 ]
multi.dates <- rownames ( rag.t ) # all the
IDs with >1 date
rag.am <- rag.a [ multi.dates ] # rag.am
only has IDs with > 1 Date
# But now I'm stuck.
# Each row of the array is rag.am$ID.
# So I can't pick columns of DATEs from the ragged array.
This message and any attachment are intended solely for the
addressee and may contain confidential information. If you have
received this message in error, please send it back to me, and immediately delete
it. Please do not use, copy or disclose the information contained
in this message or in any attachment. Any views or opinions
expressed by the author of this email do not necessarily reflect the
views of the University of Nottingham.
This message has been checked for viruses but the contents of an
attachment may still contain software viruses which could damage
your computer
system:
you are advised to perform your own checks. Email communications
with the University of Nottingham may be monitored as permitted by
UK legislation.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi Stuart,
This also should get you the IDs you wanted.
new1<-id.d[duplicated(id.d[,2])|duplicated(id.d[,2],fromLast=TRUE),]
earliest <- tapply ( DATE, ID, min)? ? ? ? ? ? ? ?
?rownames(earliest[earliest%in% new1])
#[1] "167"? "841"? "1019"
A.K.
----- Original Message -----
From: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
To: Rui Barradas <ruipbarradas at sapo.pt>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Tuesday, October 23, 2012 7:37 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?
Thanks Rui - your initial, very elegant suggestion, has spurred me on!
1. As you noticed, my example data had no examples of duplicate first dates (DOH!)
I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.
2. Your suggestion gave me all the duplicates:
how.many? <-? ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many? > 1,? ]
3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:
earliest <- tapply ( DATE, ID, min)? ? ? ? ? ? ? ?
rownames(earliest[earliest%in%nd.b])?
This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!
Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).
Stuart
************************************************
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
earliest <- tapply? ( DATE, ID, min)? ? ? ? ? ? ? ? ? ? # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )? # IDs of duplicates at the earliest date for that individual. I think...
******************************************************************
-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?
Hello,
Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)
DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
id.d <- cbind(ID, DATE)
getRepeat <- function(x, first = TRUE){
? ? fun <- if(first) head else tail
? ? sp <- split(data.frame(x), x[,1])
? ? first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
? ? lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
? ? n <- unlist(lapply(lst, sum))
? ? sp1 <- sp[n > 1]
? ? i1 <- lst[n > 1]
? ? lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }
getRepeat(id.d)? # defaults to first = TRUE getRepeat(id.d, first = FALSE)? # to get the last ones
Hope this helps,
Rui Barradas
Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello, I'm not sure I understand it well, in the solution below the only returned value is ID == 814 but it's not the first nor the last DATE. how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) id.d[how.many > 1, ] See the help page for ?ave if the repetition of id.d[,1] is confusing. The first is the vector to average (to apply FUN to) and the second is one of thw two vectors defining the groups. Hope this helps, Rui Barradas Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on that date). Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS. The diagnoses may change between appointments. I want to subset the data in two ways: -? ? ? ? ? define groups of patients by the first diagnosis given -? ? ? ? ? define groups of patients by the last diagnosis given. The problem: Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID and (lowest or highest) DATE. The size of the dataset precludes the option of doing this by eye. I suspect there is a very elegant way of doing this in R. This is what I've come up with: -? ? ? ? ? Sort by DATE then ID -? ? ? ? ? Make a ragged array of DATE by ID -? ? ? ? ? Remove IDs that only occur once. -? ? ? ? ? Subtract the first and second DATEs. Remove IDs for which this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date). -? ? ? ? ? (Then do the same to get the 'last appointment' duplicates, by reversing the initial sort by DATE.) I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array. I hope someone can help. My ugly code is below, with some data for testing. Stuart Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer and Honorary Consultant Pychiatrist Institute of Mental Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 115 82 30419 stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk> Google 'Dr Stuart Leask' ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 ,20091224,20050503,19870508,19880223,19880330) id.d <- cbind (ID,DATE ) rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ? # create ragged array, 1-n DATES for every NAME # Inelegant attempt to remove IDs that only have one entry: rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000: rag.t <- rag.s [ rag.s > 21000000 ] multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ? # all the IDs with >1 date rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ? # rag.am only has IDs with > 1 Date # But now I'm stuck. # Each row of the array is rag.am$ID. # So I can't pick columns of DATEs from the ragged array. This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.? Please do not use, copy or disclose the information contained in this message or in any attachment.? Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. ? ? [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi,
?res1<-data.frame(col=sapply(tapply(DATE,ID,function(x)? duplicated(head(x,2))),function(x) x[2]))
row.names(subset(res1,col==TRUE))
#[1] "167"? "841"? "1019"
#assuming that dates are sorted
A.K.
----- Original Message -----
From: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
To: Rui Barradas <ruipbarradas at sapo.pt>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Tuesday, October 23, 2012 8:17 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?
Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
Test data is all - see below it failing.
So, I remain very grateful for your function!
Stuart
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
# rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ? # create ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum)? ? ? ? ? ? #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ? # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ? # rag.am only has IDs with > 1 Date
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1]? ? ? ? # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2]? ? ? ? # list of dup'd dates
earliest<-tapply(DATE,ID,min)? # table of mins
rownames(earliest[earliest%in%nd.b])? # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate,
# and an earliest date that is the same as another ID's earliest+duplicate.
-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: r-help at r-project.org
Subject: RE: [R] [r] How to pick colums from a ragged array?
Thanks Rui - your initial, very elegant suggestion, has spurred me on!
1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.
2. Your suggestion gave me all the duplicates:
how.many? <-? ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many? > 1,? ]
3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:
earliest <- tapply ( DATE, ID, min)? ? ? ? ? ? ? ?
rownames(earliest[earliest%in%nd.b])?
This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!
Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).
Stuart
************************************************
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
earliest <- tapply? ( DATE, ID, min)? ? ? ? ? ? ? ? ? ? # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )? # IDs of duplicates at the earliest date for that individual. I think...
******************************************************************
-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?
Hello,
Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)
DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
id.d <- cbind(ID, DATE)
getRepeat <- function(x, first = TRUE){
? ? fun <- if(first) head else tail
? ? sp <- split(data.frame(x), x[,1])
? ? first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
? ? lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
? ? n <- unlist(lapply(lst, sum))
? ? sp1 <- sp[n > 1]
? ? i1 <- lst[n > 1]
? ? lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }
getRepeat(id.d)? # defaults to first = TRUE getRepeat(id.d, first = FALSE)? # to get the last ones
Hope this helps,
Rui Barradas
Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello, I'm not sure I understand it well, in the solution below the only returned value is ID == 814 but it's not the first nor the last DATE. how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) id.d[how.many > 1, ] See the help page for ?ave if the repetition of id.d[,1] is confusing. The first is the vector to average (to apply FUN to) and the second is one of thw two vectors defining the groups. Hope this helps, Rui Barradas Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on that date). Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS. The diagnoses may change between appointments. I want to subset the data in two ways: -? ? ? ? ? define groups of patients by the first diagnosis given -? ? ? ? ? define groups of patients by the last diagnosis given. The problem: Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID and (lowest or highest) DATE. The size of the dataset precludes the option of doing this by eye. I suspect there is a very elegant way of doing this in R. This is what I've come up with: -? ? ? ? ? Sort by DATE then ID -? ? ? ? ? Make a ragged array of DATE by ID -? ? ? ? ? Remove IDs that only occur once. -? ? ? ? ? Subtract the first and second DATEs. Remove IDs for which this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date). -? ? ? ? ? (Then do the same to get the 'last appointment' duplicates, by reversing the initial sort by DATE.) I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array. I hope someone can help. My ugly code is below, with some data for testing. Stuart Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer and Honorary Consultant Pychiatrist Institute of Mental Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 115 82 30419 stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk> Google 'Dr Stuart Leask' ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 ,20091224,20050503,19870508,19880223,19880330) id.d <- cbind (ID,DATE ) rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ? # create ragged array, 1-n DATES for every NAME # Inelegant attempt to remove IDs that only have one entry: rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000: rag.t <- rag.s [ rag.s > 21000000 ] multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ? # all the IDs with >1 date rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ? # rag.am only has IDs with > 1 Date # But now I'm stuck. # Each row of the array is rag.am$ID. # So I can't pick columns of DATEs from the ragged array. This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.? Please do not use, copy or disclose the information contained in this message or in any attachment.? Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. ? ? [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi,
I read ur first post and I think I understand what you meant.?
I guess this should work.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
res1<- data.frame(flag=tapply(id.d[,2],id.d[,1],FUN=function(x) head(duplicated(x)|duplicated(x,fromLast=TRUE),1)|tail(duplicated(x)|duplicated(x,fromLast=TRUE),1)))
?id.d[id.d[,1]%in%names(res1[res1$flag==TRUE,]),]
?# ???? ID???? DATE
?#[1,]? 167 20040205
?#[2,]? 167 20040205
?#[3,]? 841 20050421
?#[4,]? 841 20050421
?#[5,]? 841 20060428
?#[6,]? 841 20060602
?#[7,]? 841 20060816
?#[8,]? 841 20061025
?#[9,]? 841 20061129
#[10,]? 841 20070112
#[11,]? 841 20070514
#[12,] 1019 19870508
#[13,] 1019 19870508
#[14,] 1019 19880330
So, these IDs should be deleted if I am correct.
A.K.
----- Original Message -----
From: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
To: Rui Barradas <ruipbarradas at sapo.pt>
Cc: "r-help at r-project.org" <r-help at r-project.org>
Sent: Tuesday, October 23, 2012 8:17 AM
Subject: Re: [R] [r] How to pick colums from a ragged array?
Ah, no, my method does fail.
Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
Test data is all - see below it failing.
So, I remain very grateful for your function!
Stuart
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
# rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ? # create ragged array, 1-n DATES for every NAME
# Inelegant attempt to remove IDs that only have one entry:
# rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum)? ? ? ? ? ? #add up the dates per row
# Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
# rag.t <- rag.s [ rag.s > 21000000 ]
# multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ? # all the IDs with >1 date
# rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ? # rag.am only has IDs with > 1 Date
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
#ni<-dim(nd.b)[1]
#nd.IDs<-nd.b[1:(ni/2)*2,1]? ? ? ? # list of IDs with dups
#nd.DATEs<-nd.b[1:(ni/2)*2,2]? ? ? ? # list of dup'd dates
earliest<-tapply(DATE,ID,min)? # table of mins
rownames(earliest[earliest%in%nd.b])? # IDs of dups with min
# This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate,
# and an earliest date that is the same as another ID's earliest+duplicate.
-----Original Message-----
From: Leask Stuart
Sent: 23 October 2012 12:38
To: 'Rui Barradas'
Cc: r-help at r-project.org
Subject: RE: [R] [r] How to pick colums from a ragged array?
Thanks Rui - your initial, very elegant suggestion, has spurred me on!
1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.
2. Your suggestion gave me all the duplicates:
how.many? <-? ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many? > 1,? ]
3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:
earliest <- tapply ( DATE, ID, min)? ? ? ? ? ? ? ?
rownames(earliest[earliest%in%nd.b])?
This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!
Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).
Stuart
************************************************
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
earliest <- tapply? ( DATE, ID, min)? ? ? ? ? ? ? ? ? ? # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )? # IDs of duplicates at the earliest date for that individual. I think...
******************************************************************
-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?
Hello,
Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)
DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
id.d <- cbind(ID, DATE)
getRepeat <- function(x, first = TRUE){
? ? fun <- if(first) head else tail
? ? sp <- split(data.frame(x), x[,1])
? ? first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
? ? lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
? ? n <- unlist(lapply(lst, sum))
? ? sp1 <- sp[n > 1]
? ? i1 <- lst[n > 1]
? ? lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }
getRepeat(id.d)? # defaults to first = TRUE getRepeat(id.d, first = FALSE)? # to get the last ones
Hope this helps,
Rui Barradas
Em 23-10-2012 10:59, Rui Barradas escreveu:
Hello, I'm not sure I understand it well, in the solution below the only returned value is ID == 814 but it's not the first nor the last DATE. how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) id.d[how.many > 1, ] See the help page for ?ave if the repetition of id.d[,1] is confusing. The first is the vector to average (to apply FUN to) and the second is one of thw two vectors defining the groups. Hope this helps, Rui Barradas Em 23-10-2012 10:37, Stuart Leask escreveu:
I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on that date). Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS. The diagnoses may change between appointments. I want to subset the data in two ways: -? ? ? ? ? define groups of patients by the first diagnosis given -? ? ? ? ? define groups of patients by the last diagnosis given. The problem: Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID and (lowest or highest) DATE. The size of the dataset precludes the option of doing this by eye. I suspect there is a very elegant way of doing this in R. This is what I've come up with: -? ? ? ? ? Sort by DATE then ID -? ? ? ? ? Make a ragged array of DATE by ID -? ? ? ? ? Remove IDs that only occur once. -? ? ? ? ? Subtract the first and second DATEs. Remove IDs for which this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date). -? ? ? ? ? (Then do the same to get the 'last appointment' duplicates, by reversing the initial sort by DATE.) I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array. I hope someone can help. My ugly code is below, with some data for testing. Stuart Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer and Honorary Consultant Pychiatrist Institute of Mental Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 115 82 30419 stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk> Google 'Dr Stuart Leask' ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 ,20091224,20050503,19870508,19880223,19880330) id.d <- cbind (ID,DATE ) rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ? # create ragged array, 1-n DATES for every NAME # Inelegant attempt to remove IDs that only have one entry: rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000: rag.t <- rag.s [ rag.s > 21000000 ] multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ? # all the IDs with >1 date rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ? # rag.am only has IDs with > 1 Date # But now I'm stuck. # Each row of the array is rag.am$ID. # So I can't pick columns of DATEs from the ragged array. This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.? Please do not use, copy or disclose the information contained in this message or in any attachment.? Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. ? ? [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.