An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090514/d6bd61ff/attachment-0001.pl>
Duplicates and duplicated
12 messages · christiaan pauw, Andrej Blejec, Linlin Yan +4 more
On Thu, May 14, 2009 at 2:16 PM, christiaan pauw <cjpauw at gmail.com> wrote:
Hi everybody. I want to identify not only duplicate number but also the original number that has been duplicated. Example: x=c(1,2,3,4,4,5,6,7,8,9) y=duplicated(x) rbind(x,y) gives: ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 i.e. the second 4 [,5] is a duplicate. What I want is the first and second 4. i.e [,4] and [,5] to be TRUE ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0
How about
rbind(x, duplicated(x) | duplicated(x, fromLast=TRUE))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x 1 2 3 4 4 5 6 7 8 9
0 0 0 1 1 0 0 0 0 0
I assume it can be done by sorting the vector and then checking is the next or the previous entry matches using identical() . I am just unsure on how to write such a loop the logic of which (I think) is as follows: sort x for every value of x check if the next value is identical and return TRUE (or 1) if it is and FALSE (or 0) if it is not AND check is the previous value is identical and return TRUE (or 1) if it is and FALSE (or 0) if it is not Im i thinking correct and can some help to write such a function regards Christiaan ? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Try this x%in%x[which(y)]
From your example
x=c(1,2,3,4,4,5,6,7,8,9) y=duplicated(x) rbind(x,y)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x 1 2 3 4 4 5 6 7 8 9 y 0 0 0 0 1 0 0 0 0 0
which(y)
[1] 5
x[which(y)]
[1] 4
x%in%x[which(y)]
[1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE Andrej -- Andrej Blejec National Institute of Biology Vecna pot 111 POB 141 SI-1000 Ljubljana SLOVENIA e-mail: andrej.blejec at nib.si URL: http://ablejec.nib.si tel: + 386 (0)59 232 789 fax: + 386 1 241 29 80 -------------------------- Organizer of Applied Statistics 2009 conference http://conferences.nib.si/AS2009
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of christiaan pauw
Sent: Thursday, May 14, 2009 8:17 AM
To: r-help at r-project.org
Subject: [R] Duplicates and duplicated
Hi everybody.
I want to identify not only duplicate number but also the original
number
that has been duplicated.
Example:
x=c(1,2,3,4,4,5,6,7,8,9)
y=duplicated(x)
rbind(x,y)
gives:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x 1 2 3 4 4 5 6 7 8 9
y 0 0 0 0 1 0 0 0 0 0
i.e. the second 4 [,5] is a duplicate.
What I want is the first and second 4. i.e [,4] and [,5] to be TRUE
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x 1 2 3 4 4 5 6 7 8 9
y 0 0 0 1 1 0 0 0 0 0
I assume it can be done by sorting the vector and then checking is the
next
or the previous entry matches using
identical() . I am just unsure on how to write such a loop the logic
of
which (I think) is as follows: sort x for every value of x check if the next value is identical and return TRUE (or 1) if it is and FALSE (or 0) if it is not AND check is the previous value is identical and return TRUE (or 1) if it is and FALSE (or 0) if it is not Im i thinking correct and can some help to write such a function regards Christiaan [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
The operator %in% is very good! And that can be simpler like this: x %in% x[duplicated(x)] [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
On Thu, May 14, 2009 at 4:43 PM, Andrej Blejec <Andrej.Blejec at nib.si> wrote:
Try this x%in%x[which(y)]
From your example
x=c(1,2,3,4,4,5,6,7,8,9) y=duplicated(x) rbind(x,y)
?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0
which(y)
[1] 5
x[which(y)]
[1] 4
x%in%x[which(y)]
?[1] FALSE FALSE FALSE ?TRUE ?TRUE FALSE FALSE FALSE FALSE FALSE Andrej -- Andrej Blejec National Institute of Biology Vecna pot 111 POB 141 SI-1000 Ljubljana SLOVENIA e-mail: andrej.blejec at nib.si URL: http://ablejec.nib.si tel: + 386 (0)59 232 789 fax: + 386 1 241 29 80 -------------------------- Organizer of Applied Statistics 2009 conference http://conferences.nib.si/AS2009
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of christiaan pauw Sent: Thursday, May 14, 2009 8:17 AM To: r-help at r-project.org Subject: [R] Duplicates and duplicated Hi everybody. I want to identify not only duplicate number but also the original number that has been duplicated. Example: x=c(1,2,3,4,4,5,6,7,8,9) y=duplicated(x) rbind(x,y) gives: ? ? [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 i.e. the second 4 [,5] is a duplicate. What I want is the first and second 4. i.e [,4] and [,5] to be TRUE ? ? [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 I assume it can be done by sorting the vector and then checking is the next or the previous entry matches using identical() . I am just unsure on how to write such a loop the logic
of
which (I think) is as follows: sort x for every value of x check if the next value is identical and return TRUE (or 1) if it is and FALSE (or 0) if it is not AND check is the previous value is identical and return TRUE (or 1) if it is and FALSE (or 0) if it is not Im i thinking correct and can some help to write such a function regards Christiaan ? ? ? [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Noting that:
ave(x, x, FUN = length) > 1
[1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE try this:
rbind(x, dup = ave(x, x, FUN = length) > 1)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x 1 2 3 4 4 5 6 7 8 9 dup 0 0 0 1 1 0 0 0 0 0
On Thu, May 14, 2009 at 2:16 AM, christiaan pauw <cjpauw at gmail.com> wrote:
Hi everybody. I want to identify not only duplicate number but also the original number that has been duplicated. Example: x=c(1,2,3,4,4,5,6,7,8,9) y=duplicated(x) rbind(x,y) gives: ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 i.e. the second 4 [,5] is a duplicate. What I want is the first and second 4. i.e [,4] and [,5] to be TRUE ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 I assume it can be done by sorting the vector and then checking is the next or the previous entry matches using identical() . I am just unsure on how to write such a loop the logic of which (I think) is as follows: sort x for every value of x check if the next value is identical and return TRUE (or 1) if it is and FALSE (or 0) if it is not AND check is the previous value is identical and return TRUE (or 1) if it is and FALSE (or 0) if it is not Im i thinking correct and can some help to write such a function regards Christiaan ? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
... or, similar in character to Gabor's solution: tbl <- table(x) (tbl[as.character(sort(x))]>1)+0 Bert Gunter Nonclinical Biostatistics 467-7374 -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Gabor Grothendieck Sent: Thursday, May 14, 2009 7:34 AM To: christiaan pauw Cc: r-help at r-project.org Subject: Re: [R] Duplicates and duplicated Noting that:
ave(x, x, FUN = length) > 1
[1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE try this:
rbind(x, dup = ave(x, x, FUN = length) > 1)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x 1 2 3 4 4 5 6 7 8 9 dup 0 0 0 1 1 0 0 0 0 0
On Thu, May 14, 2009 at 2:16 AM, christiaan pauw <cjpauw at gmail.com> wrote:
Hi everybody. I want to identify not only duplicate number but also the original number that has been duplicated. Example: x=c(1,2,3,4,4,5,6,7,8,9) y=duplicated(x) rbind(x,y) gives: ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 i.e. the second 4 [,5] is a duplicate. What I want is the first and second 4. i.e [,4] and [,5] to be TRUE ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 I assume it can be done by sorting the vector and then checking is the
next
or the previous entry matches using identical() . I am just unsure on how to write such a loop the logic of which (I think) is as follows: sort x for every value of x check if the next value is identical and return TRUE (or 1) if it is and FALSE (or 0) if it is not AND check is the previous value is identical and return TRUE (or 1) if it is
and
FALSE (or 0) if it is not Im i thinking correct and can some help to write such a function regards Christiaan ? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090514/fe25713c/attachment-0001.pl>
The table()-based solution can have problems when there are
very closely spaced floating point numbers in x, as in
x1<-c(1, 1-.Machine$double.eps, 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default
levels=as.character(sort(x)) and that default may change.
It omits NA's from the result. (I think it also ought to put the results in
the original order of the data, so one can, e.g., omit or select values
which are duplicated.)
The ave()-based solution fails when there are NA's or NaN's in the data.
x2 <- c(1,2,3,NA,10,6,3)
The ave()-based solution can be slower than necessary on long datasets,
especially ones with few or no duplicates.
x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]
I think the following function avoids these problems. It never converts
the data to character, but uses match() on the original data to convert
it to a set of unique integers that tabulate can handle.
f2 <- function(x){
ix<-match(x,x)
tix<-tabulate(ix)
retval<-logical(length(x))
retval[which(tix!=1)]<-TRUE
retval
}
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter Sent: Thursday, May 14, 2009 9:10 AM To: 'Gabor Grothendieck'; 'christiaan pauw' Cc: r-help at r-project.org Subject: Re: [R] Duplicates and duplicated ... or, similar in character to Gabor's solution: tbl <- table(x) (tbl[as.character(sort(x))]>1)+0 Bert Gunter Nonclinical Biostatistics 467-7374 -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Gabor Grothendieck Sent: Thursday, May 14, 2009 7:34 AM To: christiaan pauw Cc: r-help at r-project.org Subject: Re: [R] Duplicates and duplicated Noting that:
ave(x, x, FUN = length) > 1
[1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE try this:
rbind(x, dup = ave(x, x, FUN = length) > 1)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x 1 2 3 4 4 5 6 7 8 9 dup 0 0 0 1 1 0 0 0 0 0 On Thu, May 14, 2009 at 2:16 AM, christiaan pauw <cjpauw at gmail.com> wrote:
Hi everybody. I want to identify not only duplicate number but also the
original number
that has been duplicated. Example: x=c(1,2,3,4,4,5,6,7,8,9) y=duplicated(x) rbind(x,y) gives: ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 i.e. the second 4 [,5] is a duplicate. What I want is the first and second 4. i.e [,4] and [,5] to be TRUE ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 I assume it can be done by sorting the vector and then
checking is the next
or the previous entry matches using identical() . I am just unsure on how to write such a loop
the logic of
which (I think) is as follows: sort x for every value of x check if the next value is identical
and return TRUE
(or 1) if it is and FALSE (or 0) if it is not AND check is the previous value is identical and return TRUE
(or 1) if it is and
FALSE (or 0) if it is not Im i thinking correct and can some help to write such a function regards Christiaan ? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thanks, Bill. I also had some concerns about how reliable numeric values
converted to character might be, so I'm glad to have an authoritative
criticism. Of course, I was really just being cute with R's versatility.
But Jim Holtman's solution seems like the best way to go, anyway, does it
not?
-- Bert
Bert Gunter
Genentech Nonclinical Biostatistics
-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com]
Sent: Thursday, May 14, 2009 10:44 AM
To: Bert Gunter; Gabor Grothendieck; christiaan pauw
Cc: r-help at r-project.org
Subject: RE: [R] Duplicates and duplicated
The table()-based solution can have problems when there are
very closely spaced floating point numbers in x, as in
x1<-c(1, 1-.Machine$double.eps, 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default
levels=as.character(sort(x)) and that default may change.
It omits NA's from the result. (I think it also ought to put the results in
the original order of the data, so one can, e.g., omit or select values
which are duplicated.)
The ave()-based solution fails when there are NA's or NaN's in the data.
x2 <- c(1,2,3,NA,10,6,3)
The ave()-based solution can be slower than necessary on long datasets,
especially ones with few or no duplicates.
x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]
I think the following function avoids these problems. It never converts
the data to character, but uses match() on the original data to convert
it to a set of unique integers that tabulate can handle.
f2 <- function(x){
ix<-match(x,x)
tix<-tabulate(ix)
retval<-logical(length(x))
retval[which(tix!=1)]<-TRUE
retval
}
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter Sent: Thursday, May 14, 2009 9:10 AM To: 'Gabor Grothendieck'; 'christiaan pauw' Cc: r-help at r-project.org Subject: Re: [R] Duplicates and duplicated ... or, similar in character to Gabor's solution: tbl <- table(x) (tbl[as.character(sort(x))]>1)+0 Bert Gunter Nonclinical Biostatistics 467-7374 -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Gabor Grothendieck Sent: Thursday, May 14, 2009 7:34 AM To: christiaan pauw Cc: r-help at r-project.org Subject: Re: [R] Duplicates and duplicated Noting that:
ave(x, x, FUN = length) > 1
[1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE try this:
rbind(x, dup = ave(x, x, FUN = length) > 1)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x 1 2 3 4 4 5 6 7 8 9 dup 0 0 0 1 1 0 0 0 0 0 On Thu, May 14, 2009 at 2:16 AM, christiaan pauw <cjpauw at gmail.com> wrote:
Hi everybody. I want to identify not only duplicate number but also the
original number
that has been duplicated. Example: x=c(1,2,3,4,4,5,6,7,8,9) y=duplicated(x) rbind(x,y) gives: ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 i.e. the second 4 [,5] is a duplicate. What I want is the first and second 4. i.e [,4] and [,5] to be TRUE ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 I assume it can be done by sorting the vector and then
checking is the next
or the previous entry matches using identical() . I am just unsure on how to write such a loop
the logic of
which (I think) is as follows: sort x for every value of x check if the next value is identical
and return TRUE
(or 1) if it is and FALSE (or 0) if it is not AND check is the previous value is identical and return TRUE
(or 1) if it is and
FALSE (or 0) if it is not Im i thinking correct and can some help to write such a function regards Christiaan ? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I don't think that that is the conclusion. All the solutions solve the original problem and the additional "requirements" may or may not be what is wanted in any particular case. The ave solution propagates the NA which seems like the right thing to do whereas the f2 solution and the duplicated solutions labels it FALSE which seems wrong (though it may be right if that were wanted). Also, the f2 solution does not pick up the 3 at the end but again that may or may not be wanted.
x <- c(1, 2, 3, NA, 10, 6, 3) ave(x, x, FUN = length) > 1
[1] FALSE FALSE TRUE NA FALSE FALSE TRUE
f2(x)
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
duplicated(x) | duplicated(x, fromLast=TRUE)
[1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE so it all depends on what you want.
On Thu, May 14, 2009 at 1:43 PM, William Dunlap <wdunlap at tibco.com> wrote:
The table()-based solution can have problems when there are
very closely spaced floating point numbers in x, as in
? x1<-c(1, 1-.Machine$double.eps, 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default
levels=as.character(sort(x)) and that default may change.
It omits NA's from the result. (I think it also ought to put the results in
the original order of the data, so one can, e.g., omit or select values
which are duplicated.)
The ave()-based solution fails when there are NA's or NaN's in the data.
? x2 <- c(1,2,3,NA,10,6,3)
The ave()-based solution can be slower than necessary on long datasets,
especially ones with few or no duplicates.
? x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]
I think the following function avoids these problems. ?It never converts
the data to character, but uses match() on the original data to convert
it to a set of unique integers that tabulate can handle.
f2 <- function(x){
? ix<-match(x,x)
? tix<-tabulate(ix)
? retval<-logical(length(x))
? retval[which(tix!=1)]<-TRUE
? retval
}
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter Sent: Thursday, May 14, 2009 9:10 AM To: 'Gabor Grothendieck'; 'christiaan pauw' Cc: r-help at r-project.org Subject: Re: [R] Duplicates and duplicated ... or, similar in character to Gabor's solution: tbl <- table(x) (tbl[as.character(sort(x))]>1)+0 Bert Gunter Nonclinical Biostatistics 467-7374 -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Gabor Grothendieck Sent: Thursday, May 14, 2009 7:34 AM To: christiaan pauw Cc: r-help at r-project.org Subject: Re: [R] Duplicates and duplicated Noting that:
ave(x, x, FUN = length) > 1
?[1] FALSE FALSE FALSE ?TRUE ?TRUE FALSE FALSE FALSE FALSE FALSE try this:
rbind(x, dup = ave(x, x, FUN = length) > 1)
? ? [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 dup ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 On Thu, May 14, 2009 at 2:16 AM, christiaan pauw <cjpauw at gmail.com> wrote:
Hi everybody. I want to identify not only duplicate number but also the
original number
that has been duplicated. Example: x=c(1,2,3,4,4,5,6,7,8,9) y=duplicated(x) rbind(x,y) gives: ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 i.e. the second 4 [,5] is a duplicate. What I want is the first and second 4. i.e [,4] and [,5] to be TRUE ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 I assume it can be done by sorting the vector and then
checking is the next
or the previous entry matches using identical() . I am just unsure on how to write such a loop
the logic of
which (I think) is as follows: sort x for every value of x check if the next value is identical
and return TRUE
(or 1) if it is and FALSE (or 0) if it is not AND check is the previous value is identical and return TRUE
(or 1) if it is and
FALSE (or 0) if it is not Im i thinking correct and can some help to write such a function regards Christiaan ? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-----Original Message----- From: Bert Gunter [mailto:gunter.berton at gene.com] Sent: Thursday, May 14, 2009 2:31 PM To: William Dunlap; 'Gabor Grothendieck'; 'christiaan pauw'; 'jim holtman' Cc: r-help at r-project.org Subject: RE: [R] Duplicates and duplicated Thanks, Bill. I also had some concerns about how reliable numeric values converted to character might be, so I'm glad to have an authoritative criticism. Of course, I was really just being cute with R's versatility. But Jim Holtman's solution seems like the best way to go, anyway, does it not?
That was
f3 <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE)
which is equivalent to
function(x) duplicated(x) | rev(duplicated(rev(x)))
in S+, which doesn't have the fromLast= argument.
It avoids the problems involved in table() and ave(),
but it just seems sneaky to me.
Linlin Yan's
f4 <- function(x) x %in% x[duplicated(x)]
seems to me more direct and also avoids those problems.
Mine was wrong. It fails on
x <- c(1, 2, 8, 2, 4, 5, 10, 1, 4, 16, 2)
My intent was to provide one that would generalize to identifiying
all elements that had n or more repetitions in the input vector.
(E.g., you may want to drop from some analysis subjects with
fewer than 5 observations on them.) The corrected version is
f2<-function(x, n=2){
ix<-match(x,x);
tix<-tabulate(ix);
ix %in% which(tix>=n)
}
E.g.,
rbind(x, f2(x), f3(x), f4(x)) # identify duplicated entries
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
x 1 2 8 2 4 5 10 1 4 16 2
1 1 0 1 1 0 0 1 1 0 1
1 1 0 1 1 0 0 1 1 0 1
1 1 0 1 1 0 0 1 1 0 1
rbind(x, f2(x, n=3)) # find ones with >= 3 reps
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
x 1 2 8 2 4 5 10 1 4 16 2
0 1 0 1 0 0 0 0 0 0 1
-- Bert
Bert Gunter
Genentech Nonclinical Biostatistics
-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com]
Sent: Thursday, May 14, 2009 10:44 AM
To: Bert Gunter; Gabor Grothendieck; christiaan pauw
Cc: r-help at r-project.org
Subject: RE: [R] Duplicates and duplicated
The table()-based solution can have problems when there are
very closely spaced floating point numbers in x, as in
x1<-c(1, 1-.Machine$double.eps,
1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default
levels=as.character(sort(x)) and that default may change.
It omits NA's from the result. (I think it also ought to put
the results in
the original order of the data, so one can, e.g., omit or
select values
which are duplicated.)
The ave()-based solution fails when there are NA's or NaN's
in the data.
x2 <- c(1,2,3,NA,10,6,3)
The ave()-based solution can be slower than necessary on long
datasets,
especially ones with few or no duplicates.
x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]
I think the following function avoids these problems. It
never converts
the data to character, but uses match() on the original data
to convert
it to a set of unique integers that tabulate can handle.
f2 <- function(x){
ix<-match(x,x)
tix<-tabulate(ix)
retval<-logical(length(x))
retval[which(tix!=1)]<-TRUE
retval
}
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter Sent: Thursday, May 14, 2009 9:10 AM To: 'Gabor Grothendieck'; 'christiaan pauw' Cc: r-help at r-project.org Subject: Re: [R] Duplicates and duplicated ... or, similar in character to Gabor's solution: tbl <- table(x) (tbl[as.character(sort(x))]>1)+0 Bert Gunter Nonclinical Biostatistics 467-7374 -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Gabor Grothendieck Sent: Thursday, May 14, 2009 7:34 AM To: christiaan pauw Cc: r-help at r-project.org Subject: Re: [R] Duplicates and duplicated Noting that:
ave(x, x, FUN = length) > 1
[1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE try this:
rbind(x, dup = ave(x, x, FUN = length) > 1)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x 1 2 3 4 4 5 6 7 8 9 dup 0 0 0 1 1 0 0 0 0 0 On Thu, May 14, 2009 at 2:16 AM, christiaan pauw <cjpauw at gmail.com> wrote:
Hi everybody. I want to identify not only duplicate number but also the
original number
that has been duplicated. Example: x=c(1,2,3,4,4,5,6,7,8,9) y=duplicated(x) rbind(x,y) gives: ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 i.e. the second 4 [,5] is a duplicate. What I want is the first and second 4. i.e [,4] and [,5]
to be TRUE
? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 I assume it can be done by sorting the vector and then
checking is the next
or the previous entry matches using identical() . I am just unsure on how to write such a loop
the logic of
which (I think) is as follows: sort x for every value of x check if the next value is identical
and return TRUE
(or 1) if it is and FALSE (or 0) if it is not AND check is the previous value is identical and return TRUE
(or 1) if it is and
FALSE (or 0) if it is not Im i thinking correct and can some help to write such a function regards Christiaan ? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Gabor,
My f2 was just wrong. It should have been
f2 <- function(x, n=2){ ix<-match(x,x); tix<-tabulate(ix); ix %in% which(tix>=n) }
which would be roughly the same as your
f1 <- function(x, n=2) ave(x,x,FUN=length)>=n
and flags all elements of x with >= n repetitions.
ave() involves a call to factor, which folks on R-devel have been fiddling
with to change how it works with close-together numbers, so its results
may vary with the version of R. The ix<-match(x,x) is a way to avoid
the dependency on factor.
For very long vectors with few duplicates tabulate is faster than then many
calls to length in ave and I think f2 uses less memory because of the
lists involved in the calls to split and lapply in ave. E.g., on a pretty
old Linux machine:
x<-c(1:5e5,5,5,5,7,7,2) which(f2(x))
[1] 2 5 7 500001 500002 500003 500004 500005 500006
which(f1(x))
[1] 2 5 7 500001 500002 500003 500004 500005 500006
system.time(f1(x))
user system elapsed 23.726 0.250 23.999
system.time(f2(x))
user system elapsed 0.639 0.003 0.642 ave() is certainly easier to understand. Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com
-----Original Message----- From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] Sent: Thursday, May 14, 2009 2:47 PM To: William Dunlap Cc: Bert Gunter; christiaan pauw; r-help at r-project.org Subject: Re: [R] Duplicates and duplicated I don't think that that is the conclusion. All the solutions solve the original problem and the additional "requirements" may or may not be what is wanted in any particular case. The ave solution propagates the NA which seems like the right thing to do whereas the f2 solution and the duplicated solutions labels it FALSE which seems wrong (though it may be right if that were wanted). Also, the f2 solution does not pick up the 3 at the end but again that may or may not be wanted.
x <- c(1, 2, 3, NA, 10, 6, 3) ave(x, x, FUN = length) > 1
[1] FALSE FALSE TRUE NA FALSE FALSE TRUE
f2(x)
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
duplicated(x) | duplicated(x, fromLast=TRUE)
[1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE so it all depends on what you want. On Thu, May 14, 2009 at 1:43 PM, William Dunlap <wdunlap at tibco.com> wrote:
The table()-based solution can have problems when there are very closely spaced floating point numbers in x, as in ? x1<-c(1, 1-.Machine$double.eps,
1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default levels=as.character(sort(x)) and that default may change. It omits NA's from the result. (I think it also ought to
put the results in
the original order of the data, so one can, e.g., omit or
select values
which are duplicated.) The ave()-based solution fails when there are NA's or NaN's
in the data.
? x2 <- c(1,2,3,NA,10,6,3) The ave()-based solution can be slower than necessary on
long datasets,
especially ones with few or no duplicates. ? x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17] I think the following function avoids these problems. ?It
never converts
the data to character, but uses match() on the original
data to convert
it to a set of unique integers that tabulate can handle.
f2 <- function(x){
? ix<-match(x,x)
? tix<-tabulate(ix)
? retval<-logical(length(x))
? retval[which(tix!=1)]<-TRUE
? retval
}
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter Sent: Thursday, May 14, 2009 9:10 AM To: 'Gabor Grothendieck'; 'christiaan pauw' Cc: r-help at r-project.org Subject: Re: [R] Duplicates and duplicated ... or, similar in character to Gabor's solution: tbl <- table(x) (tbl[as.character(sort(x))]>1)+0 Bert Gunter Nonclinical Biostatistics 467-7374 -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Gabor Grothendieck Sent: Thursday, May 14, 2009 7:34 AM To: christiaan pauw Cc: r-help at r-project.org Subject: Re: [R] Duplicates and duplicated Noting that:
ave(x, x, FUN = length) > 1
?[1] FALSE FALSE FALSE ?TRUE ?TRUE FALSE FALSE FALSE FALSE FALSE try this:
rbind(x, dup = ave(x, x, FUN = length) > 1)
? ? [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 dup ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 On Thu, May 14, 2009 at 2:16 AM, christiaan pauw <cjpauw at gmail.com> wrote:
Hi everybody. I want to identify not only duplicate number but also the
original number
that has been duplicated. Example: x=c(1,2,3,4,4,5,6,7,8,9) y=duplicated(x) rbind(x,y) gives: ? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 i.e. the second 4 [,5] is a duplicate. What I want is the first and second 4. i.e [,4] and [,5]
to be TRUE
? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9 y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0 I assume it can be done by sorting the vector and then
checking is the next
or the previous entry matches using identical() . I am just unsure on how to write such a loop
the logic of
which (I think) is as follows: sort x for every value of x check if the next value is identical
and return TRUE
(or 1) if it is and FALSE (or 0) if it is not AND check is the previous value is identical and return TRUE
(or 1) if it is and
FALSE (or 0) if it is not Im i thinking correct and can some help to write such a function regards Christiaan ? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained,
reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.