Duplicates and duplicated

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090514/d6bd61ff/attachment-0001.pl>
Hi everybody.
I want to identify not only duplicate number but also the original number
that has been duplicated.
Example:
x=c(1,2,3,4,4,5,6,7,8,9)
y=duplicated(x)
rbind(x,y)

gives:
? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

i.e. the second 4 [,5] is a duplicate.

What I want is the first and second 4. i.e [,4] and [,5] to be TRUE

? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

How about

rbind(x, duplicated(x) | duplicated(x, fromLast=TRUE))
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x    1    2    3    4    4    5    6    7    8     9
     0    0    0    1    1    0    0    0    0     0
I assume it can be done by sorting the vector and then checking is the next
or the previous entry matches using
identical() . I am just unsure on how to write such a loop the logic of
which (I think) is as follows:

sort x
for every value of x check if the next value is identical and return TRUE
(or 1) if it is and FALSE (or 0) if it is not
AND
check is the previous value is identical and return TRUE (or 1) if it is and
FALSE (or 0) if it is not

Im i thinking correct and can some help to write such a function

regards
Christiaan

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Try this

x%in%x[which(y)]
From your example
x=c(1,2,3,4,4,5,6,7,8,9)
y=duplicated(x)
rbind(x,y)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x    1    2    3    4    4    5    6    7    8     9
y    0    0    0    0    1    0    0    0    0     0
which(y)
[1] 5
x[which(y)]
[1] 4
x%in%x[which(y)]
[1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

Andrej

--
Andrej Blejec
National Institute of Biology
Vecna pot 111 POB 141
SI-1000 Ljubljana
SLOVENIA
e-mail: andrej.blejec at nib.si
URL: http://ablejec.nib.si 
tel: + 386 (0)59 232 789
fax: + 386 1 241 29 80
--------------------------
Organizer of
Applied Statistics 2009 conference
http://conferences.nib.si/AS2009
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of christiaan pauw
Sent: Thursday, May 14, 2009 8:17 AM
To: r-help at r-project.org
Subject: [R] Duplicates and duplicated

Hi everybody.
I want to identify not only duplicate number but also the original
number
that has been duplicated.
Example:
x=c(1,2,3,4,4,5,6,7,8,9)
y=duplicated(x)
rbind(x,y)

gives:
    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x    1    2    3    4    4    5    6    7    8     9
y    0    0    0    0    1    0    0    0    0     0

i.e. the second 4 [,5] is a duplicate.

What I want is the first and second 4. i.e [,4] and [,5] to be TRUE

    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x    1    2    3    4    4    5    6    7    8     9
y    0    0    0    1    1    0    0    0    0     0

I assume it can be done by sorting the vector and then checking is the
next
or the previous entry matches using
identical() . I am just unsure on how to write such a loop the logic
of
which (I think) is as follows:

sort x
for every value of x check if the next value is identical and return
TRUE
(or 1) if it is and FALSE (or 0) if it is not
AND
check is the previous value is identical and return TRUE (or 1) if it
is and
FALSE (or 0) if it is not

Im i thinking correct and can some help to write such a function

regards
Christiaan

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html
and provide commented, minimal, self-contained, reproducible code.
The operator %in% is very good! And that can be simpler like this:
x %in% x[duplicated(x)]
 [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
Try this

x%in%x[which(y)]

From your example

x=c(1,2,3,4,4,5,6,7,8,9)
y=duplicated(x)
rbind(x,y)
?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0
which(y)
[1] 5
x[which(y)]
[1] 4
x%in%x[which(y)]
?[1] FALSE FALSE FALSE ?TRUE ?TRUE FALSE FALSE FALSE FALSE FALSE

Andrej

--
Andrej Blejec
National Institute of Biology
Vecna pot 111 POB 141
SI-1000 Ljubljana
SLOVENIA
e-mail: andrej.blejec at nib.si
URL: http://ablejec.nib.si
tel: + 386 (0)59 232 789
fax: + 386 1 241 29 80
--------------------------
Organizer of
Applied Statistics 2009 conference
http://conferences.nib.si/AS2009

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of christiaan pauw
Sent: Thursday, May 14, 2009 8:17 AM
To: r-help at r-project.org
Subject: [R] Duplicates and duplicated

Hi everybody.
I want to identify not only duplicate number but also the original
number
that has been duplicated.
Example:
x=c(1,2,3,4,4,5,6,7,8,9)
y=duplicated(x)
rbind(x,y)

gives:
? ? [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

i.e. the second 4 [,5] is a duplicate.

What I want is the first and second 4. i.e [,4] and [,5] to be TRUE

? ? [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

I assume it can be done by sorting the vector and then checking is the
next
or the previous entry matches using
identical() . I am just unsure on how to write such a loop the logic
of
which (I think) is as follows:

sort x
for every value of x check if the next value is identical and return
TRUE
(or 1) if it is and FALSE (or 0) if it is not
AND
check is the previous value is identical and return TRUE (or 1) if it
is and
FALSE (or 0) if it is not

Im i thinking correct and can some help to write such a function

regards
Christiaan

? ? ? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Noting that:
ave(x, x, FUN = length) > 1
[1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

try this:
rbind(x, dup = ave(x, x, FUN = length) > 1)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x      1    2    3    4    4    5    6    7    8     9
dup    0    0    0    1    1    0    0    0    0     0
Hi everybody.
I want to identify not only duplicate number but also the original number
that has been duplicated.
Example:
x=c(1,2,3,4,4,5,6,7,8,9)
y=duplicated(x)
rbind(x,y)

gives:
? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

i.e. the second 4 [,5] is a duplicate.

What I want is the first and second 4. i.e [,4] and [,5] to be TRUE

? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

I assume it can be done by sorting the vector and then checking is the next
or the previous entry matches using
identical() . I am just unsure on how to write such a loop the logic of
which (I think) is as follows:

sort x
for every value of x check if the next value is identical and return TRUE
(or 1) if it is and FALSE (or 0) if it is not
AND
check is the previous value is identical and return TRUE (or 1) if it is and
FALSE (or 0) if it is not

Im i thinking correct and can some help to write such a function

regards
Christiaan

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

... or, similar in character to Gabor's solution:

tbl <- table(x)
(tbl[as.character(sort(x))]>1)+0

Bert Gunter
Nonclinical Biostatistics
467-7374

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Gabor Grothendieck
Sent: Thursday, May 14, 2009 7:34 AM
To: christiaan pauw
Cc: r-help at r-project.org
Subject: Re: [R] Duplicates and duplicated

Noting that:
ave(x, x, FUN = length) > 1
[1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

try this:
rbind(x, dup = ave(x, x, FUN = length) > 1)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x      1    2    3    4    4    5    6    7    8     9
dup    0    0    0    1    1    0    0    0    0     0
Hi everybody.
I want to identify not only duplicate number but also the original number
that has been duplicated.
Example:
x=c(1,2,3,4,4,5,6,7,8,9)
y=duplicated(x)
rbind(x,y)

gives:
? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

i.e. the second 4 [,5] is a duplicate.

What I want is the first and second 4. i.e [,4] and [,5] to be TRUE

? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

I assume it can be done by sorting the vector and then checking is the
next
or the previous entry matches using
identical() . I am just unsure on how to write such a loop the logic of
which (I think) is as follows:

sort x
for every value of x check if the next value is identical and return TRUE
(or 1) if it is and FALSE (or 0) if it is not
AND
check is the previous value is identical and return TRUE (or 1) if it is
and
FALSE (or 0) if it is not

Im i thinking correct and can some help to write such a function

regards
Christiaan

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090514/fe25713c/attachment-0001.pl>
The table()-based solution can have problems when there are
very closely spaced floating point numbers in x, as in
   x1<-c(1, 1-.Machine$double.eps, 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default
levels=as.character(sort(x)) and that default may change.
It omits NA's from the result. (I think it also ought to put the results in
the original order of the data, so one can, e.g., omit or select values
which are duplicated.)

The ave()-based solution fails when there are NA's or NaN's in the data.
   x2 <- c(1,2,3,NA,10,6,3)

The ave()-based solution can be slower than necessary on long datasets,
especially ones with few or no duplicates.
   x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]

I think the following function avoids these problems.  It never converts
the data to character, but uses match() on the original data to convert
it to a set of unique integers that tabulate can handle.

f2 <- function(x){
   ix<-match(x,x)
   tix<-tabulate(ix)
   retval<-logical(length(x))
   retval[which(tix!=1)]<-TRUE
   retval
}

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org 
[mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
Sent: Thursday, May 14, 2009 9:10 AM
To: 'Gabor Grothendieck'; 'christiaan pauw'
Cc: r-help at r-project.org
Subject: Re: [R] Duplicates and duplicated

... or, similar in character to Gabor's solution:

tbl <- table(x)
(tbl[as.character(sort(x))]>1)+0

Bert Gunter
Nonclinical Biostatistics
467-7374

-----Original Message-----
From: r-help-bounces at r-project.org 
[mailto:r-help-bounces at r-project.org] On
Behalf Of Gabor Grothendieck
Sent: Thursday, May 14, 2009 7:34 AM
To: christiaan pauw
Cc: r-help at r-project.org
Subject: Re: [R] Duplicates and duplicated

Noting that:

ave(x, x, FUN = length) > 1
 [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

try this:

rbind(x, dup = ave(x, x, FUN = length) > 1)
    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x      1    2    3    4    4    5    6    7    8     9
dup    0    0    0    1    1    0    0    0    0     0

On Thu, May 14, 2009 at 2:16 AM, christiaan pauw 
<cjpauw at gmail.com> wrote:
Hi everybody.
I want to identify not only duplicate number but also the 
original number
that has been duplicated.
Example:
x=c(1,2,3,4,4,5,6,7,8,9)
y=duplicated(x)
rbind(x,y)

gives:
? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

i.e. the second 4 [,5] is a duplicate.

What I want is the first and second 4. i.e [,4] and [,5] to be TRUE

? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

I assume it can be done by sorting the vector and then 
checking is the
next
or the previous entry matches using
identical() . I am just unsure on how to write such a loop 
the logic of
which (I think) is as follows:

sort x
for every value of x check if the next value is identical 
and return TRUE
(or 1) if it is and FALSE (or 0) if it is not
AND
check is the previous value is identical and return TRUE 
(or 1) if it is
and
FALSE (or 0) if it is not

Im i thinking correct and can some help to write such a function

regards
Christiaan

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Thanks, Bill. I also had some concerns about how reliable numeric values
converted to character might be, so I'm glad to have an authoritative
criticism. Of course, I was really just being cute with R's versatility. 

But Jim Holtman's solution seems like the best way to go, anyway, does it
not?

-- Bert 

Bert Gunter
Genentech Nonclinical Biostatistics

-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com] 
Sent: Thursday, May 14, 2009 10:44 AM
To: Bert Gunter; Gabor Grothendieck; christiaan pauw
Cc: r-help at r-project.org
Subject: RE: [R] Duplicates and duplicated

The table()-based solution can have problems when there are
very closely spaced floating point numbers in x, as in
   x1<-c(1, 1-.Machine$double.eps, 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default
levels=as.character(sort(x)) and that default may change.
It omits NA's from the result. (I think it also ought to put the results in
the original order of the data, so one can, e.g., omit or select values
which are duplicated.)

The ave()-based solution fails when there are NA's or NaN's in the data.
   x2 <- c(1,2,3,NA,10,6,3)

The ave()-based solution can be slower than necessary on long datasets,
especially ones with few or no duplicates.
   x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]

I think the following function avoids these problems.  It never converts
the data to character, but uses match() on the original data to convert
it to a set of unique integers that tabulate can handle.

f2 <- function(x){
   ix<-match(x,x)
   tix<-tabulate(ix)
   retval<-logical(length(x))
   retval[which(tix!=1)]<-TRUE
   retval
}

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org 
[mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
Sent: Thursday, May 14, 2009 9:10 AM
To: 'Gabor Grothendieck'; 'christiaan pauw'
Cc: r-help at r-project.org
Subject: Re: [R] Duplicates and duplicated

... or, similar in character to Gabor's solution:

tbl <- table(x)
(tbl[as.character(sort(x))]>1)+0

Bert Gunter
Nonclinical Biostatistics
467-7374

-----Original Message-----
From: r-help-bounces at r-project.org 
[mailto:r-help-bounces at r-project.org] On
Behalf Of Gabor Grothendieck
Sent: Thursday, May 14, 2009 7:34 AM
To: christiaan pauw
Cc: r-help at r-project.org
Subject: Re: [R] Duplicates and duplicated

Noting that:

ave(x, x, FUN = length) > 1
 [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

try this:

rbind(x, dup = ave(x, x, FUN = length) > 1)
    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x      1    2    3    4    4    5    6    7    8     9
dup    0    0    0    1    1    0    0    0    0     0

On Thu, May 14, 2009 at 2:16 AM, christiaan pauw 
<cjpauw at gmail.com> wrote:
Hi everybody.
I want to identify not only duplicate number but also the 
original number
that has been duplicated.
Example:
x=c(1,2,3,4,4,5,6,7,8,9)
y=duplicated(x)
rbind(x,y)

gives:
? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

i.e. the second 4 [,5] is a duplicate.

What I want is the first and second 4. i.e [,4] and [,5] to be TRUE

? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

I assume it can be done by sorting the vector and then 
checking is the
next
or the previous entry matches using
identical() . I am just unsure on how to write such a loop 
the logic of
which (I think) is as follows:

sort x
for every value of x check if the next value is identical 
and return TRUE
(or 1) if it is and FALSE (or 0) if it is not
AND
check is the previous value is identical and return TRUE 
(or 1) if it is
and
FALSE (or 0) if it is not

Im i thinking correct and can some help to write such a function

regards
Christiaan

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

I don't think that that is the conclusion.

All the solutions solve the original problem and the additional
"requirements" may or may not be what is wanted in any
particular case.

The ave solution propagates the NA which seems like
the right thing to do whereas the f2 solution and the
duplicated solutions labels it FALSE which seems
wrong (though it may be right if that were wanted).
Also, the f2 solution does not pick up the 3 at the end
but again that may or may not be wanted.
x <- c(1, 2, 3, NA, 10, 6, 3)
ave(x, x, FUN = length) > 1
[1] FALSE FALSE  TRUE    NA FALSE FALSE  TRUE
f2(x)
[1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
duplicated(x) | duplicated(x, fromLast=TRUE)
[1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

so it all depends on what you want.
The table()-based solution can have problems when there are
very closely spaced floating point numbers in x, as in
? x1<-c(1, 1-.Machine$double.eps, 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default
levels=as.character(sort(x)) and that default may change.
It omits NA's from the result. (I think it also ought to put the results in
the original order of the data, so one can, e.g., omit or select values
which are duplicated.)

The ave()-based solution fails when there are NA's or NaN's in the data.
? x2 <- c(1,2,3,NA,10,6,3)

The ave()-based solution can be slower than necessary on long datasets,
especially ones with few or no duplicates.
? x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]

I think the following function avoids these problems. ?It never converts
the data to character, but uses match() on the original data to convert
it to a set of unique integers that tabulate can handle.

f2 <- function(x){
? ix<-match(x,x)
? tix<-tabulate(ix)
? retval<-logical(length(x))
? retval[which(tix!=1)]<-TRUE
? retval
}

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
Sent: Thursday, May 14, 2009 9:10 AM
To: 'Gabor Grothendieck'; 'christiaan pauw'
Cc: r-help at r-project.org
Subject: Re: [R] Duplicates and duplicated

... or, similar in character to Gabor's solution:

tbl <- table(x)
(tbl[as.character(sort(x))]>1)+0

Bert Gunter
Nonclinical Biostatistics
467-7374

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On
Behalf Of Gabor Grothendieck
Sent: Thursday, May 14, 2009 7:34 AM
To: christiaan pauw
Cc: r-help at r-project.org
Subject: Re: [R] Duplicates and duplicated

Noting that:

ave(x, x, FUN = length) > 1
?[1] FALSE FALSE FALSE ?TRUE ?TRUE FALSE FALSE FALSE FALSE FALSE

try this:

rbind(x, dup = ave(x, x, FUN = length) > 1)
? ? [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
dup ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

On Thu, May 14, 2009 at 2:16 AM, christiaan pauw
<cjpauw at gmail.com> wrote:
Hi everybody.
I want to identify not only duplicate number but also the
original number
that has been duplicated.
Example:
x=c(1,2,3,4,4,5,6,7,8,9)
y=duplicated(x)
rbind(x,y)

gives:
? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

i.e. the second 4 [,5] is a duplicate.

What I want is the first and second 4. i.e [,4] and [,5] to be TRUE

? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

I assume it can be done by sorting the vector and then
checking is the
next
or the previous entry matches using
identical() . I am just unsure on how to write such a loop
the logic of
which (I think) is as follows:

sort x
for every value of x check if the next value is identical
and return TRUE
(or 1) if it is and FALSE (or 0) if it is not
AND
check is the previous value is identical and return TRUE
(or 1) if it is
and
FALSE (or 0) if it is not

Im i thinking correct and can some help to write such a function

regards
Christiaan

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

-----Original Message-----
From: Bert Gunter [mailto:gunter.berton at gene.com] 
Sent: Thursday, May 14, 2009 2:31 PM
To: William Dunlap; 'Gabor Grothendieck'; 'christiaan pauw'; 
'jim holtman'
Cc: r-help at r-project.org
Subject: RE: [R] Duplicates and duplicated

Thanks, Bill. I also had some concerns about how reliable 
numeric values
converted to character might be, so I'm glad to have an authoritative
criticism. Of course, I was really just being cute with R's 
versatility. 

But Jim Holtman's solution seems like the best way to go, 
anyway, does it
not?
That was
    f3 <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE)
which is equivalent to
           function(x) duplicated(x) | rev(duplicated(rev(x)))
in S+, which doesn't have the fromLast= argument.
It avoids the problems involved in table() and ave(),
but it just seems sneaky to me.

Linlin Yan's
    f4 <- function(x) x %in% x[duplicated(x)]
seems to me more direct and also avoids those problems.

Mine was wrong.  It fails on
   x <- c(1, 2, 8, 2, 4, 5, 10, 1, 4, 16, 2)
My intent was to provide one that would generalize to identifiying
all elements that had n or more repetitions in the input vector.
(E.g., you may want to drop from some analysis subjects with
fewer than 5 observations on them.)  The corrected version is
   f2<-function(x, n=2){
       ix<-match(x,x);
       tix<-tabulate(ix);
       ix %in% which(tix>=n)
   }

E.g.,
rbind(x, f2(x), f3(x), f4(x)) # identify duplicated entries
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
x    1    2    8    2    4    5   10    1    4    16     2
     1    1    0    1    1    0    0    1    1     0     1
     1    1    0    1    1    0    0    1    1     0     1
     1    1    0    1    1    0    0    1    1     0     1
rbind(x, f2(x, n=3)) # find ones with >= 3 reps
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
x    1    2    8    2    4    5   10    1    4    16     2
     0    1    0    1    0    0    0    0    0     0     1
-- Bert 

Bert Gunter
Genentech Nonclinical Biostatistics

-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com] 
Sent: Thursday, May 14, 2009 10:44 AM
To: Bert Gunter; Gabor Grothendieck; christiaan pauw
Cc: r-help at r-project.org
Subject: RE: [R] Duplicates and duplicated

The table()-based solution can have problems when there are
very closely spaced floating point numbers in x, as in
   x1<-c(1, 1-.Machine$double.eps, 
1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default
levels=as.character(sort(x)) and that default may change.
It omits NA's from the result. (I think it also ought to put 
the results in
the original order of the data, so one can, e.g., omit or 
select values
which are duplicated.)

The ave()-based solution fails when there are NA's or NaN's 
in the data.
   x2 <- c(1,2,3,NA,10,6,3)

The ave()-based solution can be slower than necessary on long 
datasets,
especially ones with few or no duplicates.
   x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]

I think the following function avoids these problems.  It 
never converts
the data to character, but uses match() on the original data 
to convert
it to a set of unique integers that tabulate can handle.

f2 <- function(x){
   ix<-match(x,x)
   tix<-tabulate(ix)
   retval<-logical(length(x))
   retval[which(tix!=1)]<-TRUE
   retval
}

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com  

-----Original Message-----
From: r-help-bounces at r-project.org 
[mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
Sent: Thursday, May 14, 2009 9:10 AM
To: 'Gabor Grothendieck'; 'christiaan pauw'
Cc: r-help at r-project.org
Subject: Re: [R] Duplicates and duplicated

... or, similar in character to Gabor's solution:

tbl <- table(x)
(tbl[as.character(sort(x))]>1)+0

Bert Gunter
Nonclinical Biostatistics
467-7374

-----Original Message-----
From: r-help-bounces at r-project.org 
[mailto:r-help-bounces at r-project.org] On
Behalf Of Gabor Grothendieck
Sent: Thursday, May 14, 2009 7:34 AM
To: christiaan pauw
Cc: r-help at r-project.org
Subject: Re: [R] Duplicates and duplicated

Noting that:

ave(x, x, FUN = length) > 1
 [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

try this:

rbind(x, dup = ave(x, x, FUN = length) > 1)
    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x      1    2    3    4    4    5    6    7    8     9
dup    0    0    0    1    1    0    0    0    0     0

On Thu, May 14, 2009 at 2:16 AM, christiaan pauw 
<cjpauw at gmail.com> wrote:
Hi everybody.
I want to identify not only duplicate number but also the 
original number
that has been duplicated.
Example:
x=c(1,2,3,4,4,5,6,7,8,9)
y=duplicated(x)
rbind(x,y)

gives:
? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

i.e. the second 4 [,5] is a duplicate.

What I want is the first and second 4. i.e [,4] and [,5] 
to be TRUE
? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

I assume it can be done by sorting the vector and then 
checking is the
next
or the previous entry matches using
identical() . I am just unsure on how to write such a loop 
the logic of
which (I think) is as follows:

sort x
for every value of x check if the next value is identical 
and return TRUE
(or 1) if it is and FALSE (or 0) if it is not
AND
check is the previous value is identical and return TRUE 
(or 1) if it is
and
FALSE (or 0) if it is not

Im i thinking correct and can some help to write such a function

regards
Christiaan

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Gabor,

My f2 was just wrong.  It should have been
   f2 <- function(x, n=2){ ix<-match(x,x); tix<-tabulate(ix); ix %in% which(tix>=n) }
which would be roughly the same as your
   f1 <- function(x, n=2) ave(x,x,FUN=length)>=n
and flags all elements of x with >= n repetitions.

ave() involves a call to factor, which folks on R-devel have been fiddling
with to change how it works with close-together numbers, so its results
may vary with the version of R.  The ix<-match(x,x) is a way to avoid
the dependency on factor.

For very long vectors with few duplicates tabulate is faster than then many
calls to length in ave and I think f2 uses less memory because of the
lists involved in the calls to split and lapply in ave.  E.g., on a pretty
old Linux machine:
x<-c(1:5e5,5,5,5,7,7,2)
which(f2(x))
[1]      2      5      7 500001 500002 500003 500004 500005 500006
which(f1(x))
[1]      2      5      7 500001 500002 500003 500004 500005 500006
system.time(f1(x))
user  system elapsed
 23.726   0.250  23.999
system.time(f2(x))
user  system elapsed
  0.639   0.003   0.642

ave() is certainly easier to understand.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] 
Sent: Thursday, May 14, 2009 2:47 PM
To: William Dunlap
Cc: Bert Gunter; christiaan pauw; r-help at r-project.org
Subject: Re: [R] Duplicates and duplicated

I don't think that that is the conclusion.

All the solutions solve the original problem and the additional
"requirements" may or may not be what is wanted in any
particular case.

The ave solution propagates the NA which seems like
the right thing to do whereas the f2 solution and the
duplicated solutions labels it FALSE which seems
wrong (though it may be right if that were wanted).
Also, the f2 solution does not pick up the 3 at the end
but again that may or may not be wanted.

x <- c(1, 2, 3, NA, 10, 6, 3)
ave(x, x, FUN = length) > 1
[1] FALSE FALSE  TRUE    NA FALSE FALSE  TRUE

f2(x)
[1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

duplicated(x) | duplicated(x, fromLast=TRUE)
[1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

so it all depends on what you want.

On Thu, May 14, 2009 at 1:43 PM, William Dunlap 
<wdunlap at tibco.com> wrote:
The table()-based solution can have problems when there are
very closely spaced floating point numbers in x, as in
? x1<-c(1, 1-.Machine$double.eps, 
1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default
levels=as.character(sort(x)) and that default may change.
It omits NA's from the result. (I think it also ought to 
put the results in
the original order of the data, so one can, e.g., omit or 
select values
which are duplicated.)

The ave()-based solution fails when there are NA's or NaN's 
in the data.
? x2 <- c(1,2,3,NA,10,6,3)

The ave()-based solution can be slower than necessary on 
long datasets,
especially ones with few or no duplicates.
? x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]

I think the following function avoids these problems. ?It 
never converts
the data to character, but uses match() on the original 
data to convert
it to a set of unique integers that tabulate can handle.

f2 <- function(x){
? ix<-match(x,x)
? tix<-tabulate(ix)
? retval<-logical(length(x))
? retval[which(tix!=1)]<-TRUE
? retval
}

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Bert Gunter
Sent: Thursday, May 14, 2009 9:10 AM
To: 'Gabor Grothendieck'; 'christiaan pauw'
Cc: r-help at r-project.org
Subject: Re: [R] Duplicates and duplicated

... or, similar in character to Gabor's solution:

tbl <- table(x)
(tbl[as.character(sort(x))]>1)+0

Bert Gunter
Nonclinical Biostatistics
467-7374

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On
Behalf Of Gabor Grothendieck
Sent: Thursday, May 14, 2009 7:34 AM
To: christiaan pauw
Cc: r-help at r-project.org
Subject: Re: [R] Duplicates and duplicated

Noting that:

ave(x, x, FUN = length) > 1
?[1] FALSE FALSE FALSE ?TRUE ?TRUE FALSE FALSE FALSE FALSE FALSE

try this:

rbind(x, dup = ave(x, x, FUN = length) > 1)
? ? [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
dup ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

On Thu, May 14, 2009 at 2:16 AM, christiaan pauw
<cjpauw at gmail.com> wrote:
Hi everybody.
I want to identify not only duplicate number but also the
original number
that has been duplicated.
Example:
x=c(1,2,3,4,4,5,6,7,8,9)
y=duplicated(x)
rbind(x,y)

gives:
? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?0 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

i.e. the second 4 [,5] is a duplicate.

What I want is the first and second 4. i.e [,4] and [,5] 
to be TRUE
? ?[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x ? ?1 ? ?2 ? ?3 ? ?4 ? ?4 ? ?5 ? ?6 ? ?7 ? ?8 ? ? 9
y ? ?0 ? ?0 ? ?0 ? ?1 ? ?1 ? ?0 ? ?0 ? ?0 ? ?0 ? ? 0

I assume it can be done by sorting the vector and then
checking is the
next
or the previous entry matches using
identical() . I am just unsure on how to write such a loop
the logic of
which (I think) is as follows:

sort x
for every value of x check if the next value is identical
and return TRUE
(or 1) if it is and FALSE (or 0) if it is not
AND
check is the previous value is identical and return TRUE
(or 1) if it is
and
FALSE (or 0) if it is not

Im i thinking correct and can some help to write such a function

regards
Christiaan

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, 
reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.