Skip to content

Duplicates and duplicated

12 messages · christiaan pauw, Andrej Blejec, Linlin Yan +4 more

#
On Thu, May 14, 2009 at 2:16 PM, christiaan pauw <cjpauw at gmail.com> wrote:
How about

rbind(x, duplicated(x) | duplicated(x, fromLast=TRUE))
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x    1    2    3    4    4    5    6    7    8     9
     0    0    0    1    1    0    0    0    0     0
#
Try this

x%in%x[which(y)]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x    1    2    3    4    4    5    6    7    8     9
y    0    0    0    0    1    0    0    0    0     0
[1] 5
[1] 4
[1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

Andrej

--
Andrej Blejec
National Institute of Biology
Vecna pot 111 POB 141
SI-1000 Ljubljana
SLOVENIA
e-mail: andrej.blejec at nib.si
URL: http://ablejec.nib.si 
tel: + 386 (0)59 232 789
fax: + 386 1 241 29 80
--------------------------
Organizer of
Applied Statistics 2009 conference
http://conferences.nib.si/AS2009
of
#
The operator %in% is very good! And that can be simpler like this:
x %in% x[duplicated(x)]
 [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
On Thu, May 14, 2009 at 4:43 PM, Andrej Blejec <Andrej.Blejec at nib.si> wrote:
#
Noting that:
[1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

try this:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x      1    2    3    4    4    5    6    7    8     9
dup    0    0    0    1    1    0    0    0    0     0
On Thu, May 14, 2009 at 2:16 AM, christiaan pauw <cjpauw at gmail.com> wrote:
#
... or, similar in character to Gabor's solution:

tbl <- table(x)
(tbl[as.character(sort(x))]>1)+0


Bert Gunter
Nonclinical Biostatistics
467-7374

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Gabor Grothendieck
Sent: Thursday, May 14, 2009 7:34 AM
To: christiaan pauw
Cc: r-help at r-project.org
Subject: Re: [R] Duplicates and duplicated

Noting that:
[1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

try this:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
x      1    2    3    4    4    5    6    7    8     9
dup    0    0    0    1    1    0    0    0    0     0
On Thu, May 14, 2009 at 2:16 AM, christiaan pauw <cjpauw at gmail.com> wrote:
next
and
http://www.R-project.org/posting-guide.html
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
The table()-based solution can have problems when there are
very closely spaced floating point numbers in x, as in
   x1<-c(1, 1-.Machine$double.eps, 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default
levels=as.character(sort(x)) and that default may change.
It omits NA's from the result. (I think it also ought to put the results in
the original order of the data, so one can, e.g., omit or select values
which are duplicated.)

The ave()-based solution fails when there are NA's or NaN's in the data.
   x2 <- c(1,2,3,NA,10,6,3)

The ave()-based solution can be slower than necessary on long datasets,
especially ones with few or no duplicates.
   x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]

I think the following function avoids these problems.  It never converts
the data to character, but uses match() on the original data to convert
it to a set of unique integers that tabulate can handle.
 
f2 <- function(x){
   ix<-match(x,x)
   tix<-tabulate(ix)
   retval<-logical(length(x))
   retval[which(tix!=1)]<-TRUE
   retval
}

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
#
Thanks, Bill. I also had some concerns about how reliable numeric values
converted to character might be, so I'm glad to have an authoritative
criticism. Of course, I was really just being cute with R's versatility. 

But Jim Holtman's solution seems like the best way to go, anyway, does it
not?

-- Bert 

Bert Gunter
Genentech Nonclinical Biostatistics


-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com] 
Sent: Thursday, May 14, 2009 10:44 AM
To: Bert Gunter; Gabor Grothendieck; christiaan pauw
Cc: r-help at r-project.org
Subject: RE: [R] Duplicates and duplicated

The table()-based solution can have problems when there are
very closely spaced floating point numbers in x, as in
   x1<-c(1, 1-.Machine$double.eps, 1+2*.Machine$double.eps)[c(1,2,3,2,3)]
It also relies on table(x) turning x into a factor with the default
levels=as.character(sort(x)) and that default may change.
It omits NA's from the result. (I think it also ought to put the results in
the original order of the data, so one can, e.g., omit or select values
which are duplicated.)

The ave()-based solution fails when there are NA's or NaN's in the data.
   x2 <- c(1,2,3,NA,10,6,3)

The ave()-based solution can be slower than necessary on long datasets,
especially ones with few or no duplicates.
   x3 <- sample(1e5,replace=FALSE) ; x3[17] <- x3[length(x3)-17]

I think the following function avoids these problems.  It never converts
the data to character, but uses match() on the original data to convert
it to a set of unique integers that tabulate can handle.
 
f2 <- function(x){
   ix<-match(x,x)
   tix<-tabulate(ix)
   retval<-logical(length(x))
   retval[which(tix!=1)]<-TRUE
   retval
}

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
#
I don't think that that is the conclusion.

All the solutions solve the original problem and the additional
"requirements" may or may not be what is wanted in any
particular case.

The ave solution propagates the NA which seems like
the right thing to do whereas the f2 solution and the
duplicated solutions labels it FALSE which seems
wrong (though it may be right if that were wanted).
Also, the f2 solution does not pick up the 3 at the end
but again that may or may not be wanted.
[1] FALSE FALSE  TRUE    NA FALSE FALSE  TRUE
[1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[1] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

so it all depends on what you want.
On Thu, May 14, 2009 at 1:43 PM, William Dunlap <wdunlap at tibco.com> wrote:
#
That was
    f3 <- function(x) duplicated(x) | duplicated(x, fromLast=TRUE)
which is equivalent to
           function(x) duplicated(x) | rev(duplicated(rev(x)))
in S+, which doesn't have the fromLast= argument.
It avoids the problems involved in table() and ave(),
but it just seems sneaky to me.

Linlin Yan's
    f4 <- function(x) x %in% x[duplicated(x)]
seems to me more direct and also avoids those problems.

Mine was wrong.  It fails on
   x <- c(1, 2, 8, 2, 4, 5, 10, 1, 4, 16, 2)
My intent was to provide one that would generalize to identifiying
all elements that had n or more repetitions in the input vector.
(E.g., you may want to drop from some analysis subjects with
fewer than 5 observations on them.)  The corrected version is
   f2<-function(x, n=2){
       ix<-match(x,x);
       tix<-tabulate(ix);
       ix %in% which(tix>=n)
   }

E.g.,
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
x    1    2    8    2    4    5   10    1    4    16     2
     1    1    0    1    1    0    0    1    1     0     1
     1    1    0    1    1    0    0    1    1     0     1
     1    1    0    1    1    0    0    1    1     0     1
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
x    1    2    8    2    4    5   10    1    4    16     2
     0    1    0    1    0    0    0    0    0     0     1
#
Gabor,

My f2 was just wrong.  It should have been
   f2 <- function(x, n=2){ ix<-match(x,x); tix<-tabulate(ix); ix %in% which(tix>=n) }
which would be roughly the same as your
   f1 <- function(x, n=2) ave(x,x,FUN=length)>=n
and flags all elements of x with >= n repetitions.

ave() involves a call to factor, which folks on R-devel have been fiddling
with to change how it works with close-together numbers, so its results
may vary with the version of R.  The ix<-match(x,x) is a way to avoid
the dependency on factor.

For very long vectors with few duplicates tabulate is faster than then many
calls to length in ave and I think f2 uses less memory because of the
lists involved in the calls to split and lapply in ave.  E.g., on a pretty
old Linux machine:
[1]      2      5      7 500001 500002 500003 500004 500005 500006
[1]      2      5      7 500001 500002 500003 500004 500005 500006
user  system elapsed
 23.726   0.250  23.999
user  system elapsed
  0.639   0.003   0.642

ave() is certainly easier to understand.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com