Skip to content

duplicates() function

12 messages · Hadley Wickham, Kasper Daniel Hansen, Joshua Ulrich +5 more

#
I need a function which is similar to duplicated(), but instead of 
returning TRUE/FALSE, returns indices of which element was duplicated.  
That is,

 > x <- c(9,7,9,3,7)
 > duplicated(x)
[1] FALSE FALSE  TRUE FALSE TRUE

 > duplicates(x)
[1] NA NA  1 NA  2

(so that I know that element 3 is a duplicate of element 1, and element 
5 is a duplicate of element 2, whereas the others were not duplicated 
according to our definition.)

Is there a simple way to write this function?  I have  an ugly 
implementation in R that loops over all the values; it would make more 
sense to redo it in C, if there isn't a simple implementation I missed.

Duncan Murdoch
#
How about:

y <- rep(NA,length(x))
y[duplicated(x)] <- match(x[duplicated(x)] ,x)

--
Joshua Ulrich ?| ?FOSS Trading: www.fosstrading.com
On Fri, Apr 8, 2011 at 9:59 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
#
On Fri, Apr 8, 2011 at 9:59 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
I'd think of making it a lookup table.  The basic idea is

split(seq_along(x), x)

but there are probably much faster ways of doing it, depending on what
you need.  But for efficiency, you probably need a hashtable
somewhere.

Hadley
#
On 08/04/2011 11:08 AM, Joshua Ulrich wrote:
That's a nice solution for vectors.  Unfortunately for me, I have a 
matrix (which duplicated() handles by checking whole rows).  So a better 
example that I should have posted would be

x <-  cbind(1, c(9,7,9,3,7) )

and I'd still like the same output
[1] FALSE FALSE  TRUE FALSE TRUE
[1] NA NA  1 NA  2


Duncan Murdoch
#
On Fri, Apr 8, 2011 at 11:08 AM, Joshua Ulrich <josh.m.ulrich at gmail.com> wrote:
I use Joshua's trick all the time.  But it might still be nice with a
C implementation.

While we are discussing duplication, I would also like to see
something like duplicated() but which returns TRUE whenever a value is
later duplicated, so I can easily select the values of a vector which
has are never duplicated.  Right now I need to do something like
  y [ ! y %in% y[duplicated(y)] ]
I am only bringing this up because of Duncan's request.

Kasper
#
On Fri, Apr 8, 2011 at 10:15 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
For a matrix, could you apply the same strategy used in duplicated()?

y <- rep(NA,NROW(x))
temp <- apply(x, 1, function(x) paste(x, collapse="\r"))
y[duplicated(temp)] <- match(temp[duplicated(temp)], temp)
#
Does R have a function like match() that treats matrices
and data.frames row-wise, as duplicated() and unique() do?
duplicated() and match() do related things and I've been
annoyed that their methods for non-vectors do not match up
with each other.  (For historical reasons match cannot be
changed, but perhaps a new generic is in order.)

JU's code still would not work on matrices, but a variant could.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
On Fri, Apr 08, 2011 at 10:59:10AM -0400, Duncan Murdoch wrote:
A possible strategy is to use sorting. In a sorted matrix
or data frame, the elements, which are duplicates of the
same element, form consecutive blocks. These blocks may
be identified using !duplicated(), which determines the
first elements of these blocks. Since sorting is stable,
when we map these blocks back to the original order, the
first element of each block is mapped to the first ocurrence
of the given row in the original order.

An implementation may be done as follows.

  duplicates <- function(dat)
  {
      s <- do.call("order", as.data.frame(dat))
      non.dup <- !duplicated(dat[s, ])
      orig.ind <- s[non.dup]
      first.occ <- orig.ind[cumsum(non.dup)]
      first.occ[non.dup] <- NA
      first.occ[order(s)]
  }
 
  x <-  cbind(1, c(9,7,9,3,7) )
  duplicates(x)
  [1] NA NA  1 NA  2

The line

      orig.ind <- s[non.dup]

creates a vector, whose length is the number of non-duplicated
rows in the sorted "dat". Its components are indices of the
corresponding first occurrences of these rows in the original
order. For this, the stability of the order is needed.

The lines

      first.occ <- orig.ind[cumsum(non.dup)]
      first.occ[non.dup] <- NA

expand orig.ind to a vector, which satisfies: If i-th row of the
sorted "dat" is duplicated, then first.occ[i] is the index of the
first row in the original "dat", which is equal to this row. So, the
values in first.occ are those, which are required for the output
of duplicates(), but they are in the order of the sorted "dat". The
last line 

  first.occ[order(s)]

reorders the vector to the original order of the rows.

Petr Savicky.
1 day later
#
Thanks for your answer, but:
1. can you please cite the question? It is hard for mailing list readers 
to follow now.
2. I think
    which(duplicated(x))
should be simpler, faster and less confusing, if your code would be the 
solution - which is not.
3. Please read the original question carefuly and find that your code 
and my optimization of it above gives a different undesired answer.

Best,
Uwe Ligges
On 09.04.2011 01:05, B77S wrote:
#
On 08/04/2011 11:39 AM, Joshua Ulrich wrote:
Since this thread hasn't ended, I will say that I think this solution is 
the best I've seen for my specific problem.  I was actually surprised 
that duplicated() did the string concatenation trick, but since it does, 
it makes a lot of sense to do the same in duplicates().

I think a good general purpose solution that worked wherever 
duplicated() works would likely be harder, because we don't really have 
the right primitives to make it work.

Duncan Murdoch
#
On Mon, Apr 11, 2011 at 02:05:11PM -0400, Duncan Murdoch wrote:
Consistency with duplicated() is a good argument.

Let me point out, although it goes beyond the original question, that
sorting may be used to compute duplicated() in a way, which is more
efficient than the paste() approach according to the test below.

  duplicatedSort <- function(df)
  {
      n <- nrow(df)
      if (n == 1) {
          return(FALSE)
      } else {
          s <- do.call(order, as.data.frame(df))
          equal <- df[s[2:n], , drop=FALSE] == df[s[1:(n-1)], , drop=FALSE]
          dup <- c(FALSE, rowSums(equal) == ncol(df))
          return(dup[order(s)])
      }
  }

The following tests efficiency for a character matrix.
 
  m <- 1000
  n <- 4
  a <- matrix(as.character(sample(10, m*n, replace=TRUE)), nrow=m, ncol=n)
  system.time(out1 <- duplicatedSort(a))
  system.time(out2 <- duplicated(a))
  identical(out1, out2)
  table(out1)

I obtained, for example,

     user  system elapsed 
    0.003   0.000   0.003 
  
     user  system elapsed 
    0.012   0.000   0.011 
  
  [1] TRUE

  out1
  FALSE  TRUE 
    942    58 

For a numeric matrix, the ratio of the running times is larger in
the same direction.

Petr Savicky.