duplicates() function
On Mon, Apr 11, 2011 at 02:05:11PM -0400, Duncan Murdoch wrote:
On 08/04/2011 11:39 AM, Joshua Ulrich wrote:
On Fri, Apr 8, 2011 at 10:15 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 08/04/2011 11:08 AM, Joshua Ulrich wrote:
How about: y<- rep(NA,length(x)) y[duplicated(x)]<- match(x[duplicated(x)] ,x)
That's a nice solution for vectors. Unfortunately for me, I have a
matrix
(which duplicated() handles by checking whole rows). So a better
example
that I should have posted would be x<- cbind(1, c(9,7,9,3,7) ) and I'd still like the same output
For a matrix, could you apply the same strategy used in duplicated()? y<- rep(NA,NROW(x)) temp<- apply(x, 1, function(x) paste(x, collapse="\r")) y[duplicated(temp)]<- match(temp[duplicated(temp)], temp)
Since this thread hasn't ended, I will say that I think this solution is the best I've seen for my specific problem. I was actually surprised that duplicated() did the string concatenation trick, but since it does, it makes a lot of sense to do the same in duplicates().
Consistency with duplicated() is a good argument.
Let me point out, although it goes beyond the original question, that
sorting may be used to compute duplicated() in a way, which is more
efficient than the paste() approach according to the test below.
duplicatedSort <- function(df)
{
n <- nrow(df)
if (n == 1) {
return(FALSE)
} else {
s <- do.call(order, as.data.frame(df))
equal <- df[s[2:n], , drop=FALSE] == df[s[1:(n-1)], , drop=FALSE]
dup <- c(FALSE, rowSums(equal) == ncol(df))
return(dup[order(s)])
}
}
The following tests efficiency for a character matrix.
m <- 1000
n <- 4
a <- matrix(as.character(sample(10, m*n, replace=TRUE)), nrow=m, ncol=n)
system.time(out1 <- duplicatedSort(a))
system.time(out2 <- duplicated(a))
identical(out1, out2)
table(out1)
I obtained, for example,
user system elapsed
0.003 0.000 0.003
user system elapsed
0.012 0.000 0.011
[1] TRUE
out1
FALSE TRUE
942 58
For a numeric matrix, the ratio of the running times is larger in
the same direction.
Petr Savicky.