I need a function which is similar to duplicated(), but instead of returning TRUE/FALSE, returns indices of which element was duplicated. That is, > x <- c(9,7,9,3,7) > duplicated(x) [1] FALSE FALSE TRUE FALSE TRUE > duplicates(x) [1] NA NA 1 NA 2 (so that I know that element 3 is a duplicate of element 1, and element 5 is a duplicate of element 2, whereas the others were not duplicated according to our definition.) Is there a simple way to write this function? I have an ugly implementation in R that loops over all the values; it would make more sense to redo it in C, if there isn't a simple implementation I missed. Duncan Murdoch
duplicates() function
12 messages · Hadley Wickham, Kasper Daniel Hansen, Joshua Ulrich +5 more
How about: y <- rep(NA,length(x)) y[duplicated(x)] <- match(x[duplicated(x)] ,x) -- Joshua Ulrich ?| ?FOSS Trading: www.fosstrading.com
On Fri, Apr 8, 2011 at 9:59 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
I need a function which is similar to duplicated(), but instead of returning TRUE/FALSE, returns indices of which element was duplicated. ?That is,
x <- c(9,7,9,3,7) duplicated(x)
[1] FALSE FALSE ?TRUE FALSE TRUE
duplicates(x)
[1] NA NA ?1 NA ?2 (so that I know that element 3 is a duplicate of element 1, and element 5 is a duplicate of element 2, whereas the others were not duplicated according to our definition.) Is there a simple way to write this function? ?I have ?an ugly implementation in R that loops over all the values; it would make more sense to redo it in C, if there isn't a simple implementation I missed. Duncan Murdoch
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Fri, Apr 8, 2011 at 9:59 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
I need a function which is similar to duplicated(), but instead of returning TRUE/FALSE, returns indices of which element was duplicated. ?That is,
x <- c(9,7,9,3,7) duplicated(x)
[1] FALSE FALSE ?TRUE FALSE TRUE
duplicates(x)
[1] NA NA ?1 NA ?2 (so that I know that element 3 is a duplicate of element 1, and element 5 is a duplicate of element 2, whereas the others were not duplicated according to our definition.) Is there a simple way to write this function? ?I have ?an ugly implementation in R that loops over all the values; it would make more sense to redo it in C, if there isn't a simple implementation I missed.
I'd think of making it a lookup table. The basic idea is split(seq_along(x), x) but there are probably much faster ways of doing it, depending on what you need. But for efficiency, you probably need a hashtable somewhere. Hadley
Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/
On 08/04/2011 11:08 AM, Joshua Ulrich wrote:
How about: y<- rep(NA,length(x)) y[duplicated(x)]<- match(x[duplicated(x)] ,x)
That's a nice solution for vectors. Unfortunately for me, I have a matrix (which duplicated() handles by checking whole rows). So a better example that I should have posted would be x <- cbind(1, c(9,7,9,3,7) ) and I'd still like the same output
duplicated(x)
[1] FALSE FALSE TRUE FALSE TRUE
duplicates(x)
[1] NA NA 1 NA 2 Duncan Murdoch
-- Joshua Ulrich | FOSS Trading: www.fosstrading.com On Fri, Apr 8, 2011 at 9:59 AM, Duncan Murdoch<murdoch.duncan at gmail.com> wrote:
I need a function which is similar to duplicated(), but instead of returning TRUE/FALSE, returns indices of which element was duplicated. That is,
x<- c(9,7,9,3,7) duplicated(x)
[1] FALSE FALSE TRUE FALSE TRUE
duplicates(x)
[1] NA NA 1 NA 2 (so that I know that element 3 is a duplicate of element 1, and element 5 is a duplicate of element 2, whereas the others were not duplicated according to our definition.) Is there a simple way to write this function? I have an ugly implementation in R that loops over all the values; it would make more sense to redo it in C, if there isn't a simple implementation I missed. Duncan Murdoch
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Fri, Apr 8, 2011 at 11:08 AM, Joshua Ulrich <josh.m.ulrich at gmail.com> wrote:
How about: y <- rep(NA,length(x)) y[duplicated(x)] <- match(x[duplicated(x)] ,x)
I use Joshua's trick all the time. But it might still be nice with a C implementation. While we are discussing duplication, I would also like to see something like duplicated() but which returns TRUE whenever a value is later duplicated, so I can easily select the values of a vector which has are never duplicated. Right now I need to do something like y [ ! y %in% y[duplicated(y)] ] I am only bringing this up because of Duncan's request. Kasper
-- Joshua Ulrich ?| ?FOSS Trading: www.fosstrading.com On Fri, Apr 8, 2011 at 9:59 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
I need a function which is similar to duplicated(), but instead of returning TRUE/FALSE, returns indices of which element was duplicated. ?That is,
x <- c(9,7,9,3,7) duplicated(x)
[1] FALSE FALSE ?TRUE FALSE TRUE
duplicates(x)
[1] NA NA ?1 NA ?2 (so that I know that element 3 is a duplicate of element 1, and element 5 is a duplicate of element 2, whereas the others were not duplicated according to our definition.) Is there a simple way to write this function? ?I have ?an ugly implementation in R that loops over all the values; it would make more sense to redo it in C, if there isn't a simple implementation I missed. Duncan Murdoch
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On Fri, Apr 8, 2011 at 10:15 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
On 08/04/2011 11:08 AM, Joshua Ulrich wrote:
How about: y<- rep(NA,length(x)) y[duplicated(x)]<- match(x[duplicated(x)] ,x)
That's a nice solution for vectors. ?Unfortunately for me, I have a matrix (which duplicated() handles by checking whole rows). ?So a better example that I should have posted would be x <- ?cbind(1, c(9,7,9,3,7) ) and I'd still like the same output
For a matrix, could you apply the same strategy used in duplicated()? y <- rep(NA,NROW(x)) temp <- apply(x, 1, function(x) paste(x, collapse="\r")) y[duplicated(temp)] <- match(temp[duplicated(temp)], temp)
?duplicated(x)
[1] FALSE FALSE ?TRUE FALSE TRUE
?duplicates(x)
[1] NA NA ?1 NA ?2 Duncan Murdoch
-- Joshua Ulrich ?| ?FOSS Trading: www.fosstrading.com On Fri, Apr 8, 2011 at 9:59 AM, Duncan Murdoch<murdoch.duncan at gmail.com> ?wrote:
?I need a function which is similar to duplicated(), but instead of returning ?TRUE/FALSE, returns indices of which element was duplicated. ?That is,
?x<- c(9,7,9,3,7) ?duplicated(x)
?[1] FALSE FALSE ?TRUE FALSE TRUE
?duplicates(x)
?[1] NA NA ?1 NA ?2 ?(so that I know that element 3 is a duplicate of element 1, and element 5 is ?a duplicate of element 2, whereas the others were not duplicated according ?to our definition.) ?Is there a simple way to write this function? ?I have ?an ugly ?implementation in R that loops over all the values; it would make more sense ?to redo it in C, if there isn't a simple implementation I missed. ?Duncan Murdoch ?______________________________________________ ?R-devel at r-project.org mailing list ?https://stat.ethz.ch/mailman/listinfo/r-devel
-----Original Message----- From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf Of Duncan Murdoch Sent: Friday, April 08, 2011 8:16 AM To: Joshua Ulrich Cc: R-devel at r-project.org Subject: Re: [Rd] duplicates() function On 08/04/2011 11:08 AM, Joshua Ulrich wrote:
How about: y<- rep(NA,length(x)) y[duplicated(x)]<- match(x[duplicated(x)] ,x)
That's a nice solution for vectors. Unfortunately for me, I have a matrix (which duplicated() handles by checking whole rows).
Does R have a function like match() that treats matrices and data.frames row-wise, as duplicated() and unique() do? duplicated() and match() do related things and I've been annoyed that their methods for non-vectors do not match up with each other. (For historical reasons match cannot be changed, but perhaps a new generic is in order.) JU's code still would not work on matrices, but a variant could. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
So a better example that I should have posted would be x <- cbind(1, c(9,7,9,3,7) ) and I'd still like the same output
duplicated(x)
[1] FALSE FALSE TRUE FALSE TRUE
duplicates(x)
[1] NA NA 1 NA 2 Duncan Murdoch
-- Joshua Ulrich | FOSS Trading: www.fosstrading.com On Fri, Apr 8, 2011 at 9:59 AM, Duncan
Murdoch<murdoch.duncan at gmail.com> wrote:
I need a function which is similar to duplicated(), but
instead of returning
TRUE/FALSE, returns indices of which element was
duplicated. That is,
x<- c(9,7,9,3,7) duplicated(x)
[1] FALSE FALSE TRUE FALSE TRUE
duplicates(x)
[1] NA NA 1 NA 2 (so that I know that element 3 is a duplicate of element
1, and element 5 is
a duplicate of element 2, whereas the others were not
duplicated according
to our definition.) Is there a simple way to write this function? I have an ugly implementation in R that loops over all the values; it
would make more sense
to redo it in C, if there isn't a simple implementation I missed. Duncan Murdoch
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
which(duplicated(x)=="TRUE") -- View this message in context: http://r.789695.n4.nabble.com/duplicates-function-tp3436584p3437614.html Sent from the R devel mailing list archive at Nabble.com.
On Fri, Apr 08, 2011 at 10:59:10AM -0400, Duncan Murdoch wrote:
I need a function which is similar to duplicated(), but instead of returning TRUE/FALSE, returns indices of which element was duplicated. That is,
x <- c(9,7,9,3,7) duplicated(x)
[1] FALSE FALSE TRUE FALSE TRUE
duplicates(x)
[1] NA NA 1 NA 2 (so that I know that element 3 is a duplicate of element 1, and element 5 is a duplicate of element 2, whereas the others were not duplicated according to our definition.) Is there a simple way to write this function?
A possible strategy is to use sorting. In a sorted matrix
or data frame, the elements, which are duplicates of the
same element, form consecutive blocks. These blocks may
be identified using !duplicated(), which determines the
first elements of these blocks. Since sorting is stable,
when we map these blocks back to the original order, the
first element of each block is mapped to the first ocurrence
of the given row in the original order.
An implementation may be done as follows.
duplicates <- function(dat)
{
s <- do.call("order", as.data.frame(dat))
non.dup <- !duplicated(dat[s, ])
orig.ind <- s[non.dup]
first.occ <- orig.ind[cumsum(non.dup)]
first.occ[non.dup] <- NA
first.occ[order(s)]
}
x <- cbind(1, c(9,7,9,3,7) )
duplicates(x)
[1] NA NA 1 NA 2
The line
orig.ind <- s[non.dup]
creates a vector, whose length is the number of non-duplicated
rows in the sorted "dat". Its components are indices of the
corresponding first occurrences of these rows in the original
order. For this, the stability of the order is needed.
The lines
first.occ <- orig.ind[cumsum(non.dup)]
first.occ[non.dup] <- NA
expand orig.ind to a vector, which satisfies: If i-th row of the
sorted "dat" is duplicated, then first.occ[i] is the index of the
first row in the original "dat", which is equal to this row. So, the
values in first.occ are those, which are required for the output
of duplicates(), but they are in the order of the sorted "dat". The
last line
first.occ[order(s)]
reorders the vector to the original order of the rows.
Petr Savicky.
1 day later
Thanks for your answer, but:
1. can you please cite the question? It is hard for mailing list readers
to follow now.
2. I think
which(duplicated(x))
should be simpler, faster and less confusing, if your code would be the
solution - which is not.
3. Please read the original question carefuly and find that your code
and my optimization of it above gives a different undesired answer.
Best,
Uwe Ligges
On 09.04.2011 01:05, B77S wrote:
which(duplicated(x)=="TRUE") -- View this message in context: http://r.789695.n4.nabble.com/duplicates-function-tp3436584p3437614.html Sent from the R devel mailing list archive at Nabble.com.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On 08/04/2011 11:39 AM, Joshua Ulrich wrote:
On Fri, Apr 8, 2011 at 10:15 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 08/04/2011 11:08 AM, Joshua Ulrich wrote:
How about: y<- rep(NA,length(x)) y[duplicated(x)]<- match(x[duplicated(x)] ,x)
That's a nice solution for vectors. Unfortunately for me, I have a matrix (which duplicated() handles by checking whole rows). So a better example that I should have posted would be x<- cbind(1, c(9,7,9,3,7) ) and I'd still like the same output
For a matrix, could you apply the same strategy used in duplicated()? y<- rep(NA,NROW(x)) temp<- apply(x, 1, function(x) paste(x, collapse="\r")) y[duplicated(temp)]<- match(temp[duplicated(temp)], temp)
Since this thread hasn't ended, I will say that I think this solution is the best I've seen for my specific problem. I was actually surprised that duplicated() did the string concatenation trick, but since it does, it makes a lot of sense to do the same in duplicates(). I think a good general purpose solution that worked wherever duplicated() works would likely be harder, because we don't really have the right primitives to make it work. Duncan Murdoch
duplicated(x)
[1] FALSE FALSE TRUE FALSE TRUE
duplicates(x)
[1] NA NA 1 NA 2 Duncan Murdoch
-- Joshua Ulrich | FOSS Trading: www.fosstrading.com On Fri, Apr 8, 2011 at 9:59 AM, Duncan Murdoch<murdoch.duncan at gmail.com> wrote:
> I need a function which is similar to duplicated(), but instead of > returning > TRUE/FALSE, returns indices of which element was duplicated. That is, >
>> x<- c(9,7,9,3,7) >> duplicated(x)
> [1] FALSE FALSE TRUE FALSE TRUE >
>> duplicates(x)
> [1] NA NA 1 NA 2 > > (so that I know that element 3 is a duplicate of element 1, and element > 5 is > a duplicate of element 2, whereas the others were not duplicated > according > to our definition.) > > Is there a simple way to write this function? I have an ugly > implementation in R that loops over all the values; it would make more > sense > to redo it in C, if there isn't a simple implementation I missed. > > Duncan Murdoch > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
On Mon, Apr 11, 2011 at 02:05:11PM -0400, Duncan Murdoch wrote:
On 08/04/2011 11:39 AM, Joshua Ulrich wrote:
On Fri, Apr 8, 2011 at 10:15 AM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 08/04/2011 11:08 AM, Joshua Ulrich wrote:
How about: y<- rep(NA,length(x)) y[duplicated(x)]<- match(x[duplicated(x)] ,x)
That's a nice solution for vectors. Unfortunately for me, I have a
matrix
(which duplicated() handles by checking whole rows). So a better
example
that I should have posted would be x<- cbind(1, c(9,7,9,3,7) ) and I'd still like the same output
For a matrix, could you apply the same strategy used in duplicated()? y<- rep(NA,NROW(x)) temp<- apply(x, 1, function(x) paste(x, collapse="\r")) y[duplicated(temp)]<- match(temp[duplicated(temp)], temp)
Since this thread hasn't ended, I will say that I think this solution is the best I've seen for my specific problem. I was actually surprised that duplicated() did the string concatenation trick, but since it does, it makes a lot of sense to do the same in duplicates().
Consistency with duplicated() is a good argument.
Let me point out, although it goes beyond the original question, that
sorting may be used to compute duplicated() in a way, which is more
efficient than the paste() approach according to the test below.
duplicatedSort <- function(df)
{
n <- nrow(df)
if (n == 1) {
return(FALSE)
} else {
s <- do.call(order, as.data.frame(df))
equal <- df[s[2:n], , drop=FALSE] == df[s[1:(n-1)], , drop=FALSE]
dup <- c(FALSE, rowSums(equal) == ncol(df))
return(dup[order(s)])
}
}
The following tests efficiency for a character matrix.
m <- 1000
n <- 4
a <- matrix(as.character(sample(10, m*n, replace=TRUE)), nrow=m, ncol=n)
system.time(out1 <- duplicatedSort(a))
system.time(out2 <- duplicated(a))
identical(out1, out2)
table(out1)
I obtained, for example,
user system elapsed
0.003 0.000 0.003
user system elapsed
0.012 0.000 0.011
[1] TRUE
out1
FALSE TRUE
942 58
For a numeric matrix, the ratio of the running times is larger in
the same direction.
Petr Savicky.