Skip to content

which rows are duplicates?

9 messages · Aaron M. Swoboda, Bill Venables, Michael Dewey +2 more

#
I would like to know which rows are duplicates of each other, not  
simply that a row is duplicate of another row. In the following  
example rows 1 and 3 are duplicates.

 > x <- c(1,3,1)
 > y <- c(2,4,2)
 > z <- c(3,4,3)
 > data <- data.frame(x,y,z)
     x y z
1 1 2 3
2 3 4 4
3 1 2 3

I can't figure out how to get R to tell me that observation 1 and 3  
are the same.  It seems like the "duplicated" and "unique" functions  
should be able to help me out, but I am stumped.

For instance, if I use "duplicated" ...

 > duplicated(data)
[1] FALSE FALSE TRUE

it tells me that row 3 is a duplicate, but not which row it matches.  
How do I figure out WHICH row it matches?

And If I use "unique"...

 > unique(data)
     x y z
1 1 2 3
2 3 4 4

I see that rows 1 and 2 are unique, leaving me to infer that row 3 was  
a duplicate, but again it doesn't tell me which row it was a duplicate  
of (as far as I can tell). Am I missing something?

How can I determine that row 3 is a duplicate OF ROW 1?

Thanks,

Aaron
#
If you sort the data then the duplicated entries will occur in consecutive blocks:
x y z
1 1 2 3
2 3 4 4
3 1 2 3
x y z
1 1 2 3
3 1 2 3
2 3 4 4
[1] FALSE  TRUE FALSE
When you identify the blocks, the row names will tell you where they occur in the original data frame.

Bill Venables
http://www.cmis.csiro.au/bill.venables/ 


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Aaron M. Swoboda
Sent: Monday, 30 March 2009 2:07 PM
To: r-help at r-project.org
Subject: [R] which rows are duplicates?

I would like to know which rows are duplicates of each other, not  
simply that a row is duplicate of another row. In the following  
example rows 1 and 3 are duplicates.

 > x <- c(1,3,1)
 > y <- c(2,4,2)
 > z <- c(3,4,3)
 > data <- data.frame(x,y,z)
     x y z
1 1 2 3
2 3 4 4
3 1 2 3

I can't figure out how to get R to tell me that observation 1 and 3  
are the same.  It seems like the "duplicated" and "unique" functions  
should be able to help me out, but I am stumped.

For instance, if I use "duplicated" ...

 > duplicated(data)
[1] FALSE FALSE TRUE

it tells me that row 3 is a duplicate, but not which row it matches.  
How do I figure out WHICH row it matches?

And If I use "unique"...

 > unique(data)
     x y z
1 1 2 3
2 3 4 4

I see that rows 1 and 2 are unique, leaving me to infer that row 3 was  
a duplicate, but again it doesn't tell me which row it was a duplicate  
of (as far as I can tell). Am I missing something?

How can I determine that row 3 is a duplicate OF ROW 1?

Thanks,

Aaron

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
At 05:07 30/03/2009, Aaron M. Swoboda wrote:
Does this do what you want?
 > x <- c(1,3,1)
 > y <- c(2,4,2)
 > z <- c(3,4,3)
 > data <- data.frame(x,y,z)
 > data.u <- unique(data)
 > data.u
   x y z
1 1 2 3
2 3 4 4
 > data.u <- cbind(data.u, set = 1:nrow(data.u))
 > merge(data, data.u)
   x y z set
1 1 2 3   1
2 1 2 3   1
3 3 4 4   2

You need to do a bit more work to get them back into the original row 
order if that is essential.
Michael Dewey
http://www.aghmed.fsnet.co.uk
#
Michael Dewey wrote:
i don't have any solution significantly better than what you have
already been given.  but i have a warning instead.

in the below, you use both 'duplicated' and 'unique' on data frames, and
the proposed solution relies on the latter.  you may want to try to
avoid both when working with data frames;  this is because of how they
do (or don't) work.

duplicated (and unique, which calls duplicated) simply pastes the
content of each row into a *string*, and then works on the strings. 
this means that NAs in the data frame are converted to "NA"s, and "NA"
== "NA", obviously, so that rows that include NAs and are otherwise
identical will be considered *identical*.

that's not bad (yet), but you should be aware.  however, duplicated has
a parameter named 'incomparables', explained in ?duplicated as follows:

"
incomparables: a vector of values that cannot be compared. 'FALSE' is a
          special value, meaning that all values can be compared, and
          may be the only value accepted for methods other than the
          default.  It will be coerced internally to the same type as
          'x'.
"

and also

"
     Values in 'incomparables' will never be marked as duplicated. This
     is intended to be used for a fairly small set of values and will
     not be efficient for a very large set.
"

that is, for example:

    vector = c(NA, NA)
    duplicated(vector)
    # [1] FALSE TRUE
    duplicated(vector), incomparables=NA)
    # [1] FALSE FALSE

    list = list(NA, NA)
    duplicated(list)
    # [1] FALSE TRUE
    duplicated(list, incomparables=NA)
    # [1] FALSE FALSE


what the documentation *fails* to tell you is that the parameter
'incomparables' is defunct in duplicated.data.frame, which you can see
in its source code (below), or in the following example:

    # data as above, or any data frame
    duplicated(data, incomparables=NA)
    # Error in if (!is.logical(incomparables) || incomparables)
.NotYetUsed("incomparables != FALSE") :
    #   missing value where TRUE/FALSE needed

the error message here is *confusing*.  the error is raised because the
author of the code made a mistake and apparently haven't carefully
examined and tested his product;  the code goes:

    duplicated.data.frame
    # function (x, incomparables = FALSE, fromLast = FALSE, ...)
    # {
    #    if (!is.logical(incomparables) || incomparables)
    #        .NotYetUsed("incomparables != FALSE")
    #    duplicated(do.call("paste", c(x, sep = "\r")), fromLast = fromLast)
    # }
    # <environment: namespace:base>

clearly, the intention here is to raise an error with a (still hardly
clear) message as in:

    .NotYetUsed("incomparables != FALSE")
    # Error: argument 'incomparables != FALSE' is not used (yet)

but instead, if(NA) is evaluated (because '!is.logical(NA) || NA'
evaluates, *obviously*, to NA) and hence the uninformative error message.

take home point:  rtfm, *but* don't believe it.

vQ

  
    
#
Wacek Kusnierczyk wrote:
i now seem to have one:

    # dummy data
    data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace=TRUE))
   
    # add a class column; identical rows have the same class id
    data$class = local({
        rows = do.call('paste', c(data, sep='\r'))
        with(
            rle(sort(rows)),
            rep(1:length(values), lengths)[rank(rows)] ) })

    data
    #   x y class
    # 1 2 2     3
    # 2 2 1     2
    # 3 2 1     2
    # 4 1 2     1
    # 5 2 2     3


this approach seems to be roughly comparable to michael's, depending on
the shape (and size?) of the input:

    # dummy data frame, just integers
    n = 100; m = 100
    data = as.data.frame(
        matrix(nrow=n, ncol=m,
            sample(n, m*n, replace=TRUE)))

    # do a simple benchmarking
    library(rbenchmark)
    benchmark(replications=100, order='elapsed', columns=c('test',
'elapsed'),
        waku=local({
            rows = do.call('paste', c(data, sep='\r'))
            data$class = with(
                rle(sort(rows)),
                rep(1:length(values), lengths)[rank(rows)] ) }),
        mide=local({
            unique = unique(data)
            data = merge(data, cbind(unique, class=1:nrow(unique))) }))

    #   test elapsed
    # 1 waku   0.503
    # 2 mide   3.269

and for m = 10 and n = 1000 i get:

    #   test elapsed
    # 1 waku   0.571
    # 2 mide  15.836

while for m = 1000 and n = 10 i get:

    #   test elapsed
    # 1 waku   1.110
    # 2 mide   2.461

the type of the content should not have any impact on the ratio (pure
guess, no testing done). 

whether my approach is more intuitive is arguable.  note that, unlike in
michael's solution, the final result (the data frame with a class column
added) is in the original order.  (and sorting would add a performance
penalty in the other case.)

my previous remarks about the treatment on NAs still apply;  the
do.call('paste', ... is taken from duplicated.data.frame.

regards,
vQ
#
Wacek Kusnierczyk wrote:
another approach (maybe a bit cleaner) seems to be:

data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5, 
replace = TRUE))

vals <- do.call('paste', c(data, sep = '\r'))
data$class <- match(vals, unique(vals))
data


I have tried benchmarking it.

Best,
Dimitris

  
    
#
Dimitris Rizopoulos wrote:
sorry, I wanted to write: I have *not* tried benchmarking it.

Best,
Dimitris

  
    
#
Dimitris Rizopoulos wrote:
wow, cool!  this seems unbeatable ;)
i guess it can't be slower than any of the others.

vQ
#
Dimitris Rizopoulos wrote:
# dummy data frame, just integers
    n = 100; m = 100
    data = as.data.frame(
        matrix(nrow=n, ncol=m,
            sample(n, m*n, replace=TRUE)))

    # do a simple benchmarking
    library(rbenchmark)
    benchmark(
	replications=100, 
	order='elapsed', 
	columns=c('test', 'elapsed'),
        waku=local({
            rows = do.call('paste', c(data, sep='\r'))
            data$class = with(
                rle(sort(rows)),
                rep(1:length(values), lengths)[rank(rows)] ) }),
	diri=local({
            values = do.call('paste', c(data, sep='\r'))
            data$class = match(values, unique(values)) }) )

        #  test elapsed
        # 2 diri    0.43
        # 1 waku    0.52


comparable for m=n=100 (and even better for n >> m), but way cleaner
code, and the class ids are now better sorted.  that's collaborative
problem solving ;)

best,
vQ