Skip to content
Prev 315636 / 398506 Next

select rows with identical columns from a data frame

Here are two related approaches to your problem.  The first uses
a logical vector, "keep", to say which rows to keep.  The second
uses an integer vector, it can be considerably faster when the columns
are not well correlated with one another (so the number of desired
rows is small proportion of the input rows).

f1 <- function (x) 
{
    # sieve with logical 'keep' vector
    stopifnot(is.data.frame(x), ncol(x) > 1)
    keep <- x[[1]] == x[[2]]
    for (i in seq_len(ncol(x))[-(1:2)]) {
        keep <- keep & x[[i - 1]] == x[[i]]
    }
    !is.na(keep) & keep
}

f2 <- function (x) 
{
    # sieve with integer 'keep' vector
    stopifnot(is.data.frame(x), ncol(x) > 1)
    keep <- which(x[[1]] == x[[2]])
    for (i in seq_len(ncol(x))[-(1:2)]) {
        keep <- keep[which(x[[i - 1]][keep] == x[[i]][keep])]
    }
    seq_len(nrow(x)) %in% keep
}

E.g., for a 10 million by 10 data.frame I get:
user  system elapsed 
   4.04    0.16    4.19
user  system elapsed 
   0.80    0.00    0.79
[1] TRUE
a b c d e f g h i j
4811  2 2 2 2 2 2 2 2 2 2
41706 1 1 1 1 1 1 1 1 1 1
56633 1 1 1 1 1 1 1 1 1 1
70859 1 1 1 1 1 1 1 1 1 1
83848 1 1 1 1 1 1 1 1 1 1
84767 1 1 1 1 1 1 1 1 1 1


Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com