select rows with identical columns from a data frame

Here are two related approaches to your problem.  The first uses
a logical vector, "keep", to say which rows to keep.  The second
uses an integer vector, it can be considerably faster when the columns
are not well correlated with one another (so the number of desired
rows is small proportion of the input rows).

f1 <- function (x) 
{
    # sieve with logical 'keep' vector
    stopifnot(is.data.frame(x), ncol(x) > 1)
    keep <- x[[1]] == x[[2]]
    for (i in seq_len(ncol(x))[-(1:2)]) {
        keep <- keep & x[[i - 1]] == x[[i]]
    }
    !is.na(keep) & keep
}

f2 <- function (x) 
{
    # sieve with integer 'keep' vector
    stopifnot(is.data.frame(x), ncol(x) > 1)
    keep <- which(x[[1]] == x[[2]])
    for (i in seq_len(ncol(x))[-(1:2)]) {
        keep <- keep[which(x[[i - 1]][keep] == x[[i]][keep])]
    }
    seq_len(nrow(x)) %in% keep
}

E.g., for a 10 million by 10 data.frame I get:

select rows with identical columns from a data frame

Thread (11 messages)