select rows with identical columns from a data frame

11 messages · Sam Steingold, Rui Barradas, William Dunlap +3 more

Original

1

11

Fri, Jan 18, 2013 12:53 PM #

I have a data frame with several columns.
I want to select the rows with no NAs (as with complete.cases)
and all columns identical.
E.g., for

--8<---------------cut here---------------start------------->8---

a  b  c
1  1  1  1
2 NA NA NA
3 NA  3  5
4  4 40 40
--8<---------------cut here---------------end--------------->8---

I want the vector TRUE,FALSE,FALSE,FALSE selecting just the first
row because there all 3 columns are the same and none is NA.

thanks!

Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://memri.org http://mideasttruth.com
http://honestreporting.com http://pmw.org.il http://iris.org.il
All extremists should be taken out and shot.

Fri, Jan 18, 2013 12:58 PM #

I can do
  Reduce("==",f[complete.cases(f),])
but that creates an intermediate data frame which I would love to avoid
(to save memory).

Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://truepeace.org http://iris.org.il
http://www.PetitionOnline.com/tap12009/ http://ffii.org http://jihadwatch.org
War doesn't determine who's right, just who's left.

Rui Barradas

Fri, Jan 18, 2013 1:02 PM #

Hello,

Try the following.

complete.cases(f) & apply(f, 1, function(x) all(x == x[1]))


Hope this helps,

Rui Barradas

Em 18-01-2013 20:53, Sam Steingold escreveu:

David Winsemius

Fri, Jan 18, 2013 1:47 PM #

On Jan 18, 2013, at 1:02 PM, Rui Barradas wrote:

a b c
1 1 1 1

David Winsemius
Alameda, CA, USA

Fri, Jan 18, 2013 1:48 PM #

Here are two related approaches to your problem.  The first uses
a logical vector, "keep", to say which rows to keep.  The second
uses an integer vector, it can be considerably faster when the columns
are not well correlated with one another (so the number of desired
rows is small proportion of the input rows).

f1 <- function (x) 
{
    # sieve with logical 'keep' vector
    stopifnot(is.data.frame(x), ncol(x) > 1)
    keep <- x[[1]] == x[[2]]
    for (i in seq_len(ncol(x))[-(1:2)]) {
        keep <- keep & x[[i - 1]] == x[[i]]
    }
    !is.na(keep) & keep
}

f2 <- function (x) 
{
    # sieve with integer 'keep' vector
    stopifnot(is.data.frame(x), ncol(x) > 1)
    keep <- which(x[[1]] == x[[2]])
    for (i in seq_len(ncol(x))[-(1:2)]) {
        keep <- keep[which(x[[i - 1]][keep] == x[[i]][keep])]
    }
    seq_len(nrow(x)) %in% keep
}

E.g., for a 10 million by 10 data.frame I get:

user  system elapsed 
   4.04    0.16    4.19

user  system elapsed 
   0.80    0.00    0.79

[1] TRUE

a b c d e f g h i j
4811  2 2 2 2 2 2 2 2 2 2
41706 1 1 1 1 1 1 1 1 1 1
56633 1 1 1 1 1 1 1 1 1 1
70859 1 1 1 1 1 1 1 1 1 1
83848 1 1 1 1 1 1 1 1 1 1
84767 1 1 1 1 1 1 1 1 1 1


Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

arun

Fri, Jan 18, 2013 2:05 PM #

?apply(f,1,function(x) all(duplicated(x)|duplicated(x,fromLast=TRUE)&!is.na(x)))

#[1]? TRUE FALSE FALSE FALSE


A.K.



----- Original Message -----
From: Sam Steingold <sds at gnu.org>
To: r-help at r-project.org
Cc: 
Sent: Friday, January 18, 2013 3:53 PM
Subject: [R] select rows with identical columns from a data frame

I have a data frame with several columns.
I want to select the rows with no NAs (as with complete.cases)
and all columns identical.
E.g., for

--8<---------------cut here---------------start------------->8---

?  a? b? c
1? 1? 1? 1
2 NA NA NA
3 NA? 3? 5
4? 4 40 40
--8<---------------cut here---------------end--------------->8---

I want the vector TRUE,FALSE,FALSE,FALSE selecting just the first
row because there all 3 columns are the same and none is NA.

thanks!

Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://memri.org http://mideasttruth.com
http://honestreporting.com http://pmw.org.il http://iris.org.il
All extremists should be taken out and shot.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

1 day later

Sat, Jan 19, 2013 9:41 PM #

thanks, this works, but is horribly slow (dim(f) is 766,950x2)

Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://truepeace.org http://palestinefacts.org
http://thereligionofpeace.com http://honestreporting.com http://ffii.org
usually: can't pay ==> don't buy. software: can't buy ==> don't pay

Bert Gunter

Sat, Jan 19, 2013 10:26 PM #

But David W. and Bill Dunlap gave you solutions that also work and are
much faster, no?!

-- Bert

On Sat, Jan 19, 2013 at 9:41 PM, Sam Steingold <sds at gnu.org> wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

Sun, Jan 20, 2013 8:26 AM #

Yes, indeed, and I am now using David's solution as it is fast
(enough), simple and concise.

Thanks a lot to David, Bill, Rui, and arun for their answers (to this
question, my many previous questions, and, I hope, my future questions
in advance)!

Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
http://www.childpsy.net/ http://americancensorship.org http://palestinefacts.org
http://thereligionofpeace.com http://camera.org http://think-israel.org
Lisp is a way of life.  C is a way of death.

David Winsemius

Sun, Jan 20, 2013 9:27 AM #

On Jan 20, 2013, at 8:26 AM, Sam Steingold wrote:

I am a bit surprised by that. I do agree that it was simple and  
concise, two programming virtues that I occasionally achieve. However,  
when I tested it against either of Bill Dunlap's suggestions mine was  
15-40 times slower. (So I saved Bill's code and made a mental note to  
study it's superiority.) I could see why the f2 version was superior,  
since it progressively shrank the index candidates for further  
comparison, but his first function used no such logic and was still 15  
times faster.

My test included the creation of the smaller data.frame which his did  
not, but when I modified mine to only return the index vector, that  
was the step that consumed all the time. I wondered if it were `which`  
that consumed the time but it appears the inner step of x==x[[1]] that  
was the culprit.

 > x <- data.frame(lapply(structure(1:10,names=letters[1:10]),  
function(i) sample(c(NA,1,1,1,2,2,2,3), replace=TRUE, size=1e6)))

 > system.time({ keep <- x[[1]] == x[[2]]
+    for (i in seq_len(ncol(x))[-(1:2)]) {
+        keep <- keep & x[[i - 1]] == x[[i]]
+    }
+    z2 <- !is.na(keep) & keep})
    user  system elapsed
   0.179   0.056   0.240

 > system.time({z <- rowSums(x==x[[1]]) })
    user  system elapsed
   3.535   0.535   4.067

 > system.time({z <- x==x[[1]] })
    user  system elapsed
   3.540   0.524   4.061

-- 
David

David Winsemius, MD
Alameda, CA, USA

David Winsemius

Sun, Jan 20, 2013 10:37 AM #

On Jan 20, 2013, at 9:27 AM, David Winsemius wrote:

A further note: Was able to recover most of the timing efficiency with  
initial coercion of the dataframe structure to matrix before the "=="  
operation:

 > system.time({z <- as.matrix(x)==x[[1]] })
    user  system elapsed
   0.181   0.140   0.320

So it's really `==.data.frame` that is the resource hog.

David.
> -- 
> David
>
>
>
>>
>> Thanks a lot to David, Bill, Rui, and arun for their answers (to this
>> question, my many previous questions, and, I hope, my future  
>> questions
>> in advance)!
>>
>>> On Sat, Jan 19, 2013 at 9:41 PM, Sam Steingold <sds at gnu.org> wrote:
>>>>> * Rui Barradas <ehvconeenqnf at fncb.cg> [2013-01-18 21:02:20 +0000]:
>>>>>
>>>>> Try the following.
>>>>>
>>>>> complete.cases(f) & apply(f, 1, function(x) all(x == x[1]))
>>>>
>>>> thanks, this works, but is horribly slow (dim(f) is 766,950x2)
>>
> -- 
>
> David Winsemius, MD
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Alameda, CA, USA