select rows with identical columns from a data frame

Sun, Jan 20, 2013 9:27 AM

On Jan 20, 2013, at 8:26 AM, Sam Steingold wrote:

I am a bit surprised by that. I do agree that it was simple and  
concise, two programming virtues that I occasionally achieve. However,  
when I tested it against either of Bill Dunlap's suggestions mine was  
15-40 times slower. (So I saved Bill's code and made a mental note to  
study it's superiority.) I could see why the f2 version was superior,  
since it progressively shrank the index candidates for further  
comparison, but his first function used no such logic and was still 15  
times faster.

My test included the creation of the smaller data.frame which his did  
not, but when I modified mine to only return the index vector, that  
was the step that consumed all the time. I wondered if it were `which`  
that consumed the time but it appears the inner step of x==x[[1]] that  
was the culprit.

 > x <- data.frame(lapply(structure(1:10,names=letters[1:10]),  
function(i) sample(c(NA,1,1,1,2,2,2,3), replace=TRUE, size=1e6)))

 > system.time({ keep <- x[[1]] == x[[2]]
+    for (i in seq_len(ncol(x))[-(1:2)]) {
+        keep <- keep & x[[i - 1]] == x[[i]]
+    }
+    z2 <- !is.na(keep) & keep})
    user  system elapsed
   0.179   0.056   0.240

 > system.time({z <- rowSums(x==x[[1]]) })
    user  system elapsed
   3.535   0.535   4.067

 > system.time({z <- x==x[[1]] })
    user  system elapsed
   3.540   0.524   4.061

-- 
David

David Winsemius, MD
Alameda, CA, USA

select rows with identical columns from a data frame

Thread (11 messages)