Skip to content

which() vs. just logical selection in df

3 messages · Bert Gunter, kMan

#
Hi Dr. Snow, & R-helpers,

Thank you for your reply! I hadn't heard of the {microbenchmark}
package & was excited to try it! Thank you for the suggestion! I did
check the reference source for which() beforehand, which included the
statement to remove NAa, and I didn't have any missing values or NAs:

sum(is.na(dat$gender2))
sum(is.na(dat$gender))
sum(is.na(dat$y))

[1] 0
[1] 0
[1] 0

I still had a 10ms difference in the value returned by microbenchmark
between the following methods: one with and one without using which().
The difference is reversed from what I expected, since which() is an
extra step.

microbenchmark(
  head(
    dat[which(dat$gender2=="other"),],), times=100L)
microbenchmark(
  head(
    dat[dat$gender2=="other",],), times=100L)

         min                lq                 mean
head(dat[which(dat$gender2 == "other"), ], )      62.93803
74.25939     88.4704
head(dat[dat$gender2 == "other", ], )                 71.8914
87.95844    103.7231

Is which() invoking c-level code by chance, making it slightly faster
on average? The difference likely becomes important on terabytes of
data. The addition of which() still seems superfluous to me, and I'd
like to know whether it's considered best practice to keep it. What is
R inoking when which() isn't called explicitly? Is R invoking which()
eventually anyway?

Cheers!
Keith
#
Inline.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Wed, Oct 14, 2020 at 3:23 PM 1/k^c <kchamberln at gmail.com> wrote:
Is which() invoking c-level code by chance, making it slightly faster
You do not need to ask such questions. R is open source, so just look!
function (x, arr.ind = FALSE, useNames = TRUE)
{
    wh <- .Internal(which(x))   ## C code
    if (arr.ind && !is.null(d <- dim(x)))
        arrayInd(wh, d, dimnames(x), useNames = useNames)
    else wh
}
<bytecode: 0x7fcdba0b8e80>
<environment: namespace:base>
#
Hi Bert,

Thank you very much! I was unaware that .Internal() referred to C code.

I figured out the difference. which() dimensions the object returned
to be only the relevant records first. Logical indexing dimensions
last.
[1] 2000000
[1] 666667
length(dat[index1,])
[1] 666667
length(dat[index2,])
[1] 666667

microbenchmark(index1<-dat$gender2=="other", times=100L) # 2e6 records, ~ 13ms.
microbenchmark(index2<-which(index1), times=100L) # Extra time for
which() ~ 5ms.
microbenchmark(dat[index1,], times=100L) # Time to return just TRUE
records using the whole 2e6 index. ~99ms
microbenchmark(dat[index2,], times=100L) # Time to return all records
from shorter index ~64ms.

Cheers,
Keith
On Wed, Oct 14, 2020 at 4:42 PM Bert Gunter <bgunter.4567 at gmail.com> wrote: