Suggestion: 20% speed up of which() with two-character mod

3 messages · Henrik Bengtsson, Martin Maechler

Original

1

3

Henrik Bengtsson

Mon, Aug 4, 2008 9:14 PM #

Hi,

I just want to do a follow up this very simple
fix/correction/speedup/cleanup of the base::which() function.  Here is
a diff:

diff src/library/base/R/which.R which.R
21c21
<     wh <- seq_along(x)[ll <- x & !is.na(x)]
---

25c25
<         names(wh) <- names(x)[ll]
---

FYI, the 'll' variable is not used elsewhere.  I've been going through
this modifications several times and I cannot see any side effects.

Could someone of R core please commit this?

BTW, when one report diff:s, do you prefer to get it with or without
context information, e.g. -C 3?

/Henrik

On Fri, Jul 11, 2008 at 8:57 AM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Martin Maechler

Tue, Aug 5, 2008 5:54 AM #

HenrikB> Hi,

    HenrikB> I just want to do a follow up this very simple
    HenrikB> fix/correction/speedup/cleanup of the base::which() function.  Here is
    HenrikB> a diff:

    HenrikB> diff src/library/base/R/which.R which.R
    HenrikB> 21c21
    HenrikB> <     wh <- seq_along(x)[ll <- x & !is.na(x)]
    HenrikB> ---
    >> wh <- seq_along(x)[x & !is.na(x)]
    HenrikB> 25c25
    HenrikB> <    names(wh) <- names(x)[ll]
    HenrikB> ---
              >> names(wh) <- names(x)[wh]

    HenrikB> FYI, the 'll' variable is not used elsewhere.  I've been going through
    HenrikB> this modifications several times and I cannot see any side effects.

    HenrikB> Could someone of R core please commit this?

I had added your proposition to my version of R-devel in order
to commit it, and had wanted to do my own performance tests
under different scenarios, but I had forgotten / postponed it.
{I have more such things , notably the "help.request() from Kate
 Mullen  -- with quite a few of my own changes, not quite
 finished ... that will have to wait for after useR!2008 ..}

In fact, it seems is pretty obvious that the version with [wh]
instead of [ll] should be faster in most cases, and never
slower,
and so I do commit it now.

Thank you Henrik, for the reminder.

Martin

    HenrikB> BTW, when one report diff:s, do you prefer to get it with or without
    HenrikB> context information, e.g. -C 3?

{My exact preference would depend on the size / style of the
 patch itself. It does not really matter, and as a general rule,
 I'd personally prefer '-u' (unified diffs which include context)}

    HenrikB> /Henrik

HenrikB> On Fri, Jul 11, 2008 at 8:57 AM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:

>> On Thu, 10 Jul 2008, Henrik Bengtsson wrote:

>> 
    >>> Hi,
    >>> 
    >>> by replacing 'll' with 'wh' in the source code for base::which() one
    >>> gets ~20% speed up for *named logical vectors*.
    >> 
    >> 
    >> The amount of speedup depends on how sparse the TRUE values are.
    >> 
    >> When the proportion of TRUEs gets small the speedup is more than twofold on
    >> my macbook. For high proportions of TRUE, the speedup is more like the 20%
    >> you cite.
    >> 
    >> HTH,
    >> 
    >> Chuck
    >> 
    >>> 
    >>> CURRENT CODE:
    >>> 
    >>> which <- function(x, arr.ind = FALSE)
    >>> {
    >>> if(!is.logical(x))
    >>> stop("argument to 'which' is not logical")
    >>> wh <- seq_along(x)[ll <- x & !is.na(x)]
    >>> m <- length(wh)
    >>> dl <- dim(x)
    >>> if (is.null(dl) || !arr.ind) {
    >>> names(wh) <- names(x)[ll]
    >>> }
    >>> ...
    >>> wh;
    >>> }
    >>> 
    >>> SUGGESTED CODE: (Remove 'll' and use 'wh')
    >>> 
    >>> which2 <- function(x, arr.ind = FALSE)
    >>> {
    >>> if(!is.logical(x))
    >>> stop("argument to 'which' is not logical")
    >>> wh <- seq_along(x)[x & !is.na(x)]
    >>> m <- length(wh)
    >>> dl <- dim(x)
    >>> if (is.null(dl) || !arr.ind) {
    >>> names(wh) <- names(x)[wh]
    >>> }
    >>> ...
    >>> wh;
    >>> }
    >>> 
    >>> That's all.
    >>> 
    >>> BENCHMARKING:
    >>> 
    >>> # To measure both in same environment
    >>> which1 <- base::which;
    >>> environment(which1) <- globalenv();  # Needed?
    >>> 
    >>> N <- 1e6;
    >>> set.seed(0xbeef);
    >>> x <- sample(c(TRUE, FALSE), size=N, replace=TRUE);
    >>> names(x) <- seq_along(x);
    >>> B <- 10;
    >>> t1 <- system.time({ for (bb in 1:B) idxs1 <- which1(x); });
    >>> t2 <- system.time({ for (bb in 1:B) idxs2 <- which2(x); });
    >>> stopifnot(identical(idxs1, idxs2));
    >>> print(t1/t2);
    >>> # Fair benchmarking
    >>> t2 <- system.time({ for (bb in 1:B) idxs2 <- which2(x); });
    >>> t1 <- system.time({ for (bb in 1:B) idxs1 <- which1(x); });
    >>> print(t1/t2);
    >>> ##      user    system   elapsed
    >>> ##   1.283186   1.052632   1.250000
    >>> 
    >>> You get similar results if you put for loop outside the system.time()
    >>> call (and sum up the timings).
    >>> 
    >>> Cheers
    >>> 
    >>> Henrik
    >>> 
    >>> ______________________________________________
    >>> R-devel at r-project.org mailing list
    >>> https://stat.ethz.ch/mailman/listinfo/r-devel
    >>> 
    >> 
    >> Charles C. Berry                            (858) 534-2098
    >> Dept of Family/Preventive
    >> Medicine
    >> E mailto:cberry at tajo.ucsd.edu               UC San Diego
    >> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
    >> 
    >> 
    >> 

    HenrikB> ______________________________________________
    HenrikB> R-devel at r-project.org mailing list
    HenrikB> https://stat.ethz.ch/mailman/listinfo/r-devel

Henrik Bengtsson

Tue, Aug 5, 2008 8:45 AM #

Hi,

thanks for this.  I'll use "unified" diff next time, i.e.

  diff -u <current>.R <new>.R

/Henrik

On Tue, Aug 5, 2008 at 5:54 AM, Martin Maechler

<maechler at stat.math.ethz.ch> wrote:

"HenrikB" == Henrik Bengtsson <hb at stat.berkeley.edu>
    on Mon, 4 Aug 2008 21:14:12 -0700 writes:

   HenrikB> Hi,

   HenrikB> I just want to do a follow up this very simple
   HenrikB> fix/correction/speedup/cleanup of the base::which() function.  Here is
   HenrikB> a diff:

   HenrikB> diff src/library/base/R/which.R which.R
   HenrikB> 21c21
   HenrikB> <     wh <- seq_along(x)[ll <- x & !is.na(x)]
   HenrikB> ---

   >> wh <- seq_along(x)[x & !is.na(x)]

   HenrikB> 25c25
   HenrikB> <    names(wh) <- names(x)[ll]
   HenrikB> ---

             >> names(wh) <- names(x)[wh]

   HenrikB> FYI, the 'll' variable is not used elsewhere.  I've been going through
   HenrikB> this modifications several times and I cannot see any side effects.

   HenrikB> Could someone of R core please commit this?

I had added your proposition to my version of R-devel in order
to commit it, and had wanted to do my own performance tests
under different scenarios, but I had forgotten / postponed it.
{I have more such things , notably the "help.request() from Kate
 Mullen  -- with quite a few of my own changes, not quite
 finished ... that will have to wait for after useR!2008 ..}

In fact, it seems is pretty obvious that the version with [wh]
instead of [ll] should be faster in most cases, and never
slower,
and so I do commit it now.

Thank you Henrik, for the reminder.

Martin

   HenrikB> BTW, when one report diff:s, do you prefer to get it with or without
   HenrikB> context information, e.g. -C 3?

{My exact preference would depend on the size / style of the
 patch itself. It does not really matter, and as a general rule,
 I'd personally prefer '-u' (unified diffs which include context)}

   HenrikB> /Henrik

   HenrikB> On Fri, Jul 11, 2008 at 8:57 AM, Charles C. Berry <cberry at tajo.ucsd.edu> wrote:

   >> On Thu, 10 Jul 2008, Henrik Bengtsson wrote:
   >>

   >>> Hi,
   >>>
   >>> by replacing 'll' with 'wh' in the source code for base::which() one
   >>> gets ~20% speed up for *named logical vectors*.

   >>
   >>
   >> The amount of speedup depends on how sparse the TRUE values are.
   >>
   >> When the proportion of TRUEs gets small the speedup is more than twofold on
   >> my macbook. For high proportions of TRUE, the speedup is more like the 20%
   >> you cite.
   >>
   >> HTH,
   >>
   >> Chuck
   >>

   >>>
   >>> CURRENT CODE:
   >>>
   >>> which <- function(x, arr.ind = FALSE)
   >>> {
   >>> if(!is.logical(x))
   >>> stop("argument to 'which' is not logical")
   >>> wh <- seq_along(x)[ll <- x & !is.na(x)]
   >>> m <- length(wh)
   >>> dl <- dim(x)
   >>> if (is.null(dl) || !arr.ind) {
   >>> names(wh) <- names(x)[ll]
   >>> }
   >>> ...
   >>> wh;
   >>> }
   >>>
   >>> SUGGESTED CODE: (Remove 'll' and use 'wh')
   >>>
   >>> which2 <- function(x, arr.ind = FALSE)
   >>> {
   >>> if(!is.logical(x))
   >>> stop("argument to 'which' is not logical")
   >>> wh <- seq_along(x)[x & !is.na(x)]
   >>> m <- length(wh)
   >>> dl <- dim(x)
   >>> if (is.null(dl) || !arr.ind) {
   >>> names(wh) <- names(x)[wh]
   >>> }
   >>> ...
   >>> wh;
   >>> }
   >>>
   >>> That's all.
   >>>
   >>> BENCHMARKING:
   >>>
   >>> # To measure both in same environment
   >>> which1 <- base::which;
   >>> environment(which1) <- globalenv();  # Needed?
   >>>
   >>> N <- 1e6;
   >>> set.seed(0xbeef);
   >>> x <- sample(c(TRUE, FALSE), size=N, replace=TRUE);
   >>> names(x) <- seq_along(x);
   >>> B <- 10;
   >>> t1 <- system.time({ for (bb in 1:B) idxs1 <- which1(x); });
   >>> t2 <- system.time({ for (bb in 1:B) idxs2 <- which2(x); });
   >>> stopifnot(identical(idxs1, idxs2));
   >>> print(t1/t2);
   >>> # Fair benchmarking
   >>> t2 <- system.time({ for (bb in 1:B) idxs2 <- which2(x); });
   >>> t1 <- system.time({ for (bb in 1:B) idxs1 <- which1(x); });
   >>> print(t1/t2);
   >>> ##      user    system   elapsed
   >>> ##   1.283186   1.052632   1.250000
   >>>
   >>> You get similar results if you put for loop outside the system.time()
   >>> call (and sum up the timings).
   >>>
   >>> Cheers
   >>>
   >>> Henrik
   >>>
   >>> ______________________________________________
   >>> R-devel at r-project.org mailing list
   >>> https://stat.ethz.ch/mailman/listinfo/r-devel
   >>>

   >>
   >> Charles C. Berry                            (858) 534-2098
   >> Dept of Family/Preventive
   >> Medicine
   >> E mailto:cberry at tajo.ucsd.edu               UC San Diego
   >> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
   >>
   >>
   >>

   HenrikB> ______________________________________________
   HenrikB> R-devel at r-project.org mailing list
   HenrikB> https://stat.ethz.ch/mailman/listinfo/r-devel