Suggestions to speed up median() and has.na()

8 messages · Henrik Bengtsson, Thomas Lumley, Bill Dunlap +1 more

Original

1

8

Henrik Bengtsson

Mon, Apr 10, 2006 10:37 AM #

Hi,

I've got two suggestions how to speed up median() about 50%.  For all
iterative methods calling median() in the loops this has a major
impact.  The second suggestion will apply to other methods too.

This is what the functions look like today:

function (x, na.rm = FALSE)
{
    if (is.factor(x) || mode(x) != "numeric")
        stop("need numeric data")
    if (na.rm)
        x <- x[!is.na(x)]
    else if (any(is.na(x)))
        return(NA)
    n <- length(x)
    if (n == 0)
        return(NA)
    half <- (n + 1)/2
    if (n%%2 == 1) {
        sort(x, partial = half)[half]
    }
    else {
        sum(sort(x, partial = c(half, half + 1))[c(half, half +
            1)])/2
    }
}
<environment: namespace:stats>

Suggestion 1:
Replace the sort() calls with the .Internal(psort(x, partial)).   This
will avoid unnecessary overhead, especially an expensive second check
for NAs using any(is.na(x)).  Simple benchmarking with

x <- rnorm(10e6)
system.time(median(x))/system.time(median2(x))

where median2() is the function with the above replacements, gives
about 20-25% speed up.

Suggestion 2:
Create a has.na(x) function to replace any(is.na(x)) that returns TRUE
as soon as a NA value is detected.  In the best case it returns after
the first index with TRUE, in the worst case it returns after the last
index N with FALSE.  The cost for is.na(x) is always O(N), and any()
in the best case O(1) and in the worst case O(N) (if any() is
implemented as I hope).  An has.na() function would be very useful
elsewhere too.

An poor mans alternative to (2), is to have a third alternative to
'na.rm', say, NA, which indicates that we know that there are no NAs
in 'x'.

The original median() is approx 50% slower (naive benchmarking) than a
version with the above two improvements, if passing a large 'x' with
no NAs;

median2 <- function (x, na.rm = FALSE) {
    if (is.factor(x) || mode(x) != "numeric")
        stop("need numeric data")

    if (is.na(na.rm)) {
    } else if (na.rm)
        x <- x[!is.na(x)]
    else if (any(is.na(x)))
        return(NA)

    n <- length(x)
    if (n == 0)
        return(NA)
    half <- (n + 1)/2
    if (n%%2 == 1) {
        .Internal(psort(x, half))[half]
    }
    else {
        sum(.Internal(psort(x, c(half, half + 1)))[c(half, half + 1)])/2
    }
}

x <- rnorm(10e5)
K <- 10
t0 <- system.time({
  for (kk in 1:K)
    y <- median(x);
})
print(t0)  # [1] 1.82 0.14 1.98   NA   NA
t1 <- system.time({
  for (kk in 1:K)
    y <- median2(x, na.rm=NA);
})
print(t1)  # [1] 1.25 0.06 1.34   NA   NA
print(t0/t1)  # [1] 1.456000 2.333333 1.477612       NA       NA

BTW, without having checked the source code, it looks like is.na() is
unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on
a vector without NAs.  On the other hand, is.na(sum(x)) becomes
awfully slow if 'x' contains NAs.

/Henrik

Mon, Apr 10, 2006 4:22 PM #

On Mon, 10 Apr 2006, Henrik Bengtsson wrote:

I'm surprised this has a major impact -- in your example it takes much 
longer to generate the ten million numbers than to find the median.

There's something that seems a bit undesirable about having median() call 
the .Internal function for sort().

This sounds useful (though it has missed the deadline for 2.3.0).

It won't help if the typical case is no missing values, as you suggest, 
but it will be faster when there are missing values.

I don't think  it is unnecessarily slow.  It has to dispatch methods and 
it has to make sure that matrix structure is preserved.  After that the 
code is just

     case REALSXP:
         for (i = 0; i < n; i++)
             LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);
         break;

and it's hard to see how that can be improved. It does suggest that a 
faster anyNA() function would have to not be generic.


 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

Bill Dunlap

Mon, Apr 10, 2006 4:42 PM #

On Mon, 10 Apr 2006, Thomas Lumley wrote:

Splus has such a function, but it is called anyMissing().  In the
interests of interoperability it would be nice if R used that name.
(I did not choose the name, but that is what it is.)

The following experiment using Splus seems to indicate the speedup has
less to do with stopping at the first NA than it does with not
making/filling/copying/whatever the big vector of logicals that is.na
returns.

   > # NA near start of list of 10 million integers
   > { z<-replace(1:1e7,2,NA); unix.time(anyMissing(z)) }
   [1] 0 0 0 0 0
   > { z<-replace(1:1e7,2,NA); unix.time(any(is.na(z)))}
   [1] 0.62 0.13 0.75 0.00 0.00

   > # NA at end of list
   > { z<-replace(1:1e7,1e7,NA); unix.time(anyMissing(z)) }
   [1] 0.07 0.00 0.07 0.00 0.00
   > { z<-replace(1:1e7,1e7,NA); unix.time(any(is.na(z)))}
   [1] 0.64 0.11 0.75 0.00 0.00

The Splus anyMissing is an s3 generic (i.e., it calls UseMethod()).
The Splus is.na is an s4 generic and its default method may invoke
an s3 generic.

----------------------------------------------------------------------------
Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."

Mon, Apr 10, 2006 4:48 PM #

On 4/10/2006 7:22 PM, Thomas Lumley wrote:

I think it would help even in that case if the vector is large, because 
it avoids allocating and disposing of the logical vector of the same 
length as x.

If it's necessary to make it not generic to achieve the speedup, I don't 
think it's worth doing.  If anyNA is written not to be generic I'd guess 
a very common error will be to apply it to a dataframe and get a 
misleading "FALSE" answer.  If we do that, I predict that the total 
amount of r-help time wasted on it will exceed the CPU time saved by 
orders of magnitude.

Duncan Murdoch

Mon, Apr 10, 2006 5:08 PM #

On Mon, 10 Apr 2006, Duncan Murdoch wrote:

That makes sense. I have just tried, and for vectors of length ten 
million it does make a measurable difference.

I wasn't proposing that it should be stupid, just not generic.  It could 
support data frames (sum(), does, for example). If it didn't support data 
frames it should certainly give an error rather than the wrong answer, but 
if we are seriously trying to avoid delays around 0.1 seconds then going 
through the generic function mechanism may be a problem.


 	-thomas

Mon, Apr 10, 2006 5:54 PM #

On 4/10/2006 8:08 PM, Thomas Lumley wrote:

On Mon, 10 Apr 2006, Duncan Murdoch wrote:

On 4/10/2006 7:22 PM, Thomas Lumley wrote:

On Mon, 10 Apr 2006, Henrik Bengtsson wrote:

Suggestion 2:
Create a has.na(x) function to replace any(is.na(x)) that returns TRUE
as soon as a NA value is detected.  In the best case it returns after
the first index with TRUE, in the worst case it returns after the last
index N with FALSE.  The cost for is.na(x) is always O(N), and any()
in the best case O(1) and in the worst case O(N) (if any() is
implemented as I hope).  An has.na() function would be very useful
elsewhere too.

This sounds useful (though it has missed the deadline for 2.3.0).

It won't help if the typical case is no missing values, as you suggest, but 
it will be faster when there are missing values.

I think it would help even in that case if the vector is large, because it 
avoids allocating and disposing of the logical vector of the same length as 
x.

That makes sense. I have just tried, and for vectors of length ten 
million it does make a measurable difference.

BTW, without having checked the source code, it looks like is.na() is
unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on
a vector without NAs.  On the other hand, is.na(sum(x)) becomes
awfully slow if 'x' contains NAs.

I don't think  it is unnecessarily slow.  It has to dispatch methods and it 
has to make sure that matrix structure is preserved.  After that the code 
is just

     case REALSXP:
         for (i = 0; i < n; i++)
             LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);
         break;

and it's hard to see how that can be improved. It does suggest that a 
faster anyNA() function would have to not be generic.

If it's necessary to make it not generic to achieve the speedup, I don't 
think it's worth doing.  If anyNA is written not to be generic I'd guess a 
very common error will be to apply it to a dataframe and get a misleading 
"FALSE" answer.  If we do that, I predict that the total amount of r-help 
time wasted on it will exceed the CPU time saved by orders of magnitude.

I wasn't proposing that it should be stupid, just not generic.  It could 
support data frames (sum(), does, for example). If it didn't support data 
frames it should certainly give an error rather than the wrong answer, but 
if we are seriously trying to avoid delays around 0.1 seconds then going 
through the generic function mechanism may be a problem.

If it's not dataframes, it will be something else.  I think it's highly 
desirable that any(is.na(x)) == anyNA(x) within base packages, and we 
should make it straightforward to maintain this identity in contributed 
packages.

By the way, I think Bill's suggestion of calling it anyMissing makes a 
lot of sense.

Duncan

Henrik Bengtsson

Tue, Apr 11, 2006 4:56 AM #

On 4/11/06, Thomas Lumley <tlumley at u.washington.edu> wrote:

I would still argue that is.na(x) does N comparisons and any() does at
least one and in the worst case N, in total [N+1,2N], but optimally
you can get away with [1,N].  In my case (see below) N is 2,000, then
with 100,000 such vectors and an iteration algorithm with maxIter=20,
I will save up to 4,000,000,000 comparisons!

Yes.  It makes a huge difference for long vectors or many short
vectors.  Take for instance rlm() is MASS.  With the default
arguments, it uses median(abs(x)) to estimate the scale in each
iteration.  In my case (Affymetrix SNP data), I know for sure that 'x'
never contains NAs, or even if it did, I could move that check outside
the iteration loop.  Even if the length of each 'x' is only 20*100,
with a dataset of >100,000 such vectors, in one step, I managed to cut
down the analysis from 14 hours to 7 hours!  This is probably, as you
say, due to allocation/copying/deallocation.

Can we do both and let certain functions such as median() call the
default/internal method explicitly?  Or a low-level and a high-level
function?  Possibly introducing an 'hasNA' argument to many of our
functions, or let 'na.rm=NA' indicate that we no that there are no
NAs; when we know for sure that there is no NAs, even with an optimal
anyNA() function we will waste CPU time in iterative methods.

BTW, I did a quick scan in the R source code and I found 106
occurances of "any(is.na(" not looking at any CRAN or Bioinductor
packages.

Best,

Henrik

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Tue, Apr 11, 2006 7:40 AM #

On Mon, 10 Apr 2006, Duncan Murdoch wrote:

Here we agree.

 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle