variance/mean

Oops, I was thinking backwards.  This sort of
hack could avoid the Fortran aliasing rules, not
run afoul of them.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
-----Original Message-----
From: r-devel-bounces at r-project.org 
[mailto:r-devel-bounces at r-project.org] On Behalf Of William Dunlap
Sent: Monday, March 23, 2009 6:18 PM
To: Wacek Kusnierczyk; r-devel at r-project.org
Subject: Re: [Rd] [R] variance/mean

Doesn't Fortran still require that the arguments to
a function not alias each other (in whole or in part)?
I could imagine that var() might call into Fortran code
(BLAS or LAPACK).  Wouldn you want to chance erroneous
results  at a high optimization level to save a bit of
time in an unusual situation?

(I could also imagine someone changing the R interpreter
so that x and x[-length(x)] could share the same memory
block and that could cause Fortran aliasing problems as
well.)

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com  

-----Original Message-----
From: r-devel-bounces at r-project.org 
[mailto:r-devel-bounces at r-project.org] On Behalf Of Wacek 
Kusnierczyk
Sent: Monday, March 23, 2009 4:40 PM
To: r-devel at r-project.org
Cc: r-help at r-project.org; rkevinburton at charter.net; Bert Gunter
Subject: Re: [Rd] [R] variance/mean

(this post suggests a patch to the sources, so i allow myself 
to divert
it to r-devel)

Bert Gunter wrote:
x a numeric vector, matrix or data frame. 
y NULL (default) or a vector, matrix or data frame with compatible
dimensions to x. The default is equivalent to y = x (but 
more efficient). 

bert points to an interesting fragment of ?var:  it suggests that
computing var(x) is more efficient than computing var(x,x), 
for any x
valid as input to var.  indeed:

    set.seed(0)
    x = matrix(rnorm(10000), 100, 100)

    library(rbenchmark)
    benchmark(replications=1000, columns=c('test', 'elapsed'),
       var(x),
       var(x, x))
    #        test elapsed
    # 1    var(x)   1.091
    # 2 var(x, x)   2.051

that's of course, so to speak, unreasonable:  for what 
var(x) does is
actually computing the covariance of x and x, which should 
be the same
as var(x,x). 

the hack is that if y is given, there's an overhead of memory 
allocation
for *both* x and y when y is given, as seen in src/main/cov.c:720+.
incidentally, it seems that the problem can be solved with a 
trivial fix
(see the attached patch), so that

    set.seed(0)
    x = matrix(rnorm(10000), 100, 100)

    library(rbenchmark)
    benchmark(replications=1000, columns=c('test', 'elapsed'),
       var(x),
       var(x, x))
    #        test elapsed
    # 1    var(x)   1.121
    # 2 var(x, x)   1.107

with the quick checks

    all.equal(var(x), var(x, x))
    # TRUE

    all(var(x) == var(x, x))
    # TRUE

and for cor it seems to make cor(x,x) slightly faster than 
cor(x), while
originally it was twice slower:

    # original
    benchmark(replications=1000, columns=c('test', 'elapsed'),
       cor(x),
       cor(x, x))
    #        test elapsed
    # 1    cor(x)   1.196
    # 2 cor(x, x)   2.253

    # patched
    benchmark(replications=1000, columns=c('test', 'elapsed'),
       cor(x),
       cor(x, x))
    #        test elapsed
    # 1    cor(x)   1.207
    # 2 cor(x, x)   1.204

(there is a visible penalty due to an additional pointer 
test, but it's
10ms on 1000 replications with 10000 data points, which i think is
negligible.)

This is as clear as I would know how to state. 
i believe bert is right.

however, with the above fix, this can now be rewritten as:

"
x: a numeric vector, matrix or data frame. 
y: a vector, matrix or data frame with dimensions compatible 
to those of x. 
By default, y = x. 
"

which, to my simple mind, is even more clear than what bert 
would know
how to state, and less likely to cause the sort of confusion that
originated this thread.

the attached patch suggests modifications to src/main/cov.c and
src/library/stats/man/cor.Rd.
it has been prepared and checked as follows:

    svn co https://svn.r-project.org/R/trunk trunk
    cd trunk
    # edited the sources
    svn diff > cov.diff
    svn revert -R src
    patch -p0 < cov.diff

    tools/rsync-recommended
    ./configure
    make
    make check
    bin/R
    # subsequent testing within R

if you happen to consider this patch for a commit, please be sure to
examine and test it carefully first.

vQ

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

variance/mean

Thread (4 messages)