-----Original Message-----
From: r-devel-bounces at r-project.org
[mailto:r-devel-bounces at r-project.org] On Behalf Of William Dunlap
Sent: Monday, March 23, 2009 6:18 PM
To: Wacek Kusnierczyk; r-devel at r-project.org
Subject: Re: [Rd] [R] variance/mean
Doesn't Fortran still require that the arguments to
a function not alias each other (in whole or in part)?
I could imagine that var() might call into Fortran code
(BLAS or LAPACK). Wouldn you want to chance erroneous
results at a high optimization level to save a bit of
time in an unusual situation?
(I could also imagine someone changing the R interpreter
so that x and x[-length(x)] could share the same memory
block and that could cause Fortran aliasing problems as
well.)
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
-----Original Message-----
From: r-devel-bounces at r-project.org
[mailto:r-devel-bounces at r-project.org] On Behalf Of Wacek
Sent: Monday, March 23, 2009 4:40 PM
To: r-devel at r-project.org
Cc: r-help at r-project.org; rkevinburton at charter.net; Bert Gunter
Subject: Re: [Rd] [R] variance/mean
(this post suggests a patch to the sources, so i allow myself
to divert
it to r-devel)
Bert Gunter wrote:
x a numeric vector, matrix or data frame.
y NULL (default) or a vector, matrix or data frame with compatible
dimensions to x. The default is equivalent to y = x (but
bert points to an interesting fragment of ?var: it suggests that
computing var(x) is more efficient than computing var(x,x),
valid as input to var. indeed:
set.seed(0)
x = matrix(rnorm(10000), 100, 100)
library(rbenchmark)
benchmark(replications=1000, columns=c('test', 'elapsed'),
var(x),
var(x, x))
# test elapsed
# 1 var(x) 1.091
# 2 var(x, x) 2.051
that's of course, so to speak, unreasonable: for what
actually computing the covariance of x and x, which should
as var(x,x).
the hack is that if y is given, there's an overhead of memory
allocation
for *both* x and y when y is given, as seen in src/main/cov.c:720+.
incidentally, it seems that the problem can be solved with a
trivial fix
(see the attached patch), so that
set.seed(0)
x = matrix(rnorm(10000), 100, 100)
library(rbenchmark)
benchmark(replications=1000, columns=c('test', 'elapsed'),
var(x),
var(x, x))
# test elapsed
# 1 var(x) 1.121
# 2 var(x, x) 1.107
with the quick checks
all.equal(var(x), var(x, x))
# TRUE
all(var(x) == var(x, x))
# TRUE
and for cor it seems to make cor(x,x) slightly faster than
cor(x), while
originally it was twice slower:
# original
benchmark(replications=1000, columns=c('test', 'elapsed'),
cor(x),
cor(x, x))
# test elapsed
# 1 cor(x) 1.196
# 2 cor(x, x) 2.253
# patched
benchmark(replications=1000, columns=c('test', 'elapsed'),
cor(x),
cor(x, x))
# test elapsed
# 1 cor(x) 1.207
# 2 cor(x, x) 1.204
(there is a visible penalty due to an additional pointer
test, but it's
10ms on 1000 replications with 10000 data points, which i think is
negligible.)
This is as clear as I would know how to state.
i believe bert is right.
however, with the above fix, this can now be rewritten as:
"
x: a numeric vector, matrix or data frame.
y: a vector, matrix or data frame with dimensions compatible
to those of x.
By default, y = x.
"
which, to my simple mind, is even more clear than what bert
how to state, and less likely to cause the sort of confusion that
originated this thread.
the attached patch suggests modifications to src/main/cov.c and
src/library/stats/man/cor.Rd.
it has been prepared and checked as follows:
svn co https://svn.r-project.org/R/trunk trunk
cd trunk
# edited the sources
svn diff > cov.diff
svn revert -R src
patch -p0 < cov.diff
tools/rsync-recommended
./configure
make
make check
bin/R
# subsequent testing within R
if you happen to consider this patch for a commit, please be sure to
examine and test it carefully first.
vQ