At the risk of appearing ignorant why is the folowing true?
o <- cbind(rep(1,3),rep(2,3),rep(3,3))
var(o)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
and
mean(o)
[1] 2
How do I get mean to return an array similar to var? I would expect in the above example a vector of length 3 {1,2,3}.
Thank you for your help.
Kevin
variance/mean
6 messages · rkevinburton at charter.net, (Ted Harding), Bert Gunter +1 more
On 22-Mar-09 08:17:29, rkevinburton at charter.net wrote:
At the risk of appearing ignorant why is the folowing true?
o <- cbind(rep(1,3),rep(2,3),rep(3,3))
var(o)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
and
mean(o)
[1] 2
How do I get mean to return an array similar to var? I would expect in
the above example a vector of length 3 {1,2,3}.
Thank you for your help.
Kevin
This is a consequence of (understandable) confusion about how var()
and mean() operate! It is not explicit, in "?var", that if you apply
var() to a matrix, as in your "var(o)" you get the covariance matrix
between the columns of 'o' -- except where it says (almost as an
aside) that "'var' is just another interface to 'cov'". Hence in
your example "var(o)" is equivalent to "cov(o)". Looked at in this
way, it is now straightforward to expect what you got.
This is, of course, different from what you would expect if you apply
var() to a vector, namely the variance of that series of numbers
(a single value).
On the other hand, mean() works differently. According to "?mean":
Arguments:
x: An R object. Currently there are methods for numeric
data frames, numeric vectors and dates.
[...]
Value:
For a data frame, a named vector with the appropriate method
being applied column by column.
which may have been what you expected. But a matrix is not a data
frame. Instead, it is an array, which (in effect) is a vector with
an attached "dimensions" attribute which tells R how to chop it up
into columns etc. -- whereas a data frame has its "by-column"
structure built in to it.
Now: "?mean" says nothing about matrices. Nothing whatever.
So you have to find out the hard way that mean(o) treats the array
'o' as a vector, ignoring its "dimensions" attribute. Hence you
get a single number, which is the mean of all the values in the
matrix.
In order to get what you are apparently looking for (the means of
the columns of 'o'), you could:
a) (the smooth way) use the apply() function, causing mean() to be
applied to the second dimension (columns) of 'o':
apply(o,2,mean)
# [1] 1 2 3
b) (the heavy way) take a hint from "?mean" and feed it a data frame:
mean(as.data.frame(o))
# V1 V2 V3
# 1 2 3
Hoping this helps to clarify things!
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 22-Mar-09 Time: 09:01:40
------------------------------ XFMail ------------------------------
rkevinburton at charter.net wrote:
At the risk of appearing ignorant why is the folowing true?
o <- cbind(rep(1,3),rep(2,3),rep(3,3))
var(o)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
and
mean(o)
[1] 2
How do I get mean to return an array similar to var? I would expect in the above example a vector of length 3 {1,2,3}.
you may well be ignorant about how var works with matrices, but this
does not mean it's your fault. the documentation is typically cryptical.
when you apply var to a single matrix, it will compute covariances
between its columns rather than the overall variance:
set.seed(0)
x = matrix(rnorm(4), 2, 2)
var(x)
# [,1] [,2]
# [1,] 1.2629543 1.329799
# [2,] -0.3262334 1.272429
matrix(nrow=2, ncol=2, byrow=TRUE, c(
cov(x[,1], x[,1]), cov(x[,1], x[,2]),
cov(x[,2], x[,1]), cov(x[,2], x[,2])))
vQ
Wacek Kusnierczyk wrote:
when you apply var to a single matrix, it will compute covariances
between its columns rather than the overall variance:
set.seed(0)
x = matrix(rnorm(4), 2, 2)
var(x)
# [,1] [,2]
# [1,] 1.2629543 1.329799
# [2,] -0.3262334 1.272429
except for that i seem to have pasted wrong output.
set.seed(0)
x = matrix(rnorm(4), 2, 2)
var(x)
# [,1] [,2]
# [1,] 1.2627587 0.045585801
# [2,] 0.0455858 0.001645655
matrix(nrow=2, ncol=2, byrow=TRUE, c(
cov(x[,1], x[,1]), cov(x[,1], x[,2]),
cov(x[,2], x[,1]), cov(x[,2], x[,2])))
# [,1] [,2]
# [1,] 1.2627587 0.045585801
# [2,] 0.0455858 0.001645655
vQ
1 day later
Inline Below. -- Bert Bert Gunter Genentech Nonclinical Biostatistics 650-467-7374 -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Wacek Kusnierczyk Sent: Sunday, March 22, 2009 2:16 AM To: rkevinburton at charter.net Cc: r-help at r-project.org Subject: Re: [R] variance/mean
rkevinburton at charter.net wrote:
At the risk of appearing ignorant why is the folowing true?
o <- cbind(rep(1,3),rep(2,3),rep(3,3))
var(o)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 0
[3,] 0 0 0
and
mean(o)
[1] 2
How do I get mean to return an array similar to var? I would expect in the
above example a vector of length 3 {1,2,3}.
You said: "you may well be ignorant about how var works with matrices, but this does not mean it's your fault. the documentation is typically cryptical." -- How so? ?var clearly states: " ... If x and y are matrices then the covariances (or correlations) between the columns of x and the columns of y are computed. " and the Arguments section says: x a numeric vector, matrix or data frame. y NULL (default) or a vector, matrix or data frame with compatible dimensions to x. The default is equivalent to y = x (but more efficient). This is as clear as I would know how to state. I think "...typically cryptical" is a canard and most unfair. -- Bert
(this post suggests a patch to the sources, so i allow myself to divert it to r-devel)
Bert Gunter wrote:
x a numeric vector, matrix or data frame. y NULL (default) or a vector, matrix or data frame with compatible dimensions to x. The default is equivalent to y = x (but more efficient).
bert points to an interesting fragment of ?var: it suggests that
computing var(x) is more efficient than computing var(x,x), for any x
valid as input to var. indeed:
set.seed(0)
x = matrix(rnorm(10000), 100, 100)
library(rbenchmark)
benchmark(replications=1000, columns=c('test', 'elapsed'),
var(x),
var(x, x))
# test elapsed
# 1 var(x) 1.091
# 2 var(x, x) 2.051
that's of course, so to speak, unreasonable: for what var(x) does is
actually computing the covariance of x and x, which should be the same
as var(x,x).
the hack is that if y is given, there's an overhead of memory allocation
for *both* x and y when y is given, as seen in src/main/cov.c:720+.
incidentally, it seems that the problem can be solved with a trivial fix
(see the attached patch), so that
set.seed(0)
x = matrix(rnorm(10000), 100, 100)
library(rbenchmark)
benchmark(replications=1000, columns=c('test', 'elapsed'),
var(x),
var(x, x))
# test elapsed
# 1 var(x) 1.121
# 2 var(x, x) 1.107
with the quick checks
all.equal(var(x), var(x, x))
# TRUE
all(var(x) == var(x, x))
# TRUE
and for cor it seems to make cor(x,x) slightly faster than cor(x), while
originally it was twice slower:
# original
benchmark(replications=1000, columns=c('test', 'elapsed'),
cor(x),
cor(x, x))
# test elapsed
# 1 cor(x) 1.196
# 2 cor(x, x) 2.253
# patched
benchmark(replications=1000, columns=c('test', 'elapsed'),
cor(x),
cor(x, x))
# test elapsed
# 1 cor(x) 1.207
# 2 cor(x, x) 1.204
(there is a visible penalty due to an additional pointer test, but it's
10ms on 1000 replications with 10000 data points, which i think is
negligible.)
This is as clear as I would know how to state.
i believe bert is right.
however, with the above fix, this can now be rewritten as:
"
x: a numeric vector, matrix or data frame.
y: a vector, matrix or data frame with dimensions compatible to those of x.
By default, y = x.
"
which, to my simple mind, is even more clear than what bert would know
how to state, and less likely to cause the sort of confusion that
originated this thread.
the attached patch suggests modifications to src/main/cov.c and
src/library/stats/man/cor.Rd.
it has been prepared and checked as follows:
svn co https://svn.r-project.org/R/trunk trunk
cd trunk
# edited the sources
svn diff > cov.diff
svn revert -R src
patch -p0 < cov.diff
tools/rsync-recommended
./configure
make
make check
bin/R
# subsequent testing within R
if you happen to consider this patch for a commit, please be sure to
examine and test it carefully first.
vQ