variance/mean - R-help | R Mailing Lists

Sun, Mar 22, 2009 1:17 AM #

At the risk of appearing ignorant why is the folowing true?

o <- cbind(rep(1,3),rep(2,3),rep(3,3))
var(o)
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0
[3,]    0    0    0

and

mean(o)
[1] 2

How do I get mean to return an array similar to var? I would expect in the above example a vector of length 3 {1,2,3}.

Thank you for your help.

Kevin

(Ted Harding)

Sun, Mar 22, 2009 2:01 AM #

On 22-Mar-09 08:17:29, rkevinburton at charter.net wrote:

This is a consequence of (understandable) confusion about how var()
and mean() operate! It is not explicit, in "?var", that if you apply
var() to a matrix, as in your "var(o)" you get the covariance matrix
between the columns of 'o' -- except where it says (almost as an
aside) that "'var' is just another interface to 'cov'". Hence in
your example "var(o)" is equivalent to "cov(o)". Looked at in this
way, it is now straightforward to expect what you got.

This is, of course, different from what you would expect if you apply
var() to a vector, namely the variance of that series of numbers
(a single value).

On the other hand, mean() works differently. According to "?mean":
  Arguments:
     x: An R object.  Currently there are methods for numeric
        data frames, numeric vectors and dates.
  [...]
  Value:
     For a data frame, a named vector with the appropriate method
     being applied column by column.

which may have been what you expected. But a matrix is not a data
frame. Instead, it is an array, which (in effect) is a vector with
an attached "dimensions" attribute which tells R how to chop it up
into columns etc. -- whereas a data frame has its "by-column"
structure built in to it.

Now: "?mean" says nothing about matrices. Nothing whatever.
So you have to find out the hard way that mean(o) treats the array
'o' as a vector, ignoring its "dimensions" attribute. Hence you
get a single number, which is the mean of all the values in the
matrix.

In order to get what you are apparently looking for (the means of
the columns of 'o'), you could:

a) (the smooth way) use the apply() function, causing mean() to be
   applied to the second dimension (columns) of 'o':

   apply(o,2,mean)
   # [1] 1 2 3

b) (the heavy way) take a hint from "?mean" and feed it a data frame:

   mean(as.data.frame(o))
   # V1 V2 V3
   #  1  2  3 

Hoping this helps to clarify things!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 22-Mar-09                                       Time: 09:01:40
------------------------------ XFMail ------------------------------

Wacek Kusnierczyk

Sun, Mar 22, 2009 2:15 AM #

rkevinburton at charter.net wrote:

you may well be ignorant about how var works with matrices, but this
does not mean it's your fault.  the documentation is typically cryptical.

when you apply var to a single matrix, it will compute covariances
between its columns rather than the overall variance:

    set.seed(0)
    x = matrix(rnorm(4), 2, 2)
   
    var(x)
    #                [,1]     [,2]
    # [1,]  1.2629543 1.329799
    # [2,] -0.3262334 1.272429

    matrix(nrow=2, ncol=2, byrow=TRUE, c(
       cov(x[,1], x[,1]), cov(x[,1], x[,2]),
       cov(x[,2], x[,1]), cov(x[,2], x[,2])))
      
vQ

Wacek Kusnierczyk

Sun, Mar 22, 2009 2:28 AM #

Wacek Kusnierczyk wrote:

except for that i seem to have pasted wrong output.

    set.seed(0)
    x = matrix(rnorm(4), 2, 2)

    var(x)
    #           [,1]        [,2]
    # [1,] 1.2627587 0.045585801
    # [2,] 0.0455858 0.001645655

    matrix(nrow=2, ncol=2, byrow=TRUE, c(
        cov(x[,1], x[,1]), cov(x[,1], x[,2]),
        cov(x[,2], x[,1]), cov(x[,2], x[,2])))
    #           [,1]        [,2]
    # [1,] 1.2627587 0.045585801
    # [2,] 0.0455858 0.001645655

vQ

Bert Gunter

Mon, Mar 23, 2009 9:06 AM #

Inline Below.

-- Bert 


Bert Gunter
Genentech Nonclinical Biostatistics
650-467-7374

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Wacek Kusnierczyk
Sent: Sunday, March 22, 2009 2:16 AM
To: rkevinburton at charter.net
Cc: r-help at r-project.org
Subject: Re: [R] variance/mean

rkevinburton at charter.net wrote:

above example a vector of length 3 {1,2,3}.

You said:

"you may well be ignorant about how var works with matrices, but this
does not mean it's your fault.  the documentation is typically cryptical."


-- How so? ?var clearly states:

" ... If x and y are matrices then the covariances (or correlations) between
the columns of x and the columns of y are computed. "

and the Arguments section says:

x a numeric vector, matrix or data frame. 
y NULL (default) or a vector, matrix or data frame with compatible
dimensions to x. The default is equivalent to y = x (but more efficient). 


This is as clear as I would know how to state. I think "...typically
cryptical" is a canard and most unfair.

-- Bert

Wacek Kusnierczyk

Mon, Mar 23, 2009 4:39 PM #

(this post suggests a patch to the sources, so i allow myself to divert
it to r-devel)

Bert Gunter wrote:

bert points to an interesting fragment of ?var:  it suggests that
computing var(x) is more efficient than computing var(x,x), for any x
valid as input to var.  indeed:

    set.seed(0)
    x = matrix(rnorm(10000), 100, 100)

    library(rbenchmark)
    benchmark(replications=1000, columns=c('test', 'elapsed'),
       var(x),
       var(x, x))
    #        test elapsed
    # 1    var(x)   1.091
    # 2 var(x, x)   2.051

that's of course, so to speak, unreasonable:  for what var(x) does is
actually computing the covariance of x and x, which should be the same
as var(x,x). 

the hack is that if y is given, there's an overhead of memory allocation
for *both* x and y when y is given, as seen in src/main/cov.c:720+.
incidentally, it seems that the problem can be solved with a trivial fix
(see the attached patch), so that

    set.seed(0)
    x = matrix(rnorm(10000), 100, 100)

    library(rbenchmark)
    benchmark(replications=1000, columns=c('test', 'elapsed'),
       var(x),
       var(x, x))
    #        test elapsed
    # 1    var(x)   1.121
    # 2 var(x, x)   1.107

with the quick checks

    all.equal(var(x), var(x, x))
    # TRUE
   
    all(var(x) == var(x, x))
    # TRUE

and for cor it seems to make cor(x,x) slightly faster than cor(x), while
originally it was twice slower:

    # original
    benchmark(replications=1000, columns=c('test', 'elapsed'),
       cor(x),
       cor(x, x))
    #        test elapsed
    # 1    cor(x)   1.196
    # 2 cor(x, x)   2.253
   
    # patched
    benchmark(replications=1000, columns=c('test', 'elapsed'),
       cor(x),
       cor(x, x))
    #        test elapsed
    # 1    cor(x)   1.207
    # 2 cor(x, x)   1.204

(there is a visible penalty due to an additional pointer test, but it's
10ms on 1000 replications with 10000 data points, which i think is
negligible.)

i believe bert is right.

however, with the above fix, this can now be rewritten as:

"
x: a numeric vector, matrix or data frame. 
y: a vector, matrix or data frame with dimensions compatible to those of x. 
By default, y = x. 
"

which, to my simple mind, is even more clear than what bert would know
how to state, and less likely to cause the sort of confusion that
originated this thread.

the attached patch suggests modifications to src/main/cov.c and
src/library/stats/man/cor.Rd.
it has been prepared and checked as follows:

    svn co https://svn.r-project.org/R/trunk trunk
    cd trunk
    # edited the sources
    svn diff > cov.diff
    svn revert -R src
    patch -p0 < cov.diff

    tools/rsync-recommended
    ./configure
    make
    make check
    bin/R
    # subsequent testing within R

if you happen to consider this patch for a commit, please be sure to
examine and test it carefully first.

vQ