Skip to content
Prev 343096 / 398506 Next

A basic statistics question

On 12-Aug-2014 22:22:13 Ted Harding wrote:
After some hesitation (not wanting to prolong a thread whose original
query has been effectively settled), nevertheless I think it may be
worth stating the general picture, for the sake of users who might
be confused or misled regarding the use of  the functions var(), cov(),
cor(), sd() etc.

The distinction to become aware of is between "summary" statistics
and statistics which will be used for estimation/inference.

Given a set of numbers, x[1], x[2], ... , x[N], they have a mean

  MEAN = sum(x)/N

and their variance VAR is the mean of the deviations between the x[i]
and MEAN:

  VAR = (sum(x - MEAN)^2)/N

If a random value X is drawn from {x[1], x[2], ... , x[N]}, with
uniform probability, then the expectation of X is again

  E(X) = MEAN

and the variance of X is again

  Var(X) = E((X - MEAN)^2) = VAR

with MEAN and VAR as given above. And the R function mean(x) will
return MEAN is its value. However, the R function var(x) will not
return VAR -- it will return (N/(N-1))*VAR

The above definitions of MEAN and VAR use division by N, i.e.
"denominator = N". But the R functions var(x), sd(x) etc. divide by
(N-1), i.e. use "denominator = N-1", as explained in
  ?var
  ?sd
etc.

The basic reason for this is that, given a random sample X[1], ... ,X[n]
of size n from {x[1], x[2], ... , x[N]}, the expectation of

  sum((X - mean(X))^2)

is (n-1)*VAR, so to obtain an unbiased estimate of VAR from the X
sample one must use the "bias-corrected" sample variance

  var(X) = sum((X - mean(X))^2)/(n-1)

i.e. "denominator = (n-1)" as described in the help pages.

So the function var(), with denominator = (n-1), is "correct" for
obtaining an unbiased estimator. But it will not be correct for the
variance sum((X - mean(X))^2)/n of the numbers X[1], ... ,X[n].

Since sd() also uses denominator = (n-1), sd(X) is the square root
of var(X). But, while var(X) is unbiased for VAR, sd(X) is not
unbiased for SD = sqrt(VAR), since, for a positive ransom variable Y,
in general

  E(sqrt(Y)) < sqrt(E(Y))

i.e. E(sd(X)) = E(sqrt(var(X))) < sqrt(E(var(X))) = sqrt(VAR) = SD.

The R functions var() etc., which use denominator = (n-1), do not
have an option which allows the user to choose denominator = n.

Therefore, in particular, a user who has a set of numbers
  {x[1], x[2], ... , x[N]}
(e.g. a record of a popuation) and wants the SD of these (e.g. for
use in summary statistics Mean and SD), could inadvertently use
R's sd(x), expecting SD, without being alerted to the fact that it
will give the wrong answer. And the only way round it is to explicitly
write one's own correction, e.g.

  SD <- function(x){n<-length(x); sd(x)*sqrt((n-1)/n)}

Indeed, this topic has got me wondering how many times I may have
blindly used sd(x) in the past, as if it were going to give me the
standard (sum(x - mean(x))^2)/length(x) result!

As I wrote earlier, when there is more than one definition which
might be used, the important thing is to know which one is being used,
and to correct accordingly if it is not the one you want.

Best wishes to all,
Ted.

-------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at wlandres.net>
Date: 13-Aug-2014  Time: 19:49:02
This message was sent by XFMail