Skip to content

regression towards the mean, AS paper November 2007

7 messages · Troels Ring, Peter Dalgaard, Duncan Murdoch +3 more

#
Dear friends, regression towards the mean is interesting in medical 
circles, and a very recent paper (The American Statistician November 
2007;61:302-307 by Krause and Pinheiro) treats it at length. An initial 
example specifies (p 303):
"Consider the following example: we draw 100 samples from a bivariate 
Normal distribution with X0~N(0,1), X1~N(0,1) and cov(X0,X1)=0.7, We 
then calculate the p value for the null hypothesis that the means of X0 
and X1 are equal, using a paired Student's t test. The procedure is 
repeated 1000 times, producing 1000 simulated p values. Because X0 and 
X1 have identical marginal distributions, the simulated p values behave 
like independent Uniform(0,1) random variables." This I did not 
understand, and simulating like shown below produced far from uniform 
(0,1) p values - but I fail to see how it is wrong. I contacted the 
authors of the paper but they did not answer. So, please, doesn?t the 
code below specify a bivariate N(0,1) with covariance 0.7? I get p 
values = 1 all over - not interesting, but how wrong?
Best wishes
Troels

library(MASS)
Sigma <- matrix(c(1,0.7,0.7,1),2,2)
Sigma
res <- NULL
for (i in 1:1000){
ff <-(mvrnorm(n=100, rep(0, 2), Sigma, empirical = TRUE))
res[i] <- t.test(ff[,1],ff[,2],paired=TRUE)$p.value}
#
Troels Ring wrote:
You do not want empirical=TRUE in the mvrnorm call. This pegs the
empirical means to exactly (0,0) which are obviously never significantly
different.

BTW, there's a function called replicate()...
#
On 12/17/2007 1:21 PM, Troels Ring wrote:
Specifying empirical=TRUE means that your sampled values are not 
independent, the means are guaranteed to match exactly, and the mean 
difference is exactly zero.  Thus all of the t statistics are exactly 
zero, and the p-values are exactly 1.

Set empirical=FALSE (the default), and you'll see more reasonable results.

Duncan Murdoch
#
On 18/12/2007, at 7:32 AM, Duncan Murdoch wrote:

            
This has nothing to do really with the question that Troels asked,
	but the exposition quoted from the AA paper is unnecessarily confusing.
	The phrase ``Because X0 and X1 have identical marginal  
distributions ...''
	throws the reader off the track.  The identical marginal distributions
	are irrelevant.  All one needs is that the ***means*** of X0 and X1
	be the same, and then the null hypothesis tested by a paired t-test
	is true and so the p-values are (asymptotically) Uniform[0,1].  With
	a sample size of 100, the ``asymptotically'' bit can be safely ignored
	for any ``decent'' joint distribution of X0 and X1.  If one further
	assumes that X0 - X1 is Gaussian (which has nothing to do with X0 and
	X1 having identical marginal distributions) then ``asymptotically''
	turns into ``exactly''.

				cheers,

					Rolf Turner
######################################################################
Attention: 
This e-mail message is privileged and confidential. If you are not the 
intended recipient please delete the message and notify the sender. 
Any views or opinions presented are solely those of the author.

This e-mail has been scanned and cleared by MailMarshal 
www.marshalsoftware.com
######################################################################
#
Another related issue is that uniform distributions don't look very uniform:

hist(runif(100))
hist(runif(1000))
hist(runif(10000))

Be sure to calibrate your eyes (and your bin width) before rejecting
the hypothesis that the distribution is uniform.

Hadley
#
On Dec 17, 2007 3:10 PM, hadley wickham <h.wickham at gmail.com> wrote:
Thanks for the example, Hadley.  To me, this suggests we should stop
teaching histograms in Stat 101 and instead use quantile plots, which
give excellent results for n=100 and even surprisingly good results
for n=10:

par(mfrow=c(2,2))
for(i in c(10, 100, 1000, 10000)) {
  qqplot(runif(i), qunif(seq(1/i, 1, length=i)), main=i,
         xlim=c(0,1), ylim=c(0,1),
         xlab="runif", ylab="Uniform distribution quantiles")
  abline(0,1,col="lightgray")
}

Kevin (drifting even further off topic)
#
It all depends on what you're trying to do - I don't think histograms
are particularly good as density estimators, but that's not what
you're using them for most of the time! You're using them as an
exploratory tool to try and understand what's going on in your data -
often you'll need to use very small bin widths which help find
unexpected gaps and patterns in your data.   It's helpful to have some
feel for what common distributions look like.

Hadley