If the observations are normally distributed and the 2xk design is
balanced, theory requires that the tests for interaction and row effects be
independent. In my program, appended below, this would translate to cntT
(approx)= cntR*cntI/N if all R routines were functioning correctly. They
aren't.
sim2=function(size,N,p){
cntR=0
cntC=0
cntI=0
cntT=0
cntP=0
for(i in 1:N){
#generate data
v=gendata(size)
#analyze after build(ing) design containing data
lm.out=lm(yield~c*r,build(size,v))
av.out=anova(lm.out)
#if column effect is significant, increment cntC
if (av.out[[5]][1]<=p)cntC=cntC+1
#if row effect is significant, increment cntR
if (av.out[[5]][2]<=p){
cntR=cntR+1
tmp = 1
}
else tmp =0
if (av.out[[5]][3]<=p){
#if interaction is significant, increment cntI
cntI=cntI+1
#if both interaction and row effect are significant, increment cntT
cntT=cntT + tmp
}
}
list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
}
build=function(size,v){
#size is a vector containing the sample sizes
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
gendata=function(size){
ssize=sum(size);
return (rnorm(ssize))
}
#Example
size=c(3,3,3,0,3,3,3,0)
sim2(size,10000,10,.16)
Phillip Good
Huntington Beach CA
Lack of independence in anova()
27 messages · Phillip Good, Douglas Bates, Thomas Lumley +6 more
Messages 1–25 of 27
I have already had email exchanges off-list with Phillip Good pointing out that the independence property that he wishes to establish by simulation is a consequence of orthogonality of the column span of the row contrasts and the interaction contrasts. If you set the contrasts option to a set of orthogonal contrasts such as contr.helmert or contr.sum, which has no effect on the results of the anova, this is easily established.
build
function(size, v = rnorm(sum(size))) {
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
size
[1] 3 3 3 0 3 3 3 0
set.seed(1234321) vv <- build(size) vv
c r yield 1 0 0 1.2369081 2 0 0 1.5616230 3 0 0 1.8396185 4 1 0 0.3173245 5 1 0 1.0715115 6 1 0 -1.1459955 7 2 0 0.2488894 8 2 0 0.1158625 9 2 0 2.6200816 10 0 1 1.2624048 11 0 1 -0.9862578 12 0 1 -0.3235653 13 1 1 0.2039706 14 1 1 -1.4574576 15 1 1 1.9158713 16 2 1 -2.0333909 17 2 1 1.0050272 18 2 1 0.6789184
options(contrasts = c('contr.helmert', 'contr.poly'))
crossprod(model.matrix(~c*r, vv))
(Intercept) c1 c2 r1 c1:r1 c2:r1 (Intercept) 18 0 0 0 0 0 c1 0 12 0 0 0 0 c2 0 0 36 0 0 0 r1 0 0 0 18 0 0 c1:r1 0 0 0 0 12 0 c2:r1 0 0 0 0 0 36
On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
If the observations are normally distributed and the 2xk design is
balanced, theory requires that the tests for interaction and row effects be
independent. In my program, appended below, this would translate to cntT
(approx)= cntR*cntI/N if all R routines were functioning correctly. They
aren't.
sim2=function(size,N,p){
cntR=0
cntC=0
cntI=0
cntT=0
cntP=0
for(i in 1:N){
#generate data
v=gendata(size)
#analyze after build(ing) design containing data
lm.out=lm(yield~c*r,build(size,v))
av.out=anova(lm.out)
#if column effect is significant, increment cntC
if (av.out[[5]][1]<=p)cntC=cntC+1
#if row effect is significant, increment cntR
if (av.out[[5]][2]<=p){
cntR=cntR+1
tmp = 1
}
else tmp =0
if (av.out[[5]][3]<=p){
#if interaction is significant, increment cntI
cntI=cntI+1
#if both interaction and row effect are significant, increment cntT
cntT=cntT + tmp
}
}
list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
}
build=function(size,v){
#size is a vector containing the sample sizes
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
gendata=function(size){
ssize=sum(size);
return (rnorm(ssize))
}
#Example
size=c(3,3,3,0,3,3,3,0)
sim2(size,10000,10,.16)
Phillip Good
Huntington Beach CA
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
A couple more comments on this simulation. Notice that the sim2 function is defined with arguments size, N and p but is being called with four arguments. It appears as if the value of p will be 10 in that call. If you decide to do such a simulation yourself you can save a lot of time by just building the model matrix once and using lm.fit in subsequent calls. Also, there is no need to do the counting in the body of the sim2 function. Just save the 3 p-values from each replication. The test of independence is equivalent to showing that the distribution of the p-values is uniform over the unit cube.
On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
If the observations are normally distributed and the 2xk design is
balanced, theory requires that the tests for interaction and row effects be
independent. In my program, appended below, this would translate to cntT
(approx)= cntR*cntI/N if all R routines were functioning correctly. They
aren't.
sim2=function(size,N,p){
cntR=0
cntC=0
cntI=0
cntT=0
cntP=0
for(i in 1:N){
#generate data
v=gendata(size)
#analyze after build(ing) design containing data
lm.out=lm(yield~c*r,build(size,v))
av.out=anova(lm.out)
#if column effect is significant, increment cntC
if (av.out[[5]][1]<=p)cntC=cntC+1
#if row effect is significant, increment cntR
if (av.out[[5]][2]<=p){
cntR=cntR+1
tmp = 1
}
else tmp =0
if (av.out[[5]][3]<=p){
#if interaction is significant, increment cntI
cntI=cntI+1
#if both interaction and row effect are significant, increment cntT
cntT=cntT + tmp
}
}
list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
}
build=function(size,v){
#size is a vector containing the sample sizes
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
gendata=function(size){
ssize=sum(size);
return (rnorm(ssize))
}
#Example
size=c(3,3,3,0,3,3,3,0)
sim2(size,10000,10,.16)
Phillip Good
Huntington Beach CA
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
To my surprise, the R functions I employed did NOT verify the independence property. I've no quarrel with the theory--it's the R functions that worry me. pG -----Original Message----- From: Douglas Bates [mailto:dmbates at gmail.com] Sent: Monday, July 04, 2005 9:13 AM To: pigood at verizon.net; r-help Subject: Re: [R] Lack of independence in anova() I have already had email exchanges off-list with Phillip Good pointing out that the independence property that he wishes to establish by simulation is a consequence of orthogonality of the column span of the row contrasts and the interaction contrasts. If you set the contrasts option to a set of orthogonal contrasts such as contr.helmert or contr.sum, which has no effect on the results of the anova, this is easily established.
build
function(size, v = rnorm(sum(size))) {
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
size
[1] 3 3 3 0 3 3 3 0
set.seed(1234321) vv <- build(size) vv
c r yield 1 0 0 1.2369081 2 0 0 1.5616230 3 0 0 1.8396185 4 1 0 0.3173245 5 1 0 1.0715115 6 1 0 -1.1459955 7 2 0 0.2488894 8 2 0 0.1158625 9 2 0 2.6200816 10 0 1 1.2624048 11 0 1 -0.9862578 12 0 1 -0.3235653 13 1 1 0.2039706 14 1 1 -1.4574576 15 1 1 1.9158713 16 2 1 -2.0333909 17 2 1 1.0050272 18 2 1 0.6789184
options(contrasts = c('contr.helmert', 'contr.poly'))
crossprod(model.matrix(~c*r, vv))
(Intercept) c1 c2 r1 c1:r1 c2:r1 (Intercept) 18 0 0 0 0 0 c1 0 12 0 0 0 0 c2 0 0 36 0 0 0 r1 0 0 0 18 0 0 c1:r1 0 0 0 0 12 0 c2:r1 0 0 0 0 0 36
On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
If the observations are normally distributed and the 2xk design is balanced, theory requires that the tests for interaction and row effects
be
independent. In my program, appended below, this would translate to cntT
(approx)= cntR*cntI/N if all R routines were functioning correctly. They
aren't.
sim2=function(size,N,p){
cntR=0
cntC=0
cntI=0
cntT=0
cntP=0
for(i in 1:N){
#generate data
v=gendata(size)
#analyze after build(ing) design containing data
lm.out=lm(yield~c*r,build(size,v))
av.out=anova(lm.out)
#if column effect is significant, increment cntC
if (av.out[[5]][1]<=p)cntC=cntC+1
#if row effect is significant, increment cntR
if (av.out[[5]][2]<=p){
cntR=cntR+1
tmp = 1
}
else tmp =0
if (av.out[[5]][3]<=p){
#if interaction is significant, increment cntI
cntI=cntI+1
#if both interaction and row effect are significant, increment
cntT
cntT=cntT + tmp
}
}
list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
}
build=function(size,v){
#size is a vector containing the sample sizes
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
gendata=function(size){
ssize=sum(size);
return (rnorm(ssize))
}
#Example
size=c(3,3,3,0,3,3,3,0)
sim2(size,10000,10,.16)
Phillip Good
Huntington Beach CA
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
My bad. The example of a call should read, sim2(size,10000,.16). Originally, the intent of the program was to compare ANOV (the gold standard) with synchronized permutation tests when data is from contaminated normal distributions or the design is unbalanced. The tests for main effects and interactions are always independent with synchronized permutations and ought to be for ANOV with normal data and balanced designs. -----Original Message----- From: Douglas Bates [mailto:dmbates at gmail.com] Sent: Monday, July 04, 2005 9:24 AM To: pigood at verizon.net Cc: r-help at r-project.org Subject: Re: [R] Lack of independence in anova() A couple more comments on this simulation. Notice that the sim2 function is defined with arguments size, N and p but is being called with four arguments. It appears as if the value of p will be 10 in that call. If you decide to do such a simulation yourself you can save a lot of time by just building the model matrix once and using lm.fit in subsequent calls. Also, there is no need to do the counting in the body of the sim2 function. Just save the 3 p-values from each replication. The test of independence is equivalent to showing that the distribution of the p-values is uniform over the unit cube.
On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
If the observations are normally distributed and the 2xk design is balanced, theory requires that the tests for interaction and row effects
be
independent. In my program, appended below, this would translate to cntT
(approx)= cntR*cntI/N if all R routines were functioning correctly. They
aren't.
sim2=function(size,N,p){
cntR=0
cntC=0
cntI=0
cntT=0
cntP=0
for(i in 1:N){
#generate data
v=gendata(size)
#analyze after build(ing) design containing data
lm.out=lm(yield~c*r,build(size,v))
av.out=anova(lm.out)
#if column effect is significant, increment cntC
if (av.out[[5]][1]<=p)cntC=cntC+1
#if row effect is significant, increment cntR
if (av.out[[5]][2]<=p){
cntR=cntR+1
tmp = 1
}
else tmp =0
if (av.out[[5]][3]<=p){
#if interaction is significant, increment cntI
cntI=cntI+1
#if both interaction and row effect are significant, increment
cntT
cntT=cntT + tmp
}
}
list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
}
build=function(size,v){
#size is a vector containing the sample sizes
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
gendata=function(size){
ssize=sum(size);
return (rnorm(ssize))
}
#Example
size=c(3,3,3,0,3,3,3,0)
sim2(size,10000,10,.16)
Phillip Good
Huntington Beach CA
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
I wrote up a note on how to do a more efficient simulation of the p-values from anova then discovered to my surprise that the chi-square test for independence of the significance of the F-tests indicated that they were not independent. I was stumped by this but fortunately Thomas Lumley came to my rescue with an explanation. There is no reason why the results of the F tests should be independent. The numerators are independent but the denominator is the same for both tests. When, due to random variation, the denominator is small, then the p-values for both tests will tend to be small. If, instead of F-tests you use chi-square tests then you do see independence. Here is the note on the simulation. There are several things that could be done to speed the simulation of the p-values of an anova like this under the null distribution. If you examine the structure of a fitted lm object (use the function str()) you will see that there are components called `qr', `effects' and `assign'. You can verify by experimentation that `qr' and `assign' depend only on the experimental design. Furthermore, the `effects' vector can be reproduced as qr.qty(qrstr, y). The sums of squares for the different terms in the model are the sums of squares of elements of the effects vector as indexed by the assign vector. The residual sum of squares is the sum of squares of the remaining elements of the assign vector. You can generate 10000 replications of the calculations of the relevant sums of squares as
set.seed(1234321) vv <- data.frame(c = gl(3,3,18), r = gl(2,9,18)) vv
c r 1 1 1 2 1 1 3 1 1 4 2 1 5 2 1 6 2 1 7 3 1 8 3 1 9 3 1 10 1 2 11 1 2 12 1 2 13 2 2 14 2 2 15 2 2 16 3 2 17 3 2 18 3 2
fm1 <- lm(rnorm(18) ~ c*r, vv) fm1$assign
[1] 0 1 1 2 3 3
asgn <- c(fm1$assign, rep(4, 12)) system.time(res <- replicate(10000, tapply(qr.qty(fm1$qr, rnorm(18))^2, asgn, sum)))
[1] 20.61 0.01 20.61 0.00 0.00
res[,1:6]
[,1] [,2] [,3] [,4] [,5] [,6] 0 0.4783121 0.3048634 0.713689 0.6937838 0.03649023 2.63392426 1 0.5825213 1.4756395 1.127018 0.5209751 1.18697199 3.32972093 2 0.2612723 3.6396106 0.547506 1.1641910 0.37843963 0.03411672 3 2.6259806 3.5504584 1.645215 0.1197238 0.85361018 4.53895212 4 9.1942755 8.2122693 4.863392 5.4413571 2.03715439 22.94815118 The rows of that array correspond to the sum of squares for the Intercept (which we generally ignore), the factor 'c', the factor 'r', their interaction and the residuals. As you can see that took about 21 seconds on my system, which I expect is a bit faster than your simulation ran. Because I set the seed to a known value I can reproduce the results for the first few simulations to check that the sums of squares are correct. Remember that the original fit (fm1) is not included in the table.
set.seed(1234321) fm1 <- lm(rnorm(18) ~ c*r, vv) anova(fm2 <- lm(rnorm(18) ~ c*r, vv))
Analysis of Variance Table
Response: rnorm(18)
Df Sum Sq Mean Sq F value Pr(>F)
c 2 0.5825 0.2913 0.3801 0.6917
r 1 0.2613 0.2613 0.3410 0.5701
c:r 2 2.6260 1.3130 1.7137 0.2215
Residuals 12 9.1943 0.7662
You can continue this process if you wish further verification that
the results correspond to the fitted models.
You can get the p-values for the F-tests as
pvalsF <- data.frame(pc = pf((res[2,]/2)/(res[5,]/12), 2, 12, low = FALSE),
+ pr = pf((res[3,]/1)/(res[5,]/12), 1, 12, low = FALSE), + pint = pf((res[4,]/2)/(res[5,]/12), 2, 12, low = FALSE)) Again you can check this for the first few by hand.
pvalsF[1:5,]
pc pr pint 1 0.69171238 0.57006574 0.2214847 2 0.37102129 0.03975286 0.1158059 3 0.28634939 0.26771167 0.1740633 4 0.57775850 0.13506561 0.8775828 5 0.06363138 0.16124100 0.1224806 If you wish you could then check marginal distributions using techniques like an empirical density plot.
library(lattice) densityplot(~ pc, pvals)
At this point I would recommend checking the joint distribution but if you want to choose a specific level and check the contingency table that could be done as
xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF)
I(pint < 0.16)
I(pr < 0.16) FALSE TRUE
FALSE 7204 1240
TRUE 1215 341
The summary method for an xtabs object provides a test of independence
summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF))
Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsF) Number of cases in table: 10000 Number of factors: 2 Test for independence of all factors: Chisq = 51.6, df = 1, p-value = 6.798e-13 for which you can see the puzzling result. However, if you use the chisquared test based on the known residual variance of 1, you get
pvalsC <- data.frame(pc = pchisq(res[2,], 2, low = FALSE),
+ pr = pchisq(res[3,], 1, low = FALSE), + pint = pchisq(res[4,], 2, low = FALSE))
pvalsC[1:5,]
pc pr pint 1 0.7473209 0.60924741 0.2690144 2 0.4781553 0.05642013 0.1694446 3 0.5692081 0.45933855 0.4392846 4 0.7706757 0.28059805 0.9418946 5 0.5523983 0.53843951 0.6525907
xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC)
I(pint < 0.16)
I(pr < 0.16) FALSE TRUE
FALSE 7121 1319
TRUE 1324 236
summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC))
Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsC) Number of cases in table: 10000 Number of factors: 2 Test for independence of all factors: Chisq = 0.25041, df = 1, p-value = 0.6168
On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
To my surprise, the R functions I employed did NOT verify the independence property. I've no quarrel with the theory--it's the R functions that worry me. pG -----Original Message----- From: Douglas Bates [mailto:dmbates at gmail.com] Sent: Monday, July 04, 2005 9:13 AM To: pigood at verizon.net; r-help Subject: Re: [R] Lack of independence in anova() I have already had email exchanges off-list with Phillip Good pointing out that the independence property that he wishes to establish by simulation is a consequence of orthogonality of the column span of the row contrasts and the interaction contrasts. If you set the contrasts option to a set of orthogonal contrasts such as contr.helmert or contr.sum, which has no effect on the results of the anova, this is easily established.
build
function(size, v = rnorm(sum(size))) {
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
size
[1] 3 3 3 0 3 3 3 0
set.seed(1234321) vv <- build(size) vv
c r yield 1 0 0 1.2369081 2 0 0 1.5616230 3 0 0 1.8396185 4 1 0 0.3173245 5 1 0 1.0715115 6 1 0 -1.1459955 7 2 0 0.2488894 8 2 0 0.1158625 9 2 0 2.6200816 10 0 1 1.2624048 11 0 1 -0.9862578 12 0 1 -0.3235653 13 1 1 0.2039706 14 1 1 -1.4574576 15 1 1 1.9158713 16 2 1 -2.0333909 17 2 1 1.0050272 18 2 1 0.6789184
options(contrasts = c('contr.helmert', 'contr.poly'))
crossprod(model.matrix(~c*r, vv))
(Intercept) c1 c2 r1 c1:r1 c2:r1 (Intercept) 18 0 0 0 0 0 c1 0 12 0 0 0 0 c2 0 0 36 0 0 0 r1 0 0 0 18 0 0 c1:r1 0 0 0 0 12 0 c2:r1 0 0 0 0 0 36 On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
If the observations are normally distributed and the 2xk design is balanced, theory requires that the tests for interaction and row effects
be
independent. In my program, appended below, this would translate to cntT
(approx)= cntR*cntI/N if all R routines were functioning correctly. They
aren't.
sim2=function(size,N,p){
cntR=0
cntC=0
cntI=0
cntT=0
cntP=0
for(i in 1:N){
#generate data
v=gendata(size)
#analyze after build(ing) design containing data
lm.out=lm(yield~c*r,build(size,v))
av.out=anova(lm.out)
#if column effect is significant, increment cntC
if (av.out[[5]][1]<=p)cntC=cntC+1
#if row effect is significant, increment cntR
if (av.out[[5]][2]<=p){
cntR=cntR+1
tmp = 1
}
else tmp =0
if (av.out[[5]][3]<=p){
#if interaction is significant, increment cntI
cntI=cntI+1
#if both interaction and row effect are significant, increment
cntT
cntT=cntT + tmp
}
}
list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
}
build=function(size,v){
#size is a vector containing the sample sizes
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
gendata=function(size){
ssize=sum(size);
return (rnorm(ssize))
}
#Example
size=c(3,3,3,0,3,3,3,0)
sim2(size,10000,10,.16)
Phillip Good
Huntington Beach CA
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
On 7/4/05, Douglas Bates <dmbates at gmail.com> wrote:
...
The sums of squares for the different terms in the model are the sums of squares of elements of the effects vector as indexed by the assign vector. The residual sum of squares is the sum of squares of the remaining elements of the assign vector.
I meant to write "effects vector" not "assign vector" in that last sentence.
1 day later
Do you or Lumley have a citation for this conclusion? Most people go forward with the ANOV on the basis that the various tests are independent. Phillip Good P.S. Tests based on the method of synchronized permutations are independent. -----Original Message----- From: Douglas Bates [mailto:dmbates at gmail.com] Sent: Monday, July 04, 2005 1:44 PM To: pigood at verizon.net Cc: r-help Subject: Re: [R] Lack of independence in anova() I wrote up a note on how to do a more efficient simulation of the p-values from anova then discovered to my surprise that the chi-square test for independence of the significance of the F-tests indicated that they were not independent. I was stumped by this but fortunately Thomas Lumley came to my rescue with an explanation. There is no reason why the results of the F tests should be independent. The numerators are independent but the denominator is the same for both tests. When, due to random variation, the denominator is small, then the p-values for both tests will tend to be small. If, instead of F-tests you use chi-square tests then you do see independence. Here is the note on the simulation. There are several things that could be done to speed the simulation of the p-values of an anova like this under the null distribution. If you examine the structure of a fitted lm object (use the function str()) you will see that there are components called `qr', `effects' and `assign'. You can verify by experimentation that `qr' and `assign' depend only on the experimental design. Furthermore, the `effects' vector can be reproduced as qr.qty(qrstr, y). The sums of squares for the different terms in the model are the sums of squares of elements of the effects vector as indexed by the assign vector. The residual sum of squares is the sum of squares of the remaining elements of the assign vector. You can generate 10000 replications of the calculations of the relevant sums of squares as
set.seed(1234321) vv <- data.frame(c = gl(3,3,18), r = gl(2,9,18)) vv
c r 1 1 1 2 1 1 3 1 1 4 2 1 5 2 1 6 2 1 7 3 1 8 3 1 9 3 1 10 1 2 11 1 2 12 1 2 13 2 2 14 2 2 15 2 2 16 3 2 17 3 2 18 3 2
fm1 <- lm(rnorm(18) ~ c*r, vv) fm1$assign
[1] 0 1 1 2 3 3
asgn <- c(fm1$assign, rep(4, 12)) system.time(res <- replicate(10000, tapply(qr.qty(fm1$qr, rnorm(18))^2,
asgn, sum))) [1] 20.61 0.01 20.61 0.00 0.00
res[,1:6]
[,1] [,2] [,3] [,4] [,5] [,6] 0 0.4783121 0.3048634 0.713689 0.6937838 0.03649023 2.63392426 1 0.5825213 1.4756395 1.127018 0.5209751 1.18697199 3.32972093 2 0.2612723 3.6396106 0.547506 1.1641910 0.37843963 0.03411672 3 2.6259806 3.5504584 1.645215 0.1197238 0.85361018 4.53895212 4 9.1942755 8.2122693 4.863392 5.4413571 2.03715439 22.94815118 The rows of that array correspond to the sum of squares for the Intercept (which we generally ignore), the factor 'c', the factor 'r', their interaction and the residuals. As you can see that took about 21 seconds on my system, which I expect is a bit faster than your simulation ran. Because I set the seed to a known value I can reproduce the results for the first few simulations to check that the sums of squares are correct. Remember that the original fit (fm1) is not included in the table.
set.seed(1234321) fm1 <- lm(rnorm(18) ~ c*r, vv) anova(fm2 <- lm(rnorm(18) ~ c*r, vv))
Analysis of Variance Table
Response: rnorm(18)
Df Sum Sq Mean Sq F value Pr(>F)
c 2 0.5825 0.2913 0.3801 0.6917
r 1 0.2613 0.2613 0.3410 0.5701
c:r 2 2.6260 1.3130 1.7137 0.2215
Residuals 12 9.1943 0.7662
You can continue this process if you wish further verification that
the results correspond to the fitted models.
You can get the p-values for the F-tests as
pvalsF <- data.frame(pc = pf((res[2,]/2)/(res[5,]/12), 2, 12, low =
FALSE), + pr = pf((res[3,]/1)/(res[5,]/12), 1, 12, low = FALSE), + pint = pf((res[4,]/2)/(res[5,]/12), 2, 12, low = FALSE)) Again you can check this for the first few by hand.
pvalsF[1:5,]
pc pr pint 1 0.69171238 0.57006574 0.2214847 2 0.37102129 0.03975286 0.1158059 3 0.28634939 0.26771167 0.1740633 4 0.57775850 0.13506561 0.8775828 5 0.06363138 0.16124100 0.1224806 If you wish you could then check marginal distributions using techniques like an empirical density plot.
library(lattice) densityplot(~ pc, pvals)
At this point I would recommend checking the joint distribution but if you want to choose a specific level and check the contingency table that could be done as
xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF)
I(pint < 0.16)
I(pr < 0.16) FALSE TRUE
FALSE 7204 1240
TRUE 1215 341
The summary method for an xtabs object provides a test of independence
summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF))
Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsF) Number of cases in table: 10000 Number of factors: 2 Test for independence of all factors: Chisq = 51.6, df = 1, p-value = 6.798e-13 for which you can see the puzzling result. However, if you use the chisquared test based on the known residual variance of 1, you get
pvalsC <- data.frame(pc = pchisq(res[2,], 2, low = FALSE),
+ pr = pchisq(res[3,], 1, low = FALSE), + pint = pchisq(res[4,], 2, low = FALSE))
pvalsC[1:5,]
pc pr pint 1 0.7473209 0.60924741 0.2690144 2 0.4781553 0.05642013 0.1694446 3 0.5692081 0.45933855 0.4392846 4 0.7706757 0.28059805 0.9418946 5 0.5523983 0.53843951 0.6525907
xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC)
I(pint < 0.16)
I(pr < 0.16) FALSE TRUE
FALSE 7121 1319
TRUE 1324 236
summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC))
Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsC) Number of cases in table: 10000 Number of factors: 2 Test for independence of all factors: Chisq = 0.25041, df = 1, p-value = 0.6168
On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
To my surprise, the R functions I employed did NOT verify the independence property. I've no quarrel with the theory--it's the R functions that
worry
me. pG -----Original Message----- From: Douglas Bates [mailto:dmbates at gmail.com] Sent: Monday, July 04, 2005 9:13 AM To: pigood at verizon.net; r-help Subject: Re: [R] Lack of independence in anova() I have already had email exchanges off-list with Phillip Good pointing out that the independence property that he wishes to establish by simulation is a consequence of orthogonality of the column span of the row contrasts and the interaction contrasts. If you set the contrasts option to a set of orthogonal contrasts such as contr.helmert or contr.sum, which has no effect on the results of the anova, this is easily established.
build
function(size, v = rnorm(sum(size))) {
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
size
[1] 3 3 3 0 3 3 3 0
set.seed(1234321) vv <- build(size) vv
c r yield 1 0 0 1.2369081 2 0 0 1.5616230 3 0 0 1.8396185 4 1 0 0.3173245 5 1 0 1.0715115 6 1 0 -1.1459955 7 2 0 0.2488894 8 2 0 0.1158625 9 2 0 2.6200816 10 0 1 1.2624048 11 0 1 -0.9862578 12 0 1 -0.3235653 13 1 1 0.2039706 14 1 1 -1.4574576 15 1 1 1.9158713 16 2 1 -2.0333909 17 2 1 1.0050272 18 2 1 0.6789184
options(contrasts = c('contr.helmert', 'contr.poly'))
crossprod(model.matrix(~c*r, vv))
(Intercept) c1 c2 r1 c1:r1 c2:r1 (Intercept) 18 0 0 0 0 0 c1 0 12 0 0 0 0 c2 0 0 36 0 0 0 r1 0 0 0 18 0 0 c1:r1 0 0 0 0 12 0 c2:r1 0 0 0 0 0 36 On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
If the observations are normally distributed and the 2xk design is balanced, theory requires that the tests for interaction and row
effects
be
independent. In my program, appended below, this would translate to
cntT
(approx)= cntR*cntI/N if all R routines were functioning correctly.
They
aren't.
sim2=function(size,N,p){
cntR=0
cntC=0
cntI=0
cntT=0
cntP=0
for(i in 1:N){
#generate data
v=gendata(size)
#analyze after build(ing) design containing data
lm.out=lm(yield~c*r,build(size,v))
av.out=anova(lm.out)
#if column effect is significant, increment cntC
if (av.out[[5]][1]<=p)cntC=cntC+1
#if row effect is significant, increment cntR
if (av.out[[5]][2]<=p){
cntR=cntR+1
tmp = 1
}
else tmp =0
if (av.out[[5]][3]<=p){
#if interaction is significant, increment cntI
cntI=cntI+1
#if both interaction and row effect are significant, increment
cntT
cntT=cntT + tmp
}
}
list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
}
build=function(size,v){
#size is a vector containing the sample sizes
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
gendata=function(size){
ssize=sum(size);
return (rnorm(ssize))
}
#Example
size=c(3,3,3,0,3,3,3,0)
sim2(size,10000,10,.16)
Phillip Good
Huntington Beach CA
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
On Wed, 6 Jul 2005, Phillip Good wrote:
Do you or Lumley have a citation for this conclusion? Most people go forward with the ANOV on the basis that the various tests are independent.
I don't have a citation. I would have thought that the claim that they were independent was more in need of a citation. I'd expect that books on the classical theory of linear models would show that the row, interaction, and residual mean squares are independent, and the lack of independence of the tests is then basic probability. If X, Y, and Z are independent and Z takes on more than one value then X/Z and Y/Z can't be independent.
P.S. Tests based on the method of synchronized permutations are independent.
Quite possibly. If so, they presumably condition on the observed residual mean square. -thomas
-----Original Message----- From: Douglas Bates [mailto:dmbates at gmail.com] Sent: Monday, July 04, 2005 1:44 PM To: pigood at verizon.net Cc: r-help Subject: Re: [R] Lack of independence in anova() I wrote up a note on how to do a more efficient simulation of the p-values from anova then discovered to my surprise that the chi-square test for independence of the significance of the F-tests indicated that they were not independent. I was stumped by this but fortunately Thomas Lumley came to my rescue with an explanation. There is no reason why the results of the F tests should be independent. The numerators are independent but the denominator is the same for both tests. When, due to random variation, the denominator is small, then the p-values for both tests will tend to be small. If, instead of F-tests you use chi-square tests then you do see independence. Here is the note on the simulation. There are several things that could be done to speed the simulation of the p-values of an anova like this under the null distribution. If you examine the structure of a fitted lm object (use the function str()) you will see that there are components called `qr', `effects' and `assign'. You can verify by experimentation that `qr' and `assign' depend only on the experimental design. Furthermore, the `effects' vector can be reproduced as qr.qty(qrstr, y). The sums of squares for the different terms in the model are the sums of squares of elements of the effects vector as indexed by the assign vector. The residual sum of squares is the sum of squares of the remaining elements of the assign vector. You can generate 10000 replications of the calculations of the relevant sums of squares as
set.seed(1234321) vv <- data.frame(c = gl(3,3,18), r = gl(2,9,18)) vv
c r 1 1 1 2 1 1 3 1 1 4 2 1 5 2 1 6 2 1 7 3 1 8 3 1 9 3 1 10 1 2 11 1 2 12 1 2 13 2 2 14 2 2 15 2 2 16 3 2 17 3 2 18 3 2
fm1 <- lm(rnorm(18) ~ c*r, vv) fm1$assign
[1] 0 1 1 2 3 3
asgn <- c(fm1$assign, rep(4, 12)) system.time(res <- replicate(10000, tapply(qr.qty(fm1$qr, rnorm(18))^2,
asgn, sum))) [1] 20.61 0.01 20.61 0.00 0.00
res[,1:6]
[,1] [,2] [,3] [,4] [,5] [,6] 0 0.4783121 0.3048634 0.713689 0.6937838 0.03649023 2.63392426 1 0.5825213 1.4756395 1.127018 0.5209751 1.18697199 3.32972093 2 0.2612723 3.6396106 0.547506 1.1641910 0.37843963 0.03411672 3 2.6259806 3.5504584 1.645215 0.1197238 0.85361018 4.53895212 4 9.1942755 8.2122693 4.863392 5.4413571 2.03715439 22.94815118 The rows of that array correspond to the sum of squares for the Intercept (which we generally ignore), the factor 'c', the factor 'r', their interaction and the residuals. As you can see that took about 21 seconds on my system, which I expect is a bit faster than your simulation ran. Because I set the seed to a known value I can reproduce the results for the first few simulations to check that the sums of squares are correct. Remember that the original fit (fm1) is not included in the table.
set.seed(1234321) fm1 <- lm(rnorm(18) ~ c*r, vv) anova(fm2 <- lm(rnorm(18) ~ c*r, vv))
Analysis of Variance Table
Response: rnorm(18)
Df Sum Sq Mean Sq F value Pr(>F)
c 2 0.5825 0.2913 0.3801 0.6917
r 1 0.2613 0.2613 0.3410 0.5701
c:r 2 2.6260 1.3130 1.7137 0.2215
Residuals 12 9.1943 0.7662
You can continue this process if you wish further verification that
the results correspond to the fitted models.
You can get the p-values for the F-tests as
pvalsF <- data.frame(pc = pf((res[2,]/2)/(res[5,]/12), 2, 12, low =
FALSE), + pr = pf((res[3,]/1)/(res[5,]/12), 1, 12, low = FALSE), + pint = pf((res[4,]/2)/(res[5,]/12), 2, 12, low = FALSE)) Again you can check this for the first few by hand.
pvalsF[1:5,]
pc pr pint 1 0.69171238 0.57006574 0.2214847 2 0.37102129 0.03975286 0.1158059 3 0.28634939 0.26771167 0.1740633 4 0.57775850 0.13506561 0.8775828 5 0.06363138 0.16124100 0.1224806 If you wish you could then check marginal distributions using techniques like an empirical density plot.
library(lattice) densityplot(~ pc, pvals)
At this point I would recommend checking the joint distribution but if you want to choose a specific level and check the contingency table that could be done as
xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF)
I(pint < 0.16)
I(pr < 0.16) FALSE TRUE
FALSE 7204 1240
TRUE 1215 341
The summary method for an xtabs object provides a test of independence
summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF))
Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsF) Number of cases in table: 10000 Number of factors: 2 Test for independence of all factors: Chisq = 51.6, df = 1, p-value = 6.798e-13 for which you can see the puzzling result. However, if you use the chisquared test based on the known residual variance of 1, you get
pvalsC <- data.frame(pc = pchisq(res[2,], 2, low = FALSE),
+ pr = pchisq(res[3,], 1, low = FALSE), + pint = pchisq(res[4,], 2, low = FALSE))
pvalsC[1:5,]
pc pr pint 1 0.7473209 0.60924741 0.2690144 2 0.4781553 0.05642013 0.1694446 3 0.5692081 0.45933855 0.4392846 4 0.7706757 0.28059805 0.9418946 5 0.5523983 0.53843951 0.6525907
xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC)
I(pint < 0.16)
I(pr < 0.16) FALSE TRUE
FALSE 7121 1319
TRUE 1324 236
summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC))
Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsC) Number of cases in table: 10000 Number of factors: 2 Test for independence of all factors: Chisq = 0.25041, df = 1, p-value = 0.6168 On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
To my surprise, the R functions I employed did NOT verify the independence property. I've no quarrel with the theory--it's the R functions that
worry
me. pG -----Original Message----- From: Douglas Bates [mailto:dmbates at gmail.com] Sent: Monday, July 04, 2005 9:13 AM To: pigood at verizon.net; r-help Subject: Re: [R] Lack of independence in anova() I have already had email exchanges off-list with Phillip Good pointing out that the independence property that he wishes to establish by simulation is a consequence of orthogonality of the column span of the row contrasts and the interaction contrasts. If you set the contrasts option to a set of orthogonal contrasts such as contr.helmert or contr.sum, which has no effect on the results of the anova, this is easily established.
build
function(size, v = rnorm(sum(size))) {
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
size
[1] 3 3 3 0 3 3 3 0
set.seed(1234321) vv <- build(size) vv
c r yield 1 0 0 1.2369081 2 0 0 1.5616230 3 0 0 1.8396185 4 1 0 0.3173245 5 1 0 1.0715115 6 1 0 -1.1459955 7 2 0 0.2488894 8 2 0 0.1158625 9 2 0 2.6200816 10 0 1 1.2624048 11 0 1 -0.9862578 12 0 1 -0.3235653 13 1 1 0.2039706 14 1 1 -1.4574576 15 1 1 1.9158713 16 2 1 -2.0333909 17 2 1 1.0050272 18 2 1 0.6789184
options(contrasts = c('contr.helmert', 'contr.poly'))
crossprod(model.matrix(~c*r, vv))
(Intercept) c1 c2 r1 c1:r1 c2:r1 (Intercept) 18 0 0 0 0 0 c1 0 12 0 0 0 0 c2 0 0 36 0 0 0 r1 0 0 0 18 0 0 c1:r1 0 0 0 0 12 0 c2:r1 0 0 0 0 0 36 On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
If the observations are normally distributed and the 2xk design is balanced, theory requires that the tests for interaction and row
effects
be
independent. In my program, appended below, this would translate to
cntT
(approx)= cntR*cntI/N if all R routines were functioning correctly.
They
aren't.
sim2=function(size,N,p){
cntR=0
cntC=0
cntI=0
cntT=0
cntP=0
for(i in 1:N){
#generate data
v=gendata(size)
#analyze after build(ing) design containing data
lm.out=lm(yield~c*r,build(size,v))
av.out=anova(lm.out)
#if column effect is significant, increment cntC
if (av.out[[5]][1]<=p)cntC=cntC+1
#if row effect is significant, increment cntR
if (av.out[[5]][2]<=p){
cntR=cntR+1
tmp = 1
}
else tmp =0
if (av.out[[5]][3]<=p){
#if interaction is significant, increment cntI
cntI=cntI+1
#if both interaction and row effect are significant, increment
cntT
cntT=cntT + tmp
}
}
list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
}
build=function(size,v){
#size is a vector containing the sample sizes
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
gendata=function(size){
ssize=sum(size);
return (rnorm(ssize))
}
#Example
size=c(3,3,3,0,3,3,3,0)
sim2(size,10000,10,.16)
Phillip Good
Huntington Beach CA
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
I'm confused: I understood Doug to be describing a traditional, normal theory ANOVA with k rows in the table for different effects plus a (k+1)st for residuals and the mean squares (MS) column are all indpendent chi-squares scaled by the same unknown sigma^2. The k statistics in the F column are ratios of independent chi-squares with the same independent denominator chi-square. How can they be indpendent? spencer graves p.s. How is the method of synchronized permutations relevant to a traditional, normal theory ANOVA?
Phillip Good wrote:
Do you or Lumley have a citation for this conclusion? Most people go forward with the ANOV on the basis that the various tests are independent. Phillip Good P.S. Tests based on the method of synchronized permutations are independent. -----Original Message----- From: Douglas Bates [mailto:dmbates at gmail.com] Sent: Monday, July 04, 2005 1:44 PM To: pigood at verizon.net Cc: r-help Subject: Re: [R] Lack of independence in anova() I wrote up a note on how to do a more efficient simulation of the p-values from anova then discovered to my surprise that the chi-square test for independence of the significance of the F-tests indicated that they were not independent. I was stumped by this but fortunately Thomas Lumley came to my rescue with an explanation. There is no reason why the results of the F tests should be independent. The numerators are independent but the denominator is the same for both tests. When, due to random variation, the denominator is small, then the p-values for both tests will tend to be small. If, instead of F-tests you use chi-square tests then you do see independence. Here is the note on the simulation. There are several things that could be done to speed the simulation of the p-values of an anova like this under the null distribution. If you examine the structure of a fitted lm object (use the function str()) you will see that there are components called `qr', `effects' and `assign'. You can verify by experimentation that `qr' and `assign' depend only on the experimental design. Furthermore, the `effects' vector can be reproduced as qr.qty(qrstr, y). The sums of squares for the different terms in the model are the sums of squares of elements of the effects vector as indexed by the assign vector. The residual sum of squares is the sum of squares of the remaining elements of the assign vector. You can generate 10000 replications of the calculations of the relevant sums of squares as
set.seed(1234321) vv <- data.frame(c = gl(3,3,18), r = gl(2,9,18)) vv
c r 1 1 1 2 1 1 3 1 1 4 2 1 5 2 1 6 2 1 7 3 1 8 3 1 9 3 1 10 1 2 11 1 2 12 1 2 13 2 2 14 2 2 15 2 2 16 3 2 17 3 2 18 3 2
fm1 <- lm(rnorm(18) ~ c*r, vv) fm1$assign
[1] 0 1 1 2 3 3
asgn <- c(fm1$assign, rep(4, 12)) system.time(res <- replicate(10000, tapply(qr.qty(fm1$qr, rnorm(18))^2,
asgn, sum))) [1] 20.61 0.01 20.61 0.00 0.00
res[,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
0 0.4783121 0.3048634 0.713689 0.6937838 0.03649023 2.63392426
1 0.5825213 1.4756395 1.127018 0.5209751 1.18697199 3.32972093
2 0.2612723 3.6396106 0.547506 1.1641910 0.37843963 0.03411672
3 2.6259806 3.5504584 1.645215 0.1197238 0.85361018 4.53895212
4 9.1942755 8.2122693 4.863392 5.4413571 2.03715439 22.94815118
The rows of that array correspond to the sum of squares for the
Intercept (which we generally ignore), the factor 'c', the factor 'r',
their interaction and the residuals.
As you can see that took about 21 seconds on my system, which I expect
is a bit faster than your simulation ran.
Because I set the seed to a known value I can reproduce the results
for the first few simulations to check that the sums of squares are
correct. Remember that the original fit (fm1) is not included in the
table.
set.seed(1234321) fm1 <- lm(rnorm(18) ~ c*r, vv) anova(fm2 <- lm(rnorm(18) ~ c*r, vv))
Analysis of Variance Table
Response: rnorm(18)
Df Sum Sq Mean Sq F value Pr(>F)
c 2 0.5825 0.2913 0.3801 0.6917
r 1 0.2613 0.2613 0.3410 0.5701
c:r 2 2.6260 1.3130 1.7137 0.2215
Residuals 12 9.1943 0.7662
You can continue this process if you wish further verification that
the results correspond to the fitted models.
You can get the p-values for the F-tests as
pvalsF <- data.frame(pc = pf((res[2,]/2)/(res[5,]/12), 2, 12, low =
FALSE), + pr = pf((res[3,]/1)/(res[5,]/12), 1, 12, low = FALSE), + pint = pf((res[4,]/2)/(res[5,]/12), 2, 12, low = FALSE)) Again you can check this for the first few by hand.
pvalsF[1:5,]
pc pr pint
1 0.69171238 0.57006574 0.2214847
2 0.37102129 0.03975286 0.1158059
3 0.28634939 0.26771167 0.1740633
4 0.57775850 0.13506561 0.8775828
5 0.06363138 0.16124100 0.1224806
If you wish you could then check marginal distributions using
techniques like an empirical density plot.
library(lattice) densityplot(~ pc, pvals)
At this point I would recommend checking the joint distribution but if you want to choose a specific level and check the contingency table that could be done as
xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF)
I(pint < 0.16)
I(pr < 0.16) FALSE TRUE
FALSE 7204 1240
TRUE 1215 341
The summary method for an xtabs object provides a test of independence
summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF))
Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsF) Number of cases in table: 10000 Number of factors: 2 Test for independence of all factors: Chisq = 51.6, df = 1, p-value = 6.798e-13 for which you can see the puzzling result. However, if you use the chisquared test based on the known residual variance of 1, you get
pvalsC <- data.frame(pc = pchisq(res[2,], 2, low = FALSE),
+ pr = pchisq(res[3,], 1, low = FALSE), + pint = pchisq(res[4,], 2, low = FALSE))
pvalsC[1:5,]
pc pr pint
1 0.7473209 0.60924741 0.2690144
2 0.4781553 0.05642013 0.1694446
3 0.5692081 0.45933855 0.4392846
4 0.7706757 0.28059805 0.9418946
5 0.5523983 0.53843951 0.6525907
xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC)
I(pint < 0.16)
I(pr < 0.16) FALSE TRUE
FALSE 7121 1319
TRUE 1324 236
summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC))
Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsC) Number of cases in table: 10000 Number of factors: 2 Test for independence of all factors: Chisq = 0.25041, df = 1, p-value = 0.6168 On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
To my surprise, the R functions I employed did NOT verify the independence property. I've no quarrel with the theory--it's the R functions that
worry
me. pG -----Original Message----- From: Douglas Bates [mailto:dmbates at gmail.com] Sent: Monday, July 04, 2005 9:13 AM To: pigood at verizon.net; r-help Subject: Re: [R] Lack of independence in anova() I have already had email exchanges off-list with Phillip Good pointing out that the independence property that he wishes to establish by simulation is a consequence of orthogonality of the column span of the row contrasts and the interaction contrasts. If you set the contrasts option to a set of orthogonal contrasts such as contr.helmert or contr.sum, which has no effect on the results of the anova, this is easily established.
build
function(size, v = rnorm(sum(size))) {
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
size
[1] 3 3 3 0 3 3 3 0
set.seed(1234321) vv <- build(size) vv
c r yield 1 0 0 1.2369081 2 0 0 1.5616230 3 0 0 1.8396185 4 1 0 0.3173245 5 1 0 1.0715115 6 1 0 -1.1459955 7 2 0 0.2488894 8 2 0 0.1158625 9 2 0 2.6200816 10 0 1 1.2624048 11 0 1 -0.9862578 12 0 1 -0.3235653 13 1 1 0.2039706 14 1 1 -1.4574576 15 1 1 1.9158713 16 2 1 -2.0333909 17 2 1 1.0050272 18 2 1 0.6789184
options(contrasts = c('contr.helmert', 'contr.poly'))
crossprod(model.matrix(~c*r, vv))
(Intercept) c1 c2 r1 c1:r1 c2:r1
(Intercept) 18 0 0 0 0 0
c1 0 12 0 0 0 0
c2 0 0 36 0 0 0
r1 0 0 0 18 0 0
c1:r1 0 0 0 0 12 0
c2:r1 0 0 0 0 0 36
On 7/4/05, Phillip Good <pigood at verizon.net> wrote:
If the observations are normally distributed and the 2xk design is balanced, theory requires that the tests for interaction and row
effects
be
independent. In my program, appended below, this would translate to
cntT
(approx)= cntR*cntI/N if all R routines were functioning correctly.
They
aren't.
sim2=function(size,N,p){
cntR=0
cntC=0
cntI=0
cntT=0
cntP=0
for(i in 1:N){
#generate data
v=gendata(size)
#analyze after build(ing) design containing data
lm.out=lm(yield~c*r,build(size,v))
av.out=anova(lm.out)
#if column effect is significant, increment cntC
if (av.out[[5]][1]<=p)cntC=cntC+1
#if row effect is significant, increment cntR
if (av.out[[5]][2]<=p){
cntR=cntR+1
tmp = 1
}
else tmp =0
if (av.out[[5]][3]<=p){
#if interaction is significant, increment cntI
cntI=cntI+1
#if both interaction and row effect are significant, increment
cntT
cntT=cntT + tmp
}
}
list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
}
build=function(size,v){
#size is a vector containing the sample sizes
col=c(rep(0,size[1]),rep(1,size[2]),rep(2,size[3]),rep(3,size[4]),
rep(0,size[5]),rep(1,size[6]),rep(2,size[7]),rep(3,size[8]))
row=c(rep(0,size[1]+size[2]+size[3]+size[4]),rep(1,size[5]+size[6]
+size[7]+size[8]))
return(data.frame(c=factor(col), r=factor(row),yield=v))
}
gendata=function(size){
ssize=sum(size);
return (rnorm(ssize))
}
#Example
size=c(3,3,3,0,3,3,3,0)
sim2(size,10000,10,.16)
Phillip Good
Huntington Beach CA
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Spencer Graves, PhD Senior Development Engineer PDF Solutions, Inc. 333 West San Carlos Street Suite 700 San Jose, CA 95110, USA spencer.graves at pdf.com www.pdf.com <http://www.pdf.com> Tel: 408-938-4420 Fax: 408-280-7915
On 7/6/05, Phillip Good <pigood at verizon.net> wrote:
Do you or Lumley have a citation for this conclusion? Most people go forward with the ANOV on the basis that the various tests are independent. Phillip Good P.S. Tests based on the method of synchronized permutations are independent.
Perhaps we could review the sequence of events here. This exchange began with your sending me a message claiming that there is a bug in lm or anova in R because the results of your simulation were what you expected. I responded saying that it is unlikely that a serious bug in such a fundamental part of R would remain undetected for such a long time and then went further and took apart your simulation and showed how it could be done much more effectively and also showed, with help from Thomas Lumley and Peter Dalgaard, that your original assumption is incorrect. You have now made another blanket statement the "Most people go forward with the ANOV on the basis that the various tests are independent" and indicated that Thomas or I should provide a citation to validate our claim. Perhaps instead of claiming that it is necessary for us to produce evidence in support of our claim that they are not independent because they are based on the same denominator mean square, you could produce a citation to back up your claim. To date none of your claims of the independence of the tests or the supposed bugs in R have been substantiated. If indeed "most people" do as you claim then "most people" are suffering from a misconception, a not-uncommon situation in statistics.
On 7/6/05, Douglas Bates <dmbates at gmail.com> wrote:
...
Perhaps we could review the sequence of events here. This exchange began with your sending me a message claiming that there is a bug in lm or anova in R because the results of your simulation were what you expected.
I meant to write "were not what you expected". I've got to learn to read the email messages before posting them.
spencer graves asks: How is the method of synchronized permutations relevant to a traditional, normal theory ANOVA? One ought now ask as I am doing currently whether traditional, normal theory ANOVA is applicable to the analysis of experimental designs. If in fact, the results of the various ANOV tests are not independent, then: i) shouldn't we be teaching this in courses on the analysis of experimental designs? (Can anyone point to an FDA submission or aplied statistics publication which conceeds this result?) ii) using tests whose results are independent?
On 7/6/2005 1:43 PM, Phillip Good wrote:
spencer graves asks: How is the method of synchronized permutations relevant to a traditional, normal theory ANOVA? One ought now ask as I am doing currently whether traditional, normal theory ANOVA is applicable to the analysis of experimental designs. If in fact, the results of the various ANOV tests are not independent, then: i) shouldn't we be teaching this in courses on the analysis of experimental designs? (Can anyone point to an FDA submission or aplied statistics publication which conceeds this result?) ii) using tests whose results are independent?
I think it's relatively infrequent that we make inferences where independence matters, so it makes sense to use tests that are marginally well-behaved. When do you care about the simultaneous behaviour of two tests? The nice thing about ANOVA tests is that each one is valid regardless of the presence or absence of effects in other rows. This doesn't require independence. Where you might use independence is to try to construct a combined p-value looking for a violation of any of several hypotheses, but if that's the test you're interested in, you should just have pooled the terms in the numerator. Duncan Murdoch
On 06-Jul-05 Phillip Good wrote:
Do you or Lumley have a citation for this conclusion? Most people go forward with the ANOV on the basis that the various tests are independent.
This hardly needs a citation -- Thomas Lumley's explanation (excerpted below from Doug Bates's mail) is justification enough, and it is succinct and elementary. However, a place one naturally looks for caveats of this kind is in Kendall & Stuart, and I duly found it (Vol 3, section 35.46 of my 1983 edition). It is essentially exactly the same explanation: "However, the tests in the AV tables which we have considered are never independent tests, for although the various SS in a table may be independent of each other, all the tests we have derived use the Residual SS as denominator of the test statistic, and the various tests must therefore be statistically dependent, since, e.g., a Residual SS which is (by chance) large will depress all the values of the test staistics simultaneously." (And K&S, thorough as they are with citations, do not cite any primary reference for this either!) However, if the "degrees of freedom" for Residual SS is large, then the amount of random variation in the denominator will be small and it will be effectively constant. Then, of course, with independent numerators, the tests will be effectively independent (and equivalent to chi-squared) and also, therefore, the p-values. The fact that "most people go forward with the ANOV on the basis that the various tests are independent" possibly reflects the wide-spread act of faith that one has "a large sample", whatever the value of n may really be. One wonders how often people check the p-value for their F on (n1:n2) d.f. against the p-value for (n1:Inf) d.f.? The 5% point decreases quite perceptibly as n2 increases up to about 20, and more slowly thereafter; but still the difference between F(n1:20) and F(n1:Inf) is substantial for any n1 (being about 0.5 for n1 up to about 10, increasing thereafter up to 0.84): n1<-c(1+2*(0:50),5000);cbind(n1,qf(0.95,n1,20) - qf(0.95,n1,Inf)) F(Inf,Inf) = 1 ; F(20:Inf) = qf(0.95,Inf,20) = 1.841 Conversely (e.g.): > 1-pf(qf(0.95,20,20),20,20) [1] 0.05 > 1-pf(qf(0.95,20,20),20,Inf) [1] 0.002391189 Such differences are related to the degree of non-independence of several tests on the same data.
[Douglas Bates]: Thomas Lumley came to my rescue with an explanation. There is no reason why the results of the F tests should be independent. The numerators are independent but the denominator is the same for both tests. When, due to random variation, the denominator is small, then the p-values for both tests will tend to be small. If, instead of F-tests you use chi-square tests then you do see independence.
But surely this amounts to assuming n2 = Inf? If that's an adequate approximation, then fine; but if not (see e.g. above) then not! Best wishes to all, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 06-Jul-05 Time: 19:29:18 ------------------------------ XFMail ------------------------------
On 6 Jul 2005 at 12:30, Douglas Bates wrote:
On 7/6/05, Douglas Bates <dmbates at gmail.com> wrote: ...
Perhaps we could review the sequence of events here. This exchange began with your sending me a message claiming that there is a bug in lm or anova in R because the results of your simulation were what you expected.
At the risk of further roiling the waters ... As several have already pointed out, the "usual" F-tests in a balanced ANOVA have independent numerators but a common denominator, and hence the F-statistics cannot be independent. Is this not the basis of Kimball's Inequality, which states that the effect of the common denominator is that the simultaneous error rate cannot exceed what it would be if the tests *were* in fact independent? In other words, you should get a simultaneous error rate for the F-tests that is lower than that under independence of test statistics. Are you? ---JRG John R. Gleason
I meant to write "were not what you expected". I've got to learn to read the email messages before posting them.
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
As Duncan Murdoch indicated, the primary issue is not that the different tests are (not) statitically independent but that they are sensitive to different alternative hypotheses. spencer graves
JRG wrote:
On 6 Jul 2005 at 12:30, Douglas Bates wrote:
On 7/6/05, Douglas Bates <dmbates at gmail.com> wrote: ...
Perhaps we could review the sequence of events here. This exchange began with your sending me a message claiming that there is a bug in lm or anova in R because the results of your simulation were what you expected.
At the risk of further roiling the waters ... As several have already pointed out, the "usual" F-tests in a balanced ANOVA have independent numerators but a common denominator, and hence the F-statistics cannot be independent. Is this not the basis of Kimball's Inequality, which states that the effect of the common denominator is that the simultaneous error rate cannot exceed what it would be if the tests *were* in fact independent? In other words, you should get a simultaneous error rate for the F-tests that is lower than that under independence of test statistics. Are you? ---JRG John R. Gleason
I meant to write "were not what you expected". I've got to learn to read the email messages before posting them.
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Spencer Graves, PhD Senior Development Engineer PDF Solutions, Inc. 333 West San Carlos Street Suite 700 San Jose, CA 95110, USA spencer.graves at pdf.com www.pdf.com <http://www.pdf.com> Tel: 408-938-4420 Fax: 408-280-7915
On Wed, Jul 06, 2005 at 10:06:45AM -0700, Thomas Lumley wrote:
(...)
If X, Y, and Z are independent and Z takes on more than one value then X/Z and Y/Z can't be independent.
Not really true. I can produce a counterexample on request (admittedly quite trivial though). G??ran Brostr??m
On 06-Jul-05 G??ran Brostr??m wrote:
On Wed, Jul 06, 2005 at 10:06:45AM -0700, Thomas Lumley wrote: (...)
If X, Y, and Z are independent and Z takes on more than one value then X/Z and Y/Z can't be independent.
Not really true. I can produce a counterexample on request (admittedly quite trivial though). G??ran Brostr??m
But true if both X and Y have positive probability of being non-zero, n'est-pas? Tut, tut, G??ran! Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 07-Jul-05 Time: 00:51:24 ------------------------------ XFMail ------------------------------
Hi, G??ran: I'll bite:
(a) I'd like to see your counterexample.
(b) I'd like to know what is wrong with my the following, apparently
defective, proof that they can't be independent: First consider
indicator functions of independent events A, B, and C.
P{(AC)&(BC)} = P{ABC} = PA*PB*PC.
But P(AC)*P(BC) = PA*PB*(PC)^2. Thus, AC and BC can be independent
only if PC = 0 or 1, i.e., the indicator of C is constant almost surely.
Is there a flaw in this? If not, is there some reason this case
cannot be extended the product of arbitrary random variables X, Y, and
W=1/Z?
Thanks,
spencer graves
G??ran Brostr??m wrote:
On Wed, Jul 06, 2005 at 10:06:45AM -0700, Thomas Lumley wrote: (...)
If X, Y, and Z are independent and Z takes on more than one value then X/Z and Y/Z can't be independent.
Not really true. I can produce a counterexample on request (admittedly quite trivial though). G??ran Brostr??m
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Spencer Graves, PhD Senior Development Engineer PDF Solutions, Inc. 333 West San Carlos Street Suite 700 San Jose, CA 95110, USA spencer.graves at pdf.com www.pdf.com <http://www.pdf.com> Tel: 408-938-4420 Fax: 408-280-7915
My thanks to Douglas Bates and others for pointing out ANOV's limitations. Now the question is how to put my new won knowledge into practice. Before my eyes were opened if row effect and interaction were both statistically significant I would report the row effects separately for each column. Now if row effect and interaction are both statistically significant i'll realize this may simply be a matter of the test statistics being interdependant (though I might still report the column effects separately if, say, one column was different--in a practical sense--from the rest). if neither row nor column nor interaction effects are significant, I'll attribute the result to too small a sample. (As far as higher-order interactions go, I never really had a clue as to what they were about anyway.) Phillip Good
(Ted Harding) wrote:
On 06-Jul-05 G??ran Brostr??m wrote:
On Wed, Jul 06, 2005 at 10:06:45AM -0700, Thomas Lumley wrote: (...)
If X, Y, and Z are independent and Z takes on more than one value then X/Z and Y/Z can't be independent.
Not really true. I can produce a counterexample on request (admittedly quite trivial though). G??ran Brostr??m
But true if both X and Y have positive probability of being non-zero, n'est-pas? Tut, tut, G??ran!
If X and Y are independent with symmetric distributions about zero, and Z is is supported on +/- A for some non-zero constant A, then X/Z and Y/Z are still independent. There are probably other special cases too. Duncan Murdoch
Spencer Graves wrote:
Hi, G??ran: I'll bite:
(a) I'd like to see your counterexample.
(b) I'd like to know what is wrong with my the following, apparently
defective, proof that they can't be independent: First consider
indicator functions of independent events A, B, and C.
P{(AC)&(BC)} = P{ABC} = PA*PB*PC.
But P(AC)*P(BC) = PA*PB*(PC)^2. Thus, AC and BC can be independent
only if PC = 0 or 1, i.e., the indicator of C is constant almost surely.
Is there a flaw in this?
I don't see one. > If not, is there some reason this case
cannot be extended the product of arbitrary random variables X, Y, and W=1/Z?
Because you can't? The situations are different? If C is a non-trivial event independent of A, then AC is strictly a subset of A. However, as the example I just posted shows (with constant 1), you can have a non-trivial random variable W where XW has exactly the same distribution as X. Duncan Murdoch
Hi, Duncan & G??ran:
Consider the following: X, Y, Z have symmetric distributions with
the following restrictions:
P(X=1)=P(X=-1)=x with P(|X|<1)=0 so P(|X|>1)=1-2x.
P(Y=1)=P(Y=-1)=y with P(|Y|<1)=0 so P(|Y|>1)=1-2y.
P(Z=1)=P(Z=-1)=z with P(|Z|>1)=0 so P(|Z|<1)=1-2z.
Then
P(X/Z=1)=2xz, P(Y/Z=1)=2yz, and
P{(X/Z=1)&(Y/Z)=1}=2xyz.
Independence requires that this last probability is 4xyz^2. This is
true only if z=0.5. If z<0.5, then X/Z and Y/Z are clearly dependent.
How's this?
spencer graves
Duncan Murdoch wrote:
(Ted Harding) wrote:
On 06-Jul-05 G??ran Brostr??m wrote:
On Wed, Jul 06, 2005 at 10:06:45AM -0700, Thomas Lumley wrote: (...)
If X, Y, and Z are independent and Z takes on more than one value then X/Z and Y/Z can't be independent.
Not really true. I can produce a counterexample on request (admittedly quite trivial though). G??ran Brostr??m
But true if both X and Y have positive probability of being non-zero, n'est-pas? Tut, tut, G??ran!
If X and Y are independent with symmetric distributions about zero, and Z is is supported on +/- A for some non-zero constant A, then X/Z and Y/Z are still independent. There are probably other special cases too. Duncan Murdoch
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Spencer Graves, PhD Senior Development Engineer PDF Solutions, Inc. 333 West San Carlos Street Suite 700 San Jose, CA 95110, USA spencer.graves at pdf.com www.pdf.com <http://www.pdf.com> Tel: 408-938-4420 Fax: 408-280-7915
Spencer Graves wrote:
Hi, G??ran: I'll bite:
(a) I'd like to see your counterexample.
(b) I'd like to know what is wrong with my the following, apparently
defective, proof that they can't be independent: First consider
indicator functions of independent events A, B, and C.
P{(AC)&(BC)} = P{ABC} = PA*PB*PC.
But P(AC)*P(BC) = PA*PB*(PC)^2. Thus, AC and BC can be independent
only if PC = 0 or 1, i.e., the indicator of C is constant almost surely.
Is there a flaw in this?
As far as I can see, this is correct.
If not, is there some reason this case cannot be extended the product of arbitrary random variables X, Y, and W=1/Z?
Yes. Random variables are independent if all events which can be defined in terms of them are independent. If Z is non-constant, it must be some event defined by Z with probability strictly beteween 0 and 1 and the above argument cannot be used. Kjetil
Thanks, spencer graves G??ran Brostr??m wrote:
On Wed, Jul 06, 2005 at 10:06:45AM -0700, Thomas Lumley wrote: (...)
If X, Y, and Z are
independent and Z takes on more than one value then X/Z and Y/Z can't be
independent.
Not really true. I can produce a counterexample on request (admittedly quite trivial though). G??ran Brostr??m
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
--
Kjetil Halvorsen.
Peace is the most effective weapon of mass construction.
-- Mahdi Elmandjra
No virus found in this outgoing message. Checked by AVG Anti-Virus.