Skip to content

Replicating type III anova tests for glmer/GLMM

14 messages · Francesco Romano, Phillip Alday, Emmanuel Curis +1 more

#
Yes. An ANOVA with my final bglmer model yields:
Analysis of Variance Table

                   Df Sum Sq Mean Sq F value
syntax12            1 1.7670  1.7670  1.7670
animacy12           1 3.4036  3.4036  3.4036
group123            2 5.7213  2.8607  2.8607
animacy12:group123  2 4.5546  2.2773  2.2773
syntax12:group123   2 8.1732  4.0866  4.0866

which is counterintuitively not what the authors of the papers
apparently used to generate coefficients to report their main effects
and interactions. It looks to me more like ML fitting. Elsewhere,
and more typically, main effects and interactions are obtained by comparing
a

model with the main fixed effect to a model without the

main fixed effect in terms of log-likelihood ratio tests

(Raffray et al., 2013, http://dx.doi.org/10.1016/j.jml.2013.09.004, p.6).


I understand obtaining p-values from a summary
of linear mixed models fit by lmer is a contentious issue

https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html

but I guess I might be missing something here.






On Tue, Feb 23, 2016 at 2:21 AM, Phillip Alday <Phillip.Alday at unisa.edu.au>
wrote:

  
    
#
lme4:anova() is not the same thing as car::Anova()!

A quick R note that might have avoided the confusion:
The :: syntax in R refers to scope, so you can specify a function
unambiguously via package::function.name(). Moreover, R is case
sensitive, so Anova() and anova() are generally different things.

Henrik's message (posted to the list so if you don't suscribe, you need
to look here:
https://mailman.stat.ethz.ch/pipermail/r-sig-mixed-models/2016q1/024465.html
) describes how to do this with either his afex package (for
likelihood-ratio tests) or John Fox's car package (for analysis of
deviance / Wald tests).

If you just want to perform likelihood-ratio tests in lme4, then you
should look at the drop1() function or you can use anova(reduced.model,
full.model). Henrik also does a nice job summarizing some of the issues
here, so I won't repeat them.

One final note: not everything that holds for normal LMM holds for GLMM
-- GLMM tends to be much more complicated. :-(

Best,
Phillip
On 23/02/16 20:03, Francesco Romano wrote:
#
Thanks to Henrik and Phillip for the quick reply.
Your suggestions have been helpful in making progress.

On the one hand Henrik is right about
reporting coefficients and standard errors when
there are only two levels for the each predictor. This is
consistent with two of the sources I mentioned so far.
I infer that the authors reported directly from the summary(m1)
after use of the mixed function (not car::Anova which yields chi
square tests).

On the other hand, I don't understand how Cai et al. (2012) p.842,
"combined analysis experiments 1 and 2", reported the main effect
of a factor with 4 levels via a single estimate, SE, z, p coefficient.
How did they obtain this and is this the right way?

Finally, after running analysis both ways, I get slightly different
p-values, with the car::Anova method being more conservative
(it yields less significant predictors). Is this normal?

Frank



On Tue, Feb 23, 2016 at 10:51 AM, Phillip Alday <Phillip.Alday at unisa.edu.au>
wrote:

  
    
#
In my experience, car::Anova is slightly less conservative (as Wald
tests are known to be somewhat anti-conservative).

Are you using Type-III tests for everything? The differences between
Type-II and Type-III can actually make a big difference in terms of
which predictors are significant.

Speaking of Type-III -- although it's the default in some popular
commercial packages, Type-II (marginal tests) is actually the type that
makes the most sense in terms of statistical interpretation and
hypotheses tested. But that's a topic for another time ....

Best,
Phillip
On 23/02/16 22:41, Francesco Romano wrote:
#
On Tue, Feb 23, 2016 at 01:06:18PM +0100, Francesco Romano wrote:
? On the other hand, I don't understand how Cai et al. (2012) p.842,
? "combined analysis experiments 1 and 2", reported the main effect
? of a factor with 4 levels via a single estimate, SE, z, p coefficient.
? How did they obtain this and is this the right way?

It's just a guess, but any sum-of-square can be seen as a particular
contrast, that is a particular combination of the coefficients in the
model (or of the different means, expressed another way) that is
tested against 0. So I guess this single estimate is the value of the
contrast associated to the corresponding sum-of-squares, and SE/z/p
are derived similarly.

You can play with multcomp::glht to test this, but knowing which
contrast is tested by which sum of square in a specific desing may be
tricky: it depends on the coding, on the (un)balance...

Kowing if this is the ? right ? way is I think the same debate that
knowing which kind of sum-of-square should be used and the question is
very application dependent. Just, if you don't know what this single
estimate estimates really, interpretation is at best difficult...
#
Dear Emmanuel,

With proper contrast coding (i.e., a coding that's orthogonal in the *basis* of the design, such as provided by contr.sum() ), a "type-III" test is just a test that the corresponding parameters are 0. The models in question are generalized linear (mixed) models and so sums of squares aren't really involved, but one could do the corresponding Wald (like car::Anova) or LR test. The Wald test is what you'd get with multcomp:glht or car:linearHypothesis. BTW, I don't think that it would be hard for car::Anova to be extended to provide LR tests in this case.

Best,
 John
#
Dear Pr Fox,

Thanks for your precision. But to summarize this test of, let's say 3
parameters to 0 for a 4-levels factor, by a single value with its SE,
as mentionned in Francesco's mail, the linear combination of these
parameters that is practically tested by this sum of square is needed,
isn't it ?

I mean, if really the parameters are all 0, whatever linear
combination could do the job, but type III sum of square just tests
one of all possible linear combinations, right?

By the way, I was always very annoyed by the fact that Type III sum of
squares are so dependent on coding, but that's another debate...

Best regards,
On Tue, Feb 23, 2016 at 04:15:02PM +0000, Fox, John wrote:
? Dear Emmanuel,
? 
? With proper contrast coding (i.e., a coding that's orthogonal in the *basis* of the design, such as provided by contr.sum() ), a "type-III" test is just a test that the corresponding parameters are 0. The models in question are generalized linear (mixed) models and so sums of squares aren't really involved, but one could do the corresponding Wald (like car::Anova) or LR test. The Wald test is what you'd get with multcomp:glht or car:linearHypothesis. BTW, I don't think that it would be hard for car::Anova to be extended to provide LR tests in this case.
? 
? Best,
?  John
#
Dear Emmanuel,

First, the relevant linear hypothesis is for several coefficients simultaneously -- for example, all 3 coefficients for the contrasts representing a 4-level factor -- not for a single contrast. Although it's true that any linear combination of parameters that are 0 is 0, the converse isn't true. Second, for a GLMM, we really should be talking about type-III tests not type-III sums of squares.

Type-III tests are dependent on coding in the full-rank parametrization of linear (and similar) models used in R, to make the tests correspond to reasonable hypotheses. The invariance of type-II tests with respect to coding is attractive, but shouldn't distract from the fundamental issues, which are the hypotheses that are tested and the power of the tests. 

Best,
 John
#
John,

I tried the Anova() function in the car package implemented with
contr.sum() but it doesn't produce beta, SE, z, and p.
To be more precise, R requires that either the F or Chi sq statistic be
used. The model I used was termed "mod", here is the error:
+     test.statistic=c("LR"))
Error in match.arg(test.statistic) : 'arg' should be one of ?Chisq?, ?F?

Chi square produces the following output:
+     test.statistic=c("Chisq"))
Analysis of Deviance Table (Type III Wald chisquare tests)

Response: Correct
                              Chisq Df Pr(>Chisq)
(Intercept)                 67.7409  1  < 2.2e-16 ***
Syntax                       0.2856  1   0.593083
Animacy                      6.2575  1   0.012367 *
Prof.group.2                 2.9888  2   0.224379
Syntax:Animacy               0.0970  1   0.755521
Syntax:Prof.group.2          9.3054  2   0.009536 **
Animacy:Prof.group.2         4.7633  2   0.092399 .
Syntax:Animacy:Prof.group.2  1.3704  2   0.503997

So I still don't know how Raffrey et al. reported beta, SE, z, and p for a
main effect of factor with 4 levels.
If reviewers ask me to do this, I will argue that reporting chi square
tests with corresponding p-values is
a more accurate way of reporting main effects and interactions.

If I haven't abused enough of your time, it would be beneficial to
understand which of the two
methods suggested by Henrik I should adopt. I attach my data.

The predictors of interest are Syntax (2 levels), Animacy (2 levels),
Prof.group.2 (3 levels),
and the outcome 'correct', while the random effects are 'Part.name' and
'Item'. The best model fit is a
bglmer with glmerControl(optimizer = "bobyqa") and nAGQ=1
Cov prior  : Part.name ~ wishart(df = 3.5, scale = Inf, posterior.scale =
cov, common.scale = TRUE)
           : Item ~ wishart(df = 3.5, scale = Inf, posterior.scale = cov,
common.scale = TRUE)
Prior dev  : 1.3565

Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) ['bglmerMod']
 Family: binomial  ( logit )
Formula: Correct ~ Syntax * Animacy * Prof.group.2 + (1 | Part.name) +
 (1 | Item)
   Data: recall
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid
   313.3    372.9   -142.6    285.3      509

Scaled residuals:
    Min      1Q  Median      3Q     Max
-1.3517 -0.2926 -0.1802 -0.1137  9.3666

Random effects:
 Groups    Name        Variance Std.Dev.
 Part.name (Intercept) 0.8046   0.8970
 Item      (Intercept) 0.5031   0.7093
Number of obs: 523, groups:  Part.name, 42; Item, 16

Fixed effects:
                                       Estimate Std. Error z value Pr(>|z|)

(Intercept)                             -0.8960     0.6317  -1.418 0.156071

Syntaxs                                 -2.0713     0.9447  -2.193 0.028342
*
Animacy+AN -AN                          -3.0539     1.2548  -2.434 0.014941
*
Prof.group.2int                         -2.5594     0.9473  -2.702 0.006898
**
Prof.group.2ns                          -1.8673     0.7634  -2.446 0.014442
*
Syntaxs:Animacy+AN -AN                   1.8642     1.8202   1.024 0.305750

Syntaxs:Prof.group.2int                  4.1704     1.1676   3.572 0.000355
***
Syntaxs:Prof.group.2ns                   2.4244     1.0483   2.313 0.020736
*
Animacy+AN -AN:Prof.group.2int           3.0067     1.5528   1.936 0.052824
.
Animacy+AN -AN:Prof.group.2ns            1.3245     1.6071   0.824 0.409848

Syntaxs:Animacy+AN -AN:Prof.group.2int  -2.2056     2.0550  -1.073 0.283162

Syntaxs:Animacy+AN -AN:Prof.group.2ns   -2.3249     2.3108  -1.006 0.314360

---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1


Henrik's first method via afex:mixed leads to:
"LRT")
Formula (the first argument) converted to formula.
Fitting 8 (g)lmer() models:
(8 warnings omitted)
Mixed Model Anova Table (Type 3 tests)

Model: Correct ~ Syntax * Animacy * Prof.group.2 + (1 | Part.name) +
Model:     (1 | Item)
Data: recall
Df full model: 14
                            Df   Chisq Chi Df Pr(>Chisq)
Syntax                      13  5.5659      1   0.018313 *
Animacy                     13  8.4710      1   0.003609 **
Prof.group.2                12 10.5099      2   0.005222 **
Syntax:Animacy              13  0.9832      1   0.321400
Syntax:Prof.group.2         12 15.8094      2   0.000369 ***
Animacy:Prof.group.2        12  3.9188      2   0.140945
Syntax:Animacy:Prof.group.2 12  1.2240      2   0.542272
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

The result is a main effect of Syntax, Animacy, Prof.group.2, and
interaction
between Syntax and Prof.Group.2. The summary(m4) is perfectly interpretable.

Henrik's second method yields:

*set contrasts*
2*(recall$Animacy=="+AN -AN"))
2*(recall$Prof.group.2=="int") + 3*(recall$Prof.group.2=="ns"))
*try second method*
(1 | Item), data = recall, control = glmerControl(optimizer =
"bobyqa"), nAGQ=1, family=binomial, expand_re= T)
Warning message:
extra argument(s) ?expand_re? disregarded
Analysis of Deviance Table (Type III Wald chisquare tests)

Response: Correct
                              Chisq Df Pr(>Chisq)
(Intercept)                 67.7409  1  < 2.2e-16 ***
Syntax01                     0.2856  1   0.593083
Animacy01                    6.2575  1   0.012367 *
Group012                     2.9888  2   0.224379
Syntax01:Animacy01           0.0970  1   0.755521
Syntax01:Group012            9.3054  2   0.009536 **
Animacy01:Group012           4.7633  2   0.092399 .
Syntax01:Animacy01:Group012  1.3704  2   0.503997

The result this time is a main effect of what was Animacy and interaction
between what was Syntax and Prof.Group.2 ?!

The summary(m5) is perfectly interpretable.
On Tue, Feb 23, 2016 at 6:17 PM, Fox, John <jfox at mcmaster.ca> wrote:

            

  
    
#
Dear Francesco,

For a 1-df test, the Wald chi-square is just Z^2, but the chi-square is more general. When a term in the model has more than 1 df, there is more than one beta (hat) and one SE (and covariances) for the coefficients in the term. If you want to see the individual coefficient estimates, then summary(mod) will show you each coefficient estimate, the SE for each estimate, Z, and p. Why one would want to look at the individual effect-coded coefficients and tests in this context escapes me. 

Best,
 John
#
Dear Pr Fox,

Thanks for taking time for this discussion. I think I made a few
shortcuts that are wrong, and I still have some not understood issues
about the kind of tests even in the simplest case of linear models...

First, I think I mixed contrast and quadratic forms expectations in my
answer, I apologize for that; what I had in mind when answering
Francesco was in fact the expectation of the quadratic form, and I too
quickly deduced that there was an equivalent linear combination of the
parameters as its ? square root ?, but this was obviously wrong since
the L matrix in a Lt W L quadratic form does not have to be a column
matrix. Am I wrong thinking that typically in such tests, the L matrix
is precisely a multi-column matrix (hence also several degrees of
freedom associated), and that several contrasts are tested
simultaneously?

I precise that I call ? contrast ? a linear combination of the model
parameters with the constraint that the coefficients of this
combination sum to 0 ? this is the definition in French (? contraste ?),
but I may use it wrongly in English?

Second, I may have wrongly understood the definitions of the various
tests, and especially how they generalize from linear model to
GLM/GLMM...

I thought type I was by taking the squared distance of the successive
orthogonal projections on the subspaces generated by the various
terms, in the order given in the model; type II, by ensuring that
the term tested was the last amongst terms of same order, after
terms of lower order but before terms of higher order; type III, by
projecting on the subspace after removal of the basis vectors for the
term tested ? hence its strong dependency on the coding scheme, and
the ? drop1 ? trick to get them.

Is this definition correct? Does it generalize to other kind models,
or is another definition required? Is it unambiguous? The SAS doc
itself suggests that various procedures call "type II" different kind
of things

However, I cannot see clearly which hypothesis is indeed tested in
each case, especially in terms of cell means or marginal means (and,
when I really need it, I start from them and select the contrasts I
need).  Is there any package/software that allows to print the
hypotheses testeds in terms of means starting from the model formula?
Or is there any good reference that makes the link between the two?
For instance, a demonstration that the comparison of marginal means
? always ? leads to a type XXX sum of square?

Best regards,
On Tue, Feb 23, 2016 at 05:17:29PM +0000, Fox, John wrote:
? Dear Emmanuel,
? 
? First, the relevant linear hypothesis is for several coefficients simultaneously -- for example, all 3 coefficients for the contrasts representing a 4-level factor -- not for a single contrast. Although it's true that any linear combination of parameters that are 0 is 0, the converse isn't true. Second, for a GLMM, we really should be talking about type-III tests not type-III sums of squares.
? 
? Type-III tests are dependent on coding in the full-rank parametrization of linear (and similar) models used in R, to make the tests correspond to reasonable hypotheses. The invariance of type-II tests with respect to coding is attractive, but shouldn't distract from the fundamental issues, which are the hypotheses that are tested and the power of the tests. 
? 
? Best,
?  John
#
Dear Emmanuel,

The questions you raise are sufficiently complicated that it's difficult to address them adequately in an email. My Applied Regression and Generalized Linear Models text, for example, takes about 15 pages to explain the relationships among regressor codings, hypotheses, and tests in 2-way ANOVA, working with the full-rank parametrization of the model, and it's possible (as Russell Lenth indicated) to work things out even more generally. 

I'll try to answer briefly, however.
No need to apologize. I don't think that these are simple ideas.
Thinking in terms of the full-rank parametrization, as used in R, each type-III hypothesis is that several coefficients are simultaneously 0, which can be simply formulated as a linear hypothesis assuming an appropriate coding of the regressors for a factor. Type-II hypotheses can also be formulated as linear hypotheses, but doing so is more complicated. The Anova() function uses a kind of projection, in effect defining a type-II test as the most powerful test of a conditional hypothesis such as no A main effect given that the A:B interaction is absent in the model y ~ A*B. This works both for linear models, where (unless there is a complication like missing cells), the resulting test corresponds to the test produced by comparing the models y ~ A and y ~ A + B, using Y ~ A*B for the estimate of error variance (i.e., the denominator MS), and more generally for models with linear predictors, where it's in general possible to formulate the (Wald) tests in terms of the coefficient estimates and their covariance matrix.
I'd define a "contrast" as the weights associated with the levels of a factor for formulating a hypothesis, where the weights traditionally are constrained to sum to 0, and to differentiate this from a column of the model matrix, which I'd more generally term a "regressor." Often, a traditional set of contrasts for a factor, one less than the number of levels, are defined not only to sum to 0  but also to be orthogonal in the basis of the design. The usage in R is more general, where "contrasts" mean the set of regressors used to represent a factor. Thus, contr.sum() generates regressors that satisfy the traditional definition of contrasts, as do contr.poly() and contr.helmert(), but the default contr.treatment() generates 0/1 dummy-coded regressors that don't satisfy the traditional definition of contrasts.
Yes, if I've followed this correctly, it's correct, and it explains why it's possible to formulate the different types of tests in linear models independently of the contrasts (regressors) used to code the factors -- because fundamentally what's important is the subspace spanned by the regressors in each model, which is independent of coding. This approach, however, doesn't generalize easily beyond linear models fit by least squares. The approach taken in Anova() corresponds to this approach in linear models fit by least squares as long as the models remain full-rank and for type-III tests as long as the contrasts are properly formulated, and generalizes to other models with linear predictors.
This is where a complete explanation gets too lengthy for an email, but a shorthand formulation, e.g., for the model y ~ A*B, is that type-I tests correspond to the hypotheses A|(B = 0, AB = 0), B | AB = 0, AB = 0; type-II tests to A | AB = 0, B | AB = 0, AB = 0; and type-III tests to A = 0, B = 0, AB = 0. Here, e.g., | AB = 0 means assuming no AB interactions, so, e.g., the hypothesis A | AB = 0 means no A main effects assuming no AB interactions. A hypothesis like A = 0 is indeed formulated in terms of marginal means, understood as cell means for A averaging over the levels of B (not level means of A ignoring B).

I realize that this is far from a complete explanation.

Best,
 John
2 days later
#
Dear Pr Fox,

Thanks for the time taken clarifying things. I'll take time to read
your text, and think over things, but I think that until that I'll
stay with the writing of the comparisons in terms of means and deduce
the linear hypothesis to test, to be sure of what I do.

I don't understand well, in your answer, the part saying ? it explains
why it's possible to formulate the different types of tests in linear
models independently of the contrasts (regressors) used to code the
factors -- because fundamentally what's important is the subspace
spanned by the regressors in each model, which is independent of
coding. ?.

As I understood the model, if we have a 2?2 design (A?B) for instance,
the subspace spanned by all predictors is a 4-dimensionnal space. In
this space, each dimension can be assigned to A, B, their interaction
and a constant. That means, each predictor is associated with a
different basis vector of this 4-dimensionnal space. But there is
several ways of defining the basis, defining different sub-spaces
associated with A, B and A?B, and this corresponds to the different
codings. For instance, I can say (with 4 points)

? A   B A?B       or  ?  A  B  A?B
1 -1 -1  +1           1  0  0  0
1 -1 +1  -1           0  0  1  0
1 +1 -1  -1	      0  1  0  0
1 +1 +1  +1	      0  1  1  1

and the sub-spaces associated with ?, A, B, and A?B are different in
these two codings (but in whole, the 4-dimensionnal space is the
same).  I may miss something trivial, but I would say that the coding
instead defines the subspace spanned by the regressor, and not that
they are independant.

Am I too stuck with coding? But then, how is defined the subspace
associated to a regressor ? absolutly ??
On Wed, Feb 24, 2016 at 04:08:42PM +0000, Fox, John wrote:
? Dear Emmanuel,
? 
? The questions you raise are sufficiently complicated that it's difficult to address them adequately in an email. My Applied Regression and Generalized Linear Models text, for example, takes about 15 pages to explain the relationships among regressor codings, hypotheses, and tests in 2-way ANOVA, working with the full-rank parametrization of the model, and it's possible (as Russell Lenth indicated) to work things out even more generally. 
? 
? I'll try to answer briefly, however.
? 
? 
? No need to apologize. I don't think that these are simple ideas.
? 
? > the expectation of the quadratic form, and I too quickly deduced that there
? > was an equivalent linear combination of the parameters as its ? square root
? > ?, but this was obviously wrong since the L matrix in a Lt W L quadratic form
? > does not have to be a column matrix. Am I wrong thinking that typically in
? > such tests, the L matrix is precisely a multi-column matrix (hence also several
? > degrees of freedom associated), and that several contrasts are tested
? > simultaneously?
? 
? Thinking in terms of the full-rank parametrization, as used in R, each type-III hypothesis is that several coefficients are simultaneously 0, which can be simply formulated as a linear hypothesis assuming an appropriate coding of the regressors for a factor. Type-II hypotheses can also be formulated as linear hypotheses, but doing so is more complicated. The Anova() function uses a kind of projection, in effect defining a type-II test as the most powerful test of a conditional hypothesis such as no A main effect given that the A:B interaction is absent in the model y ~ A*B. This works both for linear models, where (unless there is a complication like missing cells), the resulting test corresponds to the test produced by comparing the models y ~ A and y ~ A + B, using Y ~ A*B for the estimate of error variance (i.e., the denominator MS), and more generally for models with linear predictors, where it's in general possible to formulate the (Wald) tests in terms of the coefficient estimates and their covariance matrix.
? 
? > 
? > I precise that I call ? contrast ? a linear combination of the model parameters
? > with the constraint that the coefficients of this combination sum to 0 ? this is
? > the definition in French (? contraste ?), but I may use it wrongly in English?
? 
? I'd define a "contrast" as the weights associated with the levels of a factor for formulating a hypothesis, where the weights traditionally are constrained to sum to 0, and to differentiate this from a column of the model matrix, which I'd more generally term a "regressor." Often, a traditional set of contrasts for a factor, one less than the number of levels, are defined not only to sum to 0  but also to be orthogonal in the basis of the design. The usage in R is more general, where "contrasts" mean the set of regressors used to represent a factor. Thus, contr.sum() generates regressors that satisfy the traditional definition of contrasts, as do contr.poly() and contr.helmert(), but the default contr.treatment() generates 0/1 dummy-coded regressors that don't satisfy the traditional definition of contrasts.
? 
? > 
? > Second, I may have wrongly understood the definitions of the various tests,
? > and especially how they generalize from linear model to GLM/GLMM...
? > 
? > I thought type I was by taking the squared distance of the successive
? > orthogonal projections on the subspaces generated by the various terms, in
? > the order given in the model; type II, by ensuring that the term tested was
? > the last amongst terms of same order, after terms of lower order but before
? > terms of higher order; type III, by projecting on the subspace after removal
? > of the basis vectors for the term tested ? hence its strong dependency on
? > the coding scheme, and the ? drop1 ? trick to get them.
? > 
? > Is this definition correct? Does it generalize to other kind models, or is
? > another definition required? Is it unambiguous? The SAS doc itself suggests
? > that various procedures call "type II" different kind of things
? 
? Yes, if I've followed this correctly, it's correct, and it explains why it's possible to formulate the different types of tests in linear models independently of the contrasts (regressors) used to code the factors -- because fundamentally what's important is the subspace spanned by the regressors in each model, which is independent of coding. This approach, however, doesn't generalize easily beyond linear models fit by least squares. The approach taken in Anova() corresponds to this approach in linear models fit by least squares as long as the models remain full-rank and for type-III tests as long as the contrasts are properly formulated, and generalizes to other models with linear predictors.
? 
? > 
? > However, I cannot see clearly which hypothesis is indeed tested in each case,
? > especially in terms of cell means or marginal means (and, when I really need
? > it, I start from them and select the contrasts I need).  Is there any
? > package/software that allows to print the hypotheses testeds in terms of
? > means starting from the model formula?
? > Or is there any good reference that makes the link between the two?
? > For instance, a demonstration that the comparison of marginal means ?
? > always ? leads to a type XXX sum of square?
? 
? This is where a complete explanation gets too lengthy for an email, but a shorthand formulation, e.g., for the model y ~ A*B, is that type-I tests correspond to the hypotheses A|(B = 0, AB = 0), B | AB = 0, AB = 0; type-II tests to A | AB = 0, B | AB = 0, AB = 0; and type-III tests to A = 0, B = 0, AB = 0. Here, e.g., | AB = 0 means assuming no AB interactions, so, e.g., the hypothesis A | AB = 0 means no A main effects assuming no AB interactions. A hypothesis like A = 0 is indeed formulated in terms of marginal means, understood as cell means for A averaging over the levels of B (not level means of A ignoring B).
? 
? I realize that this is far from a complete explanation.
? 
? Best,
?  John
1 day later
#
Dear Emmanuel,

Again, I'll respond briefly, and not in the detail that your questions really require:
The subspace spanned by the regressors in a model like y ~ A*B or y ~ A + B is independent of the coding of the regressors.
In both cases, models like y ~ A*B and y ~ A + B produce the same y-hat vectors and hence same SSs. The situation is a bit more complicated for models that violate marginality, but that situation can be handled by more general approaches, like estimable functions or close attention to the hypotheses tested. All tests can be formulated linear hypothesis in the parameters of the full, full-rank model, but different parametrizations make the tests simpler or more difficult.

You've shown the row-basis for the model matrix in the cases of effect ("contr.sum") coding and dummy ("contr.treatment") coding. Call the basis matrix X_B. Then, because these are full-rank parametrizations, as long as no cells are empty, you can solve for the cell means in terms of the model parameters. Call the parameter vector corresponding to the basis beta_B and the ravelled vector of cell means mu. Then mu = X_B beta_B and (because X_B is nonsingular), beta_B = X_B^-1 mu. This allows you to see the composition of each parameter in terms of cell means and thus the hypothesis tested by the (type-III) test that the parameter is 0. In the case of effect coding, the columns of X_B are orthogonal and so its inverse is particularly simple, with each row equal to a column of X_B up to a constant factor.
It's not, but because the model matrix spans the same subspace, it's possible to test the same hypotheses in full-rank formulations of the same model. One way to see that is to work backwards from beta_B = X_B^-1 mu (that is, define X_B^-1 as the contrasts that you want to test) to mu = X_B beta_B. As mentioned, this is particularly simple when the *rows* of X_B^-1 are orthogonal contrasts.

John