Simple question about formulae in R!?

On Fri, Aug 10, 2012 at 7:39 AM, Henrik Singmann
Dear Michael and list,

R in general tries hard to prohibit this behavior (i.e., including an
interaction but not the main effect). When removing a main effect and
leaving the interaction, the number of parameters is not reduced by one (as
would be expected) but stays the same, at least when using model.matrix:
This is because for factor or character variables, R automatically
dummy codes them (or whatever contrasts were set, dummies are the
default, but others could be set globally which would then propogate
to the specific model.matrix() call).
d <- data.frame(A = rep(c("a1", "a2"), each = 50), B = c("b1", "b2"), value
= rnorm(10))
ncol(model.matrix(~ A*B, data = d))
#  [1] 4
ncol(model.matrix(~ A*B - A, data = d))
#  [1] 4

I actually don't know understand how R parametrizes the model in the second
case, but I am pretty sure someone here might do so and be able to explain.
One way to think about this is the regression versus ANOVA style of
parameterization.  Consider these three simple models

Here we have a traditional regression.  am is a two level factor, so
it gets one dummy code with the intercept.  The model matrix looks
something like:

i a
1 0
1 0
1 0
1 1
1 1
1 1

what does that mean for the expected values?  For a = 0,

yhat = b0 * 1 + b1 * 0 = b0 * 1, the intercept

for a = 1

yhat = b0 * 1 + b1 * 1

what is the difference between these two?  Merely b1.  Hence under
this parameterization, b0 is the expected value for the group a = 0,
and b1 is the difference in expected values between the two groups a =
0 and a = 1.
coef(lm(mpg ~ 1 + factor(am), data = mtcars))
(Intercept) factor(am)1
  17.147368    7.244939

We can find the expected values for each group.  The first is just b0,
the intercept term.  The secondn is the intercept plus b1, the single
other coefficient.  We can get this just by adding (or summing all the
coefficients) like so:
sum(coef(lm(mpg ~ 1 + factor(am), data = mtcars)))
[1] 24.39231

Now, if we explicitly force out the intercept, R uses a different
coding scheme.  The intercept normally captures some reference or
baseline group, which every other group is compared to, but if we
force it out, the default design matrix is something like:

1 0
1 0
1 0
0 1
0 1
0 1

this looks very similar to what we had before, only now the two
columns are opposites.  Let's think again what our expected values
are.  For a = 0

yhat = b0 * 1 + b1 * 0 = b0 * 1

for a = 1

yhat = b0 * 0 + b1 * 1 = b1 * 1

now b0 encodes the expected value of the group a = 0 and b1 encodes
the expected value of the group a = 1.  By default these coefficients
are tested against 0.  There is no more explicit test if the two
groups are different from each other because instead of creating
differences, we have created the overall group expectation.  In lm():
coef(lm(mpg ~ 0 + factor(am), data = mtcars))
factor(am)0 factor(am)1
   17.14737    24.39231

both those numbers should be familiar from before.  In this case, both
models included two parameters, the only difference is whether they
encode group expectations or an expectation and expected difference.
These are equivalenet models, and their likelihood etc. will be
identical.   Note however that if you suppress the intercept, R will
use a different formula for the R squared so those will not look
identical (although if you got predicted values from both models,
yhat, you would find those identical).

This concept generalizes to the interaction of categorical variables
without their main effects.  Normally, the interaction terms are
expected differences from the reference group, but if the "main
effect" is dropped, then instead, the interactions become the expected
values of the various cells (if you think of a 2 x 2 table of group
membership).

For example here it is as usual:
coef(lm(mpg ~ 1 + factor(vs) * factor(am), data = mtcars))
(Intercept)             factor(vs)1             factor(am)1
              15.050000                5.692857                4.700000
factor(vs)1:factor(am)1
               2.928571
15.050000 + 5.692857
[1] 20.74286
15.050000 + 4.7
[1] 19.75
15.050000 + 5.692857 + 4.7 + 2.928571
[1] 28.37143

the intercept is the expected value for the group vs = 0 & am = 0,
then you get the difference between expectations for am = 0 & vs = 0
and am = 0 & vs = 1, then likewise for am, and finally, the additional
bit of being am = 1 and vs = 1.  I show the by hand calculations to
get the expected group values.  Right now, we get significance tests
of whether these differences are statistically significantly different
from zero.

If we drop the main effects and intercept, we get:
coef(lm(mpg ~ 0 + factor(vs) : factor(am), data = mtcars))
factor(vs)0:factor(am)0 factor(vs)1:factor(am)0 factor(vs)0:factor(am)1
               15.05000                20.74286                19.75000
factor(vs)1:factor(am)1
               28.37143

which you can see directly gets all the group expectations.

Cheers,

Josh
I have asked a question on how to get around this "limitation" on
stackoverflow with helpful answers by Ben Bolker and Joshua Wiley:
http://stackoverflow.com/q/11335923/289572
(this functionality is now used in function mixed() in my new package afex
for obtaining "type 3" p-values for mixed models)
With few exceptions, I'm in the habit of not giving people type 3
p-values.  People like, but generally do not actually understand them.
 I would also like to point out that I am in psychology, so yes I know
psychology likes them.  Further, I am a doctoral student so it is not
like I am so established that people bow to my every whim, it is work
to explain why I am not sometimes and I have had discussions in
various lab meetings and with advisors about this, but my point is
that it is possible.  I would urge you to do the same and take heart
that there are others working for change---you are not the only one
even if it feels like it at your university.
Cheers,
Henrik

Am 10.08.2012 15:48, schrieb R. Michael Weylandt:

On Fri, Aug 10, 2012 at 7:36 AM, ONKELINX, Thierry
<Thierry.ONKELINX at inbo.be> wrote:
Dear Johan,

Why should it be complicated? You have a very simple model, thus a very
simple formula. Isn't that great?

Your formula matches the model. Though Trust~Culture + Structure *
Speed_of_Integration is another option. The model fit is the same, the only
difference is the parameterization of the model.

Note quite, I don't think: A*B is shorthand for A + B + A:B where A:B
is the 2nd order (interaction) term. The OP only had the 2nd order
term without the main effects.

OP: You almost certainly want A*B -- it's strange (though certainly
not impossible) to have good models where you only have interactions
but not main effects.

Cheers,
Michael

Best regards,

Thierry

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature
and Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium
+ 32 2 525 02 51
+ 32 54 43 61 85
Thierry.Onkelinx at inbo.be
www.inbo.be

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

-----Oorspronkelijk bericht-----
Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
Namens Johan Haasnoot
Verzonden: vrijdag 10 augustus 2012 9:15
Aan: r-help at r-project.org
Onderwerp: [R] Simple question about formulae in R!?

Good morning reader,

I have encountered a, probably, simple issue with respect to the
*formulae* of a *regression model* I want to use in my research. I'm
researching alliances as part of my study Business Economics (focus
Strategy) at the Vrije Universiteit in Amsterdam. In the research model I
use a moderating variable, I'm looking for confirmation or help on the
formulation of the model.

The research model consist of 2 explanatory variables, a moderating
variable and 1 response variable. The first explanatory variable is Culture,
measured on a nominal scale and the second is Structure of the alliance,
also measured on a nominal scale. The moderating variable in the relation
towards Trust is Speed of Integration, measured as an interval.
The response variable is Trust, measured on a nominal scale (highly
likely a 5-point Likert scale). Given the research model and the measurement
scales, I intent to use a ordered probit model, often used in Marketing
models,  to execute the regression modelling. I can't seem to find
confirmation on how to model the formulae. I have been reading and studying
R! for a couple of weeks now, read a lot of books from the R! series in the
past, but I can't get a grasp on this seemingly simple formulae. I think I
understand how to model multinomial regression (using the R-package MNP),
how to execute a Principal Components Analysis and an Explanatory Factor
analysis (obviously I'm using a questionnaire to collect my data), but the
formulae itself seems to be to simple.

I expect to use the interaction symbol: "Trust~Culture + Structure :
Speed_of_Integration" for the formulae, but is it really this simple? Can
anybody confirm this or help me, advise me on this issue?

Kind regards,

--

Met vriendelijke groet,

Johan Haasnoot

De Haan & Martojo
Kerklaan 5
3645 ES Vinkeveen
Telefoon: 0297-266354
Mobiel: 06-81827665
Email: johan.haasnoot at dh-m.nl
Website: www.dehaanenmartojo.nl

         [[alternative HTML version deleted]]

* * * * * * * * * * * * * D I S C L A I M E R * * * * * * * * * * * * *
Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver
weer en binden het INBO onder geen enkel beding, zolang dit bericht niet
bevestigd is door een geldig ondertekend document.
The views expressed in this message and any annex are purely those of the
writer and may not be regarded as stating an official position of INBO, as
long as the message is not confirmed by a duly signed document.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

--
Dipl. Psych. Henrik Singmann
PhD Student
Albert-Ludwigs-Universit?t Freiburg, Germany
http://www.psychologie.uni-freiburg.de/Members/singmann

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

Simple question about formulae in R!?

Thread (10 messages)