Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing

7 messages · Tyler, Arie ten Cate

Wed, Nov 1, 2017 11:51 PM #

Hello Tyler,

Thank you for searching for, and finding, the basic description of the
behavior of R in this matter.

I think your example is in agreement with the book.

But let me first note the following. You write: "F_j refers to a
factor (variable) in a model and not a categorical factor". However:
"a factor is a vector object used to specify a discrete
classification" (start of chapter 4 of "An Introduction to R".) You
might also see the description of the R function factor().

You note that the book says about a factor F_j:
  "... F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

You find:
   "However, the example I gave demonstrated that this dummy variable
encoding only occurs for the model where the missing term is the
numeric-numeric interaction, ~(X1+X2+X3)^3-X1:X2."

We have here T_i = X1:X2:X3. Also: F_j = X3 (the only factor). Then
T_{i(j)} = X1:X2, which is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie

On Tue, Oct 31, 2017 at 4:01 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

Thank you for your further research into the issue.

Regarding Stata: On the other hand, JMP gives model matrices that use the
main effects contrasts in computing the higher order interactions, without
the dummy variable encoding. I verified this both by analyzing the linear
model given in my first example and noting that JMP has one more degree of
freedom than R for the same model, as well as looking at the generated model
matrices. It's easy to find a design where JMP will allow us fit our model
with goodness-of-fit estimates and R will not due to the extra degree(s) of
freedom required. Let's keep the conversation limited to R.

I want to refocus back onto my original bug report, which was not for a
missing main effects term, but rather for a missing lower-order interaction
term. The behavior of model.matrix.default() for a missing main effects term
is a nice example to demonstrate how model.matrix encodes with dummy
variables instead of contrasts, but doesn't demonstrate the inconsistent
behavior my bug report highlighted.

I went looking for documentation on this behavior, and the issue stems not
from model.matrix.default(), but rather the terms() function in interpreting
the formula. This "clever" replacement of contrasts by dummy variables to
maintain marginality (presuming that's the reason) is not described anywhere
in the documentation for either the model.matrix() or the terms() function.
In order to find a description for the behavior, I had to look in the
underlying C code, buried above the "TermCode" function of the "model.c"
file, which says:

"TermCode decides on the encoding of a model term. Returns 1 if variable
``whichBit'' in ``thisTerm'' is to be encoded by contrasts and 2 if it is to
be encoded by dummy variables.  This is decided using the heuristic
described in Statistical Models in S, page 38."

I do not have a copy of this book, and I suspect most R users do not as
well. Thankfully, however, some of the pages describing this behavior were
available as part of Amazon's "Look Inside" feature--but if not for that, I
would have no idea what heuristic R was using. Since those pages could made
unavailable by Amazon at any time, at the very least we have an problem with
a lack of documentation.

However, I still believe there is a bug when comparing R's implementation to
the heuristic described in the book. From Statistical Models in S, page
38-39:

"Suppose F_j is any factor included in term T_i. Let T_{i(j)} denote the
margin of T_i for factor F_j--that is, the term obtained by dropping F_j
from T_i. We say that T_{i(j)} has appeared in the formula if there is some
term T_i' for i' < i such that T_i' contains all the factors appearing in
T_{i(j)}. The usual case is that T_{i(j)} itself is one of the preceding
terms. Then F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

Here, F_j refers to a factor (variable) in a model and not a categorical
factor, as specified later in that section (page 40): "Numeric variables
appear in the computations as themselves, uncoded. Therefore, the rule does
not do anything special for them, and it remains valid, in a trivial sense,
whenever any of the F_j is numeric rather than categorical."

Going back to my original example with three variables: X1 (numeric), X2
(numeric), X3 (categorical). This heuristic prescribes encoding X1:X2:X3
with contrasts as long as X1:X2, X1:X3, and X2:X3 exist in the formula. When
any of the preceding terms do not exist, this heuristic tells us to use
dummy variables to encode the interaction (e.g. "F_j [the interaction term]
is coded ... by dummy variables if it [any of the marginal terms obtained by
dropping a single factor in the interaction] has not [appeared in the
formula]"). However, the example I gave demonstrated that this dummy
variable encoding only occurs for the model where the missing term is the
numeric-numeric interaction, "~(X1+X2+X3)^3-X1:X2". Otherwise, the
interaction term X1:X2:X3 is encoded by contrasts, not dummy variables. This
is inconsistent with the description of the intended behavior given in the
book.

Best regards,
Tyler


On Fri, Oct 27, 2017 at 2:18 PM, Arie ten Cate <arietencate at gmail.com>
wrote:

Hello Tyler,

I want to bring to your attention the following document: "What
happens if you omit the main effect in a regression model with an
interaction?"
(https://stats.idre.ucla.edu/stata/faq/what-happens-if-you-omit-the-main-effect-in-a-regression-model-with-an-interaction).
This gives a useful review of the problem. Your example is Case 2: a
continuous and a categorical regressor.

The numerical examples are coded in Stata, and they give the same
result as in R. Hence, if this is a bug in R then it is also a bug in
Stata. That seems very unlikely.

Here is a simulation in R of the above mentioned Case 2 in Stata:

df <- expand.grid(socst=c(-1:1),grp=c("1","2","3","4"))
print("Full model")
print(model.matrix(~(socst+grp)^2 ,data=df))
print("Example 2.1: drop socst")
print(model.matrix(~(socst+grp)^2 -socst ,data=df))
print("Example 2.2: drop grp")
print(model.matrix(~(socst+grp)^2 -grp ,data=df))

This gives indeed the following regressors:

"Full model"
(Intercept) socst grp2 grp3 grp4 socst:grp2 socst:grp3 socst:grp4
"Example 2.1: drop socst"
(Intercept) grp2 grp3 grp4 socst:grp1 socst:grp2 socst:grp3 socst:grp4
"Example 2.2: drop grp"
(Intercept) socst socst:grp2 socst:grp3 socst:grp4

There is a little bit of R documentation about this, based on the
concept of marginality, which typically forbids a model having an
interaction but not the corresponding main effects. (You might see the
references in https://en.wikipedia.org/wiki/Principle_of_marginality )
    See "An Introduction to R", by Venables and Smith and the R Core
Team. At the bottom of page 52 (PDF: 57) it says: "Although the
details are complicated, model formulae in R will normally generate
the models that an expert statistician would expect, provided that
marginality is preserved. Fitting, for [a contrary] example, a model
with an interaction but not the corresponding main effects will in
general lead to surprising results ....".
    The Reference Manual states that the R functions dropterm() and
addterm() resp. drop or add only terms such that marginality is
preserved.

Finally, about your singular matrix t(mm)%*%mm. This is in fact
Example 2.1 in Case 2 discussed above. As discussed there, in Stata
and in R the drop of the continuous variable has no effect on the
degrees of freedom here: it is just a reparameterisation of the full
model, protecting you against losing marginality... Hence the
model.matrix 'mm' is still square and nonsingular after the drop of
X1, unless of course when a row is removed from the matrix 'design'
when before creating 'mm'.

    Arie

On Sun, Oct 15, 2017 at 7:05 PM, Tyler <tylermw at gmail.com> wrote:

You could possibly try to explain away the behavior for a missing main
effects term, since without the main effects term we don't have main
effect
columns in the model matrix used to compute the interaction columns (At
best this is undocumented behavior--I still think it's a bug, as we know
how we would encode the categorical factors if they were in fact
present.
It's either specified in contrasts.arg or using the default set in
options). However, when all the main effects are present, why would the
three-factor interaction column not simply be the product of the main
effect columns? In my example: we know X1, we know X2, and we know X3.
Why
does the encoding of X1:X2:X3 depend on whether we specified a
two-factor
interaction, AND only changes for specific missing interactions?

In addition, I can use a two-term example similar to yours to show how
this
behavior results in a singular covariance matrix when, given the desired
factor encoding, it should not be singular.

We start with a full factorial design for a two-level continuous factor
and
a three-level categorical factor, and remove a single row. This design
matrix does not leave enough degrees of freedom to determine
goodness-of-fit, but should allow us to obtain parameter estimates.

design = expand.grid(X1=c(1,-1),X2=c("A","B","C"))
design = design[-1,]
design

  X1 X2
2 -1  A
3  1  B
4 -1  B
5  1  C
6 -1  C

Here, we first calculate the model matrix for the full model, and then
manually remove the X1 column from the model matrix. This gives us the
model matrix one would expect if X1 were removed from the model. We then
successfully calculate the covariance matrix.

mm = model.matrix(~(X1+X2)^2,data=design)
mm

  (Intercept) X1 X2B X2C X1:X2B X1:X2C
2           1 -1   0   0      0      0
3           1  1   1   0      1      0
4           1 -1   1   0     -1      0
5           1  1   0   1      0      1
6           1 -1   0   1      0     -1

mm = mm[,-2]
solve(t(mm) %*% mm)

            (Intercept)  X2B  X2C X1:X2B X1:X2C
(Intercept)           1 -1.0 -1.0    0.0    0.0
X2B                  -1  1.5  1.0    0.0    0.0
X2C                  -1  1.0  1.5    0.0    0.0
X1:X2B                0  0.0  0.0    0.5    0.0
X1:X2C                0  0.0  0.0    0.0    0.5

Here, we see the actual behavior for model.matrix. The undesired
re-coding
of the model matrix interaction term makes the information matrix
singular.

mm = model.matrix(~(X1+X2)^2-X1,data=design)
mm

  (Intercept) X2B X2C X1:X2A X1:X2B X1:X2C
2           1   0   0     -1      0      0
3           1   1   0      0      1      0
4           1   1   0      0     -1      0
5           1   0   1      0      0      1
6           1   0   1      0      0     -1

solve(t(mm) %*% mm)

Error in solve.default(t(mm) %*% mm) : system is computationally
singular:
reciprocal condition number = 5.55112e-18

I still believe this is a bug.

Best regards,
Tyler Morgan-Wall

On Sun, Oct 15, 2017 at 1:49 AM, Arie ten Cate <arietencate at gmail.com>
wrote:

I think it is not a bug. It is a general property of interactions.
This property is best observed if all variables are factors
(qualitative).

For example, you have three variables (factors). You ask for as many
interactions as possible, except an interaction term between two
particular variables. When this interaction is not a constant, it is
different for different values of the remaining variable. More
precisely: for all values of that variable. In other words: you have a
three-way interaction, with all values of that variable.

An even smaller example is the following script with only two
variables, each being a factor:

 df <- expand.grid(X1=c("p","q"), X2=c("A","B","C"))
 print(model.matrix(~(X1+X2)^2    ,data=df))
 print(model.matrix(~(X1+X2)^2 -X1,data=df))
 print(model.matrix(~(X1+X2)^2 -X2,data=df))

The result is:

  (Intercept) X1q X2B X2C X1q:X2B X1q:X2C
1           1   0   0   0       0       0
2           1   1   0   0       0       0
3           1   0   1   0       0       0
4           1   1   1   0       1       0
5           1   0   0   1       0       0
6           1   1   0   1       0       1

  (Intercept) X2B X2C X1q:X2A X1q:X2B X1q:X2C
1           1   0   0       0       0       0
2           1   0   0       1       0       0
3           1   1   0       0       0       0
4           1   1   0       0       1       0
5           1   0   1       0       0       0
6           1   0   1       0       0       1

  (Intercept) X1q X1p:X2B X1q:X2B X1p:X2C X1q:X2C
1           1   0       0       0       0       0
2           1   1       0       0       0       0
3           1   0       1       0       0       0
4           1   1       0       1       0       0
5           1   0       0       0       1       0
6           1   1       0       0       0       1

Thus, in the second result, we have no main effect of X1. Instead, the
effect of X1 depends on the value of X2; either A or B or C. In fact,
this is a two-way interaction, including all three values of X2. In
the third result, we have no main effect of X2, The effect of X2
depends on the value of X1; either p or q.

A complicating element with your example seems to be that your X1 and
X2 are not factors.

   Arie

On Thu, Oct 12, 2017 at 7:12 PM, Tyler <tylermw at gmail.com> wrote:

Hi,

I recently ran into an inconsistency in the way model.matrix.default
handles factor encoding for higher level interactions with
categorical
variables when the full hierarchy of effects is not present.
Depending on
which lower level interactions are specified, the factor encoding
changes
for a higher level interaction. Consider the following minimal

reproducible

example:

--------------

runmatrix = expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))>

model.matrix(~(X1+X2+X3)^3,data=runmatrix)   (Intercept) X1 X2 X3B X3C
X1:X2 X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0      0      0
0         0
2            1 -1  1   0   0    -1      0      0      0      0
0         0
3            1  1 -1   0   0    -1      0      0      0      0
0         0
4            1 -1 -1   0   0     1      0      0      0      0
0         0
5            1  1  1   1   0     1      1      0      1      0
1         0
6            1 -1  1   1   0    -1     -1      0      1      0
-1         0
7            1  1 -1   1   0    -1      1      0     -1      0
-1         0
8            1 -1 -1   1   0     1     -1      0     -1      0
1         0
9            1  1  1   0   1     1      0      1      0      1
0         1
10           1 -1  1   0   1    -1      0     -1      0      1
0        -1
11           1  1 -1   0   1    -1      0      1      0     -1
0        -1
12           1 -1 -1   0   1     1      0     -1      0     -1
0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6 7 7
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Specifying the full hierarchy gives us what we expect: the
interaction
columns are simply calculated from the product of the main effect

columns.

The interaction term X1:X2:X3 gives us two columns in the model
matrix,
X1:X2:X3B and X1:X2:X3C, matching the products of the main effects.

If we remove either the X2:X3 interaction or the X1:X3 interaction,
we

get

what we would expect for the X1:X2:X3 interaction, but when we remove
the
X1:X2 interaction the encoding for X1:X2:X3 changes completely:

--------------

model.matrix(~(X1+X2+X3)^3-X1:X3,data=runmatrix)   (Intercept) X1 X2

X3B X3C X1:X2 X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0         0
2            1 -1  1   0   0    -1      0      0         0         0
3            1  1 -1   0   0    -1      0      0         0         0
4            1 -1 -1   0   0     1      0      0         0         0
5            1  1  1   1   0     1      1      0         1         0
6            1 -1  1   1   0    -1      1      0        -1         0
7            1  1 -1   1   0    -1     -1      0        -1         0
8            1 -1 -1   1   0     1     -1      0         1         0
9            1  1  1   0   1     1      0      1         0         1
10           1 -1  1   0   1    -1      0      1         0        -1
11           1  1 -1   0   1    -1      0     -1         0        -1
12           1 -1 -1   0   1     1      0     -1         0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X2:X3,data=runmatrix)   (Intercept) X1 X2

X3B X3C X1:X2 X1:X3B X1:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0         0
2            1 -1  1   0   0    -1      0      0         0         0
3            1  1 -1   0   0    -1      0      0         0         0
4            1 -1 -1   0   0     1      0      0         0         0
5            1  1  1   1   0     1      1      0         1         0
6            1 -1  1   1   0    -1     -1      0        -1         0
7            1  1 -1   1   0    -1      1      0        -1         0
8            1 -1 -1   1   0     1     -1      0         1         0
9            1  1  1   0   1     1      0      1         0         1
10           1 -1  1   0   1    -1      0     -1         0        -1
11           1  1 -1   0   1    -1      0      1         0        -1
12           1 -1 -1   0   1     1      0     -1         0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X1:X2,data=runmatrix)   (Intercept) X1 X2

X3B X3C X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3A X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0      0      0      0      0         1
    0         0
2            1 -1  1   0   0      0      0      0      0        -1
    0         0
3            1  1 -1   0   0      0      0      0      0        -1
    0         0
4            1 -1 -1   0   0      0      0      0      0         1
    0         0
5            1  1  1   1   0      1      0      1      0         0
    1         0
6            1 -1  1   1   0     -1      0      1      0         0
   -1         0
7            1  1 -1   1   0      1      0     -1      0         0
   -1         0
8            1 -1 -1   1   0     -1      0     -1      0         0
    1         0
9            1  1  1   0   1      0      1      0      1         0
    0         1
10           1 -1  1   0   1      0     -1      0      1         0
    0        -1
11           1  1 -1   0   1      0      1      0     -1         0
    0        -1
12           1 -1 -1   0   1      0     -1      0     -1         0
    0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 4 5 5 6 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Here, we now see the encoding for the interaction X1:X2:X3 is now the
interaction of X1 and X2 with a new encoding for X3 where each factor

level

is represented by its own column. I would expect, given the two
column
dummy variable encoding for X3, that the X1:X2:X3 column would also
be

two

columns regardless of what two-factor interactions we also specified,
but
in this case it switches to three. If other two factor interactions
are
missing in addition to X1:X2, this issue still occurs. This also
happens
regardless of the contrast specified in contrasts.arg for X3. I don't
see
any reasoning for this behavior given in the documentation, so I
suspect

it

is a bug.

Best regards,
Tyler Morgan-Wall

        [[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Tyler

Thu, Nov 2, 2017 8:11 AM #

Hi Arie,

The book out of which this behavior is based does not use factor (in this
section) to refer to categorical factor. I will again point to this
sentence, from page 40, in the same section and referring to the behavior
under question, that shows F_j is not limited to categorical factors:
"Numeric variables appear in the computations as themselves, uncoded.
Therefore, the rule does not do anything special for them, and it remains
valid, in a trivial sense, whenever any of the F_j is numeric rather than
categorical."

Note the "... whenever any of the F_j is numeric rather than categorical."
Factor here is used in the more general sense of the word, not referring to
the R type "factor." The behavior of R does not match the heuristic that
it's citing.

Best regards,
Tyler

On Thu, Nov 2, 2017 at 2:51 AM, Arie ten Cate <arietencate at gmail.com> wrote:

Hello Tyler,

Thank you for searching for, and finding, the basic description of the
behavior of R in this matter.

I think your example is in agreement with the book.

But let me first note the following. You write: "F_j refers to a
factor (variable) in a model and not a categorical factor". However:
"a factor is a vector object used to specify a discrete
classification" (start of chapter 4 of "An Introduction to R".) You
might also see the description of the R function factor().

You note that the book says about a factor F_j:
  "... F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

You find:
   "However, the example I gave demonstrated that this dummy variable
encoding only occurs for the model where the missing term is the
numeric-numeric interaction, ~(X1+X2+X3)^3-X1:X2."

We have here T_i = X1:X2:X3. Also: F_j = X3 (the only factor). Then
T_{i(j)} = X1:X2, which is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie

On Tue, Oct 31, 2017 at 4:01 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

Thank you for your further research into the issue.

Regarding Stata: On the other hand, JMP gives model matrices that use the
main effects contrasts in computing the higher order interactions,

without

the dummy variable encoding. I verified this both by analyzing the linear
model given in my first example and noting that JMP has one more degree

of

freedom than R for the same model, as well as looking at the generated

model

matrices. It's easy to find a design where JMP will allow us fit our

model

with goodness-of-fit estimates and R will not due to the extra degree(s)

of

freedom required. Let's keep the conversation limited to R.

I want to refocus back onto my original bug report, which was not for a
missing main effects term, but rather for a missing lower-order

interaction

term. The behavior of model.matrix.default() for a missing main effects

term

is a nice example to demonstrate how model.matrix encodes with dummy
variables instead of contrasts, but doesn't demonstrate the inconsistent
behavior my bug report highlighted.

I went looking for documentation on this behavior, and the issue stems

not

from model.matrix.default(), but rather the terms() function in

interpreting

the formula. This "clever" replacement of contrasts by dummy variables to
maintain marginality (presuming that's the reason) is not described

anywhere

in the documentation for either the model.matrix() or the terms()

function.

In order to find a description for the behavior, I had to look in the
underlying C code, buried above the "TermCode" function of the "model.c"
file, which says:

"TermCode decides on the encoding of a model term. Returns 1 if variable
``whichBit'' in ``thisTerm'' is to be encoded by contrasts and 2 if it

is to

be encoded by dummy variables.  This is decided using the heuristic
described in Statistical Models in S, page 38."

I do not have a copy of this book, and I suspect most R users do not as
well. Thankfully, however, some of the pages describing this behavior

were

available as part of Amazon's "Look Inside" feature--but if not for

that, I

would have no idea what heuristic R was using. Since those pages could

made

unavailable by Amazon at any time, at the very least we have an problem

with

a lack of documentation.

However, I still believe there is a bug when comparing R's

implementation to

the heuristic described in the book. From Statistical Models in S, page
38-39:

"Suppose F_j is any factor included in term T_i. Let T_{i(j)} denote the
margin of T_i for factor F_j--that is, the term obtained by dropping F_j
from T_i. We say that T_{i(j)} has appeared in the formula if there is

some

term T_i' for i' < i such that T_i' contains all the factors appearing in
T_{i(j)}. The usual case is that T_{i(j)} itself is one of the preceding
terms. Then F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

Here, F_j refers to a factor (variable) in a model and not a categorical
factor, as specified later in that section (page 40): "Numeric variables
appear in the computations as themselves, uncoded. Therefore, the rule

does

not do anything special for them, and it remains valid, in a trivial

sense,

whenever any of the F_j is numeric rather than categorical."

Going back to my original example with three variables: X1 (numeric), X2
(numeric), X3 (categorical). This heuristic prescribes encoding X1:X2:X3
with contrasts as long as X1:X2, X1:X3, and X2:X3 exist in the formula.

When

any of the preceding terms do not exist, this heuristic tells us to use
dummy variables to encode the interaction (e.g. "F_j [the interaction

term]

is coded ... by dummy variables if it [any of the marginal terms

obtained by

dropping a single factor in the interaction] has not [appeared in the
formula]"). However, the example I gave demonstrated that this dummy
variable encoding only occurs for the model where the missing term is the
numeric-numeric interaction, "~(X1+X2+X3)^3-X1:X2". Otherwise, the
interaction term X1:X2:X3 is encoded by contrasts, not dummy variables.

This

is inconsistent with the description of the intended behavior given in

the

book.

Best regards,
Tyler


On Fri, Oct 27, 2017 at 2:18 PM, Arie ten Cate <arietencate at gmail.com>
wrote:

Hello Tyler,

I want to bring to your attention the following document: "What
happens if you omit the main effect in a regression model with an
interaction?"
(https://stats.idre.ucla.edu/stata/faq/what-happens-if-you-

omit-the-main-effect-in-a-regression-model-with-an-interaction).

This gives a useful review of the problem. Your example is Case 2: a
continuous and a categorical regressor.

The numerical examples are coded in Stata, and they give the same
result as in R. Hence, if this is a bug in R then it is also a bug in
Stata. That seems very unlikely.

Here is a simulation in R of the above mentioned Case 2 in Stata:

df <- expand.grid(socst=c(-1:1),grp=c("1","2","3","4"))
print("Full model")
print(model.matrix(~(socst+grp)^2 ,data=df))
print("Example 2.1: drop socst")
print(model.matrix(~(socst+grp)^2 -socst ,data=df))
print("Example 2.2: drop grp")
print(model.matrix(~(socst+grp)^2 -grp ,data=df))

This gives indeed the following regressors:

"Full model"
(Intercept) socst grp2 grp3 grp4 socst:grp2 socst:grp3 socst:grp4
"Example 2.1: drop socst"
(Intercept) grp2 grp3 grp4 socst:grp1 socst:grp2 socst:grp3 socst:grp4
"Example 2.2: drop grp"
(Intercept) socst socst:grp2 socst:grp3 socst:grp4

There is a little bit of R documentation about this, based on the
concept of marginality, which typically forbids a model having an
interaction but not the corresponding main effects. (You might see the
references in https://en.wikipedia.org/wiki/Principle_of_marginality )
    See "An Introduction to R", by Venables and Smith and the R Core
Team. At the bottom of page 52 (PDF: 57) it says: "Although the
details are complicated, model formulae in R will normally generate
the models that an expert statistician would expect, provided that
marginality is preserved. Fitting, for [a contrary] example, a model
with an interaction but not the corresponding main effects will in
general lead to surprising results ....".
    The Reference Manual states that the R functions dropterm() and
addterm() resp. drop or add only terms such that marginality is
preserved.

Finally, about your singular matrix t(mm)%*%mm. This is in fact
Example 2.1 in Case 2 discussed above. As discussed there, in Stata
and in R the drop of the continuous variable has no effect on the
degrees of freedom here: it is just a reparameterisation of the full
model, protecting you against losing marginality... Hence the
model.matrix 'mm' is still square and nonsingular after the drop of
X1, unless of course when a row is removed from the matrix 'design'
when before creating 'mm'.

    Arie

On Sun, Oct 15, 2017 at 7:05 PM, Tyler <tylermw at gmail.com> wrote:

You could possibly try to explain away the behavior for a missing main
effects term, since without the main effects term we don't have main
effect
columns in the model matrix used to compute the interaction columns

(At

best this is undocumented behavior--I still think it's a bug, as we

know

how we would encode the categorical factors if they were in fact
present.
It's either specified in contrasts.arg or using the default set in
options). However, when all the main effects are present, why would

the

three-factor interaction column not simply be the product of the main
effect columns? In my example: we know X1, we know X2, and we know X3.
Why
does the encoding of X1:X2:X3 depend on whether we specified a
two-factor
interaction, AND only changes for specific missing interactions?

In addition, I can use a two-term example similar to yours to show how
this
behavior results in a singular covariance matrix when, given the

desired

factor encoding, it should not be singular.

We start with a full factorial design for a two-level continuous

factor

and
a three-level categorical factor, and remove a single row. This design
matrix does not leave enough degrees of freedom to determine
goodness-of-fit, but should allow us to obtain parameter estimates.

design = expand.grid(X1=c(1,-1),X2=c("A","B","C"))
design = design[-1,]
design

  X1 X2
2 -1  A
3  1  B
4 -1  B
5  1  C
6 -1  C

Here, we first calculate the model matrix for the full model, and then
manually remove the X1 column from the model matrix. This gives us the
model matrix one would expect if X1 were removed from the model. We

then

successfully calculate the covariance matrix.

mm = model.matrix(~(X1+X2)^2,data=design)
mm

  (Intercept) X1 X2B X2C X1:X2B X1:X2C
2           1 -1   0   0      0      0
3           1  1   1   0      1      0
4           1 -1   1   0     -1      0
5           1  1   0   1      0      1
6           1 -1   0   1      0     -1

mm = mm[,-2]
solve(t(mm) %*% mm)

            (Intercept)  X2B  X2C X1:X2B X1:X2C
(Intercept)           1 -1.0 -1.0    0.0    0.0
X2B                  -1  1.5  1.0    0.0    0.0
X2C                  -1  1.0  1.5    0.0    0.0
X1:X2B                0  0.0  0.0    0.5    0.0
X1:X2C                0  0.0  0.0    0.0    0.5

Here, we see the actual behavior for model.matrix. The undesired
re-coding
of the model matrix interaction term makes the information matrix
singular.

mm = model.matrix(~(X1+X2)^2-X1,data=design)
mm

  (Intercept) X2B X2C X1:X2A X1:X2B X1:X2C
2           1   0   0     -1      0      0
3           1   1   0      0      1      0
4           1   1   0      0     -1      0
5           1   0   1      0      0      1
6           1   0   1      0      0     -1

solve(t(mm) %*% mm)

Error in solve.default(t(mm) %*% mm) : system is computationally
singular:
reciprocal condition number = 5.55112e-18

I still believe this is a bug.

Best regards,
Tyler Morgan-Wall

On Sun, Oct 15, 2017 at 1:49 AM, Arie ten Cate <arietencate at gmail.com

wrote:

I think it is not a bug. It is a general property of interactions.
This property is best observed if all variables are factors
(qualitative).

For example, you have three variables (factors). You ask for as many
interactions as possible, except an interaction term between two
particular variables. When this interaction is not a constant, it is
different for different values of the remaining variable. More
precisely: for all values of that variable. In other words: you have

three-way interaction, with all values of that variable.

An even smaller example is the following script with only two
variables, each being a factor:

 df <- expand.grid(X1=c("p","q"), X2=c("A","B","C"))
 print(model.matrix(~(X1+X2)^2    ,data=df))
 print(model.matrix(~(X1+X2)^2 -X1,data=df))
 print(model.matrix(~(X1+X2)^2 -X2,data=df))

The result is:

  (Intercept) X1q X2B X2C X1q:X2B X1q:X2C
1           1   0   0   0       0       0
2           1   1   0   0       0       0
3           1   0   1   0       0       0
4           1   1   1   0       1       0
5           1   0   0   1       0       0
6           1   1   0   1       0       1

  (Intercept) X2B X2C X1q:X2A X1q:X2B X1q:X2C
1           1   0   0       0       0       0
2           1   0   0       1       0       0
3           1   1   0       0       0       0
4           1   1   0       0       1       0
5           1   0   1       0       0       0
6           1   0   1       0       0       1

  (Intercept) X1q X1p:X2B X1q:X2B X1p:X2C X1q:X2C
1           1   0       0       0       0       0
2           1   1       0       0       0       0
3           1   0       1       0       0       0
4           1   1       0       1       0       0
5           1   0       0       0       1       0
6           1   1       0       0       0       1

Thus, in the second result, we have no main effect of X1. Instead,

the

effect of X1 depends on the value of X2; either A or B or C. In fact,
this is a two-way interaction, including all three values of X2. In
the third result, we have no main effect of X2, The effect of X2
depends on the value of X1; either p or q.

A complicating element with your example seems to be that your X1 and
X2 are not factors.

   Arie

On Thu, Oct 12, 2017 at 7:12 PM, Tyler <tylermw at gmail.com> wrote:

Hi,

I recently ran into an inconsistency in the way

model.matrix.default

handles factor encoding for higher level interactions with
categorical
variables when the full hierarchy of effects is not present.
Depending on
which lower level interactions are specified, the factor encoding
changes
for a higher level interaction. Consider the following minimal

reproducible

example:

--------------

runmatrix = expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))>

model.matrix(~(X1+X2+X3)^3,data=runmatrix)   (Intercept) X1 X2 X3B

X3C

X1:X2 X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0      0      0
0         0
2            1 -1  1   0   0    -1      0      0      0      0
0         0
3            1  1 -1   0   0    -1      0      0      0      0
0         0
4            1 -1 -1   0   0     1      0      0      0      0
0         0
5            1  1  1   1   0     1      1      0      1      0
1         0
6            1 -1  1   1   0    -1     -1      0      1      0
-1         0
7            1  1 -1   1   0    -1      1      0     -1      0
-1         0
8            1 -1 -1   1   0     1     -1      0     -1      0
1         0
9            1  1  1   0   1     1      0      1      0      1
0         1
10           1 -1  1   0   1    -1      0     -1      0      1
0        -1
11           1  1 -1   0   1    -1      0      1      0     -1
0        -1
12           1 -1 -1   0   1     1      0     -1      0     -1
0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6 7 7
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Specifying the full hierarchy gives us what we expect: the
interaction
columns are simply calculated from the product of the main effect

columns.

The interaction term X1:X2:X3 gives us two columns in the model
matrix,
X1:X2:X3B and X1:X2:X3C, matching the products of the main effects.

If we remove either the X2:X3 interaction or the X1:X3 interaction,
we

get

what we would expect for the X1:X2:X3 interaction, but when we

remove

the
X1:X2 interaction the encoding for X1:X2:X3 changes completely:

--------------

model.matrix(~(X1+X2+X3)^3-X1:X3,data=runmatrix)   (Intercept)

X1 X2

X3B X3C X1:X2 X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0

2            1 -1  1   0   0    -1      0      0         0

3            1  1 -1   0   0    -1      0      0         0

4            1 -1 -1   0   0     1      0      0         0

5            1  1  1   1   0     1      1      0         1

6            1 -1  1   1   0    -1      1      0        -1

7            1  1 -1   1   0    -1     -1      0        -1

8            1 -1 -1   1   0     1     -1      0         1

9            1  1  1   0   1     1      0      1         0

10           1 -1  1   0   1    -1      0      1         0

-1

11           1  1 -1   0   1    -1      0     -1         0

-1

12           1 -1 -1   0   1     1      0     -1         0

attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X2:X3,data=runmatrix)   (Intercept)

X1 X2

X3B X3C X1:X2 X1:X3B X1:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0

2            1 -1  1   0   0    -1      0      0         0

3            1  1 -1   0   0    -1      0      0         0

4            1 -1 -1   0   0     1      0      0         0

5            1  1  1   1   0     1      1      0         1

6            1 -1  1   1   0    -1     -1      0        -1

7            1  1 -1   1   0    -1      1      0        -1

8            1 -1 -1   1   0     1     -1      0         1

9            1  1  1   0   1     1      0      1         0

10           1 -1  1   0   1    -1      0     -1         0

-1

11           1  1 -1   0   1    -1      0      1         0

-1

12           1 -1 -1   0   1     1      0     -1         0

attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X1:X2,data=runmatrix)   (Intercept)

X1 X2

X3B X3C X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3A X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0      0      0      0      0         1
    0         0
2            1 -1  1   0   0      0      0      0      0        -1
    0         0
3            1  1 -1   0   0      0      0      0      0        -1
    0         0
4            1 -1 -1   0   0      0      0      0      0         1
    0         0
5            1  1  1   1   0      1      0      1      0         0
    1         0
6            1 -1  1   1   0     -1      0      1      0         0
   -1         0
7            1  1 -1   1   0      1      0     -1      0         0
   -1         0
8            1 -1 -1   1   0     -1      0     -1      0         0
    1         0
9            1  1  1   0   1      0      1      0      1         0
    0         1
10           1 -1  1   0   1      0     -1      0      1         0
    0        -1
11           1  1 -1   0   1      0      1      0     -1         0
    0        -1
12           1 -1 -1   0   1      0     -1      0     -1         0
    0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 4 5 5 6 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Here, we now see the encoding for the interaction X1:X2:X3 is now

the

interaction of X1 and X2 with a new encoding for X3 where each

factor

level

is represented by its own column. I would expect, given the two
column
dummy variable encoding for X3, that the X1:X2:X3 column would also
be

two

columns regardless of what two-factor interactions we also

specified,

but
in this case it switches to three. If other two factor interactions
are
missing in addition to X1:X2, this issue still occurs. This also
happens
regardless of the contrast specified in contrasts.arg for X3. I

don't

see
any reasoning for this behavior given in the documentation, so I
suspect

it

is a bug.

Best regards,
Tyler Morgan-Wall

        [[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

1 day later

Arie ten Cate

Sat, Nov 4, 2017 3:50 AM #

Hello Tyler,

I rephrase my previous mail, as follows:

In your example, T_i = X1:X2:X3. Let F_j = X3. (The numerical
variables X1 and X2 are not encoded at all.) Then T_{i(j)} = X1:X2,
which in the example is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie

On Thu, Nov 2, 2017 at 4:11 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

The book out of which this behavior is based does not use factor (in this
section) to refer to categorical factor. I will again point to this
sentence, from page 40, in the same section and referring to the behavior
under question, that shows F_j is not limited to categorical factors:
"Numeric variables appear in the computations as themselves, uncoded.
Therefore, the rule does not do anything special for them, and it remains
valid, in a trivial sense, whenever any of the F_j is numeric rather than
categorical."

Note the "... whenever any of the F_j is numeric rather than categorical."
Factor here is used in the more general sense of the word, not referring to
the R type "factor." The behavior of R does not match the heuristic that
it's citing.

Best regards,
Tyler

On Thu, Nov 2, 2017 at 2:51 AM, Arie ten Cate <arietencate at gmail.com> wrote:

Hello Tyler,

Thank you for searching for, and finding, the basic description of the
behavior of R in this matter.

I think your example is in agreement with the book.

But let me first note the following. You write: "F_j refers to a
factor (variable) in a model and not a categorical factor". However:
"a factor is a vector object used to specify a discrete
classification" (start of chapter 4 of "An Introduction to R".) You
might also see the description of the R function factor().

You note that the book says about a factor F_j:
  "... F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

You find:
   "However, the example I gave demonstrated that this dummy variable
encoding only occurs for the model where the missing term is the
numeric-numeric interaction, ~(X1+X2+X3)^3-X1:X2."

We have here T_i = X1:X2:X3. Also: F_j = X3 (the only factor). Then
T_{i(j)} = X1:X2, which is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie

On Tue, Oct 31, 2017 at 4:01 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

Thank you for your further research into the issue.

Regarding Stata: On the other hand, JMP gives model matrices that use
the
main effects contrasts in computing the higher order interactions,
without
the dummy variable encoding. I verified this both by analyzing the
linear
model given in my first example and noting that JMP has one more degree
of
freedom than R for the same model, as well as looking at the generated
model
matrices. It's easy to find a design where JMP will allow us fit our
model
with goodness-of-fit estimates and R will not due to the extra degree(s)
of
freedom required. Let's keep the conversation limited to R.

I want to refocus back onto my original bug report, which was not for a
missing main effects term, but rather for a missing lower-order
interaction
term. The behavior of model.matrix.default() for a missing main effects
term
is a nice example to demonstrate how model.matrix encodes with dummy
variables instead of contrasts, but doesn't demonstrate the inconsistent
behavior my bug report highlighted.

I went looking for documentation on this behavior, and the issue stems
not
from model.matrix.default(), but rather the terms() function in
interpreting
the formula. This "clever" replacement of contrasts by dummy variables
to
maintain marginality (presuming that's the reason) is not described
anywhere
in the documentation for either the model.matrix() or the terms()
function.
In order to find a description for the behavior, I had to look in the
underlying C code, buried above the "TermCode" function of the "model.c"
file, which says:

"TermCode decides on the encoding of a model term. Returns 1 if variable
``whichBit'' in ``thisTerm'' is to be encoded by contrasts and 2 if it
is to
be encoded by dummy variables.  This is decided using the heuristic
described in Statistical Models in S, page 38."

I do not have a copy of this book, and I suspect most R users do not as
well. Thankfully, however, some of the pages describing this behavior
were
available as part of Amazon's "Look Inside" feature--but if not for
that, I
would have no idea what heuristic R was using. Since those pages could
made
unavailable by Amazon at any time, at the very least we have an problem
with
a lack of documentation.

However, I still believe there is a bug when comparing R's
implementation to
the heuristic described in the book. From Statistical Models in S, page
38-39:

"Suppose F_j is any factor included in term T_i. Let T_{i(j)} denote the
margin of T_i for factor F_j--that is, the term obtained by dropping F_j
from T_i. We say that T_{i(j)} has appeared in the formula if there is
some
term T_i' for i' < i such that T_i' contains all the factors appearing
in
T_{i(j)}. The usual case is that T_{i(j)} itself is one of the preceding
terms. Then F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

Here, F_j refers to a factor (variable) in a model and not a categorical
factor, as specified later in that section (page 40): "Numeric variables
appear in the computations as themselves, uncoded. Therefore, the rule
does
not do anything special for them, and it remains valid, in a trivial
sense,
whenever any of the F_j is numeric rather than categorical."

Going back to my original example with three variables: X1 (numeric), X2
(numeric), X3 (categorical). This heuristic prescribes encoding X1:X2:X3
with contrasts as long as X1:X2, X1:X3, and X2:X3 exist in the formula.
When
any of the preceding terms do not exist, this heuristic tells us to use
dummy variables to encode the interaction (e.g. "F_j [the interaction
term]
is coded ... by dummy variables if it [any of the marginal terms
obtained by
dropping a single factor in the interaction] has not [appeared in the
formula]"). However, the example I gave demonstrated that this dummy
variable encoding only occurs for the model where the missing term is
the
numeric-numeric interaction, "~(X1+X2+X3)^3-X1:X2". Otherwise, the
interaction term X1:X2:X3 is encoded by contrasts, not dummy variables.
This
is inconsistent with the description of the intended behavior given in
the
book.

Best regards,
Tyler


On Fri, Oct 27, 2017 at 2:18 PM, Arie ten Cate <arietencate at gmail.com>
wrote:

Hello Tyler,

I want to bring to your attention the following document: "What
happens if you omit the main effect in a regression model with an
interaction?"

(https://stats.idre.ucla.edu/stata/faq/what-happens-if-you-omit-the-main-effect-in-a-regression-model-with-an-interaction).
This gives a useful review of the problem. Your example is Case 2: a
continuous and a categorical regressor.

The numerical examples are coded in Stata, and they give the same
result as in R. Hence, if this is a bug in R then it is also a bug in
Stata. That seems very unlikely.

Here is a simulation in R of the above mentioned Case 2 in Stata:

df <- expand.grid(socst=c(-1:1),grp=c("1","2","3","4"))
print("Full model")
print(model.matrix(~(socst+grp)^2 ,data=df))
print("Example 2.1: drop socst")
print(model.matrix(~(socst+grp)^2 -socst ,data=df))
print("Example 2.2: drop grp")
print(model.matrix(~(socst+grp)^2 -grp ,data=df))

This gives indeed the following regressors:

"Full model"
(Intercept) socst grp2 grp3 grp4 socst:grp2 socst:grp3 socst:grp4
"Example 2.1: drop socst"
(Intercept) grp2 grp3 grp4 socst:grp1 socst:grp2 socst:grp3 socst:grp4
"Example 2.2: drop grp"
(Intercept) socst socst:grp2 socst:grp3 socst:grp4

There is a little bit of R documentation about this, based on the
concept of marginality, which typically forbids a model having an
interaction but not the corresponding main effects. (You might see the
references in https://en.wikipedia.org/wiki/Principle_of_marginality )
    See "An Introduction to R", by Venables and Smith and the R Core
Team. At the bottom of page 52 (PDF: 57) it says: "Although the
details are complicated, model formulae in R will normally generate
the models that an expert statistician would expect, provided that
marginality is preserved. Fitting, for [a contrary] example, a model
with an interaction but not the corresponding main effects will in
general lead to surprising results ....".
    The Reference Manual states that the R functions dropterm() and
addterm() resp. drop or add only terms such that marginality is
preserved.

Finally, about your singular matrix t(mm)%*%mm. This is in fact
Example 2.1 in Case 2 discussed above. As discussed there, in Stata
and in R the drop of the continuous variable has no effect on the
degrees of freedom here: it is just a reparameterisation of the full
model, protecting you against losing marginality... Hence the
model.matrix 'mm' is still square and nonsingular after the drop of
X1, unless of course when a row is removed from the matrix 'design'
when before creating 'mm'.

    Arie

On Sun, Oct 15, 2017 at 7:05 PM, Tyler <tylermw at gmail.com> wrote:

You could possibly try to explain away the behavior for a missing
main
effects term, since without the main effects term we don't have main
effect
columns in the model matrix used to compute the interaction columns
(At
best this is undocumented behavior--I still think it's a bug, as we
know
how we would encode the categorical factors if they were in fact
present.
It's either specified in contrasts.arg or using the default set in
options). However, when all the main effects are present, why would
the
three-factor interaction column not simply be the product of the main
effect columns? In my example: we know X1, we know X2, and we know
X3.
Why
does the encoding of X1:X2:X3 depend on whether we specified a
two-factor
interaction, AND only changes for specific missing interactions?

In addition, I can use a two-term example similar to yours to show
how
this
behavior results in a singular covariance matrix when, given the
desired
factor encoding, it should not be singular.

We start with a full factorial design for a two-level continuous
factor
and
a three-level categorical factor, and remove a single row. This
design
matrix does not leave enough degrees of freedom to determine
goodness-of-fit, but should allow us to obtain parameter estimates.

design = expand.grid(X1=c(1,-1),X2=c("A","B","C"))
design = design[-1,]
design

  X1 X2
2 -1  A
3  1  B
4 -1  B
5  1  C
6 -1  C

Here, we first calculate the model matrix for the full model, and
then
manually remove the X1 column from the model matrix. This gives us
the
model matrix one would expect if X1 were removed from the model. We
then
successfully calculate the covariance matrix.

mm = model.matrix(~(X1+X2)^2,data=design)
mm

  (Intercept) X1 X2B X2C X1:X2B X1:X2C
2           1 -1   0   0      0      0
3           1  1   1   0      1      0
4           1 -1   1   0     -1      0
5           1  1   0   1      0      1
6           1 -1   0   1      0     -1

mm = mm[,-2]
solve(t(mm) %*% mm)

            (Intercept)  X2B  X2C X1:X2B X1:X2C
(Intercept)           1 -1.0 -1.0    0.0    0.0
X2B                  -1  1.5  1.0    0.0    0.0
X2C                  -1  1.0  1.5    0.0    0.0
X1:X2B                0  0.0  0.0    0.5    0.0
X1:X2C                0  0.0  0.0    0.0    0.5

Here, we see the actual behavior for model.matrix. The undesired
re-coding
of the model matrix interaction term makes the information matrix
singular.

mm = model.matrix(~(X1+X2)^2-X1,data=design)
mm

  (Intercept) X2B X2C X1:X2A X1:X2B X1:X2C
2           1   0   0     -1      0      0
3           1   1   0      0      1      0
4           1   1   0      0     -1      0
5           1   0   1      0      0      1
6           1   0   1      0      0     -1

solve(t(mm) %*% mm)

Error in solve.default(t(mm) %*% mm) : system is computationally
singular:
reciprocal condition number = 5.55112e-18

I still believe this is a bug.

Best regards,
Tyler Morgan-Wall

On Sun, Oct 15, 2017 at 1:49 AM, Arie ten Cate
<arietencate at gmail.com>
wrote:

I think it is not a bug. It is a general property of interactions.
This property is best observed if all variables are factors
(qualitative).

For example, you have three variables (factors). You ask for as many
interactions as possible, except an interaction term between two
particular variables. When this interaction is not a constant, it is
different for different values of the remaining variable. More
precisely: for all values of that variable. In other words: you have
a
three-way interaction, with all values of that variable.

An even smaller example is the following script with only two
variables, each being a factor:

 df <- expand.grid(X1=c("p","q"), X2=c("A","B","C"))
 print(model.matrix(~(X1+X2)^2    ,data=df))
 print(model.matrix(~(X1+X2)^2 -X1,data=df))
 print(model.matrix(~(X1+X2)^2 -X2,data=df))

The result is:

  (Intercept) X1q X2B X2C X1q:X2B X1q:X2C
1           1   0   0   0       0       0
2           1   1   0   0       0       0
3           1   0   1   0       0       0
4           1   1   1   0       1       0
5           1   0   0   1       0       0
6           1   1   0   1       0       1

  (Intercept) X2B X2C X1q:X2A X1q:X2B X1q:X2C
1           1   0   0       0       0       0
2           1   0   0       1       0       0
3           1   1   0       0       0       0
4           1   1   0       0       1       0
5           1   0   1       0       0       0
6           1   0   1       0       0       1

  (Intercept) X1q X1p:X2B X1q:X2B X1p:X2C X1q:X2C
1           1   0       0       0       0       0
2           1   1       0       0       0       0
3           1   0       1       0       0       0
4           1   1       0       1       0       0
5           1   0       0       0       1       0
6           1   1       0       0       0       1

Thus, in the second result, we have no main effect of X1. Instead,
the
effect of X1 depends on the value of X2; either A or B or C. In
fact,
this is a two-way interaction, including all three values of X2. In
the third result, we have no main effect of X2, The effect of X2
depends on the value of X1; either p or q.

A complicating element with your example seems to be that your X1
and
X2 are not factors.

   Arie

On Thu, Oct 12, 2017 at 7:12 PM, Tyler <tylermw at gmail.com> wrote:

Hi,

I recently ran into an inconsistency in the way
model.matrix.default
handles factor encoding for higher level interactions with
categorical
variables when the full hierarchy of effects is not present.
Depending on
which lower level interactions are specified, the factor encoding
changes
for a higher level interaction. Consider the following minimal

reproducible

example:

--------------

runmatrix = expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))>

model.matrix(~(X1+X2+X3)^3,data=runmatrix)   (Intercept) X1 X2 X3B
X3C
X1:X2 X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0      0      0
0         0
2            1 -1  1   0   0    -1      0      0      0      0
0         0
3            1  1 -1   0   0    -1      0      0      0      0
0         0
4            1 -1 -1   0   0     1      0      0      0      0
0         0
5            1  1  1   1   0     1      1      0      1      0
1         0
6            1 -1  1   1   0    -1     -1      0      1      0
-1         0
7            1  1 -1   1   0    -1      1      0     -1      0
-1         0
8            1 -1 -1   1   0     1     -1      0     -1      0
1         0
9            1  1  1   0   1     1      0      1      0      1
0         1
10           1 -1  1   0   1    -1      0     -1      0      1
0        -1
11           1  1 -1   0   1    -1      0      1      0     -1
0        -1
12           1 -1 -1   0   1     1      0     -1      0     -1
0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6 7 7
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Specifying the full hierarchy gives us what we expect: the
interaction
columns are simply calculated from the product of the main effect

columns.

The interaction term X1:X2:X3 gives us two columns in the model
matrix,
X1:X2:X3B and X1:X2:X3C, matching the products of the main
effects.

If we remove either the X2:X3 interaction or the X1:X3
interaction,
we

get

what we would expect for the X1:X2:X3 interaction, but when we
remove
the
X1:X2 interaction the encoding for X1:X2:X3 changes completely:

--------------

model.matrix(~(X1+X2+X3)^3-X1:X3,data=runmatrix)   (Intercept) X1
X2

X3B X3C X1:X2 X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0
0
2            1 -1  1   0   0    -1      0      0         0
0
3            1  1 -1   0   0    -1      0      0         0
0
4            1 -1 -1   0   0     1      0      0         0
0
5            1  1  1   1   0     1      1      0         1
0
6            1 -1  1   1   0    -1      1      0        -1
0
7            1  1 -1   1   0    -1     -1      0        -1
0
8            1 -1 -1   1   0     1     -1      0         1
0
9            1  1  1   0   1     1      0      1         0
1
10           1 -1  1   0   1    -1      0      1         0
-1
11           1  1 -1   0   1    -1      0     -1         0
-1
12           1 -1 -1   0   1     1      0     -1         0
1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X2:X3,data=runmatrix)   (Intercept) X1
X2

X3B X3C X1:X2 X1:X3B X1:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0
0
2            1 -1  1   0   0    -1      0      0         0
0
3            1  1 -1   0   0    -1      0      0         0
0
4            1 -1 -1   0   0     1      0      0         0
0
5            1  1  1   1   0     1      1      0         1
0
6            1 -1  1   1   0    -1     -1      0        -1
0
7            1  1 -1   1   0    -1      1      0        -1
0
8            1 -1 -1   1   0     1     -1      0         1
0
9            1  1  1   0   1     1      0      1         0
1
10           1 -1  1   0   1    -1      0     -1         0
-1
11           1  1 -1   0   1    -1      0      1         0
-1
12           1 -1 -1   0   1     1      0     -1         0
1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X1:X2,data=runmatrix)   (Intercept) X1
X2

X3B X3C X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3A X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0      0      0      0      0         1
    0         0
2            1 -1  1   0   0      0      0      0      0        -1
    0         0
3            1  1 -1   0   0      0      0      0      0        -1
    0         0
4            1 -1 -1   0   0      0      0      0      0         1
    0         0
5            1  1  1   1   0      1      0      1      0         0
    1         0
6            1 -1  1   1   0     -1      0      1      0         0
   -1         0
7            1  1 -1   1   0      1      0     -1      0         0
   -1         0
8            1 -1 -1   1   0     -1      0     -1      0         0
    1         0
9            1  1  1   0   1      0      1      0      1         0
    0         1
10           1 -1  1   0   1      0     -1      0      1         0
    0        -1
11           1  1 -1   0   1      0      1      0     -1         0
    0        -1
12           1 -1 -1   0   1      0     -1      0     -1         0
    0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 4 5 5 6 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Here, we now see the encoding for the interaction X1:X2:X3 is now
the
interaction of X1 and X2 with a new encoding for X3 where each
factor

level

is represented by its own column. I would expect, given the two
column
dummy variable encoding for X3, that the X1:X2:X3 column would
also
be

two

columns regardless of what two-factor interactions we also
specified,
but
in this case it switches to three. If other two factor
interactions
are
missing in addition to X1:X2, this issue still occurs. This also
happens
regardless of the contrast specified in contrasts.arg for X3. I
don't
see
any reasoning for this behavior given in the documentation, so I
suspect

it

is a bug.

Best regards,
Tyler Morgan-Wall

        [[alternative HTML version deleted]]

______________________________________________

Tyler

Sat, Nov 4, 2017 9:33 AM #

Hi Arie,

I understand what you're saying. The following excerpt out of the book
shows that F_j does not refer exclusively to categorical factors: "...the
rule does not do anything special for them, and it remains valid, in a
trivial sense, whenever any of the F_j is numeric rather than categorical."
Since F_j refers to both categorical and numeric variables, the behavior of
model.matrix is not consistent with the heuristic.

Best regards,
Tyler

On Sat, Nov 4, 2017 at 6:50 AM, Arie ten Cate <arietencate at gmail.com> wrote:

Hello Tyler,

I rephrase my previous mail, as follows:

In your example, T_i = X1:X2:X3. Let F_j = X3. (The numerical
variables X1 and X2 are not encoded at all.) Then T_{i(j)} = X1:X2,
which in the example is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie


On Thu, Nov 2, 2017 at 4:11 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

The book out of which this behavior is based does not use factor (in this
section) to refer to categorical factor. I will again point to this
sentence, from page 40, in the same section and referring to the behavior
under question, that shows F_j is not limited to categorical factors:
"Numeric variables appear in the computations as themselves, uncoded.
Therefore, the rule does not do anything special for them, and it remains
valid, in a trivial sense, whenever any of the F_j is numeric rather than
categorical."

Note the "... whenever any of the F_j is numeric rather than

categorical."

Factor here is used in the more general sense of the word, not referring

to

the R type "factor." The behavior of R does not match the heuristic that
it's citing.

Best regards,
Tyler

On Thu, Nov 2, 2017 at 2:51 AM, Arie ten Cate <arietencate at gmail.com>

wrote:

Hello Tyler,

Thank you for searching for, and finding, the basic description of the
behavior of R in this matter.

I think your example is in agreement with the book.

But let me first note the following. You write: "F_j refers to a
factor (variable) in a model and not a categorical factor". However:
"a factor is a vector object used to specify a discrete
classification" (start of chapter 4 of "An Introduction to R".) You
might also see the description of the R function factor().

You note that the book says about a factor F_j:
  "... F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

You find:
   "However, the example I gave demonstrated that this dummy variable
encoding only occurs for the model where the missing term is the
numeric-numeric interaction, ~(X1+X2+X3)^3-X1:X2."

We have here T_i = X1:X2:X3. Also: F_j = X3 (the only factor). Then
T_{i(j)} = X1:X2, which is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie

On Tue, Oct 31, 2017 at 4:01 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

Thank you for your further research into the issue.

Regarding Stata: On the other hand, JMP gives model matrices that use
the
main effects contrasts in computing the higher order interactions,
without
the dummy variable encoding. I verified this both by analyzing the
linear
model given in my first example and noting that JMP has one more

degree

of
freedom than R for the same model, as well as looking at the generated
model
matrices. It's easy to find a design where JMP will allow us fit our
model
with goodness-of-fit estimates and R will not due to the extra

degree(s)

of
freedom required. Let's keep the conversation limited to R.

I want to refocus back onto my original bug report, which was not for

missing main effects term, but rather for a missing lower-order
interaction
term. The behavior of model.matrix.default() for a missing main

effects

term
is a nice example to demonstrate how model.matrix encodes with dummy
variables instead of contrasts, but doesn't demonstrate the

inconsistent

behavior my bug report highlighted.

I went looking for documentation on this behavior, and the issue stems
not
from model.matrix.default(), but rather the terms() function in
interpreting
the formula. This "clever" replacement of contrasts by dummy variables
to
maintain marginality (presuming that's the reason) is not described
anywhere
in the documentation for either the model.matrix() or the terms()
function.
In order to find a description for the behavior, I had to look in the
underlying C code, buried above the "TermCode" function of the

"model.c"

file, which says:

"TermCode decides on the encoding of a model term. Returns 1 if

variable

``whichBit'' in ``thisTerm'' is to be encoded by contrasts and 2 if it
is to
be encoded by dummy variables.  This is decided using the heuristic
described in Statistical Models in S, page 38."

I do not have a copy of this book, and I suspect most R users do not

as

well. Thankfully, however, some of the pages describing this behavior
were
available as part of Amazon's "Look Inside" feature--but if not for
that, I
would have no idea what heuristic R was using. Since those pages could
made
unavailable by Amazon at any time, at the very least we have an

problem

with
a lack of documentation.

However, I still believe there is a bug when comparing R's
implementation to
the heuristic described in the book. From Statistical Models in S,

page

38-39:

"Suppose F_j is any factor included in term T_i. Let T_{i(j)} denote

the

margin of T_i for factor F_j--that is, the term obtained by dropping

F_j

from T_i. We say that T_{i(j)} has appeared in the formula if there is
some
term T_i' for i' < i such that T_i' contains all the factors appearing
in
T_{i(j)}. The usual case is that T_{i(j)} itself is one of the

preceding

terms. Then F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

Here, F_j refers to a factor (variable) in a model and not a

categorical

factor, as specified later in that section (page 40): "Numeric

variables

appear in the computations as themselves, uncoded. Therefore, the rule
does
not do anything special for them, and it remains valid, in a trivial
sense,
whenever any of the F_j is numeric rather than categorical."

Going back to my original example with three variables: X1 (numeric),

X2

(numeric), X3 (categorical). This heuristic prescribes encoding

X1:X2:X3

with contrasts as long as X1:X2, X1:X3, and X2:X3 exist in the

formula.

When
any of the preceding terms do not exist, this heuristic tells us to

use

dummy variables to encode the interaction (e.g. "F_j [the interaction
term]
is coded ... by dummy variables if it [any of the marginal terms
obtained by
dropping a single factor in the interaction] has not [appeared in the
formula]"). However, the example I gave demonstrated that this dummy
variable encoding only occurs for the model where the missing term is
the
numeric-numeric interaction, "~(X1+X2+X3)^3-X1:X2". Otherwise, the
interaction term X1:X2:X3 is encoded by contrasts, not dummy

variables.

This
is inconsistent with the description of the intended behavior given in
the
book.

Best regards,
Tyler


On Fri, Oct 27, 2017 at 2:18 PM, Arie ten Cate <arietencate at gmail.com

wrote:

Hello Tyler,

I want to bring to your attention the following document: "What
happens if you omit the main effect in a regression model with an
interaction?"

(https://stats.idre.ucla.edu/stata/faq/what-happens-if-you-o

mit-the-main-effect-in-a-regression-model-with-an-interaction).

This gives a useful review of the problem. Your example is Case 2: a
continuous and a categorical regressor.

The numerical examples are coded in Stata, and they give the same
result as in R. Hence, if this is a bug in R then it is also a bug in
Stata. That seems very unlikely.

Here is a simulation in R of the above mentioned Case 2 in Stata:

df <- expand.grid(socst=c(-1:1),grp=c("1","2","3","4"))
print("Full model")
print(model.matrix(~(socst+grp)^2 ,data=df))
print("Example 2.1: drop socst")
print(model.matrix(~(socst+grp)^2 -socst ,data=df))
print("Example 2.2: drop grp")
print(model.matrix(~(socst+grp)^2 -grp ,data=df))

This gives indeed the following regressors:

"Full model"
(Intercept) socst grp2 grp3 grp4 socst:grp2 socst:grp3 socst:grp4
"Example 2.1: drop socst"
(Intercept) grp2 grp3 grp4 socst:grp1 socst:grp2 socst:grp3

socst:grp4

"Example 2.2: drop grp"
(Intercept) socst socst:grp2 socst:grp3 socst:grp4

There is a little bit of R documentation about this, based on the
concept of marginality, which typically forbids a model having an
interaction but not the corresponding main effects. (You might see

the

references in https://en.wikipedia.org/wiki/Principle_of_marginality

    See "An Introduction to R", by Venables and Smith and the R Core
Team. At the bottom of page 52 (PDF: 57) it says: "Although the
details are complicated, model formulae in R will normally generate
the models that an expert statistician would expect, provided that
marginality is preserved. Fitting, for [a contrary] example, a model
with an interaction but not the corresponding main effects will in
general lead to surprising results ....".
    The Reference Manual states that the R functions dropterm() and
addterm() resp. drop or add only terms such that marginality is
preserved.

Finally, about your singular matrix t(mm)%*%mm. This is in fact
Example 2.1 in Case 2 discussed above. As discussed there, in Stata
and in R the drop of the continuous variable has no effect on the
degrees of freedom here: it is just a reparameterisation of the full
model, protecting you against losing marginality... Hence the
model.matrix 'mm' is still square and nonsingular after the drop of
X1, unless of course when a row is removed from the matrix 'design'
when before creating 'mm'.

    Arie

On Sun, Oct 15, 2017 at 7:05 PM, Tyler <tylermw at gmail.com> wrote:

You could possibly try to explain away the behavior for a missing
main
effects term, since without the main effects term we don't have

main

effect
columns in the model matrix used to compute the interaction columns
(At
best this is undocumented behavior--I still think it's a bug, as we
know
how we would encode the categorical factors if they were in fact
present.
It's either specified in contrasts.arg or using the default set in
options). However, when all the main effects are present, why would
the
three-factor interaction column not simply be the product of the

main

effect columns? In my example: we know X1, we know X2, and we know
X3.
Why
does the encoding of X1:X2:X3 depend on whether we specified a
two-factor
interaction, AND only changes for specific missing interactions?

In addition, I can use a two-term example similar to yours to show
how
this
behavior results in a singular covariance matrix when, given the
desired
factor encoding, it should not be singular.

We start with a full factorial design for a two-level continuous
factor
and
a three-level categorical factor, and remove a single row. This
design
matrix does not leave enough degrees of freedom to determine
goodness-of-fit, but should allow us to obtain parameter estimates.

design = expand.grid(X1=c(1,-1),X2=c("A","B","C"))
design = design[-1,]
design

  X1 X2
2 -1  A
3  1  B
4 -1  B
5  1  C
6 -1  C

Here, we first calculate the model matrix for the full model, and
then
manually remove the X1 column from the model matrix. This gives us
the
model matrix one would expect if X1 were removed from the model. We
then
successfully calculate the covariance matrix.

mm = model.matrix(~(X1+X2)^2,data=design)
mm

  (Intercept) X1 X2B X2C X1:X2B X1:X2C
2           1 -1   0   0      0      0
3           1  1   1   0      1      0
4           1 -1   1   0     -1      0
5           1  1   0   1      0      1
6           1 -1   0   1      0     -1

mm = mm[,-2]
solve(t(mm) %*% mm)

            (Intercept)  X2B  X2C X1:X2B X1:X2C
(Intercept)           1 -1.0 -1.0    0.0    0.0
X2B                  -1  1.5  1.0    0.0    0.0
X2C                  -1  1.0  1.5    0.0    0.0
X1:X2B                0  0.0  0.0    0.5    0.0
X1:X2C                0  0.0  0.0    0.0    0.5

Here, we see the actual behavior for model.matrix. The undesired
re-coding
of the model matrix interaction term makes the information matrix
singular.

mm = model.matrix(~(X1+X2)^2-X1,data=design)
mm

  (Intercept) X2B X2C X1:X2A X1:X2B X1:X2C
2           1   0   0     -1      0      0
3           1   1   0      0      1      0
4           1   1   0      0     -1      0
5           1   0   1      0      0      1
6           1   0   1      0      0     -1

solve(t(mm) %*% mm)

Error in solve.default(t(mm) %*% mm) : system is computationally
singular:
reciprocal condition number = 5.55112e-18

I still believe this is a bug.

Best regards,
Tyler Morgan-Wall

On Sun, Oct 15, 2017 at 1:49 AM, Arie ten Cate
<arietencate at gmail.com>
wrote:

I think it is not a bug. It is a general property of interactions.
This property is best observed if all variables are factors
(qualitative).

For example, you have three variables (factors). You ask for as

many

interactions as possible, except an interaction term between two
particular variables. When this interaction is not a constant, it

is

different for different values of the remaining variable. More
precisely: for all values of that variable. In other words: you

have

a
three-way interaction, with all values of that variable.

An even smaller example is the following script with only two
variables, each being a factor:

 df <- expand.grid(X1=c("p","q"), X2=c("A","B","C"))
 print(model.matrix(~(X1+X2)^2    ,data=df))
 print(model.matrix(~(X1+X2)^2 -X1,data=df))
 print(model.matrix(~(X1+X2)^2 -X2,data=df))

The result is:

  (Intercept) X1q X2B X2C X1q:X2B X1q:X2C
1           1   0   0   0       0       0
2           1   1   0   0       0       0
3           1   0   1   0       0       0
4           1   1   1   0       1       0
5           1   0   0   1       0       0
6           1   1   0   1       0       1

  (Intercept) X2B X2C X1q:X2A X1q:X2B X1q:X2C
1           1   0   0       0       0       0
2           1   0   0       1       0       0
3           1   1   0       0       0       0
4           1   1   0       0       1       0
5           1   0   1       0       0       0
6           1   0   1       0       0       1

  (Intercept) X1q X1p:X2B X1q:X2B X1p:X2C X1q:X2C
1           1   0       0       0       0       0
2           1   1       0       0       0       0
3           1   0       1       0       0       0
4           1   1       0       1       0       0
5           1   0       0       0       1       0
6           1   1       0       0       0       1

Thus, in the second result, we have no main effect of X1. Instead,
the
effect of X1 depends on the value of X2; either A or B or C. In
fact,
this is a two-way interaction, including all three values of X2.

In

the third result, we have no main effect of X2, The effect of X2
depends on the value of X1; either p or q.

A complicating element with your example seems to be that your X1
and
X2 are not factors.

   Arie

On Thu, Oct 12, 2017 at 7:12 PM, Tyler <tylermw at gmail.com> wrote:

Hi,

I recently ran into an inconsistency in the way
model.matrix.default
handles factor encoding for higher level interactions with
categorical
variables when the full hierarchy of effects is not present.
Depending on
which lower level interactions are specified, the factor

encoding

changes
for a higher level interaction. Consider the following minimal

reproducible

example:

--------------

runmatrix = expand.grid(X1=c(1,-1),X2=c(1,

-1),X3=c("A","B","C"))>

model.matrix(~(X1+X2+X3)^3,data=runmatrix)   (Intercept) X1 X2

X3B

X3C
X1:X2 X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0      0      0
0         0
2            1 -1  1   0   0    -1      0      0      0      0
0         0
3            1  1 -1   0   0    -1      0      0      0      0
0         0
4            1 -1 -1   0   0     1      0      0      0      0
0         0
5            1  1  1   1   0     1      1      0      1      0
1         0
6            1 -1  1   1   0    -1     -1      0      1      0
-1         0
7            1  1 -1   1   0    -1      1      0     -1      0
-1         0
8            1 -1 -1   1   0     1     -1      0     -1      0
1         0
9            1  1  1   0   1     1      0      1      0      1
0         1
10           1 -1  1   0   1    -1      0     -1      0      1
0        -1
11           1  1 -1   0   1    -1      0      1      0     -1
0        -1
12           1 -1 -1   0   1     1      0     -1      0     -1
0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6 7 7
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Specifying the full hierarchy gives us what we expect: the
interaction
columns are simply calculated from the product of the main

effect

columns.

The interaction term X1:X2:X3 gives us two columns in the model
matrix,
X1:X2:X3B and X1:X2:X3C, matching the products of the main
effects.

If we remove either the X2:X3 interaction or the X1:X3
interaction,
we

get

what we would expect for the X1:X2:X3 interaction, but when we
remove
the
X1:X2 interaction the encoding for X1:X2:X3 changes completely:

--------------

model.matrix(~(X1+X2+X3)^3-X1:X3,data=runmatrix)

 (Intercept) X1

X2

X3B X3C X1:X2 X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0
0
2            1 -1  1   0   0    -1      0      0         0
0
3            1  1 -1   0   0    -1      0      0         0
0
4            1 -1 -1   0   0     1      0      0         0
0
5            1  1  1   1   0     1      1      0         1
0
6            1 -1  1   1   0    -1      1      0        -1
0
7            1  1 -1   1   0    -1     -1      0        -1
0
8            1 -1 -1   1   0     1     -1      0         1
0
9            1  1  1   0   1     1      0      1         0
1
10           1 -1  1   0   1    -1      0      1         0
-1
11           1  1 -1   0   1    -1      0     -1         0
-1
12           1 -1 -1   0   1     1      0     -1         0
1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X2:X3,data=runmatrix)

 (Intercept) X1

X2

X3B X3C X1:X2 X1:X3B X1:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0
0
2            1 -1  1   0   0    -1      0      0         0
0
3            1  1 -1   0   0    -1      0      0         0
0
4            1 -1 -1   0   0     1      0      0         0
0
5            1  1  1   1   0     1      1      0         1
0
6            1 -1  1   1   0    -1     -1      0        -1
0
7            1  1 -1   1   0    -1      1      0        -1
0
8            1 -1 -1   1   0     1     -1      0         1
0
9            1  1  1   0   1     1      0      1         0
1
10           1 -1  1   0   1    -1      0     -1         0
-1
11           1  1 -1   0   1    -1      0      1         0
-1
12           1 -1 -1   0   1     1      0     -1         0
1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X1:X2,data=runmatrix)

 (Intercept) X1

X2

X3B X3C X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3A X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0      0      0      0      0

    0         0
2            1 -1  1   0   0      0      0      0      0

-1

    0         0
3            1  1 -1   0   0      0      0      0      0

-1

    0         0
4            1 -1 -1   0   0      0      0      0      0

    0         0
5            1  1  1   1   0      1      0      1      0

    1         0
6            1 -1  1   1   0     -1      0      1      0

   -1         0
7            1  1 -1   1   0      1      0     -1      0

   -1         0
8            1 -1 -1   1   0     -1      0     -1      0

    1         0
9            1  1  1   0   1      0      1      0      1

    0         1
10           1 -1  1   0   1      0     -1      0      1

    0        -1
11           1  1 -1   0   1      0      1      0     -1

    0        -1
12           1 -1 -1   0   1      0     -1      0     -1

    0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 4 5 5 6 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Here, we now see the encoding for the interaction X1:X2:X3 is

now

the
interaction of X1 and X2 with a new encoding for X3 where each
factor

level

is represented by its own column. I would expect, given the two
column
dummy variable encoding for X3, that the X1:X2:X3 column would
also
be

two

columns regardless of what two-factor interactions we also
specified,
but
in this case it switches to three. If other two factor
interactions
are
missing in addition to X1:X2, this issue still occurs. This also
happens
regardless of the contrast specified in contrasts.arg for X3. I
don't
see
any reasoning for this behavior given in the documentation, so I
suspect

it

is a bug.

Best regards,
Tyler Morgan-Wall

        [[alternative HTML version deleted]]

______________________________________________

1 day later

Arie ten Cate

Mon, Nov 6, 2017 3:45 AM #

Hello Tyler,

You write that you understand what I am saying. However, I am now at
loss about what exactly is the problem with the behavior of R.  Here
is a script which reproduces your experiments with three variables
(excluding the full model):

m=expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))
model.matrix(~(X1+X2+X3)^3-X1:X3,data=m)
model.matrix(~(X1+X2+X3)^3-X2:X3,data=m)
model.matrix(~(X1+X2+X3)^3-X1:X2,data=m)

Below are the three results, similar to your first mail. (The first
two are basically the same, of course.) Please pick one result which
you think is not consistent with the heuristic and please give what
you think is the correct result:

model.matrix(~(X1+X2+X3)^3-X1:X3)
  (Intercept)
  X1 X2 X3B X3C
  X1:X2 X2:X3B X2:X3C
  X1:X2:X3B X1:X2:X3C

model.matrix(~(X1+X2+X3)^3-X2:X3)
  (Intercept)
  X1 X2 X3B X3C
  X1:X2 X1:X3B X1:X3C
  X1:X2:X3B X1:X2:X3C

model.matrix(~(X1+X2+X3)^3-X1:X2)
  (Intercept)
  X1 X2 X3B X3C
  X1:X3B X1:X3C X2:X3B X2:X3C
  X1:X2:X3A X1:X2:X3B X1:X2:X3C

(I take it that the combination of X3A and X3B and X3C implies dummy
encoding, and the combination of only X3B and X3C implies contrasts
encoding, with respect to X3A.)

Thanks in advance,

Arie

On Sat, Nov 4, 2017 at 5:33 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

I understand what you're saying. The following excerpt out of the book shows
that F_j does not refer exclusively to categorical factors: "...the rule
does not do anything special for them, and it remains valid, in a trivial
sense, whenever any of the F_j is numeric rather than categorical." Since
F_j refers to both categorical and numeric variables, the behavior of
model.matrix is not consistent with the heuristic.

Best regards,
Tyler

On Sat, Nov 4, 2017 at 6:50 AM, Arie ten Cate <arietencate at gmail.com> wrote:

Hello Tyler,

I rephrase my previous mail, as follows:

In your example, T_i = X1:X2:X3. Let F_j = X3. (The numerical
variables X1 and X2 are not encoded at all.) Then T_{i(j)} = X1:X2,
which in the example is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie


On Thu, Nov 2, 2017 at 4:11 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

The book out of which this behavior is based does not use factor (in
this
section) to refer to categorical factor. I will again point to this
sentence, from page 40, in the same section and referring to the
behavior
under question, that shows F_j is not limited to categorical factors:
"Numeric variables appear in the computations as themselves, uncoded.
Therefore, the rule does not do anything special for them, and it
remains
valid, in a trivial sense, whenever any of the F_j is numeric rather
than
categorical."

Note the "... whenever any of the F_j is numeric rather than
categorical."
Factor here is used in the more general sense of the word, not referring
to
the R type "factor." The behavior of R does not match the heuristic that
it's citing.

Best regards,
Tyler

On Thu, Nov 2, 2017 at 2:51 AM, Arie ten Cate <arietencate at gmail.com>
wrote:

Hello Tyler,

Thank you for searching for, and finding, the basic description of the
behavior of R in this matter.

I think your example is in agreement with the book.

But let me first note the following. You write: "F_j refers to a
factor (variable) in a model and not a categorical factor". However:
"a factor is a vector object used to specify a discrete
classification" (start of chapter 4 of "An Introduction to R".) You
might also see the description of the R function factor().

You note that the book says about a factor F_j:
  "... F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

You find:
   "However, the example I gave demonstrated that this dummy variable
encoding only occurs for the model where the missing term is the
numeric-numeric interaction, ~(X1+X2+X3)^3-X1:X2."

We have here T_i = X1:X2:X3. Also: F_j = X3 (the only factor). Then
T_{i(j)} = X1:X2, which is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie

On Tue, Oct 31, 2017 at 4:01 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

Thank you for your further research into the issue.

Regarding Stata: On the other hand, JMP gives model matrices that use
the
main effects contrasts in computing the higher order interactions,
without
the dummy variable encoding. I verified this both by analyzing the
linear
model given in my first example and noting that JMP has one more
degree
of
freedom than R for the same model, as well as looking at the
generated
model
matrices. It's easy to find a design where JMP will allow us fit our
model
with goodness-of-fit estimates and R will not due to the extra
degree(s)
of
freedom required. Let's keep the conversation limited to R.

I want to refocus back onto my original bug report, which was not for
a
missing main effects term, but rather for a missing lower-order
interaction
term. The behavior of model.matrix.default() for a missing main
effects
term
is a nice example to demonstrate how model.matrix encodes with dummy
variables instead of contrasts, but doesn't demonstrate the
inconsistent
behavior my bug report highlighted.

I went looking for documentation on this behavior, and the issue
stems
not
from model.matrix.default(), but rather the terms() function in
interpreting
the formula. This "clever" replacement of contrasts by dummy
variables
to
maintain marginality (presuming that's the reason) is not described
anywhere
in the documentation for either the model.matrix() or the terms()
function.
In order to find a description for the behavior, I had to look in the
underlying C code, buried above the "TermCode" function of the
"model.c"
file, which says:

"TermCode decides on the encoding of a model term. Returns 1 if
variable
``whichBit'' in ``thisTerm'' is to be encoded by contrasts and 2 if
it
is to
be encoded by dummy variables.  This is decided using the heuristic
described in Statistical Models in S, page 38."

I do not have a copy of this book, and I suspect most R users do not
as
well. Thankfully, however, some of the pages describing this behavior
were
available as part of Amazon's "Look Inside" feature--but if not for
that, I
would have no idea what heuristic R was using. Since those pages
could
made
unavailable by Amazon at any time, at the very least we have an
problem
with
a lack of documentation.

However, I still believe there is a bug when comparing R's
implementation to
the heuristic described in the book. From Statistical Models in S,
page
38-39:

"Suppose F_j is any factor included in term T_i. Let T_{i(j)} denote
the
margin of T_i for factor F_j--that is, the term obtained by dropping
F_j
from T_i. We say that T_{i(j)} has appeared in the formula if there
is
some
term T_i' for i' < i such that T_i' contains all the factors
appearing
in
T_{i(j)}. The usual case is that T_{i(j)} itself is one of the
preceding
terms. Then F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

Here, F_j refers to a factor (variable) in a model and not a
categorical
factor, as specified later in that section (page 40): "Numeric
variables
appear in the computations as themselves, uncoded. Therefore, the
rule
does
not do anything special for them, and it remains valid, in a trivial
sense,
whenever any of the F_j is numeric rather than categorical."

Going back to my original example with three variables: X1 (numeric),
X2
(numeric), X3 (categorical). This heuristic prescribes encoding
X1:X2:X3
with contrasts as long as X1:X2, X1:X3, and X2:X3 exist in the
formula.
When
any of the preceding terms do not exist, this heuristic tells us to
use
dummy variables to encode the interaction (e.g. "F_j [the interaction
term]
is coded ... by dummy variables if it [any of the marginal terms
obtained by
dropping a single factor in the interaction] has not [appeared in the
formula]"). However, the example I gave demonstrated that this dummy
variable encoding only occurs for the model where the missing term is
the
numeric-numeric interaction, "~(X1+X2+X3)^3-X1:X2". Otherwise, the
interaction term X1:X2:X3 is encoded by contrasts, not dummy
variables.
This
is inconsistent with the description of the intended behavior given
in
the
book.

Best regards,
Tyler


On Fri, Oct 27, 2017 at 2:18 PM, Arie ten Cate
<arietencate at gmail.com>
wrote:

Hello Tyler,

I want to bring to your attention the following document: "What
happens if you omit the main effect in a regression model with an
interaction?"


(https://stats.idre.ucla.edu/stata/faq/what-happens-if-you-omit-the-main-effect-in-a-regression-model-with-an-interaction).
This gives a useful review of the problem. Your example is Case 2: a
continuous and a categorical regressor.

The numerical examples are coded in Stata, and they give the same
result as in R. Hence, if this is a bug in R then it is also a bug
in
Stata. That seems very unlikely.

Here is a simulation in R of the above mentioned Case 2 in Stata:

df <- expand.grid(socst=c(-1:1),grp=c("1","2","3","4"))
print("Full model")
print(model.matrix(~(socst+grp)^2 ,data=df))
print("Example 2.1: drop socst")
print(model.matrix(~(socst+grp)^2 -socst ,data=df))
print("Example 2.2: drop grp")
print(model.matrix(~(socst+grp)^2 -grp ,data=df))

This gives indeed the following regressors:

"Full model"
(Intercept) socst grp2 grp3 grp4 socst:grp2 socst:grp3 socst:grp4
"Example 2.1: drop socst"
(Intercept) grp2 grp3 grp4 socst:grp1 socst:grp2 socst:grp3
socst:grp4
"Example 2.2: drop grp"
(Intercept) socst socst:grp2 socst:grp3 socst:grp4

There is a little bit of R documentation about this, based on the
concept of marginality, which typically forbids a model having an
interaction but not the corresponding main effects. (You might see
the
references in https://en.wikipedia.org/wiki/Principle_of_marginality
)
    See "An Introduction to R", by Venables and Smith and the R Core
Team. At the bottom of page 52 (PDF: 57) it says: "Although the
details are complicated, model formulae in R will normally generate
the models that an expert statistician would expect, provided that
marginality is preserved. Fitting, for [a contrary] example, a model
with an interaction but not the corresponding main effects will in
general lead to surprising results ....".
    The Reference Manual states that the R functions dropterm() and
addterm() resp. drop or add only terms such that marginality is
preserved.

Finally, about your singular matrix t(mm)%*%mm. This is in fact
Example 2.1 in Case 2 discussed above. As discussed there, in Stata
and in R the drop of the continuous variable has no effect on the
degrees of freedom here: it is just a reparameterisation of the full
model, protecting you against losing marginality... Hence the
model.matrix 'mm' is still square and nonsingular after the drop of
X1, unless of course when a row is removed from the matrix 'design'
when before creating 'mm'.

    Arie

On Sun, Oct 15, 2017 at 7:05 PM, Tyler <tylermw at gmail.com> wrote:

You could possibly try to explain away the behavior for a missing
main
effects term, since without the main effects term we don't have
main
effect
columns in the model matrix used to compute the interaction
columns
(At
best this is undocumented behavior--I still think it's a bug, as
we
know
how we would encode the categorical factors if they were in fact
present.
It's either specified in contrasts.arg or using the default set in
options). However, when all the main effects are present, why
would
the
three-factor interaction column not simply be the product of the
main
effect columns? In my example: we know X1, we know X2, and we know
X3.
Why
does the encoding of X1:X2:X3 depend on whether we specified a
two-factor
interaction, AND only changes for specific missing interactions?

In addition, I can use a two-term example similar to yours to show
how
this
behavior results in a singular covariance matrix when, given the
desired
factor encoding, it should not be singular.

We start with a full factorial design for a two-level continuous
factor
and
a three-level categorical factor, and remove a single row. This
design
matrix does not leave enough degrees of freedom to determine
goodness-of-fit, but should allow us to obtain parameter
estimates.

design = expand.grid(X1=c(1,-1),X2=c("A","B","C"))
design = design[-1,]
design

  X1 X2
2 -1  A
3  1  B
4 -1  B
5  1  C
6 -1  C

Here, we first calculate the model matrix for the full model, and
then
manually remove the X1 column from the model matrix. This gives us
the
model matrix one would expect if X1 were removed from the model.
We
then
successfully calculate the covariance matrix.

mm = model.matrix(~(X1+X2)^2,data=design)
mm

  (Intercept) X1 X2B X2C X1:X2B X1:X2C
2           1 -1   0   0      0      0
3           1  1   1   0      1      0
4           1 -1   1   0     -1      0
5           1  1   0   1      0      1
6           1 -1   0   1      0     -1

mm = mm[,-2]
solve(t(mm) %*% mm)

            (Intercept)  X2B  X2C X1:X2B X1:X2C
(Intercept)           1 -1.0 -1.0    0.0    0.0
X2B                  -1  1.5  1.0    0.0    0.0
X2C                  -1  1.0  1.5    0.0    0.0
X1:X2B                0  0.0  0.0    0.5    0.0
X1:X2C                0  0.0  0.0    0.0    0.5

Here, we see the actual behavior for model.matrix. The undesired
re-coding
of the model matrix interaction term makes the information matrix
singular.

mm = model.matrix(~(X1+X2)^2-X1,data=design)
mm

  (Intercept) X2B X2C X1:X2A X1:X2B X1:X2C
2           1   0   0     -1      0      0
3           1   1   0      0      1      0
4           1   1   0      0     -1      0
5           1   0   1      0      0      1
6           1   0   1      0      0     -1

solve(t(mm) %*% mm)

Error in solve.default(t(mm) %*% mm) : system is computationally
singular:
reciprocal condition number = 5.55112e-18

I still believe this is a bug.

Best regards,
Tyler Morgan-Wall

On Sun, Oct 15, 2017 at 1:49 AM, Arie ten Cate
<arietencate at gmail.com>
wrote:

I think it is not a bug. It is a general property of
interactions.
This property is best observed if all variables are factors
(qualitative).

For example, you have three variables (factors). You ask for as
many
interactions as possible, except an interaction term between two
particular variables. When this interaction is not a constant, it
is
different for different values of the remaining variable. More
precisely: for all values of that variable. In other words: you
have
a
three-way interaction, with all values of that variable.

An even smaller example is the following script with only two
variables, each being a factor:

 df <- expand.grid(X1=c("p","q"), X2=c("A","B","C"))
 print(model.matrix(~(X1+X2)^2    ,data=df))
 print(model.matrix(~(X1+X2)^2 -X1,data=df))
 print(model.matrix(~(X1+X2)^2 -X2,data=df))

The result is:

  (Intercept) X1q X2B X2C X1q:X2B X1q:X2C
1           1   0   0   0       0       0
2           1   1   0   0       0       0
3           1   0   1   0       0       0
4           1   1   1   0       1       0
5           1   0   0   1       0       0
6           1   1   0   1       0       1

  (Intercept) X2B X2C X1q:X2A X1q:X2B X1q:X2C
1           1   0   0       0       0       0
2           1   0   0       1       0       0
3           1   1   0       0       0       0
4           1   1   0       0       1       0
5           1   0   1       0       0       0
6           1   0   1       0       0       1

  (Intercept) X1q X1p:X2B X1q:X2B X1p:X2C X1q:X2C
1           1   0       0       0       0       0
2           1   1       0       0       0       0
3           1   0       1       0       0       0
4           1   1       0       1       0       0
5           1   0       0       0       1       0
6           1   1       0       0       0       1

Thus, in the second result, we have no main effect of X1.
Instead,
the
effect of X1 depends on the value of X2; either A or B or C. In
fact,
this is a two-way interaction, including all three values of X2.
In
the third result, we have no main effect of X2, The effect of X2
depends on the value of X1; either p or q.

A complicating element with your example seems to be that your X1
and
X2 are not factors.

   Arie

On Thu, Oct 12, 2017 at 7:12 PM, Tyler <tylermw at gmail.com> wrote:

Hi,

I recently ran into an inconsistency in the way
model.matrix.default
handles factor encoding for higher level interactions with
categorical
variables when the full hierarchy of effects is not present.
Depending on
which lower level interactions are specified, the factor
encoding
changes
for a higher level interaction. Consider the following minimal

reproducible

example:

--------------

runmatrix =
expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))>

model.matrix(~(X1+X2+X3)^3,data=runmatrix)   (Intercept) X1 X2
X3B
X3C
X1:X2 X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0      0      0
0         0
2            1 -1  1   0   0    -1      0      0      0      0
0         0
3            1  1 -1   0   0    -1      0      0      0      0
0         0
4            1 -1 -1   0   0     1      0      0      0      0
0         0
5            1  1  1   1   0     1      1      0      1      0
1         0
6            1 -1  1   1   0    -1     -1      0      1      0
-1         0
7            1  1 -1   1   0    -1      1      0     -1      0
-1         0
8            1 -1 -1   1   0     1     -1      0     -1      0
1         0
9            1  1  1   0   1     1      0      1      0      1
0         1
10           1 -1  1   0   1    -1      0     -1      0      1
0        -1
11           1  1 -1   0   1    -1      0      1      0     -1
0        -1
12           1 -1 -1   0   1     1      0     -1      0     -1
0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6 7 7
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Specifying the full hierarchy gives us what we expect: the
interaction
columns are simply calculated from the product of the main
effect

columns.

The interaction term X1:X2:X3 gives us two columns in the model
matrix,
X1:X2:X3B and X1:X2:X3C, matching the products of the main
effects.

If we remove either the X2:X3 interaction or the X1:X3
interaction,
we

get

what we would expect for the X1:X2:X3 interaction, but when we
remove
the
X1:X2 interaction the encoding for X1:X2:X3 changes completely:

--------------

model.matrix(~(X1+X2+X3)^3-X1:X3,data=runmatrix)   (Intercept)
X1
X2

X3B X3C X1:X2 X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0
0
2            1 -1  1   0   0    -1      0      0         0
0
3            1  1 -1   0   0    -1      0      0         0
0
4            1 -1 -1   0   0     1      0      0         0
0
5            1  1  1   1   0     1      1      0         1
0
6            1 -1  1   1   0    -1      1      0        -1
0
7            1  1 -1   1   0    -1     -1      0        -1
0
8            1 -1 -1   1   0     1     -1      0         1
0
9            1  1  1   0   1     1      0      1         0
1
10           1 -1  1   0   1    -1      0      1         0
-1
11           1  1 -1   0   1    -1      0     -1         0
-1
12           1 -1 -1   0   1     1      0     -1         0
1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X2:X3,data=runmatrix)   (Intercept)
X1
X2

X3B X3C X1:X2 X1:X3B X1:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0
0
2            1 -1  1   0   0    -1      0      0         0
0
3            1  1 -1   0   0    -1      0      0         0
0
4            1 -1 -1   0   0     1      0      0         0
0
5            1  1  1   1   0     1      1      0         1
0
6            1 -1  1   1   0    -1     -1      0        -1
0
7            1  1 -1   1   0    -1      1      0        -1
0
8            1 -1 -1   1   0     1     -1      0         1
0
9            1  1  1   0   1     1      0      1         0
1
10           1 -1  1   0   1    -1      0     -1         0
-1
11           1  1 -1   0   1    -1      0      1         0
-1
12           1 -1 -1   0   1     1      0     -1         0
1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X1:X2,data=runmatrix)   (Intercept)
X1
X2

X3B X3C X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3A X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0      0      0      0      0
1
    0         0
2            1 -1  1   0   0      0      0      0      0
-1
    0         0
3            1  1 -1   0   0      0      0      0      0
-1
    0         0
4            1 -1 -1   0   0      0      0      0      0
1
    0         0
5            1  1  1   1   0      1      0      1      0
0
    1         0
6            1 -1  1   1   0     -1      0      1      0
0
   -1         0
7            1  1 -1   1   0      1      0     -1      0
0
   -1         0
8            1 -1 -1   1   0     -1      0     -1      0
0
    1         0
9            1  1  1   0   1      0      1      0      1
0
    0         1
10           1 -1  1   0   1      0     -1      0      1
0
    0        -1
11           1  1 -1   0   1      0      1      0     -1
0
    0        -1
12           1 -1 -1   0   1      0     -1      0     -1
0
    0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 4 5 5 6 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Here, we now see the encoding for the interaction X1:X2:X3 is
now
the
interaction of X1 and X2 with a new encoding for X3 where each
factor

level

is represented by its own column. I would expect, given the two
column
dummy variable encoding for X3, that the X1:X2:X3 column would
also
be

two

columns regardless of what two-factor interactions we also
specified,
but
in this case it switches to three. If other two factor
interactions
are
missing in addition to X1:X2, this issue still occurs. This
also
happens
regardless of the contrast specified in contrasts.arg for X3. I
don't
see
any reasoning for this behavior given in the documentation, so
I
suspect

it

is a bug.

Best regards,
Tyler Morgan-Wall

        [[alternative HTML version deleted]]

______________________________________________

Tyler

Mon, Nov 6, 2017 8:09 AM #

Hi Arie,

Given the heuristic, in all of my examples with a missing two-factor
interaction the three-factor interaction should be coded with dummy
variables. In reality, it is encoded by dummy variables only when the
numeric:numeric interaction is missing, and by contrasts for the other two.
The heuristic does not specify separate behavior for numeric vs categorical
factors (When the author of Statistical Models in S refers to F_j as a
"factor", it is a more general usage than the R type "factor" and includes
numeric variables--the language used later on in the chapter on page 40
confirms this): when there is a missing marginal term in the formula, the
higher-order interaction should be coded by dummy variables, regardless of
type. Thus, the terms() function is only following the cited behavior 1/3rd
of the time.

Best regards,
Tyler

On Mon, Nov 6, 2017 at 6:45 AM, Arie ten Cate <arietencate at gmail.com> wrote:

Hello Tyler,

You write that you understand what I am saying. However, I am now at
loss about what exactly is the problem with the behavior of R.  Here
is a script which reproduces your experiments with three variables
(excluding the full model):

m=expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))
model.matrix(~(X1+X2+X3)^3-X1:X3,data=m)
model.matrix(~(X1+X2+X3)^3-X2:X3,data=m)
model.matrix(~(X1+X2+X3)^3-X1:X2,data=m)

Below are the three results, similar to your first mail. (The first
two are basically the same, of course.) Please pick one result which
you think is not consistent with the heuristic and please give what
you think is the correct result:

model.matrix(~(X1+X2+X3)^3-X1:X3)
  (Intercept)
  X1 X2 X3B X3C
  X1:X2 X2:X3B X2:X3C
  X1:X2:X3B X1:X2:X3C

model.matrix(~(X1+X2+X3)^3-X2:X3)
  (Intercept)
  X1 X2 X3B X3C
  X1:X2 X1:X3B X1:X3C
  X1:X2:X3B X1:X2:X3C

model.matrix(~(X1+X2+X3)^3-X1:X2)
  (Intercept)
  X1 X2 X3B X3C
  X1:X3B X1:X3C X2:X3B X2:X3C
  X1:X2:X3A X1:X2:X3B X1:X2:X3C

(I take it that the combination of X3A and X3B and X3C implies dummy
encoding, and the combination of only X3B and X3C implies contrasts
encoding, with respect to X3A.)

Thanks in advance,

Arie


On Sat, Nov 4, 2017 at 5:33 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

I understand what you're saying. The following excerpt out of the book

shows

that F_j does not refer exclusively to categorical factors: "...the rule
does not do anything special for them, and it remains valid, in a trivial
sense, whenever any of the F_j is numeric rather than categorical." Since
F_j refers to both categorical and numeric variables, the behavior of
model.matrix is not consistent with the heuristic.

Best regards,
Tyler

On Sat, Nov 4, 2017 at 6:50 AM, Arie ten Cate <arietencate at gmail.com>

wrote:

Hello Tyler,

I rephrase my previous mail, as follows:

In your example, T_i = X1:X2:X3. Let F_j = X3. (The numerical
variables X1 and X2 are not encoded at all.) Then T_{i(j)} = X1:X2,
which in the example is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie


On Thu, Nov 2, 2017 at 4:11 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

The book out of which this behavior is based does not use factor (in
this
section) to refer to categorical factor. I will again point to this
sentence, from page 40, in the same section and referring to the
behavior
under question, that shows F_j is not limited to categorical factors:
"Numeric variables appear in the computations as themselves, uncoded.
Therefore, the rule does not do anything special for them, and it
remains
valid, in a trivial sense, whenever any of the F_j is numeric rather
than
categorical."

Note the "... whenever any of the F_j is numeric rather than
categorical."
Factor here is used in the more general sense of the word, not

referring

to
the R type "factor." The behavior of R does not match the heuristic

that

it's citing.

Best regards,
Tyler

On Thu, Nov 2, 2017 at 2:51 AM, Arie ten Cate <arietencate at gmail.com>
wrote:

Hello Tyler,

Thank you for searching for, and finding, the basic description of

the

behavior of R in this matter.

I think your example is in agreement with the book.

But let me first note the following. You write: "F_j refers to a
factor (variable) in a model and not a categorical factor". However:
"a factor is a vector object used to specify a discrete
classification" (start of chapter 4 of "An Introduction to R".) You
might also see the description of the R function factor().

You note that the book says about a factor F_j:
  "... F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

You find:
   "However, the example I gave demonstrated that this dummy variable
encoding only occurs for the model where the missing term is the
numeric-numeric interaction, ~(X1+X2+X3)^3-X1:X2."

We have here T_i = X1:X2:X3. Also: F_j = X3 (the only factor). Then
T_{i(j)} = X1:X2, which is dropped from the model. Hence the X3 in

T_i

must be encoded by dummy variables, as indeed it is.

  Arie

On Tue, Oct 31, 2017 at 4:01 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

Thank you for your further research into the issue.

Regarding Stata: On the other hand, JMP gives model matrices that

use

the
main effects contrasts in computing the higher order interactions,
without
the dummy variable encoding. I verified this both by analyzing the
linear
model given in my first example and noting that JMP has one more
degree
of
freedom than R for the same model, as well as looking at the
generated
model
matrices. It's easy to find a design where JMP will allow us fit

our

model
with goodness-of-fit estimates and R will not due to the extra
degree(s)
of
freedom required. Let's keep the conversation limited to R.

I want to refocus back onto my original bug report, which was not

for

a
missing main effects term, but rather for a missing lower-order
interaction
term. The behavior of model.matrix.default() for a missing main
effects
term
is a nice example to demonstrate how model.matrix encodes with

dummy

variables instead of contrasts, but doesn't demonstrate the
inconsistent
behavior my bug report highlighted.

I went looking for documentation on this behavior, and the issue
stems
not
from model.matrix.default(), but rather the terms() function in
interpreting
the formula. This "clever" replacement of contrasts by dummy
variables
to
maintain marginality (presuming that's the reason) is not described
anywhere
in the documentation for either the model.matrix() or the terms()
function.
In order to find a description for the behavior, I had to look in

the

underlying C code, buried above the "TermCode" function of the
"model.c"
file, which says:

"TermCode decides on the encoding of a model term. Returns 1 if
variable
``whichBit'' in ``thisTerm'' is to be encoded by contrasts and 2 if
it
is to
be encoded by dummy variables.  This is decided using the heuristic
described in Statistical Models in S, page 38."

I do not have a copy of this book, and I suspect most R users do

not

as
well. Thankfully, however, some of the pages describing this

behavior

were
available as part of Amazon's "Look Inside" feature--but if not for
that, I
would have no idea what heuristic R was using. Since those pages
could
made
unavailable by Amazon at any time, at the very least we have an
problem
with
a lack of documentation.

However, I still believe there is a bug when comparing R's
implementation to
the heuristic described in the book. From Statistical Models in S,
page
38-39:

"Suppose F_j is any factor included in term T_i. Let T_{i(j)}

denote

the
margin of T_i for factor F_j--that is, the term obtained by

dropping

F_j
from T_i. We say that T_{i(j)} has appeared in the formula if there
is
some
term T_i' for i' < i such that T_i' contains all the factors
appearing
in
T_{i(j)}. The usual case is that T_{i(j)} itself is one of the
preceding
terms. Then F_j is coded by contrasts if T_{i(j)} has appeared in

the

formula and by dummy variables if it has not"

Here, F_j refers to a factor (variable) in a model and not a
categorical
factor, as specified later in that section (page 40): "Numeric
variables
appear in the computations as themselves, uncoded. Therefore, the
rule
does
not do anything special for them, and it remains valid, in a

trivial

sense,
whenever any of the F_j is numeric rather than categorical."

Going back to my original example with three variables: X1

(numeric),

X2
(numeric), X3 (categorical). This heuristic prescribes encoding
X1:X2:X3
with contrasts as long as X1:X2, X1:X3, and X2:X3 exist in the
formula.
When
any of the preceding terms do not exist, this heuristic tells us to
use
dummy variables to encode the interaction (e.g. "F_j [the

interaction

term]
is coded ... by dummy variables if it [any of the marginal terms
obtained by
dropping a single factor in the interaction] has not [appeared in

the

formula]"). However, the example I gave demonstrated that this

dummy

variable encoding only occurs for the model where the missing term

is

the
numeric-numeric interaction, "~(X1+X2+X3)^3-X1:X2". Otherwise, the
interaction term X1:X2:X3 is encoded by contrasts, not dummy
variables.
This
is inconsistent with the description of the intended behavior given
in
the
book.

Best regards,
Tyler


On Fri, Oct 27, 2017 at 2:18 PM, Arie ten Cate
<arietencate at gmail.com>
wrote:

Hello Tyler,

I want to bring to your attention the following document: "What
happens if you omit the main effect in a regression model with an
interaction?"


(https://stats.idre.ucla.edu/stata/faq/what-happens-if-you-

omit-the-main-effect-in-a-regression-model-with-an-interaction).

This gives a useful review of the problem. Your example is Case

2: a

continuous and a categorical regressor.

The numerical examples are coded in Stata, and they give the same
result as in R. Hence, if this is a bug in R then it is also a bug
in
Stata. That seems very unlikely.

Here is a simulation in R of the above mentioned Case 2 in Stata:

df <- expand.grid(socst=c(-1:1),grp=c("1","2","3","4"))
print("Full model")
print(model.matrix(~(socst+grp)^2 ,data=df))
print("Example 2.1: drop socst")
print(model.matrix(~(socst+grp)^2 -socst ,data=df))
print("Example 2.2: drop grp")
print(model.matrix(~(socst+grp)^2 -grp ,data=df))

This gives indeed the following regressors:

"Full model"
(Intercept) socst grp2 grp3 grp4 socst:grp2 socst:grp3 socst:grp4
"Example 2.1: drop socst"
(Intercept) grp2 grp3 grp4 socst:grp1 socst:grp2 socst:grp3
socst:grp4
"Example 2.2: drop grp"
(Intercept) socst socst:grp2 socst:grp3 socst:grp4

There is a little bit of R documentation about this, based on the
concept of marginality, which typically forbids a model having an
interaction but not the corresponding main effects. (You might see
the
references in https://en.wikipedia.org/wiki/

Principle_of_marginality

)
    See "An Introduction to R", by Venables and Smith and the R

Core

Team. At the bottom of page 52 (PDF: 57) it says: "Although the
details are complicated, model formulae in R will normally

generate

the models that an expert statistician would expect, provided that
marginality is preserved. Fitting, for [a contrary] example, a

model

with an interaction but not the corresponding main effects will in
general lead to surprising results ....".
    The Reference Manual states that the R functions dropterm()

and

addterm() resp. drop or add only terms such that marginality is
preserved.

Finally, about your singular matrix t(mm)%*%mm. This is in fact
Example 2.1 in Case 2 discussed above. As discussed there, in

Stata

and in R the drop of the continuous variable has no effect on the
degrees of freedom here: it is just a reparameterisation of the

full

model, protecting you against losing marginality... Hence the
model.matrix 'mm' is still square and nonsingular after the drop

of

X1, unless of course when a row is removed from the matrix

'design'

when before creating 'mm'.

    Arie

On Sun, Oct 15, 2017 at 7:05 PM, Tyler <tylermw at gmail.com> wrote:

You could possibly try to explain away the behavior for a

missing

main
effects term, since without the main effects term we don't have
main
effect
columns in the model matrix used to compute the interaction
columns
(At
best this is undocumented behavior--I still think it's a bug, as
we
know
how we would encode the categorical factors if they were in fact
present.
It's either specified in contrasts.arg or using the default set

in

options). However, when all the main effects are present, why
would
the
three-factor interaction column not simply be the product of the
main
effect columns? In my example: we know X1, we know X2, and we

know

X3.
Why
does the encoding of X1:X2:X3 depend on whether we specified a
two-factor
interaction, AND only changes for specific missing interactions?

In addition, I can use a two-term example similar to yours to

show

how
this
behavior results in a singular covariance matrix when, given the
desired
factor encoding, it should not be singular.

We start with a full factorial design for a two-level continuous
factor
and
a three-level categorical factor, and remove a single row. This
design
matrix does not leave enough degrees of freedom to determine
goodness-of-fit, but should allow us to obtain parameter
estimates.

design = expand.grid(X1=c(1,-1),X2=c("A","B","C"))
design = design[-1,]
design

  X1 X2
2 -1  A
3  1  B
4 -1  B
5  1  C
6 -1  C

Here, we first calculate the model matrix for the full model,

and

then
manually remove the X1 column from the model matrix. This gives

us

the
model matrix one would expect if X1 were removed from the model.
We
then
successfully calculate the covariance matrix.

mm = model.matrix(~(X1+X2)^2,data=design)
mm

  (Intercept) X1 X2B X2C X1:X2B X1:X2C
2           1 -1   0   0      0      0
3           1  1   1   0      1      0
4           1 -1   1   0     -1      0
5           1  1   0   1      0      1
6           1 -1   0   1      0     -1

mm = mm[,-2]
solve(t(mm) %*% mm)

            (Intercept)  X2B  X2C X1:X2B X1:X2C
(Intercept)           1 -1.0 -1.0    0.0    0.0
X2B                  -1  1.5  1.0    0.0    0.0
X2C                  -1  1.0  1.5    0.0    0.0
X1:X2B                0  0.0  0.0    0.5    0.0
X1:X2C                0  0.0  0.0    0.0    0.5

Here, we see the actual behavior for model.matrix. The undesired
re-coding
of the model matrix interaction term makes the information

matrix

singular.

mm = model.matrix(~(X1+X2)^2-X1,data=design)
mm

  (Intercept) X2B X2C X1:X2A X1:X2B X1:X2C
2           1   0   0     -1      0      0
3           1   1   0      0      1      0
4           1   1   0      0     -1      0
5           1   0   1      0      0      1
6           1   0   1      0      0     -1

solve(t(mm) %*% mm)

Error in solve.default(t(mm) %*% mm) : system is computationally
singular:
reciprocal condition number = 5.55112e-18

I still believe this is a bug.

Best regards,
Tyler Morgan-Wall

On Sun, Oct 15, 2017 at 1:49 AM, Arie ten Cate
<arietencate at gmail.com>
wrote:

I think it is not a bug. It is a general property of
interactions.
This property is best observed if all variables are factors
(qualitative).

For example, you have three variables (factors). You ask for as
many
interactions as possible, except an interaction term between

two

particular variables. When this interaction is not a constant,

it

is
different for different values of the remaining variable. More
precisely: for all values of that variable. In other words: you
have
a
three-way interaction, with all values of that variable.

An even smaller example is the following script with only two
variables, each being a factor:

 df <- expand.grid(X1=c("p","q"), X2=c("A","B","C"))
 print(model.matrix(~(X1+X2)^2    ,data=df))
 print(model.matrix(~(X1+X2)^2 -X1,data=df))
 print(model.matrix(~(X1+X2)^2 -X2,data=df))

The result is:

  (Intercept) X1q X2B X2C X1q:X2B X1q:X2C
1           1   0   0   0       0       0
2           1   1   0   0       0       0
3           1   0   1   0       0       0
4           1   1   1   0       1       0
5           1   0   0   1       0       0
6           1   1   0   1       0       1

  (Intercept) X2B X2C X1q:X2A X1q:X2B X1q:X2C
1           1   0   0       0       0       0
2           1   0   0       1       0       0
3           1   1   0       0       0       0
4           1   1   0       0       1       0
5           1   0   1       0       0       0
6           1   0   1       0       0       1

  (Intercept) X1q X1p:X2B X1q:X2B X1p:X2C X1q:X2C
1           1   0       0       0       0       0
2           1   1       0       0       0       0
3           1   0       1       0       0       0
4           1   1       0       1       0       0
5           1   0       0       0       1       0
6           1   1       0       0       0       1

Thus, in the second result, we have no main effect of X1.
Instead,
the
effect of X1 depends on the value of X2; either A or B or C. In
fact,
this is a two-way interaction, including all three values of

X2.

In
the third result, we have no main effect of X2, The effect of

X2

depends on the value of X1; either p or q.

A complicating element with your example seems to be that your

X1

and
X2 are not factors.

   Arie

On Thu, Oct 12, 2017 at 7:12 PM, Tyler <tylermw at gmail.com>

wrote:

Hi,

I recently ran into an inconsistency in the way
model.matrix.default
handles factor encoding for higher level interactions with
categorical
variables when the full hierarchy of effects is not present.
Depending on
which lower level interactions are specified, the factor
encoding
changes
for a higher level interaction. Consider the following

minimal

reproducible

example:

--------------

runmatrix =
expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))>

model.matrix(~(X1+X2+X3)^3,data=runmatrix)   (Intercept) X1 X2
X3B
X3C
X1:X2 X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0      0

0         0
2            1 -1  1   0   0    -1      0      0      0

0         0
3            1  1 -1   0   0    -1      0      0      0

0         0
4            1 -1 -1   0   0     1      0      0      0

0         0
5            1  1  1   1   0     1      1      0      1

1         0
6            1 -1  1   1   0    -1     -1      0      1

-1         0
7            1  1 -1   1   0    -1      1      0     -1

-1         0
8            1 -1 -1   1   0     1     -1      0     -1

1         0
9            1  1  1   0   1     1      0      1      0

0         1
10           1 -1  1   0   1    -1      0     -1      0

0        -1
11           1  1 -1   0   1    -1      0      1      0

-1

0        -1
12           1 -1 -1   0   1     1      0     -1      0

-1

0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6 7 7
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Specifying the full hierarchy gives us what we expect: the
interaction
columns are simply calculated from the product of the main
effect

columns.

The interaction term X1:X2:X3 gives us two columns in the

model

matrix,
X1:X2:X3B and X1:X2:X3C, matching the products of the main
effects.

If we remove either the X2:X3 interaction or the X1:X3
interaction,
we

get

what we would expect for the X1:X2:X3 interaction, but when

we

remove
the
X1:X2 interaction the encoding for X1:X2:X3 changes

completely:

--------------

model.matrix(~(X1+X2+X3)^3-X1:X3,data=runmatrix)

 (Intercept)

X1
X2

X3B X3C X1:X2 X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0
0
2            1 -1  1   0   0    -1      0      0         0
0
3            1  1 -1   0   0    -1      0      0         0
0
4            1 -1 -1   0   0     1      0      0         0
0
5            1  1  1   1   0     1      1      0         1
0
6            1 -1  1   1   0    -1      1      0        -1
0
7            1  1 -1   1   0    -1     -1      0        -1
0
8            1 -1 -1   1   0     1     -1      0         1
0
9            1  1  1   0   1     1      0      1         0
1
10           1 -1  1   0   1    -1      0      1         0
-1
11           1  1 -1   0   1    -1      0     -1         0
-1
12           1 -1 -1   0   1     1      0     -1         0
1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X2:X3,data=runmatrix)

 (Intercept)

X1
X2

X3B X3C X1:X2 X1:X3B X1:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0
0
2            1 -1  1   0   0    -1      0      0         0
0
3            1  1 -1   0   0    -1      0      0         0
0
4            1 -1 -1   0   0     1      0      0         0
0
5            1  1  1   1   0     1      1      0         1
0
6            1 -1  1   1   0    -1     -1      0        -1
0
7            1  1 -1   1   0    -1      1      0        -1
0
8            1 -1 -1   1   0     1     -1      0         1
0
9            1  1  1   0   1     1      0      1         0
1
10           1 -1  1   0   1    -1      0     -1         0
-1
11           1  1 -1   0   1    -1      0      1         0
-1
12           1 -1 -1   0   1     1      0     -1         0
1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X1:X2,data=runmatrix)

 (Intercept)

X1
X2

X3B X3C X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3A X1:X2:X3B

X1:X2:X3C

1            1  1  1   0   0      0      0      0      0
1
    0         0
2            1 -1  1   0   0      0      0      0      0
-1
    0         0
3            1  1 -1   0   0      0      0      0      0
-1
    0         0
4            1 -1 -1   0   0      0      0      0      0
1
    0         0
5            1  1  1   1   0      1      0      1      0
0
    1         0
6            1 -1  1   1   0     -1      0      1      0
0
   -1         0
7            1  1 -1   1   0      1      0     -1      0
0
   -1         0
8            1 -1 -1   1   0     -1      0     -1      0
0
    1         0
9            1  1  1   0   1      0      1      0      1
0
    0         1
10           1 -1  1   0   1      0     -1      0      1
0
    0        -1
11           1  1 -1   0   1      0      1      0     -1
0
    0        -1
12           1 -1 -1   0   1      0     -1      0     -1
0
    0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 4 5 5 6 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Here, we now see the encoding for the interaction X1:X2:X3 is
now
the
interaction of X1 and X2 with a new encoding for X3 where

each

factor

level

is represented by its own column. I would expect, given the

two

column
dummy variable encoding for X3, that the X1:X2:X3 column

would

also
be

two

columns regardless of what two-factor interactions we also
specified,
but
in this case it switches to three. If other two factor
interactions
are
missing in addition to X1:X2, this issue still occurs. This
also
happens
regardless of the contrast specified in contrasts.arg for

X3. I

don't
see
any reasoning for this behavior given in the documentation,

so

I
suspect

it

is a bug.

Best regards,
Tyler Morgan-Wall

        [[alternative HTML version deleted]]

______________________________________________

Arie ten Cate

Mon, Nov 6, 2017 9:41 PM #

Hello Tyler,

model.matrix(~(X1+X2+X3)^3-X1:X3)

T_i = X1:X2:X3. Let F_j = X3. (The numerical variables X1 and X2 are
not encoded at all. Then, again, T_{i(j)} = X1:X2, which in this
example is NOT dropped from the model. Hence the X3 in T_i must be
encoded by contrast, as indeed it is.

  Arie

On Mon, Nov 6, 2017 at 5:09 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

Given the heuristic, in all of my examples with a missing two-factor
interaction the three-factor interaction should be coded with dummy
variables. In reality, it is encoded by dummy variables only when the
numeric:numeric interaction is missing, and by contrasts for the other two.
The heuristic does not specify separate behavior for numeric vs categorical
factors (When the author of Statistical Models in S refers to F_j as a
"factor", it is a more general usage than the R type "factor" and includes
numeric variables--the language used later on in the chapter on page 40
confirms this): when there is a missing marginal term in the formula, the
higher-order interaction should be coded by dummy variables, regardless of
type. Thus, the terms() function is only following the cited behavior 1/3rd
of the time.

Best regards,
Tyler

On Mon, Nov 6, 2017 at 6:45 AM, Arie ten Cate <arietencate at gmail.com> wrote:

Hello Tyler,

You write that you understand what I am saying. However, I am now at
loss about what exactly is the problem with the behavior of R.  Here
is a script which reproduces your experiments with three variables
(excluding the full model):

m=expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))
model.matrix(~(X1+X2+X3)^3-X1:X3,data=m)
model.matrix(~(X1+X2+X3)^3-X2:X3,data=m)
model.matrix(~(X1+X2+X3)^3-X1:X2,data=m)

Below are the three results, similar to your first mail. (The first
two are basically the same, of course.) Please pick one result which
you think is not consistent with the heuristic and please give what
you think is the correct result:

model.matrix(~(X1+X2+X3)^3-X1:X3)
  (Intercept)
  X1 X2 X3B X3C
  X1:X2 X2:X3B X2:X3C
  X1:X2:X3B X1:X2:X3C

model.matrix(~(X1+X2+X3)^3-X2:X3)
  (Intercept)
  X1 X2 X3B X3C
  X1:X2 X1:X3B X1:X3C
  X1:X2:X3B X1:X2:X3C

model.matrix(~(X1+X2+X3)^3-X1:X2)
  (Intercept)
  X1 X2 X3B X3C
  X1:X3B X1:X3C X2:X3B X2:X3C
  X1:X2:X3A X1:X2:X3B X1:X2:X3C

(I take it that the combination of X3A and X3B and X3C implies dummy
encoding, and the combination of only X3B and X3C implies contrasts
encoding, with respect to X3A.)

Thanks in advance,

Arie


On Sat, Nov 4, 2017 at 5:33 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

I understand what you're saying. The following excerpt out of the book
shows
that F_j does not refer exclusively to categorical factors: "...the rule
does not do anything special for them, and it remains valid, in a
trivial
sense, whenever any of the F_j is numeric rather than categorical."
Since
F_j refers to both categorical and numeric variables, the behavior of
model.matrix is not consistent with the heuristic.

Best regards,
Tyler

On Sat, Nov 4, 2017 at 6:50 AM, Arie ten Cate <arietencate at gmail.com>
wrote:

Hello Tyler,

I rephrase my previous mail, as follows:

In your example, T_i = X1:X2:X3. Let F_j = X3. (The numerical
variables X1 and X2 are not encoded at all.) Then T_{i(j)} = X1:X2,
which in the example is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie


On Thu, Nov 2, 2017 at 4:11 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

The book out of which this behavior is based does not use factor (in
this
section) to refer to categorical factor. I will again point to this
sentence, from page 40, in the same section and referring to the
behavior
under question, that shows F_j is not limited to categorical factors:
"Numeric variables appear in the computations as themselves, uncoded.
Therefore, the rule does not do anything special for them, and it
remains
valid, in a trivial sense, whenever any of the F_j is numeric rather
than
categorical."

Note the "... whenever any of the F_j is numeric rather than
categorical."
Factor here is used in the more general sense of the word, not
referring
to
the R type "factor." The behavior of R does not match the heuristic
that
it's citing.

Best regards,
Tyler

On Thu, Nov 2, 2017 at 2:51 AM, Arie ten Cate <arietencate at gmail.com>
wrote:

Hello Tyler,

Thank you for searching for, and finding, the basic description of
the
behavior of R in this matter.

I think your example is in agreement with the book.

But let me first note the following. You write: "F_j refers to a
factor (variable) in a model and not a categorical factor". However:
"a factor is a vector object used to specify a discrete
classification" (start of chapter 4 of "An Introduction to R".) You
might also see the description of the R function factor().

You note that the book says about a factor F_j:
  "... F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

You find:
   "However, the example I gave demonstrated that this dummy
variable
encoding only occurs for the model where the missing term is the
numeric-numeric interaction, ~(X1+X2+X3)^3-X1:X2."

We have here T_i = X1:X2:X3. Also: F_j = X3 (the only factor). Then
T_{i(j)} = X1:X2, which is dropped from the model. Hence the X3 in
T_i
must be encoded by dummy variables, as indeed it is.

  Arie

On Tue, Oct 31, 2017 at 4:01 PM, Tyler <tylermw at gmail.com> wrote:

Hi Arie,

Thank you for your further research into the issue.

Regarding Stata: On the other hand, JMP gives model matrices that
use
the
main effects contrasts in computing the higher order interactions,
without
the dummy variable encoding. I verified this both by analyzing the
linear
model given in my first example and noting that JMP has one more
degree
of
freedom than R for the same model, as well as looking at the
generated
model
matrices. It's easy to find a design where JMP will allow us fit
our
model
with goodness-of-fit estimates and R will not due to the extra
degree(s)
of
freedom required. Let's keep the conversation limited to R.

I want to refocus back onto my original bug report, which was not
for
a
missing main effects term, but rather for a missing lower-order
interaction
term. The behavior of model.matrix.default() for a missing main
effects
term
is a nice example to demonstrate how model.matrix encodes with
dummy
variables instead of contrasts, but doesn't demonstrate the
inconsistent
behavior my bug report highlighted.

I went looking for documentation on this behavior, and the issue
stems
not
from model.matrix.default(), but rather the terms() function in
interpreting
the formula. This "clever" replacement of contrasts by dummy
variables
to
maintain marginality (presuming that's the reason) is not
described
anywhere
in the documentation for either the model.matrix() or the terms()
function.
In order to find a description for the behavior, I had to look in
the
underlying C code, buried above the "TermCode" function of the
"model.c"
file, which says:

"TermCode decides on the encoding of a model term. Returns 1 if
variable
``whichBit'' in ``thisTerm'' is to be encoded by contrasts and 2
if
it
is to
be encoded by dummy variables.  This is decided using the
heuristic
described in Statistical Models in S, page 38."

I do not have a copy of this book, and I suspect most R users do
not
as
well. Thankfully, however, some of the pages describing this
behavior
were
available as part of Amazon's "Look Inside" feature--but if not
for
that, I
would have no idea what heuristic R was using. Since those pages
could
made
unavailable by Amazon at any time, at the very least we have an
problem
with
a lack of documentation.

However, I still believe there is a bug when comparing R's
implementation to
the heuristic described in the book. From Statistical Models in S,
page
38-39:

"Suppose F_j is any factor included in term T_i. Let T_{i(j)}
denote
the
margin of T_i for factor F_j--that is, the term obtained by
dropping
F_j
from T_i. We say that T_{i(j)} has appeared in the formula if
there
is
some
term T_i' for i' < i such that T_i' contains all the factors
appearing
in
T_{i(j)}. The usual case is that T_{i(j)} itself is one of the
preceding
terms. Then F_j is coded by contrasts if T_{i(j)} has appeared in
the
formula and by dummy variables if it has not"

Here, F_j refers to a factor (variable) in a model and not a
categorical
factor, as specified later in that section (page 40): "Numeric
variables
appear in the computations as themselves, uncoded. Therefore, the
rule
does
not do anything special for them, and it remains valid, in a
trivial
sense,
whenever any of the F_j is numeric rather than categorical."

Going back to my original example with three variables: X1
(numeric),
X2
(numeric), X3 (categorical). This heuristic prescribes encoding
X1:X2:X3
with contrasts as long as X1:X2, X1:X3, and X2:X3 exist in the
formula.
When
any of the preceding terms do not exist, this heuristic tells us
to
use
dummy variables to encode the interaction (e.g. "F_j [the
interaction
term]
is coded ... by dummy variables if it [any of the marginal terms
obtained by
dropping a single factor in the interaction] has not [appeared in
the
formula]"). However, the example I gave demonstrated that this
dummy
variable encoding only occurs for the model where the missing term
is
the
numeric-numeric interaction, "~(X1+X2+X3)^3-X1:X2". Otherwise, the
interaction term X1:X2:X3 is encoded by contrasts, not dummy
variables.
This
is inconsistent with the description of the intended behavior
given
in
the
book.

Best regards,
Tyler


On Fri, Oct 27, 2017 at 2:18 PM, Arie ten Cate
<arietencate at gmail.com>
wrote:

Hello Tyler,

I want to bring to your attention the following document: "What
happens if you omit the main effect in a regression model with an
interaction?"



(https://stats.idre.ucla.edu/stata/faq/what-happens-if-you-omit-the-main-effect-in-a-regression-model-with-an-interaction).
This gives a useful review of the problem. Your example is Case
2: a
continuous and a categorical regressor.

The numerical examples are coded in Stata, and they give the same
result as in R. Hence, if this is a bug in R then it is also a
bug
in
Stata. That seems very unlikely.

Here is a simulation in R of the above mentioned Case 2 in Stata:

df <- expand.grid(socst=c(-1:1),grp=c("1","2","3","4"))
print("Full model")
print(model.matrix(~(socst+grp)^2 ,data=df))
print("Example 2.1: drop socst")
print(model.matrix(~(socst+grp)^2 -socst ,data=df))
print("Example 2.2: drop grp")
print(model.matrix(~(socst+grp)^2 -grp ,data=df))

This gives indeed the following regressors:

"Full model"
(Intercept) socst grp2 grp3 grp4 socst:grp2 socst:grp3 socst:grp4
"Example 2.1: drop socst"
(Intercept) grp2 grp3 grp4 socst:grp1 socst:grp2 socst:grp3
socst:grp4
"Example 2.2: drop grp"
(Intercept) socst socst:grp2 socst:grp3 socst:grp4

There is a little bit of R documentation about this, based on the
concept of marginality, which typically forbids a model having an
interaction but not the corresponding main effects. (You might
see
the
references in
https://en.wikipedia.org/wiki/Principle_of_marginality
)
    See "An Introduction to R", by Venables and Smith and the R
Core
Team. At the bottom of page 52 (PDF: 57) it says: "Although the
details are complicated, model formulae in R will normally
generate
the models that an expert statistician would expect, provided
that
marginality is preserved. Fitting, for [a contrary] example, a
model
with an interaction but not the corresponding main effects will
in
general lead to surprising results ....".
    The Reference Manual states that the R functions dropterm()
and
addterm() resp. drop or add only terms such that marginality is
preserved.

Finally, about your singular matrix t(mm)%*%mm. This is in fact
Example 2.1 in Case 2 discussed above. As discussed there, in
Stata
and in R the drop of the continuous variable has no effect on the
degrees of freedom here: it is just a reparameterisation of the
full
model, protecting you against losing marginality... Hence the
model.matrix 'mm' is still square and nonsingular after the drop
of
X1, unless of course when a row is removed from the matrix
'design'
when before creating 'mm'.

    Arie

On Sun, Oct 15, 2017 at 7:05 PM, Tyler <tylermw at gmail.com> wrote:

You could possibly try to explain away the behavior for a
missing
main
effects term, since without the main effects term we don't have
main
effect
columns in the model matrix used to compute the interaction
columns
(At
best this is undocumented behavior--I still think it's a bug,
as
we
know
how we would encode the categorical factors if they were in
fact
present.
It's either specified in contrasts.arg or using the default set
in
options). However, when all the main effects are present, why
would
the
three-factor interaction column not simply be the product of
the
main
effect columns? In my example: we know X1, we know X2, and we
know
X3.
Why
does the encoding of X1:X2:X3 depend on whether we specified a
two-factor
interaction, AND only changes for specific missing
interactions?

In addition, I can use a two-term example similar to yours to
show
how
this
behavior results in a singular covariance matrix when, given
the
desired
factor encoding, it should not be singular.

We start with a full factorial design for a two-level
continuous
factor
and
a three-level categorical factor, and remove a single row. This
design
matrix does not leave enough degrees of freedom to determine
goodness-of-fit, but should allow us to obtain parameter
estimates.

design = expand.grid(X1=c(1,-1),X2=c("A","B","C"))
design = design[-1,]
design

  X1 X2
2 -1  A
3  1  B
4 -1  B
5  1  C
6 -1  C

Here, we first calculate the model matrix for the full model,
and
then
manually remove the X1 column from the model matrix. This gives
us
the
model matrix one would expect if X1 were removed from the
model.
We
then
successfully calculate the covariance matrix.

mm = model.matrix(~(X1+X2)^2,data=design)
mm

  (Intercept) X1 X2B X2C X1:X2B X1:X2C
2           1 -1   0   0      0      0
3           1  1   1   0      1      0
4           1 -1   1   0     -1      0
5           1  1   0   1      0      1
6           1 -1   0   1      0     -1

mm = mm[,-2]
solve(t(mm) %*% mm)

            (Intercept)  X2B  X2C X1:X2B X1:X2C
(Intercept)           1 -1.0 -1.0    0.0    0.0
X2B                  -1  1.5  1.0    0.0    0.0
X2C                  -1  1.0  1.5    0.0    0.0
X1:X2B                0  0.0  0.0    0.5    0.0
X1:X2C                0  0.0  0.0    0.0    0.5

Here, we see the actual behavior for model.matrix. The
undesired
re-coding
of the model matrix interaction term makes the information
matrix
singular.

mm = model.matrix(~(X1+X2)^2-X1,data=design)
mm

  (Intercept) X2B X2C X1:X2A X1:X2B X1:X2C
2           1   0   0     -1      0      0
3           1   1   0      0      1      0
4           1   1   0      0     -1      0
5           1   0   1      0      0      1
6           1   0   1      0      0     -1

solve(t(mm) %*% mm)

Error in solve.default(t(mm) %*% mm) : system is
computationally
singular:
reciprocal condition number = 5.55112e-18

I still believe this is a bug.

Best regards,
Tyler Morgan-Wall

On Sun, Oct 15, 2017 at 1:49 AM, Arie ten Cate
<arietencate at gmail.com>
wrote:

I think it is not a bug. It is a general property of
interactions.
This property is best observed if all variables are factors
(qualitative).

For example, you have three variables (factors). You ask for
as
many
interactions as possible, except an interaction term between
two
particular variables. When this interaction is not a constant,
it
is
different for different values of the remaining variable. More
precisely: for all values of that variable. In other words:
you
have
a
three-way interaction, with all values of that variable.

An even smaller example is the following script with only two
variables, each being a factor:

 df <- expand.grid(X1=c("p","q"), X2=c("A","B","C"))
 print(model.matrix(~(X1+X2)^2    ,data=df))
 print(model.matrix(~(X1+X2)^2 -X1,data=df))
 print(model.matrix(~(X1+X2)^2 -X2,data=df))

The result is:

  (Intercept) X1q X2B X2C X1q:X2B X1q:X2C
1           1   0   0   0       0       0
2           1   1   0   0       0       0
3           1   0   1   0       0       0
4           1   1   1   0       1       0
5           1   0   0   1       0       0
6           1   1   0   1       0       1

  (Intercept) X2B X2C X1q:X2A X1q:X2B X1q:X2C
1           1   0   0       0       0       0
2           1   0   0       1       0       0
3           1   1   0       0       0       0
4           1   1   0       0       1       0
5           1   0   1       0       0       0
6           1   0   1       0       0       1

  (Intercept) X1q X1p:X2B X1q:X2B X1p:X2C X1q:X2C
1           1   0       0       0       0       0
2           1   1       0       0       0       0
3           1   0       1       0       0       0
4           1   1       0       1       0       0
5           1   0       0       0       1       0
6           1   1       0       0       0       1

Thus, in the second result, we have no main effect of X1.
Instead,
the
effect of X1 depends on the value of X2; either A or B or C.
In
fact,
this is a two-way interaction, including all three values of
X2.
In
the third result, we have no main effect of X2, The effect of
X2
depends on the value of X1; either p or q.

A complicating element with your example seems to be that your
X1
and
X2 are not factors.

   Arie

On Thu, Oct 12, 2017 at 7:12 PM, Tyler <tylermw at gmail.com>
wrote:

Hi,

I recently ran into an inconsistency in the way
model.matrix.default
handles factor encoding for higher level interactions with
categorical
variables when the full hierarchy of effects is not present.
Depending on
which lower level interactions are specified, the factor
encoding
changes
for a higher level interaction. Consider the following
minimal

reproducible

example:

--------------

runmatrix =
expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))>

model.matrix(~(X1+X2+X3)^3,data=runmatrix)   (Intercept) X1 X2
X3B
X3C
X1:X2 X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0      0
0
0         0
2            1 -1  1   0   0    -1      0      0      0
0
0         0
3            1  1 -1   0   0    -1      0      0      0
0
0         0
4            1 -1 -1   0   0     1      0      0      0
0
0         0
5            1  1  1   1   0     1      1      0      1
0
1         0
6            1 -1  1   1   0    -1     -1      0      1
0
-1         0
7            1  1 -1   1   0    -1      1      0     -1
0
-1         0
8            1 -1 -1   1   0     1     -1      0     -1
0
1         0
9            1  1  1   0   1     1      0      1      0
1
0         1
10           1 -1  1   0   1    -1      0     -1      0
1
0        -1
11           1  1 -1   0   1    -1      0      1      0
-1
0        -1
12           1 -1 -1   0   1     1      0     -1      0
-1
0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6 7 7
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Specifying the full hierarchy gives us what we expect: the
interaction
columns are simply calculated from the product of the main
effect

columns.

The interaction term X1:X2:X3 gives us two columns in the
model
matrix,
X1:X2:X3B and X1:X2:X3C, matching the products of the main
effects.

If we remove either the X2:X3 interaction or the X1:X3
interaction,
we

get

what we would expect for the X1:X2:X3 interaction, but when
we
remove
the
X1:X2 interaction the encoding for X1:X2:X3 changes
completely:

--------------

model.matrix(~(X1+X2+X3)^3-X1:X3,data=runmatrix)
(Intercept)
X1
X2

X3B X3C X1:X2 X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0
0
2            1 -1  1   0   0    -1      0      0         0
0
3            1  1 -1   0   0    -1      0      0         0
0
4            1 -1 -1   0   0     1      0      0         0
0
5            1  1  1   1   0     1      1      0         1
0
6            1 -1  1   1   0    -1      1      0        -1
0
7            1  1 -1   1   0    -1     -1      0        -1
0
8            1 -1 -1   1   0     1     -1      0         1
0
9            1  1  1   0   1     1      0      1         0
1
10           1 -1  1   0   1    -1      0      1         0
-1
11           1  1 -1   0   1    -1      0     -1         0
-1
12           1 -1 -1   0   1     1      0     -1         0
1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X2:X3,data=runmatrix)
(Intercept)
X1
X2

X3B X3C X1:X2 X1:X3B X1:X3C X1:X2:X3B X1:X2:X3C

1            1  1  1   0   0     1      0      0         0
0
2            1 -1  1   0   0    -1      0      0         0
0
3            1  1 -1   0   0    -1      0      0         0
0
4            1 -1 -1   0   0     1      0      0         0
0
5            1  1  1   1   0     1      1      0         1
0
6            1 -1  1   1   0    -1     -1      0        -1
0
7            1  1 -1   1   0    -1      1      0        -1
0
8            1 -1 -1   1   0     1     -1      0         1
0
9            1  1  1   0   1     1      0      1         0
1
10           1 -1  1   0   1    -1      0     -1         0
-1
11           1  1 -1   0   1    -1      0      1         0
-1
12           1 -1 -1   0   1     1      0     -1         0
1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

model.matrix(~(X1+X2+X3)^3-X1:X2,data=runmatrix)
(Intercept)
X1
X2

X3B X3C X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3A X1:X2:X3B
X1:X2:X3C

1            1  1  1   0   0      0      0      0      0
1
    0         0
2            1 -1  1   0   0      0      0      0      0
-1
    0         0
3            1  1 -1   0   0      0      0      0      0
-1
    0         0
4            1 -1 -1   0   0      0      0      0      0
1
    0         0
5            1  1  1   1   0      1      0      1      0
0
    1         0
6            1 -1  1   1   0     -1      0      1      0
0
   -1         0
7            1  1 -1   1   0      1      0     -1      0
0
   -1         0
8            1 -1 -1   1   0     -1      0     -1      0
0
    1         0
9            1  1  1   0   1      0      1      0      1
0
    0         1
10           1 -1  1   0   1      0     -1      0      1
0
    0        -1
11           1  1 -1   0   1      0      1      0     -1
0
    0        -1
12           1 -1 -1   0   1      0     -1      0     -1
0
    0         1
attr(,"assign")
 [1] 0 1 2 3 3 4 4 5 5 6 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--------------

Here, we now see the encoding for the interaction X1:X2:X3
is
now
the
interaction of X1 and X2 with a new encoding for X3 where
each
factor

level

is represented by its own column. I would expect, given the
two
column
dummy variable encoding for X3, that the X1:X2:X3 column
would
also
be

two

columns regardless of what two-factor interactions we also
specified,
but
in this case it switches to three. If other two factor
interactions
are
missing in addition to X1:X2, this issue still occurs. This
also
happens
regardless of the contrast specified in contrasts.arg for
X3. I
don't
see
any reasoning for this behavior given in the documentation,
so
I
suspect

it

is a bug.

Best regards,
Tyler Morgan-Wall

        [[alternative HTML version deleted]]

______________________________________________