Skip to content

Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing

7 messages · Tyler, Arie ten Cate

#
Hello Tyler,

Thank you for searching for, and finding, the basic description of the
behavior of R in this matter.

I think your example is in agreement with the book.

But let me first note the following. You write: "F_j refers to a
factor (variable) in a model and not a categorical factor". However:
"a factor is a vector object used to specify a discrete
classification" (start of chapter 4 of "An Introduction to R".) You
might also see the description of the R function factor().

You note that the book says about a factor F_j:
  "... F_j is coded by contrasts if T_{i(j)} has appeared in the
formula and by dummy variables if it has not"

You find:
   "However, the example I gave demonstrated that this dummy variable
encoding only occurs for the model where the missing term is the
numeric-numeric interaction, ~(X1+X2+X3)^3-X1:X2."

We have here T_i = X1:X2:X3. Also: F_j = X3 (the only factor). Then
T_{i(j)} = X1:X2, which is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie
On Tue, Oct 31, 2017 at 4:01 PM, Tyler <tylermw at gmail.com> wrote:
#
Hi Arie,

The book out of which this behavior is based does not use factor (in this
section) to refer to categorical factor. I will again point to this
sentence, from page 40, in the same section and referring to the behavior
under question, that shows F_j is not limited to categorical factors:
"Numeric variables appear in the computations as themselves, uncoded.
Therefore, the rule does not do anything special for them, and it remains
valid, in a trivial sense, whenever any of the F_j is numeric rather than
categorical."

Note the "... whenever any of the F_j is numeric rather than categorical."
Factor here is used in the more general sense of the word, not referring to
the R type "factor." The behavior of R does not match the heuristic that
it's citing.

Best regards,
Tyler
On Thu, Nov 2, 2017 at 2:51 AM, Arie ten Cate <arietencate at gmail.com> wrote:

            

  
  
1 day later
#
Hello Tyler,

I rephrase my previous mail, as follows:

In your example, T_i = X1:X2:X3. Let F_j = X3. (The numerical
variables X1 and X2 are not encoded at all.) Then T_{i(j)} = X1:X2,
which in the example is dropped from the model. Hence the X3 in T_i
must be encoded by dummy variables, as indeed it is.

  Arie
On Thu, Nov 2, 2017 at 4:11 PM, Tyler <tylermw at gmail.com> wrote:
#
Hi Arie,

I understand what you're saying. The following excerpt out of the book
shows that F_j does not refer exclusively to categorical factors: "...the
rule does not do anything special for them, and it remains valid, in a
trivial sense, whenever any of the F_j is numeric rather than categorical."
Since F_j refers to both categorical and numeric variables, the behavior of
model.matrix is not consistent with the heuristic.

Best regards,
Tyler
On Sat, Nov 4, 2017 at 6:50 AM, Arie ten Cate <arietencate at gmail.com> wrote:

            

  
  
1 day later
#
Hello Tyler,

You write that you understand what I am saying. However, I am now at
loss about what exactly is the problem with the behavior of R.  Here
is a script which reproduces your experiments with three variables
(excluding the full model):

m=expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))
model.matrix(~(X1+X2+X3)^3-X1:X3,data=m)
model.matrix(~(X1+X2+X3)^3-X2:X3,data=m)
model.matrix(~(X1+X2+X3)^3-X1:X2,data=m)

Below are the three results, similar to your first mail. (The first
two are basically the same, of course.) Please pick one result which
you think is not consistent with the heuristic and please give what
you think is the correct result:

model.matrix(~(X1+X2+X3)^3-X1:X3)
  (Intercept)
  X1 X2 X3B X3C
  X1:X2 X2:X3B X2:X3C
  X1:X2:X3B X1:X2:X3C

model.matrix(~(X1+X2+X3)^3-X2:X3)
  (Intercept)
  X1 X2 X3B X3C
  X1:X2 X1:X3B X1:X3C
  X1:X2:X3B X1:X2:X3C

model.matrix(~(X1+X2+X3)^3-X1:X2)
  (Intercept)
  X1 X2 X3B X3C
  X1:X3B X1:X3C X2:X3B X2:X3C
  X1:X2:X3A X1:X2:X3B X1:X2:X3C

(I take it that the combination of X3A and X3B and X3C implies dummy
encoding, and the combination of only X3B and X3C implies contrasts
encoding, with respect to X3A.)

Thanks in advance,

Arie
On Sat, Nov 4, 2017 at 5:33 PM, Tyler <tylermw at gmail.com> wrote:
#
Hi Arie,

Given the heuristic, in all of my examples with a missing two-factor
interaction the three-factor interaction should be coded with dummy
variables. In reality, it is encoded by dummy variables only when the
numeric:numeric interaction is missing, and by contrasts for the other two.
The heuristic does not specify separate behavior for numeric vs categorical
factors (When the author of Statistical Models in S refers to F_j as a
"factor", it is a more general usage than the R type "factor" and includes
numeric variables--the language used later on in the chapter on page 40
confirms this): when there is a missing marginal term in the formula, the
higher-order interaction should be coded by dummy variables, regardless of
type. Thus, the terms() function is only following the cited behavior 1/3rd
of the time.

Best regards,
Tyler
On Mon, Nov 6, 2017 at 6:45 AM, Arie ten Cate <arietencate at gmail.com> wrote:

            

  
  
#
Hello Tyler,

model.matrix(~(X1+X2+X3)^3-X1:X3)

T_i = X1:X2:X3. Let F_j = X3. (The numerical variables X1 and X2 are
not encoded at all. Then, again, T_{i(j)} = X1:X2, which in this
example is NOT dropped from the model. Hence the X3 in T_i must be
encoded by contrast, as indeed it is.

  Arie
On Mon, Nov 6, 2017 at 5:09 PM, Tyler <tylermw at gmail.com> wrote: