Skip to content

reference category for factor in regression

9 messages · Jos Elkink, ONKELINX, Thierry, Berwin A Turlach +2 more

#
Hi all,

I am struggling with a strange issue in R that I have not encountered
before and I am not sure how to resolve this.

The model looks like this, with all irrelevant variables left out:

LABOUR - a dummy variable
NONLABOUR = 1 - LABOUR
AGE - a categorical variable / factor
VOTE - a dummy variable

glm(VOTE ~ 0 + LABOUR + NONLABOUR + LABOUR : AGE + NONLABOUR : AGE,
family=binomial(link="logit"))

In other words, a standard interaction model, but I want to know the
intercepts and coefficients for each of the two cases (LABOUR and
NONLABOUR), instead of getting coefficients for the differences as in
a normal interaction model.

But the strange thing is, for the two occurances of the AGE variable,
it makes a different choice as to which AGE category to leave out of
the regression. The cross-table of AGE with LABOUR does not have empty
cells.

Anyone any idea what might be going wrong? Or what I could do about this?

Thanks in advance for any help!

Regards,

Jos
#
Dear Jos,

In R you don't need to create you own dummy variables. Just create a
factor variable LABOUR (with two levels) and rerun your model. Then you
should be able to calculate all coefficients.

HTH,

Thierry

------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature
and Forest
Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
methodology and quality assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium 
tel. + 32 54/436 185
Thierry.Onkelinx at inbo.be 
www.inbo.be 

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey

-----Oorspronkelijk bericht-----
Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
Namens Jos Elkink
Verzonden: maandag 19 januari 2009 15:16
Aan: r-help at r-project.org
Onderwerp: [R] reference category for factor in regression

Hi all,

I am struggling with a strange issue in R that I have not encountered
before and I am not sure how to resolve this.

The model looks like this, with all irrelevant variables left out:

LABOUR - a dummy variable
NONLABOUR = 1 - LABOUR
AGE - a categorical variable / factor
VOTE - a dummy variable

glm(VOTE ~ 0 + LABOUR + NONLABOUR + LABOUR : AGE + NONLABOUR : AGE,
family=binomial(link="logit"))

In other words, a standard interaction model, but I want to know the
intercepts and coefficients for each of the two cases (LABOUR and
NONLABOUR), instead of getting coefficients for the differences as in
a normal interaction model.

But the strange thing is, for the two occurances of the AGE variable,
it makes a different choice as to which AGE category to leave out of
the regression. The cross-table of AGE with LABOUR does not have empty
cells.

Anyone any idea what might be going wrong? Or what I could do about
this?

Thanks in advance for any help!

Regards,

Jos
#
Hi Thierry,

Thanks for your quick answer. The problem is not so much the LABOUR
variable, however, but the AGE variable, which consists of about 5
categories for which I do indeed not create separate dummy variables.
But R does not behave as expected when deciding on which dummy to use
as reference category ...

Jos

On Mon, Jan 19, 2009 at 2:37 PM, ONKELINX, Thierry
<Thierry.ONKELINX at inbo.be> wrote:

  
    
#
G'day Jos,

On Mon, 19 Jan 2009 15:52:00 +0000
Jos Elkink <jos.elkink at ucd.ie> wrote:

            
I guess in this case we need more information.

What is the output of "str(AGE)"?  That should tell you which category
is the reference category.  And which category do you expect to be the
reference category?

Cheers,

	Berwin

=========================== Full address =============================
Berwin A Turlach                            Tel.: +65 6515 4416 (secr)
Dept of Statistics and Applied Probability        +65 6515 6650 (self)
Faculty of Science                          FAX : +65 6872 3919       
National University of Singapore
6 Science Drive 2, Blk S16, Level 7          e-mail: statba at nus.edu.sg
Singapore 117546                    http://www.stat.nus.edu.sg/~statba
#
Hi Jos,

you can force R to set contrasts for factors the way you like them with 
contrasts(). You seem to be thinking of treatment contrasts, which are 
most easily interpreted, but there are also others.

However: are you sure you want to bin an age variable into categories? 
You will lose power, along with a lot of other unpleasant things:
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/CatContinuous

With five categories, you are giving up 4 df. I'd recommend looking into 
splines, where you should be able to get more bang for the buck. Look at 
rcs() in the Design package, and at Frank Harrell's excellent book 
"Regression Modeling Strategies".

Of course, if you only have the binned data, all this is irrelevant...

HTH,
Stephan


Jos Elkink schrieb:
#
Jos,

See ?relevel for information on how to reorder the levels of a factor,
while being able to specify the reference level.

Basically, the first level of the factor is taken as the reference. If
you want to utilize a different ordering, as an alternative to the
above, simply use:

  AGE <- factor(AGE, levels = c(FirstLevel, SecondLevel, ...)

BTW, you might want to review Frank Harrell's page on why categorizing a
continuous variable is not a good idea:

  http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/CatContinuous

HTH,

Marc Schwartz
on 01/19/2009 09:52 AM Jos Elkink wrote:
#
Hi all,

Thanks for the advice.
Yes, that is how I always used it. But the problem is, in this
particular regression R does *not* take the first level as reference.
In fact, AGE appears twice in the same regression (two different
interactions) and in one case it selects the 1st category and in
another case a different one.
I most certainly agree, but the categorisation has been imposed in the
survey itself, so it is all the data I have. I did not design the
questions :-) ... Thanks for this reference, though, as it is
certainly interesting to inform my teaching.
Factor w/ 5 levels "65+","18-24",..: 5 5 1 4 5 5 2 4 1 3 ...

So I expect 65+ to be the reference category, but it is not.

Here is a little bit more R code to show the problem:
Factor w/ 5 levels "65+","18-24",..: 5 5 1 4 5 5 2 4 1 3 ...
LABOUR
   0    1
 692 1409
Call:  glm(formula = NOVOTE ~ 0 + LABOUR + NONLABOUR + AGE:LABOUR +
  AGE:NONLABOUR, family = binomial)

Coefficients:
            LABOUR           NONLABOUR       LABOUR:AGE65+     LABOUR:AGE18-24
          -0.35110            -0.30486            -0.11890            -0.66444
   LABOUR:AGE25-34     LABOUR:AGE35-49     LABOUR:AGE50-64  NONLABOUR:AGE18-24
          -0.23893            -0.15860                  NA            -0.65655
NONLABOUR:AGE25-34  NONLABOUR:AGE35-49  NONLABOUR:AGE50-64
          -0.72815             0.04951             0.17481

As you can see, 65+ is taken as reference category in the interaction
with NONLABOUR, but not in the interaction with LABOUR.

I know glm(NOVOTE ~ LABOUR * AGE, family=binomial) would be a more
conventional specification, but the above should be equivalent and
should give me the coefficients and standard errors for the two groups
(LABOUR and NONLABOUR) separately, rather than for the difference /
interaction term).

Perhaps the NA in the above output (which I only notice now) is a hint
at the problem, but I am not sure why that occurs.
, ,  = 0


          0   1
  65+   137  24
  18-24  68 127
  25-34  59 267
  35-49  71 298
  50-64  82 179

, ,  = 1


          0   1
  65+   101  15
  18-24  26  46
  25-34  21 148
  35-49  55 179
  50-64  72 126

Anyone any idea? So there must be a reason R decides *not* to use 65+
as reference in that particular scenario, and I am missing why.

Jos
#
Hi Jos,

does explicitly recoding AGE help?

AGE <- factor(c("65+","18-24","18-24","25-34"))
str(AGE)
AGE <- 
factor(c("65+","18-24","18-24","25-34"),levels=c("65+","18-24","25-34"))
str(AGE)

Best,
Stephan


Jos Elkink schrieb:
#
G'day Jos,

On Mon, 19 Jan 2009 20:22:10 +0000
Jos Elkink <jos.elkink at ucd.ie> wrote:

            
Thanks for that, all becomes clear now. :)
The way R translates model formulae into design matrices can appear as
a somewhat mysterious process.  It is claimed that the best account on
how this is done can be found in the MASS book.  

Essentially, if A is a factor with r levels, then if A appears as a
main term in the model it will contribute X_A (the matrix with r
columns that contains the dummy variables corresponding to the levels
of A) to the design matrix.  If A appears in a two-way interaction term,
then the contribution to the design matrix is the matrix obtained by
taking each column of X_A and multiplying it elementwise with each
column of the matrix contributed by the other variable.  And so forth
for higher order interactions.

Now, in general adding X_A to the design matrix will lead to a design
matrix that does not have full rank.  Thus, typically X_A is first
multiplied by a contrast matrix, say, C_A and X_A C_A is actually added
to the design matrix.  The contrast matrix used depends on your
settings of options()$contrasts.  Likewise X_A C_A is typically used in
the construction of the interaction terms and not X_A.

But if you play around with "+0" (or "-1"), then things seem to become
more complicated.  E.g. if +0 is in the model, then for the first
factor that is added as a main term, X_A is put into the design matrix,
not X_A C_A.  But a second factor, say, B would contribute X_B C_B to
the design matrix since adding X_B would lead again to a design matrix
that does not have full column rank; so some reduction is done
immediately.

MASS does not really discuss the situation when you use two
(apparently) quantitative variables in your model and one factor and
the factor only appears in the interaction terms (well, at least not the
edition of MASS that I have currently in front of me, which is an older
one; my copy of the most recent edition is in a colleague's office),
but it appears plausible that the logic of how the design matrix is
constructed uses X_A in the AGE:LABOUR interaction and X_A C_A in the
AGE:NONLABOUR interaction.  Or there is a subtle error in logic in how
the design matrix is constructed. 

Of course due to the columns contributed by LABOUR and AGE:LABOUR, the
design matrix has no longer full rank.  This is noticed while the model
is fitted and, hence, the LABOUR:AGE50-64 parameter is not estimable
and a NA is returned.

BTW, if you use

glm(NOVOTE ~ LABOUR + NONLABOUR + AGE : LABOUR + AGE : NONLABOUR, family=binomial)

Then the interaction terms are as you expect (if I understand you
correctly).  But now, of course, the NONLABOUR parameter is not
estimable.
If I understand you correctly, I would do the following to get a fitted
model that directly gives you the estimates and standard errors that you
are interested in:

1) Turn LABOUR into a factor to have R do all the work with creating
the design matrix:

R> FLABOUR <- factor(LABOUR, labels=c("NO", "YES"))

2) Use any of the following commands:

R> glm(NOVOTE ~ FLABOUR/AGE - 1, family=binomial)
R> glm(NOVOTE ~ FLABOUR/AGE + 0, family=binomial)
R> glm(NOVOTE ~ FLABOUR + FLABOUR:AGE - 1, family=binomial)
R> glm(NOVOTE ~ FLABOUR + FLABOUR:AGE + 0, family=binomial)

Somehow, I have a preference of using "-1" instead of "+0", but that
does not really matter; it also does not matter whether these appear at
the beginning or the end (or the middle of the formula).  The first two
forms are, of course, equivalent to the latter two due to the way "/"
is interpreted.  I prefer the first version (less typing), but the
third or fourth version might be clearer to others.

HTH.

Cheers,

	Berwin

=========================== Full address =============================
Berwin A Turlach                            Tel.: +65 6516 4416 (secr)
Dept of Statistics and Applied Probability        +65 6516 6650 (self)
Faculty of Science                          FAX : +65 6872 3919       
National University of Singapore     
6 Science Drive 2, Blk S16, Level 7          e-mail: statba at nus.edu.sg
Singapore 117546                    http://www.stat.nus.edu.sg/~statba