How to intepret a factor response model? - R-help

Wed, May 4, 2005 12:23 AM #

Hello,

I'd like to create a model with a factor-type response variable. This is
an example:

factor_var     real_var        
 one  :100   Min.   :-2.742877  
 three:100   1st Qu.:-0.009493  
 two  :100   Median : 2.361669  
             Mean   : 2.490411  
             3rd Qu.: 4.822394  
             Max.   : 6.924588

Call:
glm(formula = factor_var ~ real_var, family = "binomial", data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7442  -0.6774   0.1849   0.3133   2.1187  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.6798     0.1882  -3.613 0.000303 ***
real_var      0.8971     0.1066   8.417  < 2e-16 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 381.91  on 299  degrees of freedom
Residual deviance: 213.31  on 298  degrees of freedom
AIC: 217.31

Number of Fisher Scoring iterations: 6

---------------------------------------------------------------------

For models with real-type response variable it's easy to figure out,
what's the equation for the response variable (in the model). But here
- how do I interpret the model?

God made the world in six days, and was arrested on the seventh.

(Ted Harding)

Wed, May 4, 2005 1:21 AM #

On 04-May-05 Maciej Blizi??ski wrote:

Have you noticed that you get identical results with

set.seed(214354)
mydata <- data.frame(factor.var = as.factor(c(rep('one', 100),
   rep('two',100), rep('three', 100))),
   real.var = c(rnorm(150), rnorm(150) + 5))

mymodel <- glm(factor.var ~ real.var, family='binomial', data=mydata)
summary(mymodel)

and

set.seed(214354)
mydata <- data.frame(factor.var = as.factor(c(rep('one', 100),
   rep('two',200))),real.var = c(rnorm(150),rnorm(150) + 5))

mymodel <- glm(factor.var ~ real.var, family='binomial', data=mydata)
summary(mymodel)

(I've left out the "summary(mydata)" since these do naturally
differ, and I've replaced "factor_var" with "factor.var" and
"real_var" with "real.var" because of potential complications
with "_"; also "mymodel =" to "mymodel <-").

So I think the interpretation of the results from your first
model is that, because of the "family='binomial'", glm is
treating "factor.var='one'" as binomial response "0", say,
and "factor.var='two'" or "factor.var='three'" as binomial
response "1".

You're trying to fit a multinomial response, but you've
specified a binomial family to 'glm'. 'glm' does not have
a multinomial response family.

You could try 'multinom' from package 'nnet' which fits
loglinear models to factor responses with more than 2 levels.

E.g.

  library(nnet)
  mymodel <- multinom(factor.var ~ real.var,data=mydata)
   ### weights:  9 (4 variable)
   ##  initial  value 329.583687 
   ##  iter  10 value 209.780666
   ##  final  value 209.779951 
   ##  converged
  summary(mymodel)
   ## Re-fitting to get Hessian
   ## Call:
   ## multinom(formula = factor.var ~ real.var, data = mydata)
   ##  Coefficients:
   ##        (Intercept)  real.var
   ##  three  -3.4262565 1.3838231
   ##  two    -0.6754253 0.7116955
   ##
   ## Std. Errors:
   ##   (Intercept)  real.var
   ## three   0.5028541 0.1480138
   ## two     0.1846827 0.1068821
   ##
   ## Residual Deviance: 419.5599 
   ## AIC: 427.5599 
   ##
   ## Correlation of Coefficients:
   ##             three:(Intercept) three:real.var two:(Intercept)
   ## three:real.var  -0.7286258                                      
   ## two:(Intercept)  0.1986995        -0.1261034                    
   ## two:real.var    -0.1411377         0.7012481     -0.3285741

This output does suggest a fairly clear interpretation!

Hoping this helps,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 04-May-05                                       Time: 09:18:03
------------------------------ XFMail ------------------------------

Brian Ripley

Wed, May 4, 2005 1:37 AM #

On Wed, 4 May 2005, Maciej [iso-8859-2] BliziDski wrote:

What you have done here is to fit a logistic regression.  The 
interpretation of that is covered in many good books: for example there 
are plots of the predicted values in MASS4.

I do wonder if that is what you intended, though.  You have fitted a model 
of 'two or three' vs 'one'.  You may have intended a multinomial logistic 
model: again MASS4 has details of such models.

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Maciej Bliziński

Wed, May 4, 2005 3:02 AM #

Thanks a lot for the answers (prof Ripley and Ted).

I'm trying to analyze a survey. Most of the variables are of factor
type, with values for example {"no_at_all", "a_little", "mostly",
"a_lot"}.

I thought about mapping those answers to numbers, but I didn't know what
numbers should I assign them to: {1, 2, 3, 4} (linear) or maybe
{1, 2, 4, 8} (exponential)? So I rather tried to analyze the original
factor survey data.

Multinomial factor response wasn't covered in the lectures in my school
so I'm trying to use my intuition and trial/error technique (please
forgive me :-) ).

Prof Brian Ripley wrote:

I'd like to find possible correlations between factors in my survey. The
survey is about allergies and I'd like to find out if there is
correlation between the degree of allergic problems and the breast milk
(and artificial milk) feeding of the person as a child.

I'll go on reading, the "fullrefman.pdf" file.

Regards,
Maciej Blizinski
Danmarks Tekniske Universitet