Skip to content

Dependent Variable in Logistic Regression

17 messages · Rich Shepard, Paul Bernal, Patrick (Malone Quantitative) +5 more

#
Dear friends,

Hope you are doing great. I want to fit a logistic regression in R, where
the dependent variable is the covid status (I used 1 for covid positives,
and 0 for covid negatives), but when I ran the glm, R complains that I
should make the dependent variable a factor.

What would be more advisable, to keep the dependent variable with 1s and
0s, or code it as yes/no and then make it a factor?

Any guidance will be greatly appreciated,

Best regards,

Paul
#
x <- factor(0:1)
x <- factor("yes","no")

will produce identical results up to labeling.


Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sat, Aug 1, 2020 at 10:40 AM Paul Bernal <paulbernal07 at gmail.com> wrote:

            

  
  
#
On Sat, 1 Aug 2020, Paul Bernal wrote:

            
Paul,

1 or 0 are equivalent to yes or no, success or failure. All are nomminal
variables so all should be factors, regardless of the coding.

Rich
#
Hi Bert,

Thank you for the kind reply.

But what if I don't turn the variable into a factor. Let's say that in
excel I just coded the variable as 1s and 0s and just imported the dataset
into R and fitted the logistic regression without turning any categorical
variable or dummy variable into a factor?

Does R requires every dummy variable to be treated as a factor?

Best regards,

Paul

El s?b., 1 de agosto de 2020 12:59 p. m., Bert Gunter <
bgunter.4567 at gmail.com> escribi?:

  
  
#
You appear to be confusing a binomial **response** with categorical
"dependent variables." glm() of course fits continuous or categorical
dependent variables. If a continuous dependent variable has only 2 values,
the results for glm() will be the same whether or not it is considered to
be continuous or categorical, though you may not recognize it as such.

This discussion has already wandered off topic to statistical issues. I
will not comment further on or off list. I suggest you consult a good
reference on linear/generalized linear models or talk with a local
statistician.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sat, Aug 1, 2020 at 11:04 AM Paul Bernal <paulbernal07 at gmail.com> wrote:

            

  
  
#
Sorry, typo.My first sentences should read:

"You appear to be confusing a binomial **response** with categorical
"independent variables." glm() of course fits continuous or categorical
independent variables."

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sat, Aug 1, 2020 at 11:11 AM Bert Gunter <bgunter.4567 at gmail.com> wrote:

            

  
  
#
... and further:
" If a continuous independent variable has only 2 values,..."

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sat, Aug 1, 2020 at 11:11 AM Bert Gunter <bgunter.4567 at gmail.com> wrote:

            

  
  
#
No, R does not. glm() does in order to do logistic regression.
On Sat, Aug 1, 2020 at 2:11 PM Paul Bernal <paulbernal07 at gmail.com> wrote:

            

  
    
#
... yes, but so does lm() for a categorical **INdependent** variable with
more than 2 numerically labeled levels. n levels  = (n-1) df for a
categorical covariate, but 1 for a continuous one (unless more complex
models are explicitly specified of course). As I said, the OP seems
confused about whether he is referring to the response or covariates. Or
maybe he just made the same typo I did.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sat, Aug 1, 2020 at 11:15 AM Patrick (Malone Quantitative) <
malone at malonequantitative.com> wrote:

            

  
  
#
Dear friend,

I am aware that I have a binomial dependent variable, which is covid status
(1 if covid positive, and 0 otherwise).

My question was if R requires to turn a binomial response variable into a
factor or not, that's all.

Cheers,

Paul

El s?b., 1 de agosto de 2020 1:22 p. m., Bert Gunter <bgunter.4567 at gmail.com>
escribi?:

  
  
#
I didn't mean to imply that was the only time that it was required, only
that it's not universal in R.
On Sat, Aug 1, 2020 at 2:22 PM Bert Gunter <bgunter.4567 at gmail.com> wrote:

            

  
    
#
Hello,

 From the documentation, help('glm'):


      Details

A typical predictor has the form|response ~ terms|where|response|is the 
(numeric) response vector and|terms|is a series of terms which specifies 
a linear predictor for|response|. 
For|binomial|and|quasibinomial|families the response can also be 
specified as a|factor 
<http://127.0.0.1:11611/library/stats/help/factor>|(when the first level 
denotes failure and all others success) or as a two-column matrix with 
the columns giving the numbers of successes and failures. A terms 
specification of the form|first + second|indicates all the terms 
in|first|together with all the terms in|second|with any duplicates removed.


There is no need for the response to be a factor, it is optional, the 
wording is very clear,

"For|binomial|and|quasibinomial|families the response *can* also be 
specified as a|factor <http://127.0.0.1:11611/library/stats/help/factor>"|

And with binary, numeric responses I cannot reproduce the warning 
message, the models fit silently.


Hope this helps,

Rui Barradas




?s 18:39 de 01/08/2020, Paul Bernal escreveu:

  
    
#
On 2/08/20 5:39 am, Paul Bernal wrote:

            
There have been many responses to this post, the majority of them being 
confusing and off the point.

BOTTOM LINE:  R/glm() does *NOT* complain that one "should make the 
dependent variable a factor".   This is bovine faecal output.

As Rui Barradas has pointed out (alternatively: RTFM!) when you fit a 
Bernoulli model using glm(), your response/dependent variable is allowed 
to be

     * a numeric variable with values 0 or 1
     * a logical variable
     * a factor with two levels

The OP presumably fed glm() a *character* vector with values "0" and 
"1".  Doing *this* will cause glm() to whinge.

I reiterate:  RTFM!!!  (And perhaps learn to distinguish between 
character vectors and factors.)

cheers,

Rolf Turner
#
That's a bit harsh.
Isn't the best advice here, to post a reproducible example...
Which I believe has been mentioned.

Also, I'd strongly encourage people to use package+function name, for
this sort of thing.

    stats::glm

As there are many R functions for GLMs...
On Sun, Aug 2, 2020 at 12:47 PM Rolf Turner <r.turner at auckland.ac.nz> wrote:
1 day later
#
> That's a bit harsh.  Isn't the best advice here, to post a
    > reproducible example...  Which I believe has been
    > mentioned.

    > Also, I'd strongly encourage people to use
    > package+function name, for this sort of thing.

    >     stats::glm

    > As there are many R functions for GLMs...

Sorry, Abby, I do disagree here ((strongly enough as to warrant
this reply) :

We're talking about doing "basic" statistics with R,  and these
function in the stats package have been part of R even before
got a version number.

So, no,  glm()  {and the stats package} are the default and I still
think everybody should know and assume that.

Martin

    > On Sun, Aug 2, 2020 at 12:47 PM Rolf Turner
> <r.turner at auckland.ac.nz> wrote:
>> 
    >>
>> On 2/08/20 5:39 am, Paul Bernal wrote:
>> 
    >> > Dear friends,
    >> >
    >> > Hope you are doing great. I want to fit a logistic
    >> regression in R, where > the dependent variable is the
    >> covid status (I used 1 for covid positives, > and 0 for
    >> covid negatives), but when I ran the glm, R complains
    >> that I > should make the dependent variable a factor.
    >> >
    >> > What would be more advisable, to keep the dependent
    >> variable with 1s and > 0s, or code it as yes/no and then
    >> make it a factor?
    >> >
    >> > Any guidance will be greatly appreciated,
    >> 
    >> 
    >> There have been many responses to this post, the majority
    >> of them being confusing and off the point.
    >> 
    >> BOTTOM LINE: R/glm() does *NOT* complain that one "should
    >> make the dependent variable a factor".  This is bovine
    >> faecal output.
    >> 
    >> As Rui Barradas has pointed out (alternatively: RTFM!)
    >> when you fit a Bernoulli model using glm(), your
    >> response/dependent variable is allowed to be
    >> 
    >> * a numeric variable with values 0 or 1 * a logical
    >> variable * a factor with two levels
    >> 
    >> The OP presumably fed glm() a *character* vector with
    >> values "0" and "1".  Doing *this* will cause glm() to
    >> whinge.
    >> 
    >> I reiterate: RTFM!!!  (And perhaps learn to distinguish
    >> between character vectors and factors.)
    >> 
    >> cheers,
    >> 
    >> Rolf Turner
    >> 
    >> --
    >> Honorary Research Fellow Department of Statistics
    >> University of Auckland Phone: +64-9-373-7599 ext. 88276
    >> 
    >> ______________________________________________
    >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
    >> more, see https://stat.ethz.ch/mailman/listinfo/r-help
    >> PLEASE do read the posting guide
    >> http://www.R-project.org/posting-guide.html and provide
    >> commented, minimal, self-contained, reproducible code.

    > ______________________________________________
    > R-help at r-project.org mailing list -- To UNSUBSCRIBE and
    > more, see https://stat.ethz.ch/mailman/listinfo/r-help
    > PLEASE do read the posting guide
    > http://www.R-project.org/posting-guide.html and provide
    > commented, minimal, self-contained, reproducible code.
#
Which part are you disagreeing with?
That unambiquous names/references should be used, or that there are
many R functions for GLMs.
The wording of your post, suggests (kind of), that there is only one R
function for GLMs.
Remember, not everyone is using the same R packages, as you.
And some people have done university courses, or done online courses,
etc, in R, without ever using one function from the stats package.

I'm reluctant to assume that all R users will have a common understanding.
And what may seem obvious to you or me, may seem quite foreign to some
users, or vice versa.
But perhaps most importantly, the OP said "the glm".
He never said "glm()", but rather the subsequent posters did.

Rolf suggested his post was bullshit, after removing the lexical peroxide.
How does anyone know that it wasn't a genuine post, but in reference
to something other than stats::glm?

Shouldn't people be innocent until proven guilty.
Otherwise (something I have been guilty of in the past), the mailing
list turns into statistical propaganda...

Even if the OP was referring to stats::glm, I'm still inclined to feel
the post was legitimate, just a bit short on technical details...
#
All: Kindly take this offline please.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Mon, Aug 3, 2020 at 12:39 PM Abby Spurdle <spurdle.a at gmail.com> wrote: