Skip to content

logistic regression

19 messages · Darin Brooks, Kevin E. Thorpe, Dieter Menne +8 more

#
Sorry.

Let me try again then.

I am trying to find "significant" predictors" from a list of about 44
independent variables.  So I started with all 44 variables and ran
drop1(sep22lr, test="Chisq")... and then dropped the highest p value from
the run.  Then I reran the drop1.  

Model:
MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_MST_1 + 
    SOIL_NUTR + cE + cN + cELEV + cDIAM_125 + cCRCLS + cCULM_125 + 
    cSPH + cAGE + cVRI_NONPINE + cVRI_nonpineCFR + cVRI_BLEAF + 
    cvol_125 + cstrDST_SW + cwaterDST_SW + cSEEDSRCE_SW + cMAT + 
    cMWMT + cMCMT + cTD + cMAP + cMSP + cAHM + cSHM + cMATMAP + 
    cddless0 + cddless18 + cddgrtr0 + cddgrtr18 + cNFFD + cbFFP + 
    ceFFP + cPAS + cDD5_100 + cEXT_Cold + cS_INDX
                Df Deviance    AIC    LRT   Pr(Chi)    
<none>               814.21 938.21                     
ORG_CODE         4   824.97 940.97  10.76 0.0294100 *  
BECLBL08         9   845.61 951.61  31.41 0.0002519 ***
PEM_SScat       10   829.11 933.11  14.90 0.1357580    
SOIL_MST_1       1   814.63 936.63   0.43 0.5135094    
SOIL_NUTR        2   818.49 938.49   4.28 0.1175411    
cE               1   814.37 936.37   0.16 0.6886085    
cN               1   814.40 936.40   0.20 0.6566765    
cELEV            1   814.35 936.35   0.14 0.7044864    
cDIAM_125        1   817.98 939.98   3.78 0.0519554 .  
cCRCLS           1   819.32 941.32   5.11 0.0237598 *  
cCULM_125        1   816.17 938.17   1.97 0.1606846    
cSPH             1   816.62 938.62   2.41 0.1204141    
cAGE             1   815.92 937.92   1.72 0.1902314    
cVRI_NONPINE     1   818.04 940.04   3.84 0.0501149 .  
cVRI_nonpineCFR  1   821.17 943.17   6.96 0.0083197 ** 
cVRI_BLEAF       1   818.78 940.78   4.58 0.0324286 *  
cvol_125         1   814.67 936.67   0.47 0.4949495    
cstrDST_SW       1   814.63 936.63   0.42 0.5169757    
cwaterDST_SW     1   814.75 936.75   0.55 0.4592643    
cSEEDSRCE_SW     1   817.73 939.73   3.53 0.0604234 .  
cMAT             1   814.27 936.27   0.06 0.8002333    
cMWMT            1   814.49 936.49   0.28 0.5942246    
cMCMT            1   819.39 941.39   5.18 0.0228425 *  
cTD              1   816.20 938.20   1.99 0.1580332    
cMAP             1   814.25 936.25   0.04 0.8386626    
cMSP             1   818.41 940.41   4.20 0.0404411 *  
cAHM             1   815.66 937.66   1.46 0.2276311    
cSHM             1   819.95 941.95   5.75 0.0165227 *  
cMATMAP          1   814.91 936.91   0.71 0.4001878    
cddless0         1   818.04 940.04   3.83 0.0502153 .  
cddless18        1   817.81 939.81   3.60 0.0576931 .  
cddgrtr0         1   816.64 938.64   2.44 0.1184235    
cddgrtr18        1   815.77 937.77   1.57 0.2104958    
cNFFD            1   815.38 937.38   1.18 0.2782582    
cbFFP            1   814.39 936.39   0.18 0.6677481    
ceFFP            1   820.22 942.22   6.01 0.0141863 *  
cPAS             1   814.21 936.21   0.01 0.9347654    
cDD5_100         1   814.79 936.79   0.58 0.4447531    
cEXT_Cold        1   816.99 938.99   2.78 0.0954512 .  
cS_INDX          1   815.21 937.21   1.01 0.3157208    


And then systematically reran the drop1, removing the HIGHEST p value (least
significant)from each resultant until only significant (0.10) variables
remained.

Model:
MIN_Mstocked ~ ORG_CODE + BECLBL08 + PEM_SScat + SOIL_NUTR + 
    cSEEDSRCE_SW + cMSP + ceFFP + cEXT_Cold
             Df Deviance    AIC    LRT   Pr(Chi)    
<none>            884.20 946.20                     
ORG_CODE      4   916.38 970.38  32.18 1.757e-06 ***
BECLBL08      9   940.66 984.66  56.46 6.418e-09 ***
PEM_SScat    11   906.20 946.20  22.00 0.0243795 *  
SOIL_NUTR     2   894.19 952.19   9.99 0.0067557 ** 
cSEEDSRCE_SW  1   894.41 954.41  10.21 0.0013983 ** 
cMSP          1   896.97 956.97  12.77 0.0003516 ***
ceFFP         1   928.50 988.50  44.30 2.812e-11 ***
cEXT_Cold     1   923.35 983.35  39.15 3.921e-10 ***


I didn't create any kind of dummy or factor variables for my categorical
data (at least, not on purpose).

With a remaining 8 variables, I tried to run a logistic regression (glm)
against my dependent variable(MIN_Mstocked).  When I do a summary of the
glm, I am provided with the usual table of estimate, std error, z value, and
Pr(>|z|)... BUT there are some coefficients missing in the list.  None of
the categorical data is complete.  Some are missing only one category, while
others are missing 4 or 5 categories.  

e.g.

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -1.324e+02  1.363e+03  -0.097 0.922611    
ORG_CODE[T.DLA]      -1.504e+01  1.363e+03  -0.011 0.991192    
ORG_CODE[T.DMO]      -1.494e+01  1.363e+03  -0.011 0.991253    
ORG_CODE[T.DPG]      -1.766e+01  1.363e+03  -0.013 0.989658    
ORG_CODE[T.DVA]      -1.841e+01  1.363e+03  -0.014 0.989220    
BECLBL08[T.SBS dw 2] -6.733e-01  5.903e-01  -1.141 0.254033    
BECLBL08[T.SBS dw 3] -1.094e+00  5.714e-01  -1.914 0.055586 .  
BECLBL08[T.SBS mc 2]  1.573e-01  5.004e-01   0.314 0.753211    
BECLBL08[T.SBS mc 3]  1.402e+00  5.824e-01   2.408 0.016043 *  
BECLBL08[T.SBS mk 1] -2.388e+00  7.529e-01  -3.172 0.001514 ** 
BECLBL08[T.SBS mw]   -1.672e+01  1.393e+03  -0.012 0.990425    
BECLBL08[T.SBS vk]   -1.614e+01  1.243e+03  -0.013 0.989640    
BECLBL08[T.SBS wk 1] -3.640e+00  8.174e-01  -4.453 8.48e-06 ***
BECLBL08[T.SBS wk 3] -1.838e+01  1.363e+03  -0.013 0.989240    
PEM_SScat[T.B]       -1.815e+01  3.956e+03  -0.005 0.996339    
PEM_SScat[T.C]        1.998e-01  3.925e-01   0.509 0.610792    
PEM_SScat[T.D]       -2.314e-01  3.215e-01  -0.720 0.471621    
PEM_SScat[T.E]        5.581e-01  3.433e-01   1.626 0.104020    
PEM_SScat[T.F]       -1.113e+00  5.782e-01  -1.926 0.054153 .  
PEM_SScat[T.G]        1.780e-01  4.420e-01   0.403 0.687150    
PEM_SScat[T.H]        1.670e+01  3.956e+03   0.004 0.996633    
PEM_SScat[T.I]        2.751e-01  9.313e-01   0.295 0.767705    
PEM_SScat[T.J]       -2.623e-01  9.693e-01  -0.271 0.786649    
PEM_SScat[T.K]       -1.862e+01  3.956e+03  -0.005 0.996244    
PEM_SScat[T.L]       -1.661e+01  1.211e+03  -0.014 0.989056    
SOIL_NUTR[T.C]       -1.119e+00  3.781e-01  -2.960 0.003073 ** 
SOIL_NUTR[T.D]       -7.912e-02  9.049e-01  -0.087 0.930320    
cSEEDSRCE_SW         -1.512e-03  4.930e-04  -3.066 0.002170 ** 
cMSP                  1.808e-02  5.304e-03   3.409 0.000652 ***
ceFFP                 2.889e-01  4.662e-02   6.196 5.80e-10 ***
cEXT_Cold            -1.880e+00  3.330e-01  -5.647 1.63e-08 ***

There should be a PEM_Sscat[T.A].  It is the most prevalent occurrence in
this category.

ORG_CODE is missing more than 6 categories in the list

SOIL_NUTR should have a [T.B]

Does that help? 

-----Original Message-----
From: Kevin E. Thorpe [mailto:kevin.thorpe at utoronto.ca] 
Sent: Saturday, September 27, 2008 6:21 AM
To: Darin Brooks
Cc: r-help at r-project.org
Subject: Re: [R] logistic regression
Darin Brooks wrote:
understanding
I'm not sure I fully understand your question.  It sounds like you created
your own dummy variables for your categorical variables. Did you?  Or did
you use factor variables for your categorical variables?
If the latter, then I REALLY don't understand your question.

Kevin

--
Kevin E. Thorpe
Biostatistician/Trialist, Knowledge Translation Program Assistant Professor,
Dalla Lana School of Public Health University of Toronto
email: kevin.thorpe at utoronto.ca  Tel: 416.864.5776  Fax: 416.864.6057 No
virus found in this incoming message.
Checked by AVG - http://www.avg.com

6:55 PM
#
Darin Brooks wrote:
Yes.  I don't see a problem however.  First, your variables are
"factors" which means there will be one fewer coefficients than
categories.  One level is a reference group which probably explains
PEM_Sscat and SOIL_NUTR each "missing" one coefficient.  For ORG_CODE,
there were 4 DF in the starting model, 4 DF in the final model with 4
coefficients.  So the 6 missing categories appear to have been missing
from the start.

What do you expect for ORG_CODE?  What does say summary(ORG_CODE) give you?

Are you aware of the dangers of stepwise model fitting?  It is a
commonly recurring theme on this list.

Kevin

  
    
#
Darin Brooks wrote:
Why?  What is wrong with insignificant predictors?
Estimates from this model (and especially standard errors and P-values) 
will be invalid because they do not take into account the stepwise 
procedure above that was used to torture the data until they confessed.

Frank

  
    
#
Frank E Harrell Jr <f.harrell <at> vanderbilt.edu> writes:
Please book this as a fortune.

Dieter
#
On 27-Sep-08 21:45:23, Dieter Menne wrote:
Seconded!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 27-Sep-08                                       Time: 23:30:19
------------------------------ XFMail ------------------------------
#
Glad you were amused.

I assume that "booking this as a fortune" means that this was an idiotic way
to model the data?

MARS?  Boosted Regression Trees?  Any of these a better choice to extract
significant predictors (from a list of about 44) for a measured dependent
variable?

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Ted Harding
Sent: Saturday, September 27, 2008 4:30 PM
To: r-help at stat.math.ethz.ch
Subject: Re: [R] FW: logistic regression
On 27-Sep-08 21:45:23, Dieter Menne wrote:
Seconded!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 27-Sep-08                                       Time: 23:30:19
------------------------------ XFMail ------------------------------

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
No virus found in this incoming message.
Checked by AVG - http://www.avg.com

6:55 PM
#
It's more a statement that it expresses a statistical perspective very  
succinctly, somewhat like a Zen koan.  Frank's book,"Regression  
Modeling Strategies", has entire chapters on reasoned approaches to  
your question.  His website also has quite a bit of material free for  
the taking.
#
Darin Brooks wrote:
Dieter was nominating this for the "fortunes" package in R.  (Thanks Dieter)
Or use a data reduction method (principal components, variable 
clustering, etc.) or redundancy analysis (to remove individual 
predictors before examining associations with Y), or fit the full model 
using penalized maximum likelihood estimation.  lasso and lasso-like 
methods are also worth pursuing.

Cheers
Frank

  
    
#
--- On Sat, 9/27/08, Dieter Menne <dieter.menne at menne-biomed.de> wrote:

            
Here, here! I vote yes.


      __________________________________________________________________
[[elided Yahoo spam]]
#
The Inferno awaits me -- but I cannot resist a comment (but DO look at
Frank's website).

There is a deep and disconcerting dissonance here. Scientists are
(naturally) interested in getting at mechanisms, and so want to know which
of the variables "count" and which do not. But statistical analysis --
**any** statistical analysis -- cannot tell you that. All statistical
analysis can do is build models that give good predictions (and only over
the range of the data). The models you get depend **both** on the way Nature
works **and** the peculiarities of your data (which is what Frank referred
to in his comment on data reduction). In fact, it is highly likely that with
your data there are many alternative prediction equations built from
different collections of covariates that perform essentially equally well.
Sometimes it is otherwise, typically when prospective, carefully designed
studies are performed -- there is a reason that the FDA insists on clinical
trials, after all (and reasons why such studies are difficult and expensive
to do!).

The belief that "data mining" (as it is known in the polite circles that
Frank obviously eschews) is an effective (and even automated!) tool for
discovering how Nature works is a misconception, but one that for many
reasons is enthusiastically promoted.  If you are looking only to predict,
it may do; but you are deceived if you hope for Truth. Can you get hints? --
well maybe, maybe not. Chaos beckons.

I think many -- maybe even most -- statisticians rue the day that stepwise
regression was invented and certainly that it has been marketed as a tool
for winnowing out the "important" few variables from the blizzard of
"irrelevant" background noise. Pogo was right: " We have seen the enemy --
and it is us."

(As I said, the Inferno awaits...)

Cheers to all,
Bert Gunter

DEFINITELY MY OWN OPINIONS HERE!



-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of David Winsemius
Sent: Saturday, September 27, 2008 5:34 PM
To: Darin Brooks
Cc: r-help at stat.math.ethz.ch; ted.harding at manchester.ac.uk
Subject: Re: [R] FW: logistic regression

It's more a statement that it expresses a statistical perspective very  
succinctly, somewhat like a Zen koan.  Frank's book,"Regression  
Modeling Strategies", has entire chapters on reasoned approaches to  
your question.  His website also has quite a bit of material free for  
the taking.
#
I certainly appreciate your comments, Bert.  It is abundantly clear that I
won't be invited to any of the cocktail parties hosted by the "polite
circles".  I am not a statistician.  I am merely a geographer (in the field
of ecology) trying to develop a predictor to assist in a forestry-based
decision making process.  My work in the natural world has taught me that
NOTHING is predictable ... and the very idea of a bullet-proof ecological
predictive model is doomed to fail.  
That said, there ARE some basic predictors that assist foresters in their
salvage decisions.  They use these on a daily basis.  The problem is that
most of the evidence and modeling is anecdotal.  There really are no models
in the field that I am working in.  And for good reason ... The natural
world isn't interested in being modeled.  I think we can all agree on this -
guru or not.
But even the most basic predictive model (using only the GIS/mappable data
that is readily available to most users) is a starting point.  The resultant
dataset(s) of this potential model will be followed-up and field verified.
Providing this simple starting point (or catalyst if you will)could
potentially save A LOT of time and money.
What I need to do is to isolate the best available variables into a model
and assign a confidence to it.  It doesn't have to change everyone's world
... it just has to change the way of thinking in my small little world.
These past few days have been an education for me in the subject of stepwise
regression.  I approach it with much more apprehension now.  So if nothing
else good comes of this discussion/exercise/experience ... I've learned
something.

Darin Brooks           

-----Original Message-----
From: Bert Gunter [mailto:gunter.berton at gene.com] 
Sent: Sunday, September 28, 2008 6:26 PM
To: 'David Winsemius'; 'Darin Brooks'
Cc: r-help at stat.math.ethz.ch; ted.harding at manchester.ac.uk
Subject: RE: [R] FW: logistic regression


The Inferno awaits me -- but I cannot resist a comment (but DO look at
Frank's website).

There is a deep and disconcerting dissonance here. Scientists are
(naturally) interested in getting at mechanisms, and so want to know which
of the variables "count" and which do not. But statistical analysis --
**any** statistical analysis -- cannot tell you that. All statistical
analysis can do is build models that give good predictions (and only over
the range of the data). The models you get depend **both** on the way Nature
works **and** the peculiarities of your data (which is what Frank referred
to in his comment on data reduction). In fact, it is highly likely that with
your data there are many alternative prediction equations built from
different collections of covariates that perform essentially equally well.
Sometimes it is otherwise, typically when prospective, carefully designed
studies are performed -- there is a reason that the FDA insists on clinical
trials, after all (and reasons why such studies are difficult and expensive
to do!).

The belief that "data mining" (as it is known in the polite circles that
Frank obviously eschews) is an effective (and even automated!) tool for
discovering how Nature works is a misconception, but one that for many
reasons is enthusiastically promoted.  If you are looking only to predict,
it may do; but you are deceived if you hope for Truth. Can you get hints? --
well maybe, maybe not. Chaos beckons.

I think many -- maybe even most -- statisticians rue the day that stepwise
regression was invented and certainly that it has been marketed as a tool
for winnowing out the "important" few variables from the blizzard of
"irrelevant" background noise. Pogo was right: " We have seen the enemy --
and it is us."

(As I said, the Inferno awaits...)

Cheers to all,
Bert Gunter

DEFINITELY MY OWN OPINIONS HERE!



-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of David Winsemius
Sent: Saturday, September 27, 2008 5:34 PM
To: Darin Brooks
Cc: r-help at stat.math.ethz.ch; ted.harding at manchester.ac.uk
Subject: Re: [R] FW: logistic regression

It's more a statement that it expresses a statistical perspective very
succinctly, somewhat like a Zen koan.  Frank's book,"Regression Modeling
Strategies", has entire chapters on reasoned approaches to your question.
His website also has quite a bit of material free for the taking.

--
David Winsemius
Heritage Laboratories
On Sep 27, 2008, at 7:24 PM, Darin Brooks wrote:

            
http://www.R-project.org/posting-guide.html
http://www.R-project.org/posting-guide.html
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

No virus found in this incoming message.
Checked by AVG - http://www.avg.com

1:11 PM
#
Darin Brooks wrote:
Darin,

I think the point is that the confidence you can assign to the "best 
available variables" is zero.  That is the probability that stepwise 
variable selection will select the correct variables.

It is probably better to build a model based on the knowledge in the 
field you alluded to, rather than to use P-values to decide.

Frank Harrell

  
    
#
Wow.  I had no idea.  I was told to be wary ... But nothing this bold.

I appreciate your straight forward advice.   

I will be exploring the R packages: rpart, earth, and gbm.  Dr Elith has
generously provided me with literature and R support in the boosted
regression tree arena.   I will leave stepwise logistic regression alone.

Any parting advice regarding narrowing down the variables from the unruly 44
to about 8 or 10?  (In addition to your advice regarding redundancy analysis
and penalized maximum likelihood estimation). 

And I visited your website Dr. Harrell.  A LOT of help there.  I will also
be purchasing your book this week.  Wish I would have stumbled on this forum
a year ago. 

Thanks again.      

-----Original Message-----
From: Frank E Harrell Jr [mailto:f.harrell at vanderbilt.edu] 
Sent: Sunday, September 28, 2008 8:23 PM
To: Darin Brooks
Cc: 'Bert Gunter'; r-help at r-project.org
Subject: Re: [R] FW: logistic regression
Darin Brooks wrote:
little world.
Darin,

I think the point is that the confidence you can assign to the "best
available variables" is zero.  That is the probability that stepwise
variable selection will select the correct variables.

It is probably better to build a model based on the knowledge in the field
you alluded to, rather than to use P-values to decide.

Frank Harrell
perform essentially equally well.
question.

  
    
#
At the risk of my also spending time in the "Inferno", I would 
suggest your problem resembles principal components analysis or 
factor analysis. In this, you would look for a set of linear 
transforms of your variables that have a smaller dimensionality, but 
nearly the same spanned subspace.

Before you embark on any of this, you should ask what you are 
interested in: 1) A physical model that can be interpreted, and may 
hold true in future experiments; or 2) A numerical representation of 
your data that interpolates it. For the former, there is no 
substitute for expert knowledge in formulating models, and then you 
can see if they are in discord with your data. For the latter, the 
PCA approach can condense your predictor set and avoid collinearity.
At 10:49 PM 9/28/2008, Darin Brooks wrote:
================================================================
Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: ral at lcfltd.com
Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
824 Timberlake Drive                     Tel: 757-467-0954
Virginia Beach, VA 23464-3239            Fax: 757-467-2947

"Vere scire est per causas scire"
#
On Sun, 2008-09-28 at 19:26 -0600, Darin Brooks wrote:
Hi Darin,

As an ecologist myself, I think you overstate things a bit here. Clearly
there are features of the "ecological" world out there that follow
"rules" --- otherwise we might as well consign the whole branch of
theoretical ecology to the bin. These things can be modelled, but we are
often looking for a relatively small signal in a whole load of noise.

You really do need to "model" your system in order to make predictions
about it. How you go about the "modelling" is another matter.

I think you may be better off with some of the more algorithm-centric
data mining methods that are currently the rage in some quarters of
ecology (predicting climate change effects on species +/-, change in
range etc); things like regression/classification trees and
randomForest, boosting etc. Names to look out for in this literature are
JR Leathwick, Antoine Guisan, Miguel B Araujo and J Elith. You'll find a
lot of work looking at these modern methods in these authors' work, and
that of others. These methods have less statistical theoretical
underpinnings, but can be evaluated on how well they make predictions.
Which is often the whole point of doing the analysis.
I too would like to thank the contributors to this thread --- very
informative!

All the best,

G
#
On Sun, 2008-09-28 at 21:23 -0500, Frank E Harrell Jr wrote:
<snip />
Hi Frank, et al

I don't have Darin's original email to hand just now, but IIRC he turned
on the testing by p-values, something that add1 and drop1 do not do by
default.

Venables and Ripley's MASS contains stepAIC and there they make use of
drop1 in the regression chapters (Apologies if I have made sweeping
statements that are just plain wrong here - I'm at home this morning and
don't seem to have either of my two MASS copies here with me).

Would the same criticisms made by yourself and Bert, amongst others, in
this thread be levelled at simplifying models using AIC rather than via
p-values? Part of the issue with stepwise procedures is that they don't
correct the overall Type I error rate (even if you use 0.05 as your
cut-off for each test, overall your error rate can be much larger). Does
AIC allow one to get out of this bit of the problem with stepwise
methods?

I'd appreciate any thoughts you or others on the list may have on this.

All the best, and thanks for an interesting discussion thus far.

G
#
Gavin Simpson wrote:
AIC is just a restatement of P-values, so using AIC one variable at a 
time is just like using a different alpha.  Both methods have problems.

Frank

  
    
#
Frank (and any others who want to share an opinion):

What are your thoughts on model averaging as part of the above list?


--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111
#
Greg Snow wrote:
Model averaging has good performance but no advantage over fitting a 
single complex model using penalized maximum likelihood estimation.

Frank