Aitor,
Thanks very much for this, i am very grateful.
I have generated a ROC plot and a calibration curve (attached)
Also calculated the AUC 0.7201762
However if i am honest i am unsure where to go from here?
1. How does this tell me how effective the model is at predicting the
response?
2. How can i use this information to predict a response from my test data
set i.e if i only have the factors and i want to know if a species is
threatened or not?
Thanks very much
Chris
On 27 Aug 2010, at 00:07, Aitor Gast?nGonz?lez wrote:
Chris,
The predicted probabilities of a binomial GLM (i.e., logistic regression)
should not be interpreted as an absolute value, they largely depend on
the prevalence in the training sample (the proportion of threatened
species in your case).
I understand that you are interested in evaluating the predictive
performance of the model. There are many statistics to evaluate the
predictive performance of a logistic regression model. If you want to
use the predictions to rank species according to extinction risk you may
focus on discrimination, e.g. AUC (area under ROC curve). AUC may be
interpreted as the probability that the prediction for a threatened
species chosen at random is larger than the prediction for a non
threatened species chosen at random. If you are concerned with the
reliability of the predictions (i.e., level of agreement between
predicted and actual probabilities) you may evaluate calibration (e.g.
calibration slope). If your model is well calibrated, you should find
approximately 50% of threatened species among those that yielded a
predicted probability of 0.5, 30% among those that yielded 0.3 and so on.
You can try val.prob function of the Design package to calculate
discrimination and calibration measures. You will find useful advice on
predictive performance evaluation of logistic regression models in any
of these books:
Harrell, F.E., 2001. Regression Modelling Strategies with Applications to
Linear Models Logistic Regression and Survival Analysis. Springer, New
York, NY, USA, p. 568
Steyerberg, E.W., 2009. Clinical Prediction Models: A Practical Approach
to Development Validation and Updating. Springer, New York, NY, USA, p.
497.
Just in case your sample is not very large, you may consider a simpler
model. If the factors used as predictors have several levels and the
training sample size is limited, your model may be overfitted. 10 events
(number of threatened species, or unthreatened if less frequent) per
estimated parameter are recommendable (note that each factor with k
levels will "spend" k-1 parameters).
Hope this helps,
Aitor
Dear List,
I am trying to predict the extinction risk of a species based on its life
history. I will detail my method below and would welcome comments as to
why the results are not as i expected.
First i fit my model -
model1 <- glm(THREAT~ HAB*BS + FR + WO + SEA + PD, data=traits,
family="binomial")
Where THREAT is TRUE (1) / FALSE (0).
Where BS, FR etc are factors with multiple levels.
I then predicted the probability of a species being threatened or not
using
print(predict(model1, type = "response"))
example output:-
1 2 3 4 5 6 7
0.44659200 0.65221495 0.71357243 0.71357243 0.71357243 0.71357243
0.71357243
8 9 10 11 12 13 14
0.71357243 0.65221495 0.65221495 0.65221495 0.65221495 0.65221495
0.65221495
I interpret this as species 1 has a 45% chance (probability) of being
threatened etc....
I then wanted to see how this relates to the "true" threat level so i
looked at species 1 and it was classed as threatened, which disagrees
with the predict results, although marginally. In fact most of the
predict results do not agree with the "real" threat level, some species
have a probability of 0.17 which to me says they are non threatened but
in "real" they are classed as threatened.
This is important as if these are not matching, at least most of the
time, then how can i confidently predict the response of a species when i
don't know its "real" response?
I hope this makes sense.
Chris
[[alternative HTML version deleted]]