Skip to content

basehaz() in package 'Survival' and warnings() with coxph

4 messages · David Winsemius, hazbro

#
Hello,

I have a couple of questions with regards to fitting a coxph model to a data
set in R:

I have a very large dataset and wanted to get the baseline hazard using the
basehaz() function in the package : 'survival'.
If I use all the covariates then the output from basehaz(fit), where fit is
a model fit using coxph(), gives 507 unique values for the time and the
corresponding cumulative hazard function. However if I use a subset of the
varaibles, basehaz() gives 611 values for the time and cumulative hazard.

The latter makes more sense as out of my 73000 observations, there are 611
unique times. However I wish to use all the variables to get the baseline
hazard.

Also I get a couple of warnings when I fit the coxph() model:

1) In fitter(X, Y, strats, offset, init, control, weights = weights,  :
  Loglik converged before variable   ; beta may be infinite.

2)  X is deemed to be singular.

I am aware that the second one is because of multicollinearity and also none
of the  coefficients are infinite so I thought I could ignore these.
Removing the variables that causes these problems does not solve my problem
with the basehaz().

The only reason I can think of is that maybe the baseline hazard is
undefined at some time points.

Thanks for the help.





--
View this message in context: http://r.789695.n4.nabble.com/basehaz-in-package-Survival-and-warnings-with-coxph-tp4639687.html
Sent from the R help mailing list archive at Nabble.com.
1 day later
#
My sessionInfo is as follows:

R version 2.15.1 (2012-06-22)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] mi_0.09-16       arm_1.5-05       foreign_0.8-50   abind_1.4-0
 [5] R2WinBUGS_2.1-18 coda_0.15-2      lme4_0.999999-0  Matrix_1.0-6
 [9] lattice_0.20-6   car_2.0-12       nnet_7.3-4       MASS_7.3-20
[13] MuMIn_1.7.11     survival_2.36-14

loaded via a namespace (and not attached):
[1] grid_2.15.1   nlme_3.1-104  stats4_2.15.1
It will be difficult to reproduce an example here as the data set I am using
in very large. I can give you an example:

fit3.1<- coxph(formula = y ~ sex + ns(ageyrs, df = 2) + AdmissionSource +
+     X1 + X2 + X3 + X5 + X6 + X7 + X11 + X12 + X13 + X14 + X15 +
+     X16 + X17 + X18 + X19 + X20 + X22 + X24 + X25 + X26 + X27 +
+     X28 + X29 + X32 + X33 + X35 + X38 + X39 + X40 + X41 + X42 +
+     X43 + X44 + X47 + X49 + X53 + X54 + X55 + X58 + X59 + X62 +
+     X68 + X69 + X78 + X80 + X81 + X84 + X85 + X86 + X93 + X95 +
+     X98 + X100 + X101 + X102 + X105 + X107 + X108 + X109 + X110 +
+     X112 + X113 + X114 + X115 + X116 + X117 + X121 + X122 + X125 +
+     X127 + X128 + X129 + X131 + X132 + X133 + X134 + X138 + X140 +
+     X143 + X145 + X146 + X148 + X150 + X151 + X153 + X157 + X158 +
+     X159 + X164 + X197 + X200 + X202 + X203 + X204 + X205 + X211 +
+     X214 + X217 + X224 + X228 + X233 + X237 + X244 + X249 + X254 +
+     X258 + X259 + X260 + CharlsonIndex + ethnic + day + season +
+     ln, data = dat2)

haz<-basehaz(fit3.1) # gives 507 unique haz$time, time points

fit2<-coxph(y~ns(ageyrs,df=2)+day+ln+sex+AdmissionSource+season+CharlsonIndex,data=dat1)

haz<-basehaz(fit2) # gives 611 unique haz$time, time points


I get the following warnings() with fit3.1:
Warning message:
In fitter(X, Y, strats, offset, init, control, weights = weights,  :
  Loglik converged before variable   ; beta may be infinite.

Also the coefficients of the variables that the error occurs for are very
high. The Wald test suggests dropping these terms where as the LRT suggests
keeping them. What should I do in terms of model selection?



--
View this message in context: http://r.789695.n4.nabble.com/basehaz-in-package-Survival-and-warnings-with-coxph-tp4639687p4639838.html
Sent from the R help mailing list archive at Nabble.com.
#
On Aug 9, 2012, at 5:53 PM, hazbro wrote:

            
snip
Regardless of the discrepancy it appears you have over 1-200 variables  
with only 5-600 events.
That suggests that the warning should be heeded because you probably  
have numerical stability problems, possibly highly collinear variables  
or complete separation on various strata.
I worry that you have already committed many modeling sins. If you  
started out with 260 variables and dropped a bunch of them with  step  
down procedur, then you are currently underestimating the number of  
degrees of freedom that you should be using. My guess is that if you  
used the proper degrees of freedom that the LRT would not support  
keeping them. You have too few data points to support that many  
variables. As Bert Gunter often recommends... get thee to a  
statistician.
#
Okay the data sets dat1 and dat 2 are the same dat1 just has fewer
covariates.

David, I understand your concern with the number of events and number of
variables I am using however. 611 is only the unique times at which the
events occur where as there are 6987 events in my data of 77272
observations.

As for the model selection procedure I am using a stepwise forward selection
procedure using add1() and checking the Wald test. The model I start with is
fit2 and I add on the X, variables which are all factors with two levels.
There are three variables that have very high coefficients, so do you think
something like the durbin watson test should be used here?





--
View this message in context: http://r.789695.n4.nabble.com/basehaz-in-package-Survival-and-warnings-with-coxph-tp4639687p4639931.html
Sent from the R help mailing list archive at Nabble.com.