Skip to content

Cox regression model for matched data with replacement

1 message · Terry Therneau

#
Ok, I will try to do a short tutorial answer.

1. The score statistic for a Cox model is a sum of (x - xbar), where "x" is the covariate 
vector of the subject who had an event, and xbar is the mean covariate vector for the 
population, at that event time.
   - the usual Cox model uses the mean of {everyone still at risk} as xbar
   - matched Cox models use a mean of {some subset of those at risk}, and work fine as 
long as that subset is an honest estimate of xbar.  You do, of course, have to sample from 
those still at risk at the time point, since that is the xbar you are trying to estimate. 
  Someone who dies or is censored at time 10 can't be a control at time 20.
   - in an ordinary Cox model the program figures out who belongs in each xbar average all 
on its own, using the time variable.  In a matched model you need to supply the "who 
dances with who" information.  The usual way is to assign each of the sets {subject who 
died + their controls} to a separate stratum.  (If there is only one death in each stratum 
then the time variable will not be needed and you can plug in a dummy value; this is what 
clogit does.)  You can have more than one control per case by the way.

2. Variance.  In the matched model you run the risk, a quite small risk, that the same 
person would be picked again and again as the control.  If this unfortunate thing were to 
happen then the usual model based variance would be too optimistic --- because of its 
overdependence on one single subject the fit is more unstable than it looks.  Three 
solutions: a) don't worry about it (my usual approach),  b) when selecting controls, 
ensure that this doesn't happen (classic matched case control),  c) use a robust variance. 
  For the latter make sure that each subject in the data set has a unique value for some 
variable "id" and add "+ cluster(id)" to the model statement.

3. The most common mistake in matching is to exclude, at a given death time t, any subject 
with a future event from the list of potential controls at time t.  This does not lead to 
an unbiased estimate of xbar, and the resulting numerical bias in the coefficients is 
shockingly large.
   There are more clever ways to pick the subset at each event time, e.g., if you had some 
prior information on all the subjects that can classify them into high/medium/low risk. 
Survey sampling principles come into play for selection and the xbar at each time is 
replaced with an appropriate weighted survey estimate.  See various papers by Brian Langholz.

Terry T
On 08/13/2014 07:26 AM, John Pura wrote: