Cox regression model for matched data with replacement - R-help

Wed, Aug 13, 2014 6:24 AM #

Ok, I will try to do a short tutorial answer.

1. The score statistic for a Cox model is a sum of (x - xbar), where "x" is the covariate
vector of the subject who had an event, and xbar is the mean covariate vector for the
population, at that event time.
- the usual Cox model uses the mean of {everyone still at risk} as xbar
- matched Cox models use a mean of {some subset of those at risk}, and work fine as
long as that subset is an honest estimate of xbar. You do, of course, have to sample from
those still at risk at the time point, since that is the xbar you are trying to estimate.
Someone who dies or is censored at time 10 can't be a control at time 20.
- in an ordinary Cox model the program figures out who belongs in each xbar average all
on its own, using the time variable. In a matched model you need to supply the "who
dances with who" information. The usual way is to assign each of the sets {subject who
died + their controls} to a separate stratum. (If there is only one death in each stratum
then the time variable will not be needed and you can plug in a dummy value; this is what
clogit does.) You can have more than one control per case by the way.

2. Variance. In the matched model you run the risk, a quite small risk, that the same
person would be picked again and again as the control. If this unfortunate thing were to
happen then the usual model based variance would be too optimistic --- because of its
overdependence on one single subject the fit is more unstable than it looks. Three
solutions: a) don't worry about it (my usual approach), b) when selecting controls,
ensure that this doesn't happen (classic matched case control), c) use a robust variance.
For the latter make sure that each subject in the data set has a unique value for some
variable "id" and add "+ cluster(id)" to the model statement.

3. The most common mistake in matching is to exclude, at a given death time t, any subject
with a future event from the list of potential controls at time t. This does not lead to
an unbiased estimate of xbar, and the resulting numerical bias in the coefficients is
shockingly large.
There are more clever ways to pick the subset at each event time, e.g., if you had some
prior information on all the subjects that can classify them into high/medium/low risk.
Survey sampling principles come into play for selection and the xbar at each time is
replaced with an appropriate weighted survey estimate. See various papers by Brian Langholz.

Terry T

On 08/13/2014 07:26 AM, John Pura wrote: