Skip to content

Simulating Case Control Data

2 messages · R. W., Thomas Lumley

#
Dear R-Help-List,

I was wondering if anyone had experience simulating
case-control data in R?  I've been looking through
literature, and found that the main examples make
heavy parametric assumptions on the distributions of
the exposure (E), covariates (Z), and disease status
(D).  I would appreciate any guidance toward
resources/examples/literature that simulate
case-control data with fewer assumptions about the
underlying distributions of E, Z and D.

Thank you,
-R


      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page.
#
On Mon, 10 Dec 2007, R. W. wrote:

            
I think the only simple method that allows you to specify any arbitrary 
population distribution of predictors and does not rely on the logistic 
regression model being true is to simulate cohorts and then take a 
case-control sample from each one

Eg for a case-control sample of 500 cases and 1000 controls where there is 
about a 1% cumulative incidence
1. Generate all your predictor variables for a cohort of 50,000 people, 
from any distributions you want
2. Specify the disease model. This could be logistic
     logit(p(Y=1))=eta = b0+b1x1+b2x2+...
     p = exp(eta)/(1+exp(eta))
   or it could be anything else.
3. Now sum(p) gives the expected number of cases. Adjust b0 so that this 
is a bit bigger than your desired number, eg 550.
4. Generate Y for the population by rbinom(50000,1,p)
5. Choose 500 cases and 1000 controls using sample().

 	-thomas