Hi all,
I am working with an patient data base of 70K HIV-infected individuals
followed over time since treatment initiation, with 500K total observations
that include a laboratory measurement (CD4 cell count?an indicator of
immunocompetence). I?m trying to use GAMM to model the CD4 trajectory as a
function of CD4 at treatment initiation (i.e. y-intercept) and other
covariate classes (sex, age, etc). Thus, far I?ve struggled to fit GAMMs to
the entire data set.
I?m using a gaussian link function to log(CD4+1) for now. With gamm, this
gives the following:
form <- as.formula('log(cd4 + 1) ~ sex + s(ayfu, by = CD4_cat_init,
print(system.time(tg1 <- gamm(form, data = nd, order.groups=F,
family=gaussian, random=list(PatientID=~1))))
where ayfu is time since treatment initiation and CD4_cat_init is the CD4
count at treatment initiation broken into 5 categories.
I ran that on a large memory (1TB) node on our HPC cluster and, after 12
hours using between 300-500 GB of memory, it crashed:
Error in print(system.time(tg1 <- gamm(form, data = nd, order.groups =
error in evaluating the argument 'x' in selecting a method for
function 'print': Error in cbind(X1, X[[i]][, j] * X0) :
long vectors not supported yet: bind.c:1301
Calls: system.time ... extract.lme.cov2 -> cbind ->
tensor.prod.model.matrix -> cbind
Google tells me that this has to do with limits on R?s array size. But I
don?t totally follow how that is interacting with the gamm call.
I?m now trying out cubic regression splines (bs=?cs? instead of ?tp?) with
gamm and also with gamm4. Running the code on subsets of the data (1K
individuals) suggest only a mild improvement by using ?cs? for both
packages, and a *decrease* in speed using gamm4 instead of gamm. The latter
surprises me since I had thought that gamm4 was meant to be faster when the
# of random effects was large.
Eventually I?d like to use smoother-by-group interactions other than the
CD4_cat_init (i.e. sex, age etc) and test whether trajectories are
significantly different between covariate classes using AIC. It would also
be nice to somehow characterize how variable individuals? trends are within
a covariate class, though I?m not exactly sure what?s the best way to do
that.
But until I can get just one of these models to fit, these goals seem like
a long shot. I?ve struggled to find much documentation online regarding
fitting GAMMs to such large data sets, particularly one with so many random
effects. Hence the trial and error exploration of different splines &
packages. Does anyone have more concrete guidance on how to approach this
problem or helpful documentation? Help much appreciated!
Thanks,
Steve
Steve Bellan, PhD, MPH
Post-doctoral Researcher
Lauren Ancel Meyers Research Group
Center for Computational Biology and Bioinformatics
University of Texas at Austin
http://www.bio.utexas.edu/research/meyers/steve_bellan/ <
http://www.bio.utexas.edu/research/meyers/steve_bellan/>
[[alternative HTML version deleted]]