Prev 13320 / 20628 Next

GAMM big data (70K rand effects) guidance

Ken Beath

Sat, May 9, 2015 3:43 AM

Long vectors i.e. vectors of about length 2^31 or longer have been
progressively added to R, but not everywhere. In this case I think it is
because an array can't yet have a bounds of this size, although a vector
can.

Error in cbind(x, y) :
  long vectors not supported yet:
../../../../R-3.1.3/src/include/Rinlinedfuns.h:137

Even if this is fixed the C and Fortran code in many packages would need to
be modified as well.

On 8 May 2015 at 01:31, Steve Bellan <steve.bellan at gmail.com> wrote:

Hi all,

I am working with an patient data base of 70K HIV-infected individuals
followed over time since treatment initiation, with 500K total observations
that include a laboratory measurement (CD4 cell count?an indicator of
immunocompetence). I?m trying to use GAMM to model the CD4 trajectory as a
function of CD4 at treatment initiation (i.e. y-intercept) and other
covariate classes (sex, age, etc). Thus, far I?ve struggled to fit GAMMs to
the entire data set.

I?m using a gaussian link function to log(CD4+1) for now. With gamm, this
gives the following:

form <- as.formula('log(cd4 + 1) ~ sex + s(ayfu, by = CD4_cat_init,

bs=?tp")')

print(system.time(tg1 <- gamm(form, data = nd, order.groups=F,

family=gaussian, random=list(PatientID=~1))))

where ayfu is time since treatment initiation and CD4_cat_init is the CD4
count at treatment initiation broken into 5 categories.

I ran that on a large memory (1TB) node on our HPC cluster and, after 12
hours using between 300-500 GB of memory, it crashed:

Error in print(system.time(tg1 <- gamm(form, data = nd, order.groups =

F,  :

  error in evaluating the argument 'x' in selecting a method for

function 'print': Error in cbind(X1, X[[i]][, j] * X0) :

  long vectors not supported yet: bind.c:1301
Calls: system.time ... extract.lme.cov2 -> cbind ->

tensor.prod.model.matrix -> cbind

Google tells me that this has to do with limits on R?s array size. But I
don?t totally follow how that is interacting with the gamm call.

I?m now trying out cubic regression splines (bs=?cs? instead of ?tp?) with
gamm and also with gamm4. Running the code on subsets of the data (1K
individuals) suggest only a mild improvement by using ?cs? for both
packages, and a *decrease* in speed using gamm4 instead of gamm. The latter
surprises me since I had thought that gamm4 was meant to be faster when the
# of random effects was large.

Eventually I?d like to use smoother-by-group interactions other than the
CD4_cat_init (i.e. sex, age etc) and test whether trajectories are
significantly different between covariate classes using AIC. It would also
be nice to somehow characterize how variable individuals? trends are within
a covariate class, though I?m not exactly sure what?s the best way to do
that.

But until I can get just one of these models to fit, these goals seem like
a long shot. I?ve struggled to find much documentation online regarding
fitting GAMMs to such large data sets, particularly one with so many random
effects. Hence the trial and error exploration of different splines &
packages. Does anyone have more concrete guidance on how to approach this
problem or helpful documentation? Help much appreciated!

Thanks,

Steve

Steve Bellan, PhD, MPH
Post-doctoral Researcher
Lauren Ancel Meyers Research Group
Center for Computational Biology and Bioinformatics
University of Texas at Austin
http://www.bio.utexas.edu/research/meyers/steve_bellan/ <
http://www.bio.utexas.edu/research/meyers/steve_bellan/>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

*Ken Beath*
Lecturer
Statistics Department
MACQUARIE UNIVERSITY NSW 2109, Australia

Phone: +61 (0)2 9850 8516

Building E4A, room 526
http://stat.mq.edu.au/our_staff/staff_-_alphabetical/staff/beath,_ken/

CRICOS Provider No 00002J
This message is intended for the addressee named and may...{{dropped:9}}

Thread (3 messages)

Steve Bellan GAMM big data (70K rand effects) guidance May 7 Thierry Onkelinx GAMM big data (70K rand effects) guidance May 8 Ken Beath GAMM big data (70K rand effects) guidance May 9