GAMM big data (70K rand effects) guidance
Long vectors i.e. vectors of about length 2^31 or longer have been progressively added to R, but not everywhere. In this case I think it is because an array can't yet have a bounds of this size, although a vector can.
x <- 1 x[2^31] <- 1 y <- 1 y[2^31] <- 1 z <- cbind(x,y)
Error in cbind(x, y) : long vectors not supported yet: ../../../../R-3.1.3/src/include/Rinlinedfuns.h:137 Even if this is fixed the C and Fortran code in many packages would need to be modified as well.
On 8 May 2015 at 01:31, Steve Bellan <steve.bellan at gmail.com> wrote:
Hi all, I am working with an patient data base of 70K HIV-infected individuals followed over time since treatment initiation, with 500K total observations that include a laboratory measurement (CD4 cell count?an indicator of immunocompetence). I?m trying to use GAMM to model the CD4 trajectory as a function of CD4 at treatment initiation (i.e. y-intercept) and other covariate classes (sex, age, etc). Thus, far I?ve struggled to fit GAMMs to the entire data set. I?m using a gaussian link function to log(CD4+1) for now. With gamm, this gives the following:
form <- as.formula('log(cd4 + 1) ~ sex + s(ayfu, by = CD4_cat_init,
bs=?tp")')
print(system.time(tg1 <- gamm(form, data = nd, order.groups=F,
family=gaussian, random=list(PatientID=~1)))) where ayfu is time since treatment initiation and CD4_cat_init is the CD4 count at treatment initiation broken into 5 categories. I ran that on a large memory (1TB) node on our HPC cluster and, after 12 hours using between 300-500 GB of memory, it crashed:
Error in print(system.time(tg1 <- gamm(form, data = nd, order.groups =
F, :
error in evaluating the argument 'x' in selecting a method for
function 'print': Error in cbind(X1, X[[i]][, j] * X0) :
long vectors not supported yet: bind.c:1301 Calls: system.time ... extract.lme.cov2 -> cbind ->
tensor.prod.model.matrix -> cbind Google tells me that this has to do with limits on R?s array size. But I don?t totally follow how that is interacting with the gamm call. I?m now trying out cubic regression splines (bs=?cs? instead of ?tp?) with gamm and also with gamm4. Running the code on subsets of the data (1K individuals) suggest only a mild improvement by using ?cs? for both packages, and a *decrease* in speed using gamm4 instead of gamm. The latter surprises me since I had thought that gamm4 was meant to be faster when the # of random effects was large. Eventually I?d like to use smoother-by-group interactions other than the CD4_cat_init (i.e. sex, age etc) and test whether trajectories are significantly different between covariate classes using AIC. It would also be nice to somehow characterize how variable individuals? trends are within a covariate class, though I?m not exactly sure what?s the best way to do that. But until I can get just one of these models to fit, these goals seem like a long shot. I?ve struggled to find much documentation online regarding fitting GAMMs to such large data sets, particularly one with so many random effects. Hence the trial and error exploration of different splines & packages. Does anyone have more concrete guidance on how to approach this problem or helpful documentation? Help much appreciated! Thanks, Steve Steve Bellan, PhD, MPH Post-doctoral Researcher Lauren Ancel Meyers Research Group Center for Computational Biology and Bioinformatics University of Texas at Austin http://www.bio.utexas.edu/research/meyers/steve_bellan/ < http://www.bio.utexas.edu/research/meyers/steve_bellan/> [[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
*Ken Beath* Lecturer Statistics Department MACQUARIE UNIVERSITY NSW 2109, Australia Phone: +61 (0)2 9850 8516 Building E4A, room 526 http://stat.mq.edu.au/our_staff/staff_-_alphabetical/staff/beath,_ken/ CRICOS Provider No 00002J This message is intended for the addressee named and may...{{dropped:9}}