gamm4 error with large dataset
On 14-04-30 12:03 PM, Daniel Hocking wrote:
I am trying to predict daily water temperature from air temperature primarily but ideally would include other factors such as precipitation and landscape characteristics. I have paired air and water temperatures from 600+ sites over a ~10 year period. Some sites have daily temperatures for just a couple months and others for years, and some for a couple months sporadically in different years. I am trying to use a mixed effects gamm so I can include random effects of site and year and smooth over day of the year (dOY). My dataframe is and I get the following error when I run this code system.time(gamm4Full <- gamm4(temp ~ airTemp + airTempLagged1 + airTempLagged2 + prcp + prcpLagged1 + prcpLagged2 + Latitude + Longitude + Forest + Agriculture + BasinElevationM + ReachSlopePCNT + CONUSWetland + SurficialCoarseC + s(dOY) + prcp*airTemp, random = ~(1| site) + (1 | year), data = etS)) # Error in crossprod(root.phi %*% Zt) : # Cholmod error 'problem too large' at file ../Core/cholmod_aat.c, line 173 # In addition: Warning message: # In optwrap(optimizer, devfun, getStart(start, rho$lower, rho$pp), : # convergence code 1 from bobyqa: bobyqa -- maximum number of function evaluations exceeded My plan was to try gamm4 and if there was autocorrelation issues to switch to gamm within mgcv. I know bam is designed for large data but I?m not sure how I would code the random effects using bam. I know in general it?s s(dOY, bs = ?re?) but I?m not sure how to relate this to site and year. Ideally I would have random slopes for airTemp effects for each site because of things like ground water inputs that we don?t measure.
I can imagine that this problem is caused by the size of the fixed-effect matrix. A couple of thoughts (none of them practical, I'm afraid): * I was going to say that it's too bad that we haven't yet managed to implement a sparse model matrix structure; * then I was going to say that a potential trick/workaround for this (for many-level _categorical_ variables) is to treat the factor as a random effect, then use devFunOnly/modular structure to fix the theta parameter for that variable at a large value, making it a pseudo-fixed effect and getting the benefits of (1) a little bit of regularization and (2) model matrix sparsity -- but doing this within gamm4 would be harder/require more hacking * then I realized that your fixed-effect model matrix probably isn't sparse, because it looks like it's made up entirely of continuous covariates * that got me thinking about the fact that some of your continuous covariates only vary at higher levels (i.e. Lat/Long and presumably Forest, Agriculture, elevation, etc.), and wondering whether there would be any way to save space by going back to the underlying model formulation and writing this out in terms of another multiplication of higher-level covariates times an indicator matrix ... ... all of which is fascinating (to me at least) but none of which actually gets you any farther with your specific problem. Sorry. Ben Bolker