Problems with model (assumptions)
Dear Philipp, I'm missing the graphs for the data exploration step in the notebook. So you can get an idea if the relations of with the explanatory variables are (log)linear. The residual plot from the Gaussian model are typical when modelling count data. So you need a Poisson or negative binomial distribution. normal qqplots for glm models are irrelevant. residuals versus fit are difficult to interpret. You should focus on residuals versus explanatory variables (fixed and random). You could consider using length as an offset factor. That seems to make more sense than as a random effect. Since length is the maximum body length per author, you would model the relative body length per author. There are other R packages that can fit glmm. glmmADMB, INLA, ... You can try them and see what happens. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-11-20 14:17 GMT+01:00 Philipp Singer <killver at gmail.com>:
Dear all, I am currently trying to investigate the effect of time (in the sense of an index) on the length of a text that people write (body_length). So, e.g., my hypothesis is that the later someone writes a text, the shorter it is. All authors do not write the same amount of individual texts, thus I have an additional variable that captures the maximum index (length). One further thing to note is that authors can have several "sessions" on different days. I have started to use a linear mixed-effects model. However, the basic assumptions of linear regression do not seem to hold (e.g., normality of residuals) which is to be expected for count data (text length). Thus, I have tried several other GLMs and adaptions. However, for most of them, the assumptions do not hold as well. Also, I receive several odd errors for some models. The best results can be achieved when I just log transform the outcome and use linear regression. However, as suggested in literature, this is not the proper way of treating count data. One thing to note is, that my data is enormeous (50mio. data points). I have worked with a sample of 1mio datapoints here, results for the whole data are similar though. Instead of now individually highlighting all the results in this mail, I have decided to prepare an iPython notebook (using R and lme4) that should convey my main procedure that I have conducted until now. It can be found here: https://nbviewer.jupyter.org/gist/anonymous/2897dd277a35a0df52ea I am hoping for some advice on how to proceed. Thanks in advance!
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models