Advice regarding model choice
Dear David, Thanks a lot for your response. Actually, I have already tried the first solution you suggested by adding an observation random effect. Unfortunately, it ends up with an error that I have not yet found a solution for: Error: (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate My overall data is also really large, something like 35mio. samples...I could work with samples though. In general though, I am not sure whether the effort is even worth it based on the qqplots I have attached to my previous mail. The data does not seem to be fit well with either a poisson nor a negative binomial model. Cheers Philipp P.S.: David accidentally just replied to my mail which is why I continue the discussion here.
On 10/26/2015 04:55 PM, David Jones wrote:
Dear Philipp - I recently had a similar situation, and here is my 2c. Regarding model, two common ways to account for model overdispersion are negative binomial and poisson-lognormal models (for an easily implemented poisson-lognormal model lme4, see the code that corresponds to a Ben Bolker manuscript at https://blogs.umass.edu/nrc697sa-finnj/2012/11/08/bolkers-reanalysis-of-owl-data/). They will probably run very slowly with that size of sample (recently when running a dataset of N=600k, it took me about 2 hours for a model even on Amazon EC2 for a negative binomial model; may want to use the verbose statement and system.time() function to get an idea that progress is happening and to have a good idea of how long the model took when it is completed). These models took much longer to run than regular poisson on my machines. For the convergence issues, I found the following site very helpful: https://rstudio-pubs-static.s3.amazonaws.com/33653_57fc7b8e5d484c909b615d8633c01d51.html In particular, recoding predictors was helpful (while mine were categorical, changing the coding scheme helped some), and also using prior model starting values in a second model run were enough to eliminate the warnings (and at times, using the second model run as starting values in a third model). Admittedly these issues may disappear/change when you change modeling approaches. On Mon, Oct 26, 2015 at 7:15 PM, Philipp Singer <killver at gmail.com <mailto:killver at gmail.com>> wrote: My current data to study looks like the following: Suppose that we repeatedly let subjects write a piece of text. We are now mainly interested in whether the consecutive writing has an effect on features of the written text. For example, we can hypothesize that the fifth text is shorter than the first. To that end, the data looks like the following (based on only the text length feature): subject | text_length (characters) | index | total_amount I have identified that the total_amount is an important feature to consider as the e.g., text length is different for people writing the text 100 times vs. those writing it only 5 times; we have no balanced setting here. Sample data for one subject could look like: subject | text_length | index | total_amount 1 | 100 | 1 | 3 1 | 78 | 2 | 3 1 | 80 | 3 | 3 A reasonable model my experiments have suggested is the following: text_length ~ 1 + index + total_amount + (1|subject) Alternatively, it might be also reasonable to add (1|total_amount) instead of incorporating it as a fixed effect. In this model, as hypothesized, the index shows a negative coefficient. What my main reason for this post now is though, that I am unsure whether I can justify the usage of a linear model here. Actually, the data is not normally distributed and also the residuals are not. In the following, I have plotted some qqplots with different fits (based on a large sample). http://imgur.com/a/jinav Usually, I would proceed with such "count" data by using a poisson glm, however it does not converge. Also, as the plots suggest, a poisson distribution does not seem to be a good fit here. Additionally, the poisson fit indicates strong overdispersion. An important thing to note here, is that my real data is very, very large (imagine multiple millions of data points). Do you guys have any suggestions on how to proceed? Thanks!
_______________________________________________
R-sig-mixed-models at r-project.org
<mailto:R-sig-mixed-models at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models