Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
Model diagnostics show slope in residuals plot and slope on the observed vs fitted plot is different than y = x
18 messages · Carlos Familia, Thierry Onkelinx, Paul Johnson +1 more
Dear Carlos, Your plot got stripped from your mail. Try sending it as pdf or put it someone online and send us the URL. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-02 17:57 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com>:
Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Hello, The image can be found here https://s18.postimg.org/rbx2vh2ex/Pasted_Graphic_4.png <https://s18.postimg.org/rbx2vh2ex/Pasted_Graphic_4.png> Best regards, Carlos Fam?lia
On 3 Oct 2016, at 08:50, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: Dear Carlos, Your plot got stripped from your mail. Try sending it as pdf or put it someone online and send us the URL. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-02 17:57 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>>: Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
Dear Carlos, Can you show us a plot of the residuals versus W for each level of C? It looks like either the relation of Y and W is not linear, or you are missing an important covariate. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 10:34 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com>:
Hello, The image can be found here https://s18.postimg.org/ rbx2vh2ex/Pasted_Graphic_4.png Best regards, Carlos Fam?lia On 3 Oct 2016, at 08:50, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: Dear Carlos, Your plot got stripped from your mail. Try sending it as pdf or put it someone online and send us the URL. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-02 17:57 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com>:
Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Dear Thierry, The image can be found here https://s4.postimg.org/lj5xf0rpp/Screen_Shot_2016_10_03_at_09_44_28.png <https://s4.postimg.org/lj5xf0rpp/Screen_Shot_2016_10_03_at_09_44_28.png> Let me add another thing to the discussion, I was trying different models, and I tried the following lmer( Y ~ X + (1 | C), data = df) For which the residuals are distributed in a form I was expecting, however I am missing the part of the same individual being measured for different conditions, the plots can be found here, https://s25.postimg.org/oupckrapr/Screen_Shot_2016_10_03_at_09_49_20.png <https://s25.postimg.org/oupckrapr/Screen_Shot_2016_10_03_at_09_49_20.png> Thank you, Carlos Fam?lia
On 3 Oct 2016, at 09:40, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: Dear Carlos, Can you show us a plot of the residuals versus W for each level of C? It looks like either the relation of Y and W is not linear, or you are missing an important covariate. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 10:34 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>>: Hello, The image can be found here https://s18.postimg.org/rbx2vh2ex/Pasted_Graphic_4.png <https://s18.postimg.org/rbx2vh2ex/Pasted_Graphic_4.png> Best regards, Carlos Fam?lia
On 3 Oct 2016, at 08:50, Thierry Onkelinx <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be>> wrote: Dear Carlos, Your plot got stripped from your mail. Try sending it as pdf or put it someone online and send us the URL. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-02 17:57 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>>: Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
Dear Carlos, Is X an other variable? Or did you ment W? The graphs give me a strong indication for a missing covariate. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 10:51 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com>:
Dear Thierry, The image can be found here https://s4.postimg.org/ lj5xf0rpp/Screen_Shot_2016_10_03_at_09_44_28.png Let me add another thing to the discussion, I was trying different models, and I tried the following lmer( Y ~ X + (1 | C), data = df) For which the residuals are distributed in a form I was expecting, however I am missing the part of the same individual being measured for different conditions, the plots can be found here, https://s25.postimg.org/ oupckrapr/Screen_Shot_2016_10_03_at_09_49_20.png Thank you, Carlos Fam?lia On 3 Oct 2016, at 09:40, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: Dear Carlos, Can you show us a plot of the residuals versus W for each level of C? It looks like either the relation of Y and W is not linear, or you are missing an important covariate. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 10:34 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com>:
Hello, The image can be found here https://s18.postimg.org/r bx2vh2ex/Pasted_Graphic_4.png Best regards, Carlos Fam?lia On 3 Oct 2016, at 08:50, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: Dear Carlos, Your plot got stripped from your mail. Try sending it as pdf or put it someone online and send us the URL. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-02 17:57 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com>:
Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Dear Thierry, I am sorry I meant W: lmer( Y ~ W + ( 1 | C ), data = df) Thank you, Carlos
On 3 Oct 2016, at 09:57, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: Dear Carlos, Is X an other variable? Or did you ment W? The graphs give me a strong indication for a missing covariate. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 10:51 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>>: Dear Thierry, The image can be found here https://s4.postimg.org/lj5xf0rpp/Screen_Shot_2016_10_03_at_09_44_28.png <https://s4.postimg.org/lj5xf0rpp/Screen_Shot_2016_10_03_at_09_44_28.png> Let me add another thing to the discussion, I was trying different models, and I tried the following lmer( Y ~ X + (1 | C), data = df) For which the residuals are distributed in a form I was expecting, however I am missing the part of the same individual being measured for different conditions, the plots can be found here, https://s25.postimg.org/oupckrapr/Screen_Shot_2016_10_03_at_09_49_20.png <https://s25.postimg.org/oupckrapr/Screen_Shot_2016_10_03_at_09_49_20.png> Thank you, Carlos Fam?lia
On 3 Oct 2016, at 09:40, Thierry Onkelinx <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be>> wrote: Dear Carlos, Can you show us a plot of the residuals versus W for each level of C? It looks like either the relation of Y and W is not linear, or you are missing an important covariate. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 10:34 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>>: Hello, The image can be found here https://s18.postimg.org/rbx2vh2ex/Pasted_Graphic_4.png <https://s18.postimg.org/rbx2vh2ex/Pasted_Graphic_4.png> Best regards, Carlos Fam?lia
On 3 Oct 2016, at 08:50, Thierry Onkelinx <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be>> wrote: Dear Carlos, Your plot got stripped from your mail. Try sending it as pdf or put it someone online and send us the URL. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-02 17:57 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>>: Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
Dear Thierry, If that is the case, would the initial model be of any use for inference given that I have no other data or covariate and most likely I won?t be able to get it? Many thanks, Carlos Fam?lia
On 3 Oct 2016, at 09:57, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: Dear Carlos, Is X an other variable? Or did you ment W? The graphs give me a strong indication for a missing covariate. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 10:51 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>>: Dear Thierry, The image can be found here https://s4.postimg.org/lj5xf0rpp/Screen_Shot_2016_10_03_at_09_44_28.png <https://s4.postimg.org/lj5xf0rpp/Screen_Shot_2016_10_03_at_09_44_28.png> Let me add another thing to the discussion, I was trying different models, and I tried the following lmer( Y ~ X + (1 | C), data = df) For which the residuals are distributed in a form I was expecting, however I am missing the part of the same individual being measured for different conditions, the plots can be found here, https://s25.postimg.org/oupckrapr/Screen_Shot_2016_10_03_at_09_49_20.png <https://s25.postimg.org/oupckrapr/Screen_Shot_2016_10_03_at_09_49_20.png> Thank you, Carlos Fam?lia
On 3 Oct 2016, at 09:40, Thierry Onkelinx <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be>> wrote: Dear Carlos, Can you show us a plot of the residuals versus W for each level of C? It looks like either the relation of Y and W is not linear, or you are missing an important covariate. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 10:34 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>>: Hello, The image can be found here https://s18.postimg.org/rbx2vh2ex/Pasted_Graphic_4.png <https://s18.postimg.org/rbx2vh2ex/Pasted_Graphic_4.png> Best regards, Carlos Fam?lia
On 3 Oct 2016, at 08:50, Thierry Onkelinx <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be>> wrote: Dear Carlos, Your plot got stripped from your mail. Try sending it as pdf or put it someone online and send us the URL. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-02 17:57 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>>: Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
Dear Carlos, Can you send us the dataset? I have some more questions on the data and have the data would be easier to look into this problem. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 11:08 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com>:
Dear Thierry, If that is the case, would the initial model be of any use for inference given that I have no other data or covariate and most likely I won?t be able to get it? Many thanks, Carlos Fam?lia On 3 Oct 2016, at 09:57, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: Dear Carlos, Is X an other variable? Or did you ment W? The graphs give me a strong indication for a missing covariate. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 10:51 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com>:
Dear Thierry, The image can be found here https://s4.postimg.org/lj 5xf0rpp/Screen_Shot_2016_10_03_at_09_44_28.png Let me add another thing to the discussion, I was trying different models, and I tried the following lmer( Y ~ X + (1 | C), data = df) For which the residuals are distributed in a form I was expecting, however I am missing the part of the same individual being measured for different conditions, the plots can be found here, https://s25.postimg.org/oupckrapr/Screen_Shot_2016_10_03_at_09_49_20.png Thank you, Carlos Fam?lia On 3 Oct 2016, at 09:40, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: Dear Carlos, Can you show us a plot of the residuals versus W for each level of C? It looks like either the relation of Y and W is not linear, or you are missing an important covariate. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 10:34 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com>:
Hello, The image can be found here https://s18.postimg.org/r bx2vh2ex/Pasted_Graphic_4.png Best regards, Carlos Fam?lia On 3 Oct 2016, at 08:50, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: Dear Carlos, Your plot got stripped from your mail. Try sending it as pdf or put it someone online and send us the URL. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-02 17:57 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com>:
Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Hi Carlos,
Your plot didn?t come through, as Thierry noted. However it?s expected that, unlike a standard linear regression model, an LMM with a nested structure such as yours will give a positive linear relationship between the residuals and the fitted values (might also be true for other structures?), provided the residual and random effect variances are > 0. Somebody will hopefully chip in with a formal explanation, but it?s basically a similar phenomenon to regression to the mean.
Imagine a group of students taking each taking an aptitude test three times. There are two random factors: the difference in underlying aptitude between the students, modelled by the student ID random effect; and random variation between time points within each student (e.g. how good a particular student is feeling on a particular day). I?m ignoring variation between tests ? let?s unrealistically assume they?s all the same and students completely forget about them between tests.
The students with the best mean scores will be a mixture of excellent students having three so-so (some good, some bad) days, and moderately good students having the good luck to have three good days, and the very best scores will come from students who were both excellent and lucky (although this category will be small). An important point is that there is no way of using the data to separate the moderate-student-lucky-days high scores from the excellent-student-average-days scores. If we simply took the mean of the scores, we would be overestimating the performance of the students on average (we?d have good estimates of the excellent students and overestimates of the moderate ones), so the best estimate is achieved by shrinking the scores towards the mean.
This is what happens when the model is fitted. Each student is given a residual (random effect) at the student level (how good the student is relative to the value predicted by the fixed effects) and three residuals at the observation (between-test-within-student) level. For students with good scores, this will be a compromise between the inseparable excellent-student-average-days scenario and the moderate-student-lucky-days scenario. As a result, students with high student-level residuals (the student random effects) will also tend to have high inter-test residuals. The same is also true in negative for poor students and students having three bad days. So the student random effects (which are part of the fitted values) and the residuals will be positively correlated.
You can check this using by simulating new data from the fitted model re-fitting the model, and comparing the residuals-x-fitted plot (which will be "perfect?) to the one from your data. Here?s a function that does this for lme4 fits:
devtools::install_github("pcdjohnson/GLMMmisc")
library(GLMMmisc)
sim.residplot(fit) # repeat this a few times to account for sampling error
If all is well, you should see a similar slope between the real and the simulated plots, in fact the general pattern of the residuals should be similar.
(The new package DHARMa ? https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-package-for-residual-diagnostics-of-glmms/ ? takes a similar approach to assessing residuals, but in a less quick-and-dirty, more formally justified way.)
All the best,
Paul
On 2 Oct 2016, at 16:57, Carlos Familia <carlosfamilia at gmail.com> wrote: Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Dear Paul, Thank you for your reply. The image can be found here https://s18.postimg.org/rbx2vh2ex/Pasted_Graphic_4.png <https://s18.postimg.org/rbx2vh2ex/Pasted_Graphic_4.png> Do you think it would be acceptable (for publication) if the analysis was performed separately for each group? Thanks, Carlos
On 3 Oct 2016, at 10:17, Paul Johnson <paul.johnson at glasgow.ac.uk> wrote:
Hi Carlos,
Your plot didn?t come through, as Thierry noted. However it?s expected that, unlike a standard linear regression model, an LMM with a nested structure such as yours will give a positive linear relationship between the residuals and the fitted values (might also be true for other structures?), provided the residual and random effect variances are > 0. Somebody will hopefully chip in with a formal explanation, but it?s basically a similar phenomenon to regression to the mean.
Imagine a group of students taking each taking an aptitude test three times. There are two random factors: the difference in underlying aptitude between the students, modelled by the student ID random effect; and random variation between time points within each student (e.g. how good a particular student is feeling on a particular day). I?m ignoring variation between tests ? let?s unrealistically assume they?s all the same and students completely forget about them between tests.
The students with the best mean scores will be a mixture of excellent students having three so-so (some good, some bad) days, and moderately good students having the good luck to have three good days, and the very best scores will come from students who were both excellent and lucky (although this category will be small). An important point is that there is no way of using the data to separate the moderate-student-lucky-days high scores from the excellent-student-average-days scores. If we simply took the mean of the scores, we would be overestimating the performance of the students on average (we?d have good estimates of the excellent students and overestimates of the moderate ones), so the best estimate is achieved by shrinking the scores towards the mean.
This is what happens when the model is fitted. Each student is given a residual (random effect) at the student level (how good the student is relative to the value predicted by the fixed effects) and three residuals at the observation (between-test-within-student) level. For students with good scores, this will be a compromise between the inseparable excellent-student-average-days scenario and the moderate-student-lucky-days scenario. As a result, students with high student-level residuals (the student random effects) will also tend to have high inter-test residuals. The same is also true in negative for poor students and students having three bad days. So the student random effects (which are part of the fitted values) and the residuals will be positively correlated.
You can check this using by simulating new data from the fitted model re-fitting the model, and comparing the residuals-x-fitted plot (which will be "perfect?) to the one from your data. Here?s a function that does this for lme4 fits:
devtools::install_github("pcdjohnson/GLMMmisc")
library(GLMMmisc)
sim.residplot(fit) # repeat this a few times to account for sampling error
If all is well, you should see a similar slope between the real and the simulated plots, in fact the general pattern of the residuals should be similar.
(The new package DHARMa ? https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-package-for-residual-diagnostics-of-glmms/ <https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-package-for-residual-diagnostics-of-glmms/> ? takes a similar approach to assessing residuals, but in a less quick-and-dirty, more formally justified way.)
All the best,
Paul
On 2 Oct 2016, at 16:57, Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>> wrote: Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
Dear Carlos, I concur with Paul. After play a bit with the data you send me privately, I see a few things which cause problems: 1) the number of measurements per ID is low. 1/3 has one measurement in each level of C, 1/3 in two out of three levels of C and 1/3 in only one level of C. 2) the variance of ID is larger than the residual variance 3) the effect of W is small compared to the variance of ID If possible try to add extra covariates. If not I'd fall back on a simple lm. Either with ignoring the repeated measurements or by sampling the data so you have only one observation per ID. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 11:17 GMT+02:00 Paul Johnson <paul.johnson at glasgow.ac.uk>:
Hi Carlos,
Your plot didn?t come through, as Thierry noted. However it?s expected
that, unlike a standard linear regression model, an LMM with a nested
structure such as yours will give a positive linear relationship between
the residuals and the fitted values (might also be true for other
structures?), provided the residual and random effect variances are > 0.
Somebody will hopefully chip in with a formal explanation, but it?s
basically a similar phenomenon to regression to the mean.
Imagine a group of students taking each taking an aptitude test three
times. There are two random factors: the difference in underlying aptitude
between the students, modelled by the student ID random effect; and random
variation between time points within each student (e.g. how good a
particular student is feeling on a particular day). I?m ignoring variation
between tests ? let?s unrealistically assume they?s all the same and
students completely forget about them between tests.
The students with the best mean scores will be a mixture of excellent
students having three so-so (some good, some bad) days, and moderately good
students having the good luck to have three good days, and the very best
scores will come from students who were both excellent and lucky (although
this category will be small). An important point is that there is no way of
using the data to separate the moderate-student-lucky-days high scores from
the excellent-student-average-days scores. If we simply took the mean of
the scores, we would be overestimating the performance of the students on
average (we?d have good estimates of the excellent students and
overestimates of the moderate ones), so the best estimate is achieved by
shrinking the scores towards the mean.
This is what happens when the model is fitted. Each student is given a
residual (random effect) at the student level (how good the student is
relative to the value predicted by the fixed effects) and three residuals
at the observation (between-test-within-student) level. For students with
good scores, this will be a compromise between the inseparable
excellent-student-average-days scenario and the moderate-student-lucky-days
scenario. As a result, students with high student-level residuals (the
student random effects) will also tend to have high inter-test residuals.
The same is also true in negative for poor students and students having
three bad days. So the student random effects (which are part of the fitted
values) and the residuals will be positively correlated.
You can check this using by simulating new data from the fitted model
re-fitting the model, and comparing the residuals-x-fitted plot (which will
be "perfect?) to the one from your data. Here?s a function that does this
for lme4 fits:
devtools::install_github("pcdjohnson/GLMMmisc")
library(GLMMmisc)
sim.residplot(fit) # repeat this a few times to account for sampling error
If all is well, you should see a similar slope between the real and the
simulated plots, in fact the general pattern of the residuals should be
similar.
(The new package DHARMa ? https://theoreticalecology.
wordpress.com/2016/08/28/dharma-an-r-package-for-
residual-diagnostics-of-glmms/ ? takes a similar approach to assessing
residuals, but in a less quick-and-dirty, more formally justified way.)
All the best,
Paul
On 2 Oct 2016, at 16:57, Carlos Familia <carlosfamilia at gmail.com> wrote: Hello, I have in hands a quite large and unbalanced dataset, for which a Y
continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects.
I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model
with lmer:
lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be
wrong here, any suggestions would be welcome).
However when running the diagnostic plots I noticed a slope in the
residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference.
Could I still use this model for inference? Should I specify a different
formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas?
I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Dear Thierry, When you say sampling the data to have only one observation per ID, you mean reducing the dataset to cases where the Y variable was only measured in 1 condition? I have though about going lm with this, but it just didn?t feel wright... Many thanks, Carlos
On 3 Oct 2016, at 13:02, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote:
Dear Carlos,
I concur with Paul. After play a bit with the data you send me privately, I see a few things which cause problems:
1) the number of measurements per ID is low. 1/3 has one measurement in each level of C, 1/3 in two out of three levels of C and 1/3 in only one level of C.
2) the variance of ID is larger than the residual variance
3) the effect of W is small compared to the variance of ID
If possible try to add extra covariates. If not I'd fall back on a simple lm. Either with ignoring the repeated measurements or by sampling the data so you have only one observation per ID.
Best regards,
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium
To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey
2016-10-03 11:17 GMT+02:00 Paul Johnson <paul.johnson at glasgow.ac.uk <mailto:paul.johnson at glasgow.ac.uk>>:
Hi Carlos,
Your plot didn?t come through, as Thierry noted. However it?s expected that, unlike a standard linear regression model, an LMM with a nested structure such as yours will give a positive linear relationship between the residuals and the fitted values (might also be true for other structures?), provided the residual and random effect variances are > 0. Somebody will hopefully chip in with a formal explanation, but it?s basically a similar phenomenon to regression to the mean.
Imagine a group of students taking each taking an aptitude test three times. There are two random factors: the difference in underlying aptitude between the students, modelled by the student ID random effect; and random variation between time points within each student (e.g. how good a particular student is feeling on a particular day). I?m ignoring variation between tests ? let?s unrealistically assume they?s all the same and students completely forget about them between tests.
The students with the best mean scores will be a mixture of excellent students having three so-so (some good, some bad) days, and moderately good students having the good luck to have three good days, and the very best scores will come from students who were both excellent and lucky (although this category will be small). An important point is that there is no way of using the data to separate the moderate-student-lucky-days high scores from the excellent-student-average-days scores. If we simply took the mean of the scores, we would be overestimating the performance of the students on average (we?d have good estimates of the excellent students and overestimates of the moderate ones), so the best estimate is achieved by shrinking the scores towards the mean.
This is what happens when the model is fitted. Each student is given a residual (random effect) at the student level (how good the student is relative to the value predicted by the fixed effects) and three residuals at the observation (between-test-within-student) level. For students with good scores, this will be a compromise between the inseparable excellent-student-average-days scenario and the moderate-student-lucky-days scenario. As a result, students with high student-level residuals (the student random effects) will also tend to have high inter-test residuals. The same is also true in negative for poor students and students having three bad days. So the student random effects (which are part of the fitted values) and the residuals will be positively correlated.
You can check this using by simulating new data from the fitted model re-fitting the model, and comparing the residuals-x-fitted plot (which will be "perfect?) to the one from your data. Here?s a function that does this for lme4 fits:
devtools::install_github("pcdjohnson/GLMMmisc")
library(GLMMmisc)
sim.residplot(fit) # repeat this a few times to account for sampling error
If all is well, you should see a similar slope between the real and the simulated plots, in fact the general pattern of the residuals should be similar.
(The new package DHARMa ? https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-package-for-residual-diagnostics-of-glmms/ <https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-package-for-residual-diagnostics-of-glmms/> ? takes a similar approach to assessing residuals, but in a less quick-and-dirty, more formally justified way.)
All the best,
Paul
On 2 Oct 2016, at 16:57, Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>> wrote: Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
In case you have more than one measurement per ID, select one of them at random. Something like library(dplyr) df %>% group_by(id) %>% sample_n(1) ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 14:06 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com>:
Dear Thierry, When you say sampling the data to have only one observation per ID, you mean reducing the dataset to cases where the Y variable was only measured in 1 condition? I have though about going lm with this, but it just didn?t feel wright... Many thanks, Carlos On 3 Oct 2016, at 13:02, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: Dear Carlos, I concur with Paul. After play a bit with the data you send me privately, I see a few things which cause problems: 1) the number of measurements per ID is low. 1/3 has one measurement in each level of C, 1/3 in two out of three levels of C and 1/3 in only one level of C. 2) the variance of ID is larger than the residual variance 3) the effect of W is small compared to the variance of ID If possible try to add extra covariates. If not I'd fall back on a simple lm. Either with ignoring the repeated measurements or by sampling the data so you have only one observation per ID. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 11:17 GMT+02:00 Paul Johnson <paul.johnson at glasgow.ac.uk>:
Hi Carlos,
Your plot didn?t come through, as Thierry noted. However it?s expected
that, unlike a standard linear regression model, an LMM with a nested
structure such as yours will give a positive linear relationship between
the residuals and the fitted values (might also be true for other
structures?), provided the residual and random effect variances are > 0.
Somebody will hopefully chip in with a formal explanation, but it?s
basically a similar phenomenon to regression to the mean.
Imagine a group of students taking each taking an aptitude test three
times. There are two random factors: the difference in underlying aptitude
between the students, modelled by the student ID random effect; and random
variation between time points within each student (e.g. how good a
particular student is feeling on a particular day). I?m ignoring variation
between tests ? let?s unrealistically assume they?s all the same and
students completely forget about them between tests.
The students with the best mean scores will be a mixture of excellent
students having three so-so (some good, some bad) days, and moderately good
students having the good luck to have three good days, and the very best
scores will come from students who were both excellent and lucky (although
this category will be small). An important point is that there is no way of
using the data to separate the moderate-student-lucky-days high scores from
the excellent-student-average-days scores. If we simply took the mean of
the scores, we would be overestimating the performance of the students on
average (we?d have good estimates of the excellent students and
overestimates of the moderate ones), so the best estimate is achieved by
shrinking the scores towards the mean.
This is what happens when the model is fitted. Each student is given a
residual (random effect) at the student level (how good the student is
relative to the value predicted by the fixed effects) and three residuals
at the observation (between-test-within-student) level. For students with
good scores, this will be a compromise between the inseparable
excellent-student-average-days scenario and the moderate-student-lucky-days
scenario. As a result, students with high student-level residuals (the
student random effects) will also tend to have high inter-test residuals.
The same is also true in negative for poor students and students having
three bad days. So the student random effects (which are part of the fitted
values) and the residuals will be positively correlated.
You can check this using by simulating new data from the fitted model
re-fitting the model, and comparing the residuals-x-fitted plot (which will
be "perfect?) to the one from your data. Here?s a function that does this
for lme4 fits:
devtools::install_github("pcdjohnson/GLMMmisc")
library(GLMMmisc)
sim.residplot(fit) # repeat this a few times to account for sampling error
If all is well, you should see a similar slope between the real and the
simulated plots, in fact the general pattern of the residuals should be
similar.
(The new package DHARMa ? https://theoreticalecology.wor
dpress.com/2016/08/28/dharma-an-r-package-for-residual-
diagnostics-of-glmms/ ? takes a similar approach to assessing residuals,
but in a less quick-and-dirty, more formally justified way.)
All the best,
Paul
On 2 Oct 2016, at 16:57, Carlos Familia <carlosfamilia at gmail.com>
wrote:
Hello, I have in hands a quite large and unbalanced dataset, for which a Y
continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects.
I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model
with lmer:
lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be
wrong here, any suggestions would be welcome).
However when running the diagnostic plots I noticed a slope in the
residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference.
Could I still use this model for inference? Should I specify a
different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas?
I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Do you think this approach is sound and easily justifiable in a paper? Many thanks, Carlos
On 3 Oct 2016, at 13:10, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: In case you have more than one measurement per ID, select one of them at random. Something like library(dplyr) df %>% group_by(id) %>% sample_n(1) ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 14:06 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>>: Dear Thierry, When you say sampling the data to have only one observation per ID, you mean reducing the dataset to cases where the Y variable was only measured in 1 condition? I have though about going lm with this, but it just didn?t feel wright... Many thanks, Carlos
On 3 Oct 2016, at 13:02, Thierry Onkelinx <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be>> wrote:
Dear Carlos,
I concur with Paul. After play a bit with the data you send me privately, I see a few things which cause problems:
1) the number of measurements per ID is low. 1/3 has one measurement in each level of C, 1/3 in two out of three levels of C and 1/3 in only one level of C.
2) the variance of ID is larger than the residual variance
3) the effect of W is small compared to the variance of ID
If possible try to add extra covariates. If not I'd fall back on a simple lm. Either with ignoring the repeated measurements or by sampling the data so you have only one observation per ID.
Best regards,
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium
To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey
2016-10-03 11:17 GMT+02:00 Paul Johnson <paul.johnson at glasgow.ac.uk <mailto:paul.johnson at glasgow.ac.uk>>:
Hi Carlos,
Your plot didn?t come through, as Thierry noted. However it?s expected that, unlike a standard linear regression model, an LMM with a nested structure such as yours will give a positive linear relationship between the residuals and the fitted values (might also be true for other structures?), provided the residual and random effect variances are > 0. Somebody will hopefully chip in with a formal explanation, but it?s basically a similar phenomenon to regression to the mean.
Imagine a group of students taking each taking an aptitude test three times. There are two random factors: the difference in underlying aptitude between the students, modelled by the student ID random effect; and random variation between time points within each student (e.g. how good a particular student is feeling on a particular day). I?m ignoring variation between tests ? let?s unrealistically assume they?s all the same and students completely forget about them between tests.
The students with the best mean scores will be a mixture of excellent students having three so-so (some good, some bad) days, and moderately good students having the good luck to have three good days, and the very best scores will come from students who were both excellent and lucky (although this category will be small). An important point is that there is no way of using the data to separate the moderate-student-lucky-days high scores from the excellent-student-average-days scores. If we simply took the mean of the scores, we would be overestimating the performance of the students on average (we?d have good estimates of the excellent students and overestimates of the moderate ones), so the best estimate is achieved by shrinking the scores towards the mean.
This is what happens when the model is fitted. Each student is given a residual (random effect) at the student level (how good the student is relative to the value predicted by the fixed effects) and three residuals at the observation (between-test-within-student) level. For students with good scores, this will be a compromise between the inseparable excellent-student-average-days scenario and the moderate-student-lucky-days scenario. As a result, students with high student-level residuals (the student random effects) will also tend to have high inter-test residuals. The same is also true in negative for poor students and students having three bad days. So the student random effects (which are part of the fitted values) and the residuals will be positively correlated.
You can check this using by simulating new data from the fitted model re-fitting the model, and comparing the residuals-x-fitted plot (which will be "perfect?) to the one from your data. Here?s a function that does this for lme4 fits:
devtools::install_github("pcdjohnson/GLMMmisc")
library(GLMMmisc)
sim.residplot(fit) # repeat this a few times to account for sampling error
If all is well, you should see a similar slope between the real and the simulated plots, in fact the general pattern of the residuals should be similar.
(The new package DHARMa ? https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-package-for-residual-diagnostics-of-glmms/ <https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-package-for-residual-diagnostics-of-glmms/> ? takes a similar approach to assessing residuals, but in a less quick-and-dirty, more formally justified way.)
All the best,
Paul
On 2 Oct 2016, at 16:57, Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>> wrote: Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
Dear Carlos, If I understand properly what's troubling you and the nature of the suggestion, I wouldn't do it -- that is, randomly discard data to get one observation per case. Paul Johnson explained clearly why the pattern you noticed in the residuals vs. fitted values plot occurred, as a consequence of shrinkage. One way of thinking about this is that using a mixed model is *more* important when you have few observations per case, where shrinkage will be greater, than when you have many observations per case, where the "estimates" of the random effects will nearly coincide with the case means (or in a more complex model, within-case regressions). Best, John ----------------------------- John Fox, Professor McMaster University Hamilton, Ontario Canada L8S 4M4 Web: socserv.mcmaster.ca/jfox
-----Original Message----- From: R-sig-mixed-models [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of Carlos Familia Sent: October 3, 2016 8:20 AM To: Thierry Onkelinx <thierry.onkelinx at inbo.be> Cc: r-sig-mixed-models at r-project.org Subject: Re: [R-sig-ME] Model diagnostics show slope in residuals plot and slope on the observed vs fitted plot is different than y = x Do you think this approach is sound and easily justifiable in a paper? Many thanks, Carlos
On 3 Oct 2016, at 13:10, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: In case you have more than one measurement per ID, select one of them at random. Something like library(dplyr) df %>% group_by(id) %>% sample_n(1) ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 14:06 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com
<mailto:carlosfamilia at gmail.com>>:
Dear Thierry, When you say sampling the data to have only one observation per ID, you
mean reducing the dataset to cases where the Y variable was only measured in 1 condition?
I have though about going lm with this, but it just didn?t feel wright... Many thanks, Carlos
On 3 Oct 2016, at 13:02, Thierry Onkelinx <thierry.onkelinx at inbo.be
<mailto:thierry.onkelinx at inbo.be>> wrote:
Dear Carlos, I concur with Paul. After play a bit with the data you send me privately, I see
a few things which cause problems:
1) the number of measurements per ID is low. 1/3 has one measurement in
each level of C, 1/3 in two out of three levels of C and 1/3 in only one level of C.
2) the variance of ID is larger than the residual variance 3) the effect of W is small compared to the variance of ID If possible try to add extra covariates. If not I'd fall back on a simple lm.
Either with ignoring the repeated measurements or by sampling the data so you have only one observation per ID.
Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 11:17 GMT+02:00 Paul Johnson <paul.johnson at glasgow.ac.uk
<mailto:paul.johnson at glasgow.ac.uk>>:
Hi Carlos, Your plot didn?t come through, as Thierry noted. However it?s expected that,
unlike a standard linear regression model, an LMM with a nested structure such as yours will give a positive linear relationship between the residuals and the fitted values (might also be true for other structures?), provided the residual and random effect variances are > 0. Somebody will hopefully chip in with a formal explanation, but it?s basically a similar phenomenon to regression to the mean.
Imagine a group of students taking each taking an aptitude test three times.
There are two random factors: the difference in underlying aptitude between the students, modelled by the student ID random effect; and random variation between time points within each student (e.g. how good a particular student is feeling on a particular day). I?m ignoring variation between tests ? let?s unrealistically assume they?s all the same and students completely forget about them between tests.
The students with the best mean scores will be a mixture of excellent
students having three so-so (some good, some bad) days, and moderately good students having the good luck to have three good days, and the very best scores will come from students who were both excellent and lucky (although this category will be small). An important point is that there is no way of using the data to separate the moderate-student-lucky-days high scores from the excellent-student-average-days scores. If we simply took the mean of the scores, we would be overestimating the performance of the students on average (we?d have good estimates of the excellent students and overestimates of the moderate ones), so the best estimate is achieved by shrinking the scores towards the mean.
This is what happens when the model is fitted. Each student is given a
residual (random effect) at the student level (how good the student is relative to the value predicted by the fixed effects) and three residuals at the observation (between-test-within-student) level. For students with good scores, this will be a compromise between the inseparable excellent-student- average-days scenario and the moderate-student-lucky-days scenario. As a result, students with high student-level residuals (the student random effects) will also tend to have high inter-test residuals. The same is also true in negative for poor students and students having three bad days. So the student random effects (which are part of the fitted values) and the residuals will be positively correlated.
You can check this using by simulating new data from the fitted model re-
fitting the model, and comparing the residuals-x-fitted plot (which will be "perfect?) to the one from your data. Here?s a function that does this for lme4 fits:
devtools::install_github("pcdjohnson/GLMMmisc")
library(GLMMmisc)
sim.residplot(fit) # repeat this a few times to account for sampling
error
If all is well, you should see a similar slope between the real and the
simulated plots, in fact the general pattern of the residuals should be similar.
(The new package DHARMa ? https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-packa ge-for-residual-diagnostics-of-glmms/ <https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-pack age-for-residual-diagnostics-of-glmms/> ? takes a similar approach to assessing residuals, but in a less quick-and-dirty, more formally justified way.) All the best, Paul
On 2 Oct 2016, at 16:57, Carlos Familia <carlosfamilia at gmail.com
<mailto:carlosfamilia at gmail.com>> wrote:
Hello, I have in hands a quite large and unbalanced dataset, for which a Y
continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects.
I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with
lmer:
lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong
here, any suggestions would be welcome).
However when running the diagnostic plots I noticed a slope in the
residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference.
Could I still use this model for inference? Should I specify a different
formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas?
I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Dear John, Do you suggest then that I should get a different linear regression model per condition, and instead of analysing the overall effect of W across conditions, analyse it for each condition separately? Do you think this would be fine for a referee? Please note that I have no problem with this. Many thanks, Carlos Fam?lia
On 3 Oct 2016, at 14:19, Fox, John <jfox at mcmaster.ca> wrote: Dear Carlos, If I understand properly what's troubling you and the nature of the suggestion, I wouldn't do it -- that is, randomly discard data to get one observation per case. Paul Johnson explained clearly why the pattern you noticed in the residuals vs. fitted values plot occurred, as a consequence of shrinkage. One way of thinking about this is that using a mixed model is *more* important when you have few observations per case, where shrinkage will be greater, than when you have many observations per case, where the "estimates" of the random effects will nearly coincide with the case means (or in a more complex model, within-case regressions). Best, John ----------------------------- John Fox, Professor McMaster University Hamilton, Ontario Canada L8S 4M4 Web: socserv.mcmaster.ca/jfox <http://socserv.mcmaster.ca/jfox>
-----Original Message----- From: R-sig-mixed-models [mailto:r-sig-mixed-models-bounces at r-project.org <mailto:r-sig-mixed-models-bounces at r-project.org>] On Behalf Of Carlos Familia Sent: October 3, 2016 8:20 AM To: Thierry Onkelinx <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be>> Cc: r-sig-mixed-models at r-project.org <mailto:r-sig-mixed-models at r-project.org> Subject: Re: [R-sig-ME] Model diagnostics show slope in residuals plot and slope on the observed vs fitted plot is different than y = x Do you think this approach is sound and easily justifiable in a paper? Many thanks, Carlos
On 3 Oct 2016, at 13:10, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote: In case you have more than one measurement per ID, select one of them at random. Something like library(dplyr) df %>% group_by(id) %>% sample_n(1) ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 14:06 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com
<mailto:carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>>>:
Dear Thierry, When you say sampling the data to have only one observation per ID, you
mean reducing the dataset to cases where the Y variable was only measured in 1 condition?
I have though about going lm with this, but it just didn?t feel wright... Many thanks, Carlos
On 3 Oct 2016, at 13:02, Thierry Onkelinx <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be>
<mailto:thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be>>> wrote:
Dear Carlos, I concur with Paul. After play a bit with the data you send me privately, I see
a few things which cause problems:
1) the number of measurements per ID is low. 1/3 has one measurement in
each level of C, 1/3 in two out of three levels of C and 1/3 in only one level of C.
2) the variance of ID is larger than the residual variance 3) the effect of W is small compared to the variance of ID If possible try to add extra covariates. If not I'd fall back on a simple lm.
Either with ignoring the repeated measurements or by sampling the data so you have only one observation per ID.
Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 11:17 GMT+02:00 Paul Johnson <paul.johnson at glasgow.ac.uk <mailto:paul.johnson at glasgow.ac.uk>
<mailto:paul.johnson at glasgow.ac.uk <mailto:paul.johnson at glasgow.ac.uk>>>:
Hi Carlos, Your plot didn?t come through, as Thierry noted. However it?s expected that,
unlike a standard linear regression model, an LMM with a nested structure such as yours will give a positive linear relationship between the residuals and the fitted values (might also be true for other structures?), provided the residual and random effect variances are > 0. Somebody will hopefully chip in with a formal explanation, but it?s basically a similar phenomenon to regression to the mean.
Imagine a group of students taking each taking an aptitude test three times.
There are two random factors: the difference in underlying aptitude between the students, modelled by the student ID random effect; and random variation between time points within each student (e.g. how good a particular student is feeling on a particular day). I?m ignoring variation between tests ? let?s unrealistically assume they?s all the same and students completely forget about them between tests.
The students with the best mean scores will be a mixture of excellent
students having three so-so (some good, some bad) days, and moderately good students having the good luck to have three good days, and the very best scores will come from students who were both excellent and lucky (although this category will be small). An important point is that there is no way of using the data to separate the moderate-student-lucky-days high scores from the excellent-student-average-days scores. If we simply took the mean of the scores, we would be overestimating the performance of the students on average (we?d have good estimates of the excellent students and overestimates of the moderate ones), so the best estimate is achieved by shrinking the scores towards the mean.
This is what happens when the model is fitted. Each student is given a
residual (random effect) at the student level (how good the student is relative to the value predicted by the fixed effects) and three residuals at the observation (between-test-within-student) level. For students with good scores, this will be a compromise between the inseparable excellent-student- average-days scenario and the moderate-student-lucky-days scenario. As a result, students with high student-level residuals (the student random effects) will also tend to have high inter-test residuals. The same is also true in negative for poor students and students having three bad days. So the student random effects (which are part of the fitted values) and the residuals will be positively correlated.
You can check this using by simulating new data from the fitted model re-
fitting the model, and comparing the residuals-x-fitted plot (which will be "perfect?) to the one from your data. Here?s a function that does this for lme4 fits:
devtools::install_github("pcdjohnson/GLMMmisc")
library(GLMMmisc)
sim.residplot(fit) # repeat this a few times to account for sampling
error
If all is well, you should see a similar slope between the real and the
simulated plots, in fact the general pattern of the residuals should be similar.
(The new package DHARMa ? https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-packa <https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-packa> ge-for-residual-diagnostics-of-glmms/ <https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-pack <https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-pack> age-for-residual-diagnostics-of-glmms/> ? takes a similar approach to assessing residuals, but in a less quick-and-dirty, more formally justified way.) All the best, Paul
On 2 Oct 2016, at 16:57, Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>
<mailto:carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com>>> wrote:
Hello, I have in hands a quite large and unbalanced dataset, for which a Y
continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects.
I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with
lmer:
lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong
here, any suggestions would be welcome).
However when running the diagnostic plots I noticed a slope in the
residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference.
Could I still use this model for inference? Should I specify a different
formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas?
I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> <mailto:R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org>> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models> <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>>
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> <mailto:R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org>> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models> <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>>
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
Dear Carlos,
-----Original Message----- From: Carlos Familia [mailto:carlosfamilia at gmail.com] Sent: Monday, October 3, 2016 2:03 PM To: Fox, John <jfox at mcmaster.ca> Cc: r-sig-mixed-models at r-project.org; Thierry Onkelinx <thierry.onkelinx at inbo.be> Subject: Re: [R-sig-ME] Model diagnostics show slope in residuals plot and slope on the observed vs fitted plot is different than y = x Dear John, Do you suggest then that I should get a different linear regression model per condition, and instead of analysing the overall effect of W across conditions, analyse it for each condition separately?
No, I'm not recommending that, or anything in particular beyond the observation that randomly discarding data in order to fit a fixed effect model to the remaining independent observations is probably not a good idea. I also don't see what's to be gained by fitting separate models to the conditions; if you expect interactions with condition, why not model them? I don't think that it would be responsible for me to give you statistical advice by email knowing next to nothing about your research.
Do you think this would be fine for a referee?
I have no idea. I probably wouldn't feel that I could predict an unknown referee's response to your work even if I were in a position to recommend what you should do. I'm sorry that I can't be more helpful. Best, John
Please note that I have no problem with this. Many thanks, Carlos Fam?lia On 3 Oct 2016, at 14:19, Fox, John <jfox at mcmaster.ca <mailto:jfox at mcmaster.ca> > wrote: Dear Carlos, If I understand properly what's troubling you and the nature of the suggestion, I wouldn't do it -- that is, randomly discard data to get one observation per case. Paul Johnson explained clearly why the pattern you noticed in the residuals vs. fitted values plot occurred, as a consequence of shrinkage. One way of thinking about this is that using a mixed model is *more* important when you have few observations per case, where shrinkage will be greater, than when you have many observations per case, where the "estimates" of the random effects will nearly coincide with the case means (or in a more complex model, within-case regressions). Best, John ----------------------------- John Fox, Professor McMaster University Hamilton, Ontario Canada L8S 4M4 Web: socserv.mcmaster.ca/jfox <http://socserv.mcmaster.ca/jfox> -----Original Message----- From: R-sig-mixed-models [mailto:r-sig-mixed-models- bounces at r-project.org] On Behalf Of Carlos Familia Sent: October 3, 2016 8:20 AM To: Thierry Onkelinx <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be> > Cc: r-sig-mixed-models at r-project.org <mailto:r-sig-mixed- models at r-project.org> Subject: Re: [R-sig-ME] Model diagnostics show slope in residuals plot and slope on the observed vs fitted plot is different than y = x Do you think this approach is sound and easily justifiable in a paper? Many thanks, Carlos On 3 Oct 2016, at 13:10, Thierry Onkelinx <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be> > wrote: In case you have more than one measurement per ID, select one of them at random. Something like library(dplyr) df %>% group_by(id) %>% sample_n(1) ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 14:06 GMT+02:00 Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com> <mailto:carlosfamilia at gmail.com>>: Dear Thierry, When you say sampling the data to have only one observation per ID, you mean reducing the dataset to cases where the Y variable was only measured in 1 condition? I have though about going lm with this, but it just didn?t feel wright... Many thanks, Carlos On 3 Oct 2016, at 13:02, Thierry Onkelinx <thierry.onkelinx at inbo.be <mailto:thierry.onkelinx at inbo.be> <mailto:thierry.onkelinx at inbo.be>> wrote: Dear Carlos, I concur with Paul. After play a bit with the data you send me privately, I see a few things which cause problems: 1) the number of measurements per ID is low. 1/3 has one measurement in each level of C, 1/3 in two out of three levels of C and 1/3 in only one level of C. 2) the variance of ID is larger than the residual variance 3) the effect of W is small compared to the variance of ID If possible try to add extra covariates. If not I'd fall back on a simple lm. Either with ignoring the repeated measurements or by sampling the data so you have only one observation per ID. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2016-10-03 11:17 GMT+02:00 Paul Johnson <paul.johnson at glasgow.ac.uk <mailto:paul.johnson at glasgow.ac.uk> <mailto:paul.johnson at glasgow.ac.uk>>: Hi Carlos, Your plot didn?t come through, as Thierry noted. However it?s expected that, unlike a standard linear regression model, an LMM with a nested structure such as yours will give a positive linear relationship between the residuals and the fitted values (might also be true for other structures?), provided the residual and random effect variances are > 0. Somebody will hopefully chip in with a formal explanation, but it?s basically a similar phenomenon to regression to the mean. Imagine a group of students taking each taking an aptitude test three times. There are two random factors: the difference in underlying aptitude between the students, modelled by the student ID random effect; and random variation between time points within each student (e.g. how good a particular student is feeling on a particular day). I?m ignoring variation between tests ? let?s unrealistically assume they?s all the same and students completely forget about them between tests. The students with the best mean scores will be a mixture of excellent students having three so-so (some good, some bad) days, and moderately good students having the good luck to have three good days, and the very best scores will come from students who were both excellent and lucky (although this category will be small). An important point is that there is no way of using the data to separate the moderate-student-lucky-days high scores from the excellent-student-average-days scores. If we simply took the mean of the scores, we would be overestimating the performance of the students on average (we?d have good estimates of the excellent students and overestimates of the moderate ones), so the best estimate is achieved by shrinking the scores towards the mean. This is what happens when the model is fitted. Each student is given a residual (random effect) at the student level (how good the student is relative to the value predicted by the fixed effects) and three residuals at the observation (between-test-within-student) level. For students with good scores, this will be a compromise between the inseparable excellent-student- average-days scenario and the moderate-student-lucky-days scenario. As a result, students with high student-level residuals (the student random effects) will also tend to have high inter-test residuals. The same is also true in negative for poor students and students having three bad days. So the student random effects (which are part of the fitted values) and the residuals will be positively correlated. You can check this using by simulating new data from the fitted model re- fitting the model, and comparing the residuals-x-fitted plot (which will be "perfect?) to the one from your data. Here?s a function that does this for lme4 fits: devtools::install_github("pcdjohnson/GLMMmisc") library(GLMMmisc) sim.residplot(fit) # repeat this a few times to account for sampling error If all is well, you should see a similar slope between the real and the simulated plots, in fact the general pattern of the residuals should be similar. (The new package DHARMa ? https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r- packa ge-for-residual-diagnostics-of-glmms/ <https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r- pack age-for-residual-diagnostics-of-glmms/> ? takes a similar approach to assessing residuals, but in a less quick-and-dirty, more formally justified way.) All the best, Paul On 2 Oct 2016, at 16:57, Carlos Familia <carlosfamilia at gmail.com <mailto:carlosfamilia at gmail.com> <mailto:carlosfamilia at gmail.com>> wrote: Hello, I have in hands a quite large and unbalanced dataset, for which a Y continuous dependent variable was measured in 3 different conditions (C) for about 3000 subjects (ID) (although, not all subjects have Y values for the 3 conditions). Additionally, there is continuous measure W which was measured for all subjects. I am interested in testing the following: - Is the effect of W significant overall - Is the effect of W significant at each level of C - Is the effect of C significant In order to try to answer this, I have specified the following model with lmer: lmer( Y ~ W * C + (1 | ID), data = df) Which seems to proper reflect the structure of the data (I might be wrong here, any suggestions would be welcome). However when running the diagnostic plots I noticed a slope in the residuals plot and a slope different than y = x for the observed vs fitted plot (as shown bellow). Which made me question the validity of the model for inference. Could I still use this model for inference? Should I specify a different formula? Should I turn to lme and try to include different variances for each level of conditions (C)? Any ideas? I would be really appreciated if anyone could help me with this. Thanks in advance, Carlos Fam?lia
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R- sig-mixed-models at r-project.org> <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig- mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig- mixed-models> _______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig- mixed-models at r-project.org> <mailto:R-sig-mixed-models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed- models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed- models> [[alternative HTML version deleted]] _______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed- models at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models