Dear all, Lately I came upon a very interesting project, which also made me thinking since it was the first time for me to work on such data. So, I have 2-level data, with 60 participants having 2-3 measurements each, allocated (almost balanced) in two groups, say Y variable. This Y is also my outcome. Then, there are also about 350 features. Therefore, the goal is to predict the Y class based on the 350 features. Problem: I have around 180 (not independent) observations, and 350 variables. Obviously this will not work... So somehow they have to be reduced Possible solution : These 350 features are highly correlated in groups, meaning that they can form clusters which give similar information. If we were talking about independent data, then possible solution would be, say PCA, and then building the prediction model with a GLM based on these PCA features (although I never tried something like that, I see it is usual). However, Now that ultimately the goal is to use a GLMM, how can this be done ? Can you do PCA (or any variable reduction technique) in 2-level data ? And if yes, can you point me out where to learn about it? If this is not possible, can you suggest something that you would do in this case ? P.S. Since we are talking about a prediction model, is it still valid to assess prediction accuracy with AUC under GLMM ? Thank you John Zavrakidis Junior Researcher - Statistician Department of Epidemiology and Biostatistics
GLMM with many and highly correlated features
5 messages · j@z@vr@kidis m@ili@g off @ki@@l, Thierry Onkelinx, Dimitris Rizopoulos
Dear John, It looks like you have a binomial response variable. And each participant has either always 0 or always 1 as outcome. Adding participant as a random effect, will induce complete separation. Aggregating the data to one observation per participant leaves you with 60 observations: in case of a balanced design 30 with the outcome and 30 without. Hence you have about 30 effective observations, which leaves room for at most 3 (three) parameters to be estimated. So you'll need a way to reduce your 350 variables down to 3 without looking at the response variable. IMHO Tukey's quote in my signature and fortunes::fortune(119) apply. Best regards, ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be /////////////////////////////////////////////////////////////////////////////////////////// To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey /////////////////////////////////////////////////////////////////////////////////////////// ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be /////////////////////////////////////////////////////////////////////////////////////////// To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey /////////////////////////////////////////////////////////////////////////////////////////// Op ma 3 dec. 2018 om 08:56 schreef <j.zavrakidis at nki.nl>:
Dear all, Lately I came upon a very interesting project, which also made me thinking since it was the first time for me to work on such data. So, I have 2-level data, with 60 participants having 2-3 measurements each, allocated (almost balanced) in two groups, say Y variable. This Y is also my outcome. Then, there are also about 350 features. Therefore, the goal is to predict the Y class based on the 350 features. Problem: I have around 180 (not independent) observations, and 350 variables. Obviously this will not work... So somehow they have to be reduced Possible solution : These 350 features are highly correlated in groups, meaning that they can form clusters which give similar information. If we were talking about independent data, then possible solution would be, say PCA, and then building the prediction model with a GLM based on these PCA features (although I never tried something like that, I see it is usual). However, Now that ultimately the goal is to use a GLMM, how can this be done ? Can you do PCA (or any variable reduction technique) in 2-level data ? And if yes, can you point me out where to learn about it? If this is not possible, can you suggest something that you would do in this case ? P.S. Since we are talking about a prediction model, is it still valid to assess prediction accuracy with AUC under GLMM ? Thank you John Zavrakidis Junior Researcher - Statistician Department of Epidemiology and Biostatistics
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Dear Thierry, Thanks for your reply! Actually, this is exactly my question! How can I do that ? Is there a way to combine variable-reduction technique with GLMM (in r)? Does PCA work also in the context of GLMM? Kind regards, John -----Original Message----- From: Thierry Onkelinx [mailto:thierry.onkelinx at inbo.be] Sent: maandag 3 december 2018 10:39 To: John Zavrakidis Cc: r-sig-mixed-models Subject: Re: [R-sig-ME] GLMM with many and highly correlated features Dear John, It looks like you have a binomial response variable. And each participant has either always 0 or always 1 as outcome. Adding participant as a random effect, will induce complete separation. Aggregating the data to one observation per participant leaves you with 60 observations: in case of a balanced design 30 with the outcome and 30 without. Hence you have about 30 effective observations, which leaves room for at most 3 (three) parameters to be estimated. So you'll need a way to reduce your 350 variables down to 3 without looking at the response variable. IMHO Tukey's quote in my signature and fortunes::fortune(119) apply. Best regards, ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be /////////////////////////////////////////////////////////////////////////////////////////// To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey /////////////////////////////////////////////////////////////////////////////////////////// ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be /////////////////////////////////////////////////////////////////////////////////////////// To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey /////////////////////////////////////////////////////////////////////////////////////////// Op ma 3 dec. 2018 om 08:56 schreef <j.zavrakidis at nki.nl>:
Dear all, Lately I came upon a very interesting project, which also made me thinking since it was the first time for me to work on such data. So, I have 2-level data, with 60 participants having 2-3 measurements each, allocated (almost balanced) in two groups, say Y variable. This Y is also my outcome. Then, there are also about 350 features. Therefore, the goal is to predict the Y class based on the 350 features. Problem: I have around 180 (not independent) observations, and 350 variables. Obviously this will not work... So somehow they have to be reduced Possible solution : These 350 features are highly correlated in groups, meaning that they can form clusters which give similar information. If we were talking about independent data, then possible solution would be, say PCA, and then building the prediction model with a GLM based on these PCA features (although I never tried something like that, I see it is usual). However, Now that ultimately the goal is to use a GLMM, how can this be done ? Can you do PCA (or any variable reduction technique) in 2-level data ? And if yes, can you point me out where to learn about it? If this is not possible, can you suggest something that you would do in this case ? P.S. Since we are talking about a prediction model, is it still valid to assess prediction accuracy with AUC under GLMM ? Thank you John Zavrakidis Junior Researcher - Statistician Department of Epidemiology and Biostatistics
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Dear John, As I said before, a GLMM is out of the question due to complete separation. Hence a simple GLM would be sufficient. IMHO you first need to reduce the dimensionalty of the covariates from 350 down to 3(!), and then fit the GLM. Using the response in this selection is cheating. You could use the first 3 PCA axes to reduce the dimensionality. But the interpretation of those axis would be hard. Best regards, ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be /////////////////////////////////////////////////////////////////////////////////////////// To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey /////////////////////////////////////////////////////////////////////////////////////////// Op ma 3 dec. 2018 om 11:42 schreef <j.zavrakidis at nki.nl>:
Dear Thierry, Thanks for your reply! Actually, this is exactly my question! How can I do that ? Is there a way to combine variable-reduction technique with GLMM (in r)? Does PCA work also in the context of GLMM? Kind regards, John -----Original Message----- From: Thierry Onkelinx [mailto:thierry.onkelinx at inbo.be] Sent: maandag 3 december 2018 10:39 To: John Zavrakidis Cc: r-sig-mixed-models Subject: Re: [R-sig-ME] GLMM with many and highly correlated features Dear John, It looks like you have a binomial response variable. And each participant has either always 0 or always 1 as outcome. Adding participant as a random effect, will induce complete separation. Aggregating the data to one observation per participant leaves you with 60 observations: in case of a balanced design 30 with the outcome and 30 without. Hence you have about 30 effective observations, which leaves room for at most 3 (three) parameters to be estimated. So you'll need a way to reduce your 350 variables down to 3 without looking at the response variable. IMHO Tukey's quote in my signature and fortunes::fortune(119) apply. Best regards, ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be /////////////////////////////////////////////////////////////////////////////////////////// To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey /////////////////////////////////////////////////////////////////////////////////////////// ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be /////////////////////////////////////////////////////////////////////////////////////////// To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey /////////////////////////////////////////////////////////////////////////////////////////// Op ma 3 dec. 2018 om 08:56 schreef <j.zavrakidis at nki.nl>:
Dear all, Lately I came upon a very interesting project, which also made me thinking since it was the first time for me to work on such data. So, I have 2-level data, with 60 participants having 2-3 measurements each, allocated (almost balanced) in two groups, say Y variable. This Y is also my outcome. Then, there are also about 350 features. Therefore, the goal is to predict the Y class based on the 350 features. Problem: I have around 180 (not independent) observations, and 350 variables. Obviously this will not work... So somehow they have to be reduced Possible solution : These 350 features are highly correlated in groups, meaning that they can form clusters which give similar information. If we were talking about independent data, then possible solution would be, say PCA, and then building the prediction model with a GLM based on these PCA features (although I never tried something like that, I see it is usual). However, Now that ultimately the goal is to use a GLMM, how can this be done ? Can you do PCA (or any variable reduction technique) in 2-level data ? And if yes, can you point me out where to learn about it? If this is not possible, can you suggest something that you would do in this case ? P.S. Since we are talking about a prediction model, is it still valid to assess prediction accuracy with AUC under GLMM ? Thank you John Zavrakidis Junior Researcher - Statistician Department of Epidemiology and Biostatistics
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Indeed, the data do not seem to have a lot of information, but still this is a problem that is nowadays often encountered in practice, and for which methodology has been developed. For example, glmnet (https://cran.r-project.org/package=glmnet) could be of use here. Best, Dimitris
On 12/3/2018 9:26 PM, Thierry Onkelinx via R-sig-mixed-models wrote:
Dear John, As I said before, a GLMM is out of the question due to complete separation. Hence a simple GLM would be sufficient. IMHO you first need to reduce the dimensionalty of the covariates from 350 down to 3(!), and then fit the GLM. Using the response in this selection is cheating. You could use the first 3 PCA axes to reduce the dimensionality. But the interpretation of those axis would be hard. Best regards, ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be /////////////////////////////////////////////////////////////////////////////////////////// To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey /////////////////////////////////////////////////////////////////////////////////////////// Op ma 3 dec. 2018 om 11:42 schreef <j.zavrakidis at nki.nl>:
Dear Thierry, Thanks for your reply! Actually, this is exactly my question! How can I do that ? Is there a way to combine variable-reduction technique with GLMM (in r)? Does PCA work also in the context of GLMM? Kind regards, John -----Original Message----- From: Thierry Onkelinx [mailto:thierry.onkelinx at inbo.be] Sent: maandag 3 december 2018 10:39 To: John Zavrakidis Cc: r-sig-mixed-models Subject: Re: [R-sig-ME] GLMM with many and highly correlated features Dear John, It looks like you have a binomial response variable. And each participant has either always 0 or always 1 as outcome. Adding participant as a random effect, will induce complete separation. Aggregating the data to one observation per participant leaves you with 60 observations: in case of a balanced design 30 with the outcome and 30 without. Hence you have about 30 effective observations, which leaves room for at most 3 (three) parameters to be estimated. So you'll need a way to reduce your 350 variables down to 3 without looking at the response variable. IMHO Tukey's quote in my signature and fortunes::fortune(119) apply. Best regards, ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be /////////////////////////////////////////////////////////////////////////////////////////// To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey /////////////////////////////////////////////////////////////////////////////////////////// ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be /////////////////////////////////////////////////////////////////////////////////////////// To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey /////////////////////////////////////////////////////////////////////////////////////////// Op ma 3 dec. 2018 om 08:56 schreef <j.zavrakidis at nki.nl>:
Dear all, Lately I came upon a very interesting project, which also made me thinking since it was the first time for me to work on such data. So, I have 2-level data, with 60 participants having 2-3 measurements each, allocated (almost balanced) in two groups, say Y variable. This Y is also my outcome. Then, there are also about 350 features. Therefore, the goal is to predict the Y class based on the 350 features. Problem: I have around 180 (not independent) observations, and 350 variables. Obviously this will not work... So somehow they have to be reduced Possible solution : These 350 features are highly correlated in groups, meaning that they can form clusters which give similar information. If we were talking about independent data, then possible solution would be, say PCA, and then building the prediction model with a GLM based on these PCA features (although I never tried something like that, I see it is usual). However, Now that ultimately the goal is to use a GLMM, how can this be done ? Can you do PCA (or any variable reduction technique) in 2-level data ? And if yes, can you point me out where to learn about it? If this is not possible, can you suggest something that you would do in this case ? P.S. Since we are talking about a prediction model, is it still valid to assess prediction accuracy with AUC under GLMM ? Thank you John Zavrakidis Junior Researcher - Statistician Department of Epidemiology and Biostatistics
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Dimitris Rizopoulos Professor of Biostatistics Department of Biostatistics Erasmus University Medical Center Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands Tel: +31/(0)10/7043478 Fax: +31/(0)10/7043014 Web (personal): http://www.drizopoulos.com/ Web (work): http://www.erasmusmc.nl/biostatistiek/ Blog: http://iprogn.blogspot.nl/