Hello,
I?m trying to fit a mixed effects model to my corpus data. The data has a hierarchical structure. I need to make sure that the final model reflects this nested structure.
My final model looks like this:
theMdl<-lmer(dis.norm.j$transformed~disciplinaryGroup+genreGroup+level+(1|student_id)+(1|levelA)+(1|levelB)+(1|levelC),data=thedata, control=lmerControl("bobyqa?))
where
LevelA is genreGroup:genreFamily:student_id
levelB is disciplinaryGroup:discipline:student_id
levelC is level:student_id
Here is a link to my data and R script: https://www.dropbox.com/sh/46r6lv6n89bromk/AABMc8MQmAYhRC3ubJ0Ii7Wma?dl=0
Thanks
Taha
LMER-CorpusData
4 messages · Taha Omidian, Phillip Alday
1 day later
I don't think this is the model you're looking for... 1. It's really weird to have your predictors in one dataframe and your dependent variable in a different one. Are you really sure that the rows line up like you think they do? If so, why not join the dataframes earlier (with merge(), plyr::join() or dplyr::join())? I'm overall quite nervous about namespaces / scope / etc. in your code -- using attach() isn't recommended practice, especially when you mix and match things (e.g. your levelX variables aren't in your dataframe, but the other predictors are). You have to be really careful to make sure you're using the data you think you're using. You can do it like you have it, but it makes me very nervous in terms of computing what you think you're computing. 2. Your levels include the same predictor in both the fixed effects and as a grouping variable (the part of the random effect after the |) . This generally doesn't make sense -- there are a number of posts on this mailing list to that effect (see also https://rpubs.com/INBOstats/both_fixed_random and https://www.muscardinus.be/2017/08/fixed-and-random/) -- but it depends on your data. In other words, seeing your model specification isn't quite enough -- we also need to know something about your data, more than your variable names alone reveal. Even though I work a lot with language data, I still can't tell enough from your variable names and code what your data actually represent. Best, Phillip
On 10/08/2018 12:46 AM, Taha Omidian wrote:
Hello,
I?m trying to fit a mixed effects model to my corpus data. The data has a hierarchical structure. I need to make sure that the final model reflects this nested structure.
My final model looks like this:
theMdl<-lmer(dis.norm.j$transformed~disciplinaryGroup+genreGroup+level+(1|student_id)+(1|levelA)+(1|levelB)+(1|levelC),data=thedata, control=lmerControl("bobyqa?))
where
LevelA is genreGroup:genreFamily:student_id
levelB is disciplinaryGroup:discipline:student_id
levelC is level:student_id
Here is a link to my data and R script: https://www.dropbox.com/sh/46r6lv6n89bromk/AABMc8MQmAYhRC3ubJ0Ii7Wma?dl=0
Thanks
Taha
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
1 day later
Hi Philip, Thanks so much for your reply. I think the best way to describe the data is to start with the aim of our study. The purpose of our study is to investigate the effect of discipline, genre, and level of study on the use certain word combinations in learner writing. To represent learner writing, we compiled a corpus of texts collected from students in 30 different disciplines and at four levels of study. Texts in the corpus were then categorised based on their genres (13 genres). Following this, we classified the disciplines into four major disciplinary groupings. Genres were also grouped under 5 broad categories based on their social purposes. We then search the corpus for the occurrence of 278 word combinations (e.g., on the other hand) and recorded their normalised frequency of occurrence for each text (labeled as ref.norm in our data). To me, our data is structured in a hierarchical fashion (for each predictor). So here is what we have in our data: -Students (student_id col) contributed multiple texts (id col) -Each text is nested within different disciplines (discipline col) which are clustered within four disciplinary groupings (disciplinaryGroup col) -Each text is nested within genres (genreFamily col) which are grouped into five genre groups (genreGroup col) -Each text is nested within four levels of study (level col) Predictors (based on the labels in our data) are: disciplinaryGroup, genreGroup, level Dependent variable (based on its label in our data) is: ref.norm So I need to know how this nested structure can be reflected in a LME model. As always thanks for your help. T
On Oct 9, 2018, at 11:10 PM, Phillip Alday <phillip.alday at mpi.nl<mailto:phillip.alday at mpi.nl>> wrote:
I don't think this is the model you're looking for... 1. It's really weird to have your predictors in one dataframe and your dependent variable in a different one. Are you really sure that the rows line up like you think they do? If so, why not join the dataframes earlier (with merge(), plyr::join() or dplyr::join())? I'm overall quite nervous about namespaces / scope / etc. in your code -- using attach() isn't recommended practice, especially when you mix and match things (e.g. your levelX variables aren't in your dataframe, but the other predictors are). You have to be really careful to make sure you're using the data you think you're using. You can do it like you have it, but it makes me very nervous in terms of computing what you think you're computing. 2. Your levels include the same predictor in both the fixed effects and as a grouping variable (the part of the random effect after the |) . This generally doesn't make sense -- there are a number of posts on this mailing list to that effect (see also https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frpubs.com%2FINBOstats%2Fboth_fixed_random&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=nDnQofQVnta%2BUlvfdGI1z5PiNxkai0AXW59Uy368xUU%3D&reserved=0 and https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.muscardinus.be%2F2017%2F08%2Ffixed-and-random%2F&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=7D%2FgIEUAJ%2BCOmR%2BrpNRtU49jyOtXDZk33cz5h9Ke04Y%3D&reserved=0) -- but it depends on your data. In other words, seeing your model specification isn't quite enough -- we also need to know something about your data, more than your variable names alone reveal. Even though I work a lot with language data, I still can't tell enough from your variable names and code what your data actually represent. Best, Phillip
On 10/08/2018 12:46 AM, Taha Omidian wrote:
Hello,
I?m trying to fit a mixed effects model to my corpus data. The data has a hierarchical structure. I need to make sure that the final model reflects this nested structure.
My final model looks like this:
theMdl<-lmer(dis.norm.j$transformed~disciplinaryGroup+genreGroup+level+(1|student_id)+(1|levelA)+(1|levelB)+(1|levelC),data=thedata, control=lmerControl("bobyqa?))
where
LevelA is genreGroup:genreFamily:student_id
levelB is disciplinaryGroup:discipline:student_id
levelC is level:student_id
Here is a link to my data and R script: https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.dropbox.com%2Fsh%2F46r6lv6n89bromk%2FAABMc8MQmAYhRC3ubJ0Ii7Wma%3Fdl%3D0&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=%2FnFwGE4shUmS2L1QGO0ExQ0jh49iyLMCj7xhx9%2BX2yI%3D&reserved=0
Thanks
Taha
_______________________________________________
R-sig-mixed-models at r-project.org<mailto:R-sig-mixed-models at r-project.org> mailing list
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-mixed-models&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=lutNcUBM2okGBj2fpYUhH216af55V1lfnr49U47LRkE%3D&reserved=0
7 days later
Hi Taha, You can use the term "collocation" with me -- it's more precise than "word combination". ;) What seems to be missing from your model are your particular collocations -- are you doing a separate model for each collocation? Or are you looking at the combined frequency of all the collocations? Assuming the answer to one of these questions is yes (and each has its own implications and potential pitfalls for your inferences) ... I would massively reduce your random effects structure. I propose the following basic structure for the model, under the assumption that each is only in one discipline) ref.norm ~ 1 + disciplinaryGroup + genreGroup + level + (1|student_id) I would seriously consider using the following interaction model, if you have enough data to do so. Depending on which combinations of disciplinaryGroup, genreGroup and level are present in the data, this may give you warnings about a rank-deficient model matrix and dropped columns, but that's okay. lme4 is just telling you that it can't estimate interactions for combinations that didn't occur and so it won't try. If each student also produced texts in multiple genre groups, then I would see if changing (1|student_id) to (1+genreGroup|student_id) improved the fit. Is each student measured at different levels? If so, then you can consider doing the same thing as genreGroup for level|student_id. I'm not sure I would include text id in the model because it's not "repeated" in any meaningful sense and would thus be an observation-level random effect. Text id essentially is just a way of distinguishing between repetitions within each unit/student of the student grouping. Now, assuming that you don't care about particular disciplines or genres, but rather just want to see if they account for any additional variance beyond the coarser disciplinaryGroup and genreGroup categorizations, you could include them as random effects: ref.norm ~ 1 + disciplinaryGroup + genreGroup + level + (1|student_id) + (1|discipline) + (1|genreFamily) You don't have to explicitly nest student_id within discipline -- lme4 already picks up on that. genre is (at least partially) crossed with student_id and discipline, and lme4 also picks up on that. (More precisely, the mathematical formulation that lme4 uses deals with such structures without any extra work.) This formulation assumes that the effects of subject/discipline and genre are additive; you could potentially add in a (1|subject_id:genreFamily) or (1|discipline:genreFamily), but (1) I don't think this would explain that much more variation and (2) you would need a *lot* of data for this to actually be meaningful and not just overfitting. Overfitting is actually a potential problem for all of these more overcomplicated models: make sure that AIC and BIC aren't getting worse! (The likelihood-ratio test is invalid for non-nested models and tricky for nested models that only differ in their variance components. Rejecting a variance component is the same thing as saying it's equal to zero, which is at the edge of the parameter space for variance, which means the p-values from the LRT aren't right.) Assuming that each discipline only occurs within one discipline group, disciplinaryGroup:discipline is the same thing as discipline. Same thing for genreGroup:genreFamily. Finally, please note that depending on your exact normalization procedure, a standard Gaussian model with identity link (i.e. "linear") might not be the right model for the job. I'm thinking in particular about issues that can arise when your normalization procedure results in an a measure that's bounded on [0,1]. Best, Phillip
On 10/10/2018 12:52 PM, Taha Omidian wrote:
Hi Philip, Thanks so much for your reply. I think the best way to describe the data is to start with the aim of our study. The purpose of our study is to investigate the effect of discipline, genre, and level of study on the use certain word combinations in learner writing. To represent learner writing, we compiled a corpus of texts collected from students in 30 different disciplines and at four levels of study. Texts in the corpus were then categorised based on their genres (13 genres). Following this, we classified the disciplines into four major disciplinary groupings. Genres were also grouped under 5 broad categories based on their social purposes. We then search the corpus for the occurrence of 278 word combinations (e.g., on the other hand) and recorded their normalised frequency of occurrence for each text (labeled as ref.norm in our data). To me, our data is structured in a hierarchical fashion (for each predictor). So here is what we have in our data: -Students (*student_id *col) contributed multiple texts (*id* col) -Each text is nested within different disciplines (*discipline* col) which are clustered within four disciplinary groupings (*disciplinaryGroup* col) -Each text is nested within genres (*genreFamily* col) which are grouped into five genre groups (*genreGroup* col) -Each text is nested within four levels of study (*level* col) Predictors (based on the labels in our data) are: *disciplinaryGroup, **genreGroup, **level* Dependent variable (based on its label in our data) is: /*ref.norm*/ /* */ So I need to know how this nested structure can be reflected in a LME model. As always thanks for your help. T
On Oct 9, 2018, at 11:10 PM, Phillip Alday <phillip.alday at mpi.nl <mailto:phillip.alday at mpi.nl>> wrote: I don't think this is the model you're looking for... 1. It's really weird to have your predictors in one dataframe and your dependent variable in a different one. Are you really sure that the rows line up like you think they do? If so, why not join the dataframes earlier (with merge(), plyr::join() or dplyr::join())? I'm overall quite nervous about namespaces / scope / etc. in your code -- using attach() isn't recommended practice, especially when you mix and match things (e.g. your levelX variables aren't in your dataframe, but the other predictors are). You have to be really careful to make sure you're using the data you think you're using. You can do it like you have it, but it makes me very nervous in terms of computing what you think you're computing. 2. Your levels include the same predictor in both the fixed effects and as a grouping variable (the part of the random effect after the |) . This generally doesn't make sense -- there are a number of posts on this mailing list to that effect (see also https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frpubs.com%2FINBOstats%2Fboth_fixed_random&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=nDnQofQVnta%2BUlvfdGI1z5PiNxkai0AXW59Uy368xUU%3D&reserved=0 and https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.muscardinus.be%2F2017%2F08%2Ffixed-and-random%2F&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=7D%2FgIEUAJ%2BCOmR%2BrpNRtU49jyOtXDZk33cz5h9Ke04Y%3D&reserved=0) -- but it depends on your data. In other words, seeing your model specification isn't quite enough -- we also need to know something about your data, more than your variable names alone reveal. Even though I work a lot with language data, I still can't tell enough from your variable names and code what your data actually represent. Best, Phillip On 10/08/2018 12:46 AM, Taha Omidian wrote:
Hello,
I?m trying to fit a mixed effects model to my corpus data. The data
has a hierarchical structure. I need to make sure that the final
model reflects this nested structure.
My final model looks like this:
theMdl<-lmer(dis.norm.j$transformed~disciplinaryGroup+genreGroup+level+(1|student_id)+(1|levelA)+(1|levelB)+(1|levelC),data=thedata,
control=lmerControl("bobyqa?))
where
LevelA is genreGroup:genreFamily:student_id
levelB is disciplinaryGroup:discipline:student_id
levelC is level:student_id
Here is a link to my data and R
script: https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.dropbox.com%2Fsh%2F46r6lv6n89bromk%2FAABMc8MQmAYhRC3ubJ0Ii7Wma%3Fdl%3D0&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=%2FnFwGE4shUmS2L1QGO0ExQ0jh49iyLMCj7xhx9%2BX2yI%3D&reserved=0
Thanks
Taha
_______________________________________________ R-sig-mixed-models at r-project.org <mailto:R-sig-mixed-models at r-project.org> mailing list https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-mixed-models&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=lutNcUBM2okGBj2fpYUhH216af55V1lfnr49U47LRkE%3D&reserved=0