To answer your specific questions:
1. changing the reference level should not change the overall model fit
itself, but it will change the magnitude and direction of the coefficient
estimates (because when you change the reference level the new coefficients
will represent different comparisons).
2. you should always code factor variables in a way that makes sense given
the scientific question you are trying to answer. A very good discussion
of factor coding is here: http://talklab.psy.gla.ac.uk/tvw/catpred/
General comments: Coding the ordinal variables (like age cohort) in their
proper order does not in fact capture their ordered-ness: the model will
not necessarily enforce the assumption that the change from middle-aged to
old should be the same magnitude and direction as the change from young to
middle-aged. In all of the examples you've given, what you will get is the
first level of the factor treated as baseline, and the coefficient
estimates for other levels as differences between them and the baseline.
For example, a coefficient "residence:migrant" will tell you how much more
(or less) likely migrants are to use the "CA" form than the "MA" form, when
compared to villagers (the baseline level).
-- dan
Daniel McCloy
http://dan.mccloy.info/
Postdoctoral Research Associate
Institute for Learning and Brain Sciences
University of Washington
On Wed, Apr 6, 2016 at 6:22 AM, Saudi Sadiq <ss1272 at york.ac.uk> wrote:
Hi all,
I am analysing a dataset 'qaaf' (attached) using logistic regression.
The dataset includes:
1. speaker: participants in my study
2. item: words as used by my participants
3. gender: independent variable (2 levels: 'female' and 'male')
4. age.group: independent variable (3 levels: 'middle-aged', 'old' and
'young')
5. education: independent variable (3 levels: 'postgraduate', 'secondary
or below' and 'university')
6. residence: independent variable (3 levels: 'migrant', 'urbanite' and
'villager')
7. convergence: the dependent variable (whether a speaker uses a CA or MA
form). Here, I am testing whether my participants use the CA form or not.
This is the form of the prestigious dialect in Egypt. If they use MA, this
means that they use their traditional dialect. I am trying to find out
which factor (independent variable) is responsible or more responsible for
using the CA form.
As the target is CA and this (alphabetically) takes the 0 value,
I re-levelled the dependent variable (convergence) to change the value of
CA from 0 to 1, as follows:
(a) attach(qaaf)
(b) qaaf$convergence= factor(convergence, levels=c(MA', 'CA'))
I also re-levelled these variables:
(c) qaaf$education=factor(education, levels=c("secondary or below",
"university", "postgraduate"))
(d) qaaf$residence = factor(residence, levels=c('villager', 'migrant',
'urbanite'))
(e) qaaf$age.group = factor(age.group, levels=c('young', 'middle-aged',
'old'))
I re-levelled the variables in (c), (d) and (e) because these are ordinal
variables (e.g. old people were middle-aged one day and before that had
been young). My question may be general:
Q: Does changing the reference level cause any difference in results?
or
Q: Is leaving the variable levels alphabetically arranged good or bad? Put
another way, when should levels be left alphabetically arranged and when
should they be re-levelled?
Best
--
Saudi Sadiq,
Assistant Lecturer, English Department,
Faculty of Al-Alsun,Minia University,
Minia City, Egypt &
PhD Student, Language and Linguistic Science Department,
University of York, York, North Yorkshire, UK,
YO10 5DD
http://york.academia.edu/SaudiSadiq
https://www.researchgate.net/profile/Saudi_Sadiq
Certified Translator by Egyptian Translation Association (Egyta)
<http://www.egyta.com/>
Certified Interpreter by Pearl Linguistics
<http://www.pearllinguistics.com/>
Verified Teacher at https://lingos.co/users/saudi-sadiq
Verified Teacher at
https://www.firsttutors.com/uk/languages/teacher/saudi.arabic.english