Mixed-model polytomous ordered logistic regression ?

Le lundi 03 janvier 2011 ? 19:54 +0100, Rune Haubo a ?crit :
Hi Emmanuel,

You can use the clmm() function in package ordinal to fit cumulative
link mixed models which include the proportional odds mixed model that
you mention. The non-mixed equivalent, clm() gives the same answers as
polr() and glm() on your example data.

clmm() currently only allows for one random intercept, but Ken
Knoblauch proposed a polmer() function
(http://finzi.psych.upenn.edu/R-sig-mixed-models/2010q2/003778.html)
that seem to be an extension of glmer() similar to your glm() hack,
but for mixed models.
That was my point, and Ken Kornblauch has been kind enough to send me
his code, which seems quite good. I have not yet worked on his code (and
probably won't for the next month at least), but it's probably a good
start.

The "800 degrees of freedom" problem still bother me. But that's another
problem (see below).
This allows for several random effects in a
glmer() fashion. On the other hand, with clmm() you get a variety of
link functions and extensions to scale and nominal effects for
modelling non-proportional effects, structured thresholds etc.
Thank you for the hint. I'll have a look at that. I have trouble
remembering which of the 2500+ R packages available on CRAN will solve
my problem-of-the-moment. Furthermore, I tend to trust more packages
doing something I've personally tinkered with. Hence my current
infatuation with BUGS (i. e. JAGS for the time being)...

But my point wasn't to solve a specific problem, but to point out that
ordinal data are a sufficiently frequent case to deserve "first-class
treatment". I am old enough to have been teached what was then called
"the general linear model", which was about the only case whee
multivariate model could be realistically fitted. Things tended to exa,d
a bit with censored data and the introduction of the Cox model, then to
Poisson and logistic regression. I tend to think af ordered data as
another kind of data that should be analysable the same way. That's what
polr() partially does (with a limited but wisely choosen set of
options).

Similarly, mixed models became *practically* fittable first for the
dreaded Gaussian case, then for Poisson, binomial and negative-binomial
cases. Ordered data should be analysable the same way.

Another case (not directly in my current interests, but one never
knows...) is categorical data. Log-linear models allow for a similar
treatment, and should be (too) analysable in the mixed models case.

A lot is to be learned from the Bayesian models for this kind of
problems. I'm currently exploring the (now somewhat outdated) Congdon's
textbooks on Bayesian modeling, and it raises interesting ideas bout
what should be doable from a frequentist point of view...

On the other hand, I'm more and more inclined to think that the Bayesian
point of view is extremely valid. The first difficulty is to convince
journals' editors that "p-value is crack" and that a posterior joint
distribution says all that can be learned about from an
experiment/observation, provided the priors are correctly chosen. The
second (and harder) is to find ways to express Bayesian probability
models in a more concise way than what is currently allowed by BUGS :
one should not have to type dozens of almost-identical snippets of code
for each and every intermediate parameter appearing in such  model. My
current idea is tat some sort of a macro facility should help, but is
hard to design *correctly*. The third (and hardest) difficulty is to
implement that efficiently. I'm a bit stricken to see all "public" BUGS
implementations using only one of a 4- ou 8-CPU machine, even for parts
(model updating of parallel chains) that are "embarrassingly
parallel" ((C) D. Spiegelhalter & al.).

In our present case, the Bayesian answer "feels" more reasonable than
the answer given by polr/glm, in particular because the variability
estimation of the thresholds seems to avoid the artefact induced by a
non-centered independent variable...
The idea of reformulating the cumulative link model as a binomial
generalized linear model is good, but I haven't seen it described
before - can I ask you which exercise in what edition of Gelman & Hill
(2007) that mentions this?
Ex 15.1, section 15.5 of Gelman & Hill (2007), p. 342. The printer's
info at the beginning of the book (just after the full title page, but I
don't know how this page is called in American usage) says it's "First
published 2007//Reprinted with corrections 2007//10th printing 2009"

I am not aware of any other edition.

HTH,

					Emmanuel Charpentier
Cheers,
Rune

On 30 December 2010 23:27, Emmanuel Charpentier <emm.charpentier at free.fr> wrote:
Dear list,

Following a hint of an exercise in Gelman & Hill's (2007) regression
textbook, I tried to understand  how to build a function generalizing to
mixed models what polr() (MASS) does for fixed models.

I started with a very simple artificial dataset (see at end). Using the
simplest polr() use :

summary(polr(Cat~X, data=Data))
Re-fitting to get Hessian

Call:
polr(formula = Cat ~ X, data = Data)

Coefficients:
 Value Std. Error t value
X 9.557      1.821   5.247

Intercepts:
     Value   Std. Error t value
C1|C2 14.8668  3.0447     4.8828
C2|C3 24.2772  4.7119     5.1523
C3|C4 34.2367  6.5861     5.1984
C4|C5 43.3543  8.2390     5.2621
C5|C6 53.8174 10.3240     5.2128
C6|C7 63.5247 12.2352     5.1920
C7|C8 72.5850 13.7899     5.2636
C8|C9 82.2256 15.8144     5.1994

Residual Deviance: 55.7395
AIC: 73.7395
Message d'avis :
glm.fit: fitted probabilities numerically 0 or 1 occurred

the following snippet :

Glm.polr<-local({
 ll<-length(lev<-levels(Data$Cat))
 thr<-paste(lev[-ll], "|", lev[-1], sep="")
 Dataset<-do.call(rbind,
                  lapply(2:ll,
                         function(i) {
                           data.frame(Thr=thr[i-1],
                                      Cat=Data$Cat>=lev[i],
                                      Data[,names(Data)[!(names(Data) %in
% c("Thr", "Cat"))]])}))
 Dataset$Thr<-factor(Dataset$Thr)
 glm(Cat~0+Thr+X, data=Dataset, family=binomial(link=logit))
})

made me able to reproduce my standard (up to numerical precision, with
which I didn't tinker, and a couple of cosmetic details (names and sign
of intercepts)), both for central value and variability :

summary(Glm.polr)
Call:
glm(formula = Cat ~ 0 + Thr + X, family = binomial(link = logit),
   data = Dataset)

Deviance Residuals:
    Min        1Q    Median        3Q       Max
-2.09822  -0.00001   0.00000   0.00000   1.92014

Coefficients:
        Estimate Std. Error z value Pr(>|z|)
ThrC1|C2  -14.858      3.050  -4.872 1.11e-06 ***
ThrC2|C3  -24.263      4.722  -5.138 2.78e-07 ***
ThrC3|C4  -34.217      6.600  -5.184 2.17e-07 ***
ThrC4|C5  -43.329      8.257  -5.247 1.54e-07 ***
ThrC5|C6  -53.786     10.346  -5.199 2.01e-07 ***
ThrC6|C7  -63.489     12.260  -5.179 2.24e-07 ***
ThrC7|C8  -72.542     13.820  -5.249 1.53e-07 ***
ThrC8|C9  -82.174     15.852  -5.184 2.18e-07 ***
X           9.551      1.825   5.233 1.67e-07 ***
---
Signif. codes:  0  ***  0.001  **  0.01  *  0.05  .  0.1     1

(Dispersion parameter for binomial family taken to be 1)

   Null deviance: 1109.035  on 800  degrees of freedom
Residual deviance:   55.737  on 791  degrees of freedom
AIC: 73.737

Number of Fisher Scoring iterations: 12

One should note that "large" thresholds (in absolute value) have larger
standard errors, which is probably an artifact, which I don't know how to
get rid of. The glm claim of "800 degrees of freedom" is ridiculous : I
have only 100 observations. More on this below...

Now, to extend this to mixed models (using lme4's notation as  model) is
not absolutely trivial :

The "fixed effect" part of the formula must *not* have intercepts (that's
what the thresholds become) : how to ensure this ?

In the "random effect" part, the intercepts can be replaced by the
thresholds (one can think of legitimate reasons for which thresholds
should depend on a random factor...). Should that be done also for
implicit thresholds (e. g. (X | Id) means, in a GLM (implicitely)
"variation of slope of X AND intercept according to Id". Should that
become "variation of slope AND threshold according to Id" ?).

How should the tinkering with the original data be handled ?
Systematically copying the original dta can be heavy, but is safe. Can
the built-in "copy-on-write" mechanisms of R be sufficient ?

What "other" models would be useful in such a function ? Cloglog ?
"Continuation" ? Others ?

More generally, shouldn't such a function (polrer ?) be part of some
hypothetical (Platonician ?) lme4 ? Ordered data are important in many
cases where numerical expressions cannot be trusted, and the ability to
fit mixed models of these data would be important. But doing this on a
case by case basis is somewhat error-prone (especially when done by me :).

For now on, I cope with BUGS and Bayesian interpretation. And hereby lies
a possibly important remark.

My BUGS code (JAGS dialect, inspired by the "Bones" example of Classic
Bugs vol I) and R call (through rjags) is at end.

The results call for some comments. Both graphical examination and some
formal inspection show good convergence (the predictions are somewhat
harder to stabilize...) :

gelman.diag(Mod$Mod.coda[,-(1:100)]) ## First 100 columns are
                                      ## predicted values for the
                                      ## original data.
Potential scale reduction factors:

        Point est. 97.5% quantile
beta.1         1.00           1.00
deviance       1.00           1.00
gamma[1]       1.00           1.00
gamma[2]       1.00           1.00
gamma[3]       1.00           1.01
gamma[4]       1.00           1.00
gamma[5]       1.00           1.00
gamma[6]       1.00           1.00
gamma[7]       1.00           1.00
gamma[8]       1.00           1.00

Multivariate psrf

1.00

The fitted data and their variability are cosmetically quite different
(the Bayesian model's gamma is the equivalent of the Threshold/X
coefficient of the polr/glm frequentist model), but their central value
quite consistent with previous values :

summary(Mod$Mod.coda[,-(1:100)])
Iterations = 11010:21000
Thinning interval = 10
Number of chains = 3
Sample size per chain = 1000

1. Empirical mean and standard deviation for each variable,
  plus standard error of the mean:

          Mean      SD Naive SE Time-series SE
beta.1    8.927 1.68312 0.030729       0.027433
deviance 64.674 4.26684 0.077902       0.087455
gamma[1]  1.544 0.08815 0.001609       0.001649
gamma[2]  2.544 0.10957 0.002000       0.002065
gamma[3]  3.580 0.09930 0.001813       0.001708
gamma[4]  4.544 0.08445 0.001542       0.001607
gamma[5]  5.630 0.11998 0.002191       0.002137
gamma[6]  6.627 0.12406 0.002265       0.002463
gamma[7]  7.606 0.11175 0.002040       0.002019
gamma[8]  8.608 0.18558 0.003388       0.003343

2. Quantiles for each variable:

          2.5%    25%    50%    75%  97.5%
beta.1    6.180  7.741  8.799  9.878 12.582
deviance 58.271 61.458 64.116 67.112 74.836
gamma[1]  1.363  1.489  1.548  1.604  1.706
gamma[2]  2.333  2.469  2.541  2.616  2.765
gamma[3]  3.386  3.512  3.582  3.647  3.779
gamma[4]  4.385  4.487  4.543  4.600  4.713
gamma[5]  5.406  5.545  5.630  5.713  5.870
gamma[6]  6.358  6.550  6.633  6.715  6.851
gamma[7]  7.392  7.529  7.603  7.682  7.821
gamma[8]  8.266  8.470  8.608  8.738  8.967

One should note, however, that, while the central value and sampling StD
of beta are consistent with polr/glm results, the ratios SD/mean of the
thresholds are much smaller than with polr/glm, and this is reflected in
the credible intervals. Furthermore, the SDs do not seem to increase as
much with the (absolute value) of the central values. (Yes, I am aware
that centering my X values would reduce this phenomenon, but I kept them
uncentered to illustrate my point).

For example, the 0.95 confidence interval of the C1|C2 threshold is about
6.1 wide (about 41% of the mean) whereas the 0.95 credible interval of
gamma[1] is (1.36 1.70), 0.34 wide, which is about 25% of its mean. For
the C8|C9 thershold, the IC95/mean ratio is about 37% of the mean, while
the CrI95/mean of gamma[8] is 8.1%.

Given that the two models are respectively a frequentist and a Bayesian
of the very same probability model, this is perplexing.

Are there reasons to suspect that the SD reported by glm and/or polr are
inflated (for example, these model do not seem to account for the re-use
of the same data for the estimation of the various thresholds) ? Or is
there a (maybe not so subtle) error in the Bayesian model interpretation
in BUGS ?

Or am I barking up the wrong tree ?

Any idea would be welcome ...

Sincerely yours,

                                       Emmanuel Charpentier

The artificial data :
Data <-
structure(list(X.real = c(3.49764437862872, 6.90659633160501,
2.49335390384994, 7.24293843863257, 0.803964727576124, 4.04213878877333,
1.06214939381180, 0.554054259416088, 4.08729099177371, 2.52629057633986,
3.24127380430347, 1.11891797458651, 0.530158326675324, 9.13663197362062,
4.50887509479721, 8.28526961997477, 4.88346597634514, 7.1687905770683,
8.30770739436844, 4.10970688126136, 0.926142793485848, 1.60153659527237,
7.9980129347903, 7.09517767824342, 3.13154316634710, 3.73947230927693,
2.80221854118421, 6.59984440492743, 2.61928305396566, 7.01599863310761,
2.24968126253923, 1.58829945203618, 1.90902984485251, 5.12920046748922,
9.2118070532461, 9.08208307216233, 4.43550132482185, 8.05729871351512,
1.18297174176127, 8.46910981252808, 4.43676303030525, 6.4418454794069,
4.53446029962874, 0.699769024165557, 8.46097165332483, 4.53799448354545,
5.27431575051651, 7.54335240307173, 2.17309956012568, 5.88721264347949,
9.4071993099453, 1.88089249201845, 5.3721556593878, 0.771888320064126,
8.1072438324685, 6.31264339382069, 2.00734973388480, 1.55494366180340,
7.02181982951623, 2.30599238507652, 0.527109325928194, 3.66541005952039,
5.5545317236498, 1.09950173374848, 0.589535171670769, 3.53540683843517,
4.45889129379049, 6.80170522713363, 7.37081649369139, 9.4889221652781,
7.19282466718394, 6.13643707306052, 1.35088176999453, 3.16405426710045,
3.32833977017296, 6.08074276163767, 1.63873091713528, 1.46557732912316,
4.91770214161358, 0.938513984685997, 1.48740958479596, 2.01597792723018,
4.21291073575364, 5.43887605123115, 4.70221174192061, 1.67668472528311,
5.24846045233372, 9.0111473276549, 7.23671006856042, 3.72080675801418,
4.39171014756293, 4.81283564404103, 2.41342165250038, 3.82321065255632,
7.669782577978, 9.2234752201292, 0.641411897606775, 1.13470612170815,
1.02966988829139, 1.24689684186557), Cat = structure(c(3L, 7L,
2L, 7L, 1L, 4L, 1L, 1L, 4L, 3L, 3L, 1L, 1L, 9L, 5L, 8L, 5L, 7L,
8L, 4L, 1L, 2L, 8L, 7L, 3L, 4L, 3L, 7L, 3L, 7L, 2L, 2L, 2L, 5L,
9L, 9L, 4L, 8L, 1L, 8L, 4L, 6L, 5L, 1L, 8L, 5L, 5L, 8L, 2L, 6L,
9L, 2L, 5L, 1L, 8L, 6L, 2L, 2L, 7L, 2L, 1L, 4L, 6L, 1L, 1L, 4L,
4L, 7L, 7L, 9L, 7L, 6L, 1L, 3L, 3L, 6L, 2L, 1L, 5L, 1L, 1L, 2L,
4L, 5L, 5L, 2L, 5L, 9L, 7L, 4L, 4L, 5L, 2L, 4L, 8L, 9L, 1L, 1L,
1L, 1L), .Label = c("C1", "C2", "C3", "C4", "C5", "C6", "C7",
"C8", "C9"), class = c("ordered", "factor")), X = c(3.78512255820666,
7.03710649735623, 2.37579547880536, 7.01687069036889, 0.572774665466921,
4.53333128338774, 0.92564311992818, 0.682074690286214, 4.17197303371935,
2.62951212025962, 3.44249595134830, 1.02688214798057, 0.440959503725193,
9.05785834195666, 4.52616407157106, 8.07448237191856, 4.95371782896727,
7.35730414103575, 8.07638469674326, 4.26407495806574, 1.06385684713162,
1.63333280184271, 7.9246178454581, 7.41541196525404, 3.12459209709868,
3.64101167776984, 2.51917629411974, 6.8118629850261, 2.77786400733997,
7.36566898280035, 2.03009770691198, 1.61723070290973, 1.96657910701987,
5.22600151472831, 9.21386030948403, 9.18902347324618, 4.39493834305473,
8.27266371630105, 1.15477068640133, 8.15280513569419, 4.6734050583675,
6.51664088906166, 4.40955269905174, 0.54646886424704, 8.03605092676131,
4.36779957410513, 5.44860695752003, 7.83429549417553, 1.84916430642593,
5.81109382387844, 9.32133297147632, 1.73276033539495, 5.3490390105937,
0.695781216499173, 8.05603266764859, 6.67741469458807, 2.03017543225991,
1.60754713124494, 6.63481861913185, 2.20004584719421, 0.675994410762504,
3.98953568167253, 5.45631429591027, 0.675175153754915, 0.305522085250977,
3.58440905189674, 4.58557708778191, 6.99911313460102, 7.21102458968007,
9.44650050070282, 7.33311596756764, 5.87485729872331, 1.11824133536435,
3.40701638829918, 3.36818213997361, 5.92742193489156, 2.08701427718401,
1.73709762230381, 4.71589786549998, 0.980248052739725, 1.58957599464140,
2.17701413093404, 4.22999685414555, 5.84947271423299, 4.94396458250251,
1.65133703600731, 5.17336780283038, 8.8993318875765, 7.29238547003421,
3.62020801752199, 4.28550112286315, 4.95543088867053, 2.61555021108721,
3.85711807608966, 7.67091954093986, 9.24609859359625, 0.436967673925767,
1.32063426377087, 0.794911931890139, 1.16453130413352)), .Names = c
("X.real",
"Cat", "X"), row.names = c(NA, -100L), class = "data.frame")

The R/BUGS Bayesian model :

system.time(Mod<-local({
 Mod<-function(){ ## BUGS model as an R function. Easy to manage.
   ## Thresholds.
   for (j in 1:ncut) {
     gamma.raw[j]~dnorm(0, 1.0E-4)
   }
   gamma<-sort(gamma.raw)
   ## p[i,j]=Pr(Cat[i]>j)
   ## q[i,j]=Pr(Cat[i]=j)
   ##       =Pr(Cat[i]>j-1)-Pr(Cat[i]>j)
   ##       =p[i,j-1]-p[i,j]
   for (i in 1:nobs) {
     Cat[i]~dcat(q[i,1:(ncut+1)])
     ## Predictions
     Cat.new[i]~dcat(q[i,1:(ncut+1)])
     q[i,1]<-1-p[i,1]
     for (j in 2:ncut) {
       q[i,j]<-p[i,j-1]-p[i,j]
     }
     q[i,ncut+1]<-p[i,ncut]
     for (j in 1:ncut) {
       logit(p[i,j])<-beta.1*(X[i]-gamma[j])
     }
   }
   beta.1~dnorm(0, 1.0E-4)
 }
 tmpf<-tempfile()
 write.model(Mod, tmpf) ## Auxilliary function writing a function body
 Mod.jags<-jags.model(tmpf,
                      data=with(Data, {
                        Y<-outer(Cat, levels(Cat)[-1], ">=")
                        list(X=X,
                             Cat=unclass(Cat),
                             nobs=nrow(Y),
                             ncut=ncol(Y))}),
                      ## "Reasonable" (not too overdispersed) initial
                      ## values are *CRUCIAL* to model initialisation
                      inits=function() {
                        list(gamma.raw=sort(runif(8,min=-1,max=1)),
                             beta.1=runif(1,min=0,max=1))
                      },
                      n.chains=3)
 update(Mod.jags, n.iter=10000)
 unlink(tmpf)
 Mod.coda<-coda.samples(Mod.jags,
                        variable.names=c("gamma",
                          ## "beta.0",
                          "beta.1",
                          ## "gamma.adj",
                          ## "lzeta",
                          ## "pi.sgn",
                          "deviance",
                          "Cat.new"),
                        thin=10,
                        n.iter=10000)
 list(Mod.jags=Mod.jags, Mod.coda=Mod.coda)
}))

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Mixed-model polytomous ordered logistic regression ?

Thread (5 messages)