Hierarchical Psychometric Function in BRMS

Hey James,

thank you for these details. Step by step:

"1) Yes, essentially. So there are 7 tasks, some have two conditions. One
has four conditions. This is the "condition" in the model. "Norm" is the
normalized response window."
R1) I am sorry, I do not understand this. Does "condition" indicate the 14
tasks (i.e., with 14 factor levels) or the "some have two, some have four
conditions part?" If it is the latter, then why did you not include 7
"tasks" alternatively ? - Anyway - I actually would suggest using the 14
tasks as "condition", because the design matrix is not fully crossed.
(i.e., without any design, just all tasks; you still can perform post-hoc
comparisons).

2) The term response window is not self-explaining..., but I assume you
mean "time pressure" by this (how long do I have to give a response). And I
will go on to refer to this as such.
2b) Given "norm" is "time" then I can finally see where you want to go.
(Please correct me if I am wrong:

3. No offense, my choice of words was a bit clumsy. I mean a
clarification about the research question or psychological hypothesis about
which measure should predict another measure is always helpful to make
judgments about a models appropriateness. As noted: I get a grip now, and
it seems, you want to predict decision accuracy ("response") based on the
task ("condition") and the time provided to solve the task ("norm"). While
"norm" is a time window to complete the task, dynamically changing
depending on the accuracy (tailored testing). Now having spelled this out
reveals a circular causation in it: accuracy -> time window -> accuracy? It
would be good to search for a reference paper which used an equivalent
design (not just psychometric function). But to put it this way: Accuracy
(response) is not really informative, because the tasks (if they are
tailored) are -specifically designed- to that each participant has about
75% accuracy. That is, everybody will either pass a threshold (e.g., 70%)
or not (e.g., 80%), because everybody will be at 75%.  What IS informative
is how much time they need for achieving this. The underlying assumption is
that there is a level of "processing speed" which is just before I become
perfectly accurate, and the goal is to find this moment, because if I WOULD
(otherwise) be perfectly accurate in every task my ability is
unidentifiable (because the tasks were not difficult enough, or
statistically speaking: no variance), - but if I was only guessing then any
model about me is uninformative (guessing model).

3b. In other words, if you are searching for a latent ability that you want
to continuously describe in your sample, "response window" (time needed) is
the indicator. slow participants = low ability ; quick participants = high
ability.
In Item-Response-Theory you usually estimate the ability, while presenting
the same tasks to all participants (fully crossed) which allows to estimate
task difficulty (instead of manipulating it), and I would suggest searching
for related model solutions in this area. (I am not experienced in tailored
testing).

4. If you standardize the measurements within each of the four sessions,
then I would say there is no reason to further include the term in the
model. This, however, is a matter of theoretical rather than statistical
debate. One theoretical counter-argument could be: If you do not
standardize the measures, but simply include time-points as fixed effects
in the model, then you gain information (i.e., about the time effect),
without altering the content of your model (although you change a fixed
assumption - to a freely estimable one). You then could also take into
account, that some participants improve more quickly then others, which
would be a reasonable thing to do, if you think, that this is a thing.

5. What Treutwein and Strasburger write is, first, mainly about logistic
functions which have the most basic form of a one - parameter Rasch model.
Make a two-parameter Rasch model out of it, then you have the functional
form of standard logistic regression, as also performed in "lmer" and
"brms" if you write something like:
DV~Interceptvariable*Continuousvariable+(1|subjectID) + (1|trialID),
family=binomial(link=logit). with two differences 1) the R packages use a
different parameterization (e.g. dummy coding) 2) in Rasch models (or Item
Response Theory) you estimate the model terms based on items and
individuals, rather than predicting the DV based on conditions and
measurements (here is a paper that investigates the relation between
logistic models to predict accuracy and item response theory: Dixon, 2008,
Models of accuracy in repeated-measures designs). This should help getting
a "feeling" for the logistic function.

Then what Treutwein and Strasburger introduce can also be found in every
text-book namely gamma, which is a guessing parameter (gamma +
1/(1+exp(...))) which says the model can not predict 0 accuracy unless
gamma = 0, because something will always be`correct' by chance. Secondly,
however, adding gamma would lead the model to predictions larger than 1,
for why there is (1-gamma) involved. Third, the model assumes that 100%
accuracy might not be reached (for whatever reason), and lambda is
introduced to scale the model down again, giving,
gamma+(1-gamma-lambda)/...) which means the output of the logistic function
(1/(1+exp(beta(theta+x)))) is squashed between gamma and lambda.
Unfortunately, if you would try to estimate one value for each gamma,
lambda, and beta (or 1/sigma) for a single participant then the model is
simply unidentifiable because predicting a participants average behavior
(or deviation from something else) of - say 70% - can be achieved by
gamma=.3 (and lambda=0), or lambda=.3 (and gamma=0) while the logistic
function is 0 for theta... ; OR theta = -.847 (and gamma =0; lambda0) --
you see where this is going, right? I agree that it might be reasonable to
assume that participants "guess" sometimes, but this is not a matter of
estimation but a matter of your task. In a binary task gamma= .5 (lowest
probability of being correct); in a task with three responses gamma=1/3.
Measurement not required, just statistics. And the lambda parameter,
finally, is not necessary, because on the individual level it is (almost)
redundant with beta (or 1/sigma) - coming back to my initial argument. On
the average it might sometimes "look like" you can draw a horizontal line
at p=.8 to which the logistic function (on average) approaches. And one
could argue this justifies assuming a maximum of lambda=.8. However, simply
assuming hierarchical variation  in beta (or 1/sigma) either within a
participants across trials and/or tasks (or variation of beta (or
1/sigma) within a task across participants), on average, will never predict
p=1 without lambda being required, and thus provides a "natural"
performance cap, measured in terms of variation, not in terms of lambda.
Having both, again is not identifiable (in addition to the issues above).
Also, -if- "guessing" would vary between participants, then, I would argue,
one should think about the amount of trials (or which trials) in which they
guess, not about the percent being correct while guessing (which is defined
by the task at hand).

6. Finally, that all being said, I would suggest you use this model:

thresholds <- bf(
  norms ~ 0 +ability + task,
  ability ~ 0+(1|subjectID),
nl = TRUE)

## time taken to reach 75% accuracy is predicted (i.e. "norms") by the
participants 'constant' ability, while including variations over tasks
(depending on the task).
 # task estimates task difficulty - should be a factor coding all 14 tasks
(you still can compare them directly afterwards)
 # ability is a "linear" predictor, freely estimated, one for each
participant
# without intercepts (i.e., 0 in front of the formulas), the task will be
interpretable as task-specific intercepts (like grand thetas) and the
abilities centered around 0. If you "scale" norms beforehand (i.e., across
tasks, not within) to SD=1, then the prior for "ability" should be
Gaussian(0,1) as well. Voila, very simply measurement model :). You could
include more terms like time-point to control/test for training effects.

afterwards you can get the task and participant posterior estimates for
ability (I think) like this:
posterior_samples(modeloutput)
with different indices for the participants in the matrix. You then also
can directly compare single task-estimates with each other (and get Bayes
factors to check whether their difficulties differ, using a "slab-only"
approach, instead of "spike-and-slab", check the recent work of Rouder),

I can not see right now, why this should be any more complicated :) , as it
provides you with the information you want: "How much ability the
participant has" based on reaching the tailored testing performance of 75%
accuracy with a specific amount of time pressure, while controlling for
task difficulty. This also should lower the computational requirements :)

Otherwise, if you can provide a paper which estimated:
item difficulty (i.e., trial-wise), based on time pressure...
task difficulty (the 14 ones)
participant ability (unknown)
based on binary responses
in a tailored testing design

then please let me know. Sounds interesting in any case.

At least this is what I would say 'spontaneously' :))

Hope this helps,
Best Ren?

Am Mo., 16. M?rz 2020 um 22:47 Uhr schrieb Ades, James <
jades at health.ucsd.edu>:

Hierarchical Psychometric Function in BRMS

Thread (5 messages)