Modeling precision and recall with GLMMs
Hi Jake and Daniel, I was extremely obtuse in my answer to your reply: what you and Daniel suggest certainly makes a lot of sense.
On Thu, 01-01-1970, at 01:00, Jake Westfall <jake987722 at hotmail.com> wrote:
Hi Ramon,
I am not sure that would work. For each data set, each method returns a bunch of "P"s and "N"s. But what I want to do is model not the relationship between truth and prediction, but rather how good or bad each method is (at trying to reconstruct the truth).
I'm not sure I see the problem. Surely the question of "how good or bad" each method is is answered by examining which method leads to the strongest correspondence between truth and prediction. That is the idea behind what I suggest.
Yes, I see it now.
As for the fact that each method returns many data points, again I do not see the problem. You are using a multilevel model after all, right? So it seems to me that within that framework, you have classification decisions (the data) nested in algorithms, which are crossed with datasets. You could in principle use a crossed random effects model, but I think it would make more sense to treat algorithms as fixed.
I agree.
Here's an example of what this might look like. The outcome variable is "decision" (numeric: 0 or 1), the predictors are "truth" (numeric: -1 or 1) and "algorithm" (factor denoting the algorithm). The model could look like: glmer(decision ~ 0 + algorithm/truth + (1|dataset))
This syntax in the fixed effects estimates separate intercepts and slopes for each algorithm. The intercepts get at response bias while the slopes get at accuracy. As noted previously, these two estimates can be transformed to precision and recall. You could also reverse decision and truth so that we have: glmer(truth ~ 0 + algorithm/decision + (1|dataset))
This might make more sense given the random effects for datasets, which in this second case allow for different datasets having different base rates of the two classes.
This seems reasonable, in particular since the base rates of the two classes can be very different between datasets. But datasets affect also the "quality" of the signal, the d' in SDT.
In the former case the random intercepts allowed for different datasets to lead to different rates of response bias, which is not crazy but isn't as intuitive to me as the second interpretation.
I also do not find having different response biases by datasets intuitive. However, placing truth as the dependent does not seem intuitive to me. In the first model we have P(1|Signal) or P(1|Noise) but reverting that is awkward to me; then, I am not sure what a "residual" would mean, and I am not sure if the coefficients retain the same meaning (intercept capturing response bias by algorithm, or easy mapping to recall and precision, or other features explained in, say, "Signal detection theory and generalized linear models", by DeCarlo ---I've googled a little bit since Daniel's and your last email :-). Regardless, this is certainly a really nice way to approach the problem I originally posted. Moreover, I could easily add edge-specific covariates that could be related to how hard it is to correctly inferring those (i.e., for how small d' is); this would be really neat. I am intrigued because the literature I am familiar with that compares the performance of these types of algorithms (or classification algorithms in general) often uses ranking based on metrics such as recall, precision, area under the ROC curve, etc, without directly attempting to model the original responses. So I am not sure if I am not missing something obvious. A more general concern I have (which might explain the previous paragraph) is that I am not sure if SDT (or what I've been able to speed read about SDT in the last couple of hours :-) is a good model for the problem. In particular, even if for each data set and algorithm we have hits, misses, false alarms, etc, the yes or no decisions are not individual decisions on each single edge of the network, but rather based on, e.g., minimizing some error function over all edges for a given dataset. Thanks again for your detailed explanation. Best, R.
Let me know if this makes some sense. Jake
From: Daniel.Wright at act.org To: rdiaz02 at gmail.com; jake987722 at hotmail.com CC: r-sig-mixed-models at r-project.org Subject: RE: [R-sig-ME] Modeling precision and recall with GLMMs Date: Wed, 12 Mar 2014 14:26:28 +0000 The "how good or bad" each method is, is what will come out of the method Jake is suggesting. Using multilevel models for these is common in the memory recognition literature in psychology for the last decade or so, but is also relevant in lots of other areas like medical diagnostics. If the variable IS_ij is whether person i saw stimulus j (0 not seen, 1 seen), and SAY_ij is whether the person says she saw the stimuli, then a multilevel probit or logit regression, with careful coding of the variables, can mimic the standard SDT models. The critical variable for saying if people are accurate is the coefficient in front of SAY. If you have different conditions, COND_j, then interactions between COND_j (or COND_ij if varied within subject) and SAY_ij examine if accuracy varies among these. An important plus of the multilevel models is the coefficients can vary by person and/or stimuli.
Hi Ramon,
I'm not sure that I fully understand the details of what you want to accomplish. But I do want to ask: you jump right into your email assuming that of course you want to model precision and recall, but what about modelling the data directly (i.e., individual classification decisions) rather than summaries of the data? Then you could work backward (forward?) from the model results to compute what the implied precision and recall would be
Sorry I did not provide enough details. I am comparing some methods for reconstructing networks, and the True positives and False positives, for instance, refer to the number of correctly inferred edges and to the number of edges that a procedure recovers that are not in the original network, respectively. So the network reconstruction methods model the data directly, and what I want to model is how good or bad are what they return as a function of several other variables (related to several dimensions of the toughness of the problem, etc)
If you decided that modelling the data directly would work for your
purposes, then one way of doing this would be to regress
classification decisions ("P" or "N") on actual classifications ("P" or "N").
I am not sure that would work. For each data set, each method returns a bunch of "P"s and "N"s. But what I want to do is model not the relationship between truth and prediction, but rather how good or bad each method is (at trying to reconstruct the truth).
If this is done in a probit model, it is equivalent to the equal-variance signal detection model studied at length in psychology, with the intercept being the "criterion" in signal detection language (denoted c), and the slope being "sensitivity" (denoted d' or d-prime). It should definitely be possible to compute precision and recall from c and d'.
I am not familiar with this approach in psychology. As I say above, I am not sure this addresses the problem I want to address but do you have some pointer to the literature where I can read more about the approach? Best, R.
This might be simpler with a logit rather than probit link function. Let me know if I have misunderstood what you are trying to accomplish
Jake
From: rdiaz02 at gmail.com.> To: r-sig-mixed-models at r-project.org.> Date: Tue, 11 Mar 2014 11:48:57 +0100.> CC: ramon.diaz at iib.uam.es.> Subject: [R-sig-ME] Modeling precision and recall with GLMMs.> .
Dear All,. .
I am examining the performance of a couple of classification-like methods. under different scenarios. Two of the metrics I am using are precision and. recall (TP/(TP + FP) and TP/(TP + FN), where TP, FP, and FN are "true. positives", "false positives", and "false negatives" in a simple two-way. confusion matrix). Some of the combinations of methods have been used on. exactly the same data sets. So it is easy to set up a binomial model (or. multinomial2 if using MCMCglmm) such as.
cbind(TP, FP) ~ fixed effects + (1|dataset)
However, the left hand side sounds questionable, specially with precision:. the expression TP/(TP + FP) has, in the denominator, a (TP + FP) [the. number of results returned, or retrieved instances, etc] that, itself, can. be highly method-dependent (i.e., affected by the fixed effects). So rather. than a true proportion, this seems more like a ratio, where each of TP and. FP have their own variance, a covariance, etc, and thus the error. distribution is a mess (not the tidy thing of a binomial).
I've looked around in the literature and have not found much (maybe the. problem are my searching skills :-). Most people use rankings of methods,. not directly modeling precision or recall in the left-hand side of a. (generalized) linear model. A couple of papers use a linear model on the. log-transformed response (which I think is even worse than the above. binomial model, specially with lots of 0s or 1s). Some other people use a. single measure, such as the F-measure or Matthews correlation coefficient,. and I am using something similar too, but I specifically wanted to also. model precision and recall.. . .
An option would be a multi-response model with MCMCglmm, but I am not sure if this is appropriate either (dependence of the sum of FP and TP on the. fixed effects).. . .
Best,
R-sig-mixed-models at r-project.org mailing list
--
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Aut.noma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain
Phone: +34-91-497-2412
Email: rdiaz02 at gmail.com
ramon.diaz at iib.uam.es
http://ligarto.org/rdiaz
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina
Universidad Aut?noma de Madrid
Arzobispo Morcillo, 4
28029 Madrid
Spain
Phone: +34-91-497-2412
Email: rdiaz02 at gmail.com
ramon.diaz at iib.uam.es
http://ligarto.org/rdiaz