caret: Error when using rpart and CV != LOOCV
Dominik, There are a number of formulations of this statistic (see the Kv?lseth[*] reference below). I tend to think of R^2 as the proportion of variance explained by the model[**]. With the "traditional" formula, it is possible to get negative proportions (if there are extreme outliers in the predictions, the negative proportion can be very large). I used this formulation because it is always on (0, 1). It is called "R^2" after all! Here is an example:
set.seed(1) simObserved <- rnorm(100) simPredicted <- simObserved + rnorm(100)*.1 cor(simObserved, simPredicted)^2
[1] 0.9887525
customSummary(data.frame(obs = simObserved,
+ pred = simPredicted))
RMSE Rsquared
0.09538273 0.98860908
simPredicted[1]
[1] -0.6884905
simPredicted[1] <- 10 cor(simObserved, simPredicted)^2
[1] 0.3669257
customSummary(data.frame(obs = simObserved,
+ pred = simPredicted))
RMSE Rsquared
1.066900 -0.425169
It is somewhat extreme, but it does happen.
Max
* Kv?lseth, T. (1985). Cautionary note about $R^2$. American
statistician, 39(4), 279?285.
* This is a very controversial statement when non-linear models are
used. I'd rather use RMSE, but many scientists I work with still think
in terms of R^2 regardless of the model. The randomForest function
also computes this statistic, but calls it "% Var explained" instead
of explicitly labeling it as "R^2". This statistic has generated
heated debates and I hope that I will not have to wear a scarlet R in
Nashville in a few weeks.
On Thu, May 17, 2012 at 1:35 PM, Dominik Bruhn <dominik at dbruhn.de> wrote:
Hy Max,
thanks again for the answer.
I checked the caret implementation and you were right. If the
predictions for the model constant (or sd(pred)==0) then the
implementation returns a NA for the rSquare (in postResample). This is
mainly because the caret implementation uses `cor` (from the
stats-package) which would throw a error for values with sd(pred)==0.
Do you know why this is implemented in this way? I wrote my own
summaryFunction which calculates rSquare by hand and it works fine. It
nevertheless does NOT(!) generate the same values as the original
implementation. It seems that the calcuation of Rsquare does not seem to
be consistent. I took mine from Wikipedia [1].
Here is my code:
---
customSummary <- function (data, lev = NULL, model = NULL) {
? ? ? ? #Calulate rSquare
? ? ? ? ssTot <- sum((data$obs-mean(data$obs))^2)
? ? ? ? ssErr <- sum((data$obs-data$pred)^2)
? ? ? ? rSquare <- 1-(ssErr/ssTot)
? ? ? ? #Calculate MSE
? ? ? ? mse <- mean((data$pred - data$obs)^2)
? ? ? ? #Aggregate
? ? ? ? out <- c(sqrt(mse), 1-(ssErr/ssTot))
? ? ? ? names(out) <- c("RMSE", "Rsquared")
? ? ? ? return(out)
}
---
[1]: http://en.wikipedia.org/wiki/Coefficient_of_determination#Definitions
Thanks!
Dominik
On 17/05/12 04:10, Max Kuhn wrote:
Dominik, See this line:
? Min. 1st Qu. ?Median ? ?Mean 3rd Qu. ? ?Max. ?30.37 ? 30.37 ? 30.37 ? 30.37 ? 30.37 ? 30.37
The variance of the predictions is zero. caret uses the formula for R^2 by calculating the correlation between the observed data and the predictions which uses sd(pred) which is zero. I believe that the same would occur with other formulas for R^2. Max On Wed, May 16, 2012 at 11:54 AM, Dominik Bruhn <dominik at dbruhn.de> wrote:
Thanks Max for your answer.
First, I do not understand your post. Why is it a problem if two of
predictions match? From the formula for calculating R^2 I can see that
there will be a DivByZero iff the total sum of squares is 0. This is
only true if the predictions of all the predicted points from the
test-set are equal to the mean of the test-set. Why should this happen?
Anyway, I wrote the following code to check what you tried to tell:
--
library(caret)
data(trees)
formula=Volume~Girth+Height
customSummary <- function (data, lev = NULL, model = NULL) {
? ?print(summary(data$pred))
? ?return(defaultSummary(data, lev, model))
}
tc=trainControl(method='cv', summaryFunction=customSummary)
train(formula, data=trees, ?method='rpart', trControl=tc)
--
This outputs:
---
?Min. 1st Qu. ?Median ? ?Mean 3rd Qu. ? ?Max.
?18.45 ? 18.45 ? 18.45 ? 30.12 ? 35.95 ? 53.44
? Min. 1st Qu. ?Median ? ?Mean 3rd Qu. ? ?Max.
?22.69 ? 22.69 ? 22.69 ? 32.94 ? 38.06 ? 53.44
? Min. 1st Qu. ?Median ? ?Mean 3rd Qu. ? ?Max.
?30.37 ? 30.37 ? 30.37 ? 30.37 ? 30.37 ? 30.37
[cut many values like this]
Warning: In nominalTrainWorkflow(dat = trainData, info = trainInfo,
method = method, ?:
?There were missing values in resampled performance measures.
-----
As I didn't understand your post, I don't know if this confirms your
assumption.
Thanks anyway,
Dominik
On 16/05/12 17:30, Max Kuhn wrote:
More information is needed to be sure, but it is most likely that some of the resampled rpart models produce the same prediction for the hold-out samples (likely the result of no viable split being found). Almost every incarnation of R^2 requires the variance of the prediction. This particular failure mode would result in a divide by zero. Try using you own summary function (see ?trainControl) and put a print(summary(data$pred)) in there to verify my claim. Max On Wed, May 16, 2012 at 11:30 AM, Max Kuhn <mxkuhn at gmail.com> wrote:
More information is needed to be sure, but it is most likely that some of the resampled rpart models produce the same prediction for the hold-out samples (likely the result of no viable split being found). Almost every incarnation of R^2 requires the variance of the prediction. This particular failure mode would result in a divide by zero. Try using you own summary function (see ?trainControl) and put a print(summary(data$pred)) in there to verify my claim. Max On Tue, May 15, 2012 at 5:55 AM, Dominik Bruhn <dominik at dbruhn.de> wrote:
Hy, I got the following problem when trying to build a rpart model and using everything but LOOCV. Originally, I wanted to used k-fold partitioning, but every partitioning except LOOCV throws the following warning: ---- Warning message: In nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method, : There were missing values in resampled performance measures. ----- Below are some simplified testcases which repoduce the warning on my system. Question: What does this error mean? How can I avoid it? System-Information: -----
sessionInfo()
R version 2.15.0 (2012-03-30) Platform: x86_64-pc-linux-gnu (64-bit) locale: ?[1] LC_CTYPE=en_GB.UTF-8 ? ? ? LC_NUMERIC=C ?[3] LC_TIME=en_GB.UTF-8 ? ? ? ?LC_COLLATE=en_GB.UTF-8 ?[5] LC_MONETARY=en_GB.UTF-8 ? ?LC_MESSAGES=en_GB.UTF-8 ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base other attached packages: [1] rpart_3.1-52 ? caret_5.15-023 foreach_1.4.0 ?cluster_1.14.2 reshape_0.8.4 [6] plyr_1.7.1 ? ? lattice_0.20-6 loaded via a namespace (and not attached): [1] codetools_0.2-8 compiler_2.15.0 grid_2.15.0 ? ? iterators_1.0.6 [5] tools_2.15.0 ------- Simlified Testcase I: Throws warning --- library(caret) data(trees) formula=Volume~Girth+Height train(formula, data=trees, ?method='rpart') --- Simlified Testcase II: Every other CV-method also throws the warning, for example using 'cv': --- library(caret) data(trees) formula=Volume~Girth+Height tc=trainControl(method='cv') train(formula, data=trees, ?method='rpart', trControl=tc) --- Simlified Testcase III: The only CV-method which is working is 'LOOCV': --- library(caret) data(trees) formula=Volume~Girth+Height tc=trainControl(method='LOOCV') train(formula, data=trees, ?method='rpart', trControl=tc) --- Thanks! -- Dominik Bruhn mailto: dominik at dbruhn.de
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Max
-- Dominik Bruhn mailto: dominik at dbruhn.de
-- Dominik Bruhn mailto: dominik at dbruhn.de
Max