Dear list,
I'm trying to standardize a procedure to compare performance of
competing spatial prediction methods. I know that this has been
discussed in various literature and on various mailing lists, but I
would be interested in any opinion I could get.
I am comparing (see below) 2 spatial prediction methods
(regression-kriging and inverse distance interpolation) using 5-fold
cross-validation and then testing if the difference between the two is
significant. What I concluded is that there are two possible tests for
the final residuals:
1. F-test to compare variances (cross-validation residuals),
2. t-test to compare mean values,
Both tests might be important, nevertheless the F-test ("var.test")
seems to be more interesting to really be able to answer "is the method
B significantly more accurate than method A?". It appears that the
second test ("t.test") is only important if it fails -> which would mean
that one of the methods systematically over or under-estimates the mean
value (which should be 0). Did I maybe miss some important test?
Thank you!
R> library(GSIF)
R> library(gstat)
R> library(sp)
R> set.seed(2419)
R> demo(meuse, echo=FALSE)
R> omm1 <- fit.gstatModel(meuse, log1p(om)~dist+soil, meuse.grid)
Fitting a linear model...
Fitting a 2D variogram...
Saving an object of class 'gstatModel'...
R> rk1 <- predict(omm1, meuse.grid)
R> meuse.s <- meuse[!is.na(meuse$om),]
R> ok1 <- krige.cv(log1p(om)~1, meuse.s, nfold=5)
R> var.test(ok1$residual, rk1 at validation$residual, alternative = "greater")
F test to compare two variances
data: ok1$residual and rk1 at validation$residual
F = 1.2283, num df = 152, denom df = 152, p-value =
0.103
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
0.9398662 Inf
sample estimates:
ratio of variances
1.228322
R> ## No significant difference
R> t.test(ok1$residual, rk1 at validation$residual)
Welch Two Sample t-test
data: ok1$residual and rk1 at validation$residual
t = -0.0204, df = 300.842, p-value = 0.9837
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.07084667 0.06939220
sample estimates:
mean of x mean of y
0.0004766718 0.0012039089
R> ## Again, no significant difference
R> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
other attached packages:
[1] randomForest_4.6-7 nortest_1.0-2
[3] gstat_1.0-19 GSIF_0.4-2
[5] sp_1.0-15 gap_1.1-12
Comparison of prediction performance (mapping accuracy) - how to test if a method B is significantly more accurate than method A?
3 messages · Tomislav Hengl, Tim Appelhans, Jean-Daniel Sylvain
On 08/28/2014 05:10 PM, Tomislav Hengl wrote:
Dear list, I'm trying to standardize a procedure to compare performance of competing spatial prediction methods. I know that this has been discussed in various literature and on various mailing lists, but I would be interested in any opinion I could get. I am comparing (see below) 2 spatial prediction methods (regression-kriging and inverse distance interpolation) using 5-fold cross-validation and then testing if the difference between the two is significant. What I concluded is that there are two possible tests for the final residuals: 1. F-test to compare variances (cross-validation residuals), 2. t-test to compare mean values,
If you think in terms of accuracy vs. precision, I'd say both tests are equally important. Ideally you want your method to be precise (low variance) and accurate (low deviation around mean). What I usually tend to do is repeated random sub-sampling with 100+ runs.
Both tests might be important, nevertheless the F-test ("var.test")
seems to be more interesting to really be able to answer "is the
method B significantly more accurate than method A?". It appears that
the second test ("t.test") is only important if it fails -> which
would mean that one of the methods systematically over or
under-estimates the mean value (which should be 0). Did I maybe miss
some important test?
Thank you!
R> library(GSIF)
R> library(gstat)
R> library(sp)
R> set.seed(2419)
R> demo(meuse, echo=FALSE)
R> omm1 <- fit.gstatModel(meuse, log1p(om)~dist+soil, meuse.grid)
Fitting a linear model...
Fitting a 2D variogram...
Saving an object of class 'gstatModel'...
R> rk1 <- predict(omm1, meuse.grid)
R> meuse.s <- meuse[!is.na(meuse$om),]
R> ok1 <- krige.cv(log1p(om)~1, meuse.s, nfold=5)
R> var.test(ok1$residual, rk1 at validation$residual, alternative =
"greater")
F test to compare two variances
data: ok1$residual and rk1 at validation$residual
F = 1.2283, num df = 152, denom df = 152, p-value =
0.103
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
0.9398662 Inf
sample estimates:
ratio of variances
1.228322
R> ## No significant difference
R> t.test(ok1$residual, rk1 at validation$residual)
Welch Two Sample t-test
data: ok1$residual and rk1 at validation$residual
t = -0.0204, df = 300.842, p-value = 0.9837
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.07084667 0.06939220
sample estimates:
mean of x mean of y
0.0004766718 0.0012039089
R> ## Again, no significant difference
R> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
other attached packages:
[1] randomForest_4.6-7 nortest_1.0-2
[3] gstat_1.0-19 GSIF_0.4-2
[5] sp_1.0-15 gap_1.1-12
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
##################################### Tim Appelhans Department of Geography Environmental Informatics Philipps Universit?t Marburg Deutschhausstra?e 12 35032 Marburg (Paketpost: 35037 Marburg) Germany Tel +49 (0) 6421 28-25957 http://environmentalinformatics-marburg.de/ [[alternative HTML version deleted]]
1 day later
Dear Tom/list, The subject could also be look as the same problem encoutered in ensemble forecast (e.g. meteorology). If you could have more "folders" in your analysis (you can see each folder as a member of an ensemble) you could compare the two methods as it is done in ensemble forecast, in meteorology and hydrology. Both disciplines provide tools which help to study the accuracy, uncertainty and bias related to a forecast. Based on this methodological framework, it could be possible to compare both methods on several criterias. For those which could be interested by the subject : Brochero (2013) provides a good review of this subject in is Ph.D. thesis and describe several indicators (Chapter 1). Hydroinformatics and diversity in hydrological ensemble prediction systems. http://theses.ulaval.ca/archimede/meta/29908 The site below also provides a quick and simple review of the typical indicators use in meteorological forecast: http://www.eumetcal.org/resources/ukmeteocal/verification/www/english/courses/msgcrs/index.htm However, in your case this approach seems limited by the number of k-folders use. This is just an idea that is worth explorating. In my research, I entend to explore this approach. Any comments/suggestions? Le 8/28/2014 11:28 AM, Tim Appelhans a ?crit :
On 08/28/2014 05:10 PM, Tomislav Hengl wrote:
Dear list, I'm trying to standardize a procedure to compare performance of competing spatial prediction methods. I know that this has been discussed in various literature and on various mailing lists, but I would be interested in any opinion I could get. I am comparing (see below) 2 spatial prediction methods (regression-kriging and inverse distance interpolation) using 5-fold cross-validation and then testing if the difference between the two is significant. What I concluded is that there are two possible tests for the final residuals: 1. F-test to compare variances (cross-validation residuals), 2. t-test to compare mean values,
If you think in terms of accuracy vs. precision, I'd say both tests are equally important. Ideally you want your method to be precise (low variance) and accurate (low deviation around mean). What I usually tend to do is repeated random sub-sampling with 100+ runs.
Both tests might be important, nevertheless the F-test ("var.test")
seems to be more interesting to really be able to answer "is the
method B significantly more accurate than method A?". It appears that
the second test ("t.test") is only important if it fails -> which
would mean that one of the methods systematically over or
under-estimates the mean value (which should be 0). Did I maybe miss
some important test?
Thank you!
R> library(GSIF)
R> library(gstat)
R> library(sp)
R> set.seed(2419)
R> demo(meuse, echo=FALSE)
R> omm1 <- fit.gstatModel(meuse, log1p(om)~dist+soil, meuse.grid)
Fitting a linear model...
Fitting a 2D variogram...
Saving an object of class 'gstatModel'...
R> rk1 <- predict(omm1, meuse.grid)
R> meuse.s <- meuse[!is.na(meuse$om),]
R> ok1 <- krige.cv(log1p(om)~1, meuse.s, nfold=5)
R> var.test(ok1$residual, rk1 at validation$residual, alternative =
"greater")
F test to compare two variances
data: ok1$residual and rk1 at validation$residual
F = 1.2283, num df = 152, denom df = 152, p-value =
0.103
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
0.9398662 Inf
sample estimates:
ratio of variances
1.228322
R> ## No significant difference
R> t.test(ok1$residual, rk1 at validation$residual)
Welch Two Sample t-test
data: ok1$residual and rk1 at validation$residual
t = -0.0204, df = 300.842, p-value = 0.9837
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.07084667 0.06939220
sample estimates:
mean of x mean of y
0.0004766718 0.0012039089
R> ## Again, no significant difference
R> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
other attached packages:
[1] randomForest_4.6-7 nortest_1.0-2
[3] gstat_1.0-19 GSIF_0.4-2
[5] sp_1.0-15 gap_1.1-12
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo