Skip to content

Comparison of prediction performance (mapping accuracy) - how to test if a method B is significantly more accurate than method A?

3 messages · Tomislav Hengl, Tim Appelhans, Jean-Daniel Sylvain

#
Dear list,

I'm trying to standardize a procedure to compare performance of 
competing spatial prediction methods. I know that this has been 
discussed in various literature and on various mailing lists, but I 
would be interested in any opinion I could get.

I am comparing (see below) 2 spatial prediction methods 
(regression-kriging and inverse distance interpolation) using 5-fold 
cross-validation and then testing if the difference between the two is 
significant. What I concluded is that there are two possible tests for 
the final residuals:
1. F-test to compare variances (cross-validation residuals),
2. t-test to compare mean values,

Both tests might be important, nevertheless the F-test ("var.test") 
seems to be more interesting to really be able to answer "is the method 
B significantly more accurate than method A?". It appears that the 
second test ("t.test") is only important if it fails -> which would mean 
that one of the methods systematically over or under-estimates the mean 
value (which should be 0). Did I maybe miss some important test?

Thank you!

R> library(GSIF)
R> library(gstat)
R> library(sp)
R> set.seed(2419)
R> demo(meuse, echo=FALSE)
R> omm1 <- fit.gstatModel(meuse, log1p(om)~dist+soil, meuse.grid)
Fitting a linear model...
Fitting a 2D variogram...
Saving an object of class 'gstatModel'...
R> rk1 <- predict(omm1, meuse.grid)
R> meuse.s <- meuse[!is.na(meuse$om),]
R> ok1 <- krige.cv(log1p(om)~1, meuse.s, nfold=5)
R> var.test(ok1$residual, rk1 at validation$residual, alternative = "greater")

         F test to compare two variances

data:  ok1$residual and rk1 at validation$residual
F = 1.2283, num df = 152, denom df = 152, p-value =
0.103
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
  0.9398662       Inf
sample estimates:
ratio of variances
           1.228322
R> ## No significant difference
R> t.test(ok1$residual, rk1 at validation$residual)

         Welch Two Sample t-test

data:  ok1$residual and rk1 at validation$residual
t = -0.0204, df = 300.842, p-value = 0.9837
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -0.07084667  0.06939220
sample estimates:
    mean of x    mean of y
0.0004766718 0.0012039089
R> ## Again, no significant difference

R> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
other attached packages:
[1] randomForest_4.6-7 nortest_1.0-2
[3] gstat_1.0-19       GSIF_0.4-2
[5] sp_1.0-15          gap_1.1-12
#
On 08/28/2014 05:10 PM, Tomislav Hengl wrote:
If you think in terms of accuracy vs. precision, I'd say both tests are 
equally important. Ideally you want your method to be precise (low 
variance) and accurate (low deviation around mean). What I usually tend 
to do is repeated random sub-sampling with 100+ runs.

  
    
1 day later
#
Dear Tom/list,

The subject could also be look as the same problem encoutered in 
ensemble forecast (e.g. meteorology).

If you could have more "folders" in your analysis (you can see each 
folder as a member of an ensemble)
you could compare the two methods as it is done in ensemble forecast, in 
meteorology and hydrology. Both disciplines provide tools which help to 
study the accuracy, uncertainty and bias related to a forecast. Based on 
this methodological framework, it could be possible to compare both 
methods on several criterias.

For those which could be interested by the subject :

Brochero (2013) provides a good review of this subject in is Ph.D. 
thesis and describe several indicators (Chapter 1). Hydroinformatics and 
diversity in hydrological ensemble prediction systems.
  http://theses.ulaval.ca/archimede/meta/29908

The site below also provides a quick and simple review of the typical 
indicators use in meteorological forecast:
http://www.eumetcal.org/resources/ukmeteocal/verification/www/english/courses/msgcrs/index.htm

However, in your case this approach seems limited by the number of 
k-folders use.

This is just an idea that is worth explorating. In my research, I entend 
to explore this approach. Any comments/suggestions?

Le 8/28/2014 11:28 AM, Tim Appelhans a ?crit :