cross-validation

Dear list members,

Is it possible to do cross-validation on multivariate kriging?
i.e. apply krige.cv to multivariate kriging in gstat?

thank you very much,
Best wishes,
Marta
Yes, Martha; function gstat.cv works on multivariable objects; do read 
it's documentation.

Best regards,
--
Edzer
Dear list members,

Is it possible to do cross-validation on multivariate kriging?
i.e. apply krige.cv to multivariate kriging in gstat?

thank you very much,
Best wishes,
Marta

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Hello,

yes, I know it is suppose to do it, but I could not find how, because it 
gives me an error... for example:

require(gstat); require(lattice)
data(meuse)
coordinates(meuse) = ~x + y
data(meuse.grid)
gridded(meuse.grid) = ~x + y

meuse.g <- gstat(id = "zn", formula = log(zinc) ~ 1, data = meuse)
meuse.g <- gstat(meuse.g, "cu", log(copper) ~ 1, meuse)

meuse.g <- gstat(meuse.g, model = vgm(1, "Sph", 900, 1), fill.all = T)
x <- variogram(meuse.g, cutoff = 1000)
meuse.fit = fit.lmc(x, meuse.g)
plot(x, model = meuse.fit)
z <- predict(meuse.fit, newdata = meuse.grid)
spplot(z) #map
gstat.cv(meuse.g) #does not work...
gstat.cv(meuse.g, remove.all=T) #either
gstat.cv(meuse.g, all.residuals=T) #either
gstat.cv(object=meuse.g, formula = log(zinc) ~ 1, data = meuse, model = 
vgm(1, "Sph", 900, 1), nmax=40, verbose=F) #either :-(

#                             
# Intrinsic Correlation found. Good.
# [using ordinary cokriging]

# "chfactor.c", line 130: singular matrix in function LDLfactor()
# Error in predict.gstat(object, newdata = data[sel, ], ...) :
#         LDLfactor

Maybe an example on the help file would be nice (eheheh).. I
What am I missing?

Thank you very much in advance,
Marta
Hi,

You should check if you have duplicate observations, duplicate 
observations lead to a singular matrix. Use the function zerodist() to 
check where the observations are and remove.duplicates() to remove them.

cheers,
Paul

Marta Rufino schreef:
Hello,

yes, I know it is suppose to do it, but I could not find how, because it 
gives me an error... for example:

require(gstat); require(lattice)
data(meuse)
coordinates(meuse) = ~x + y
data(meuse.grid)
gridded(meuse.grid) = ~x + y

meuse.g <- gstat(id = "zn", formula = log(zinc) ~ 1, data = meuse)
meuse.g <- gstat(meuse.g, "cu", log(copper) ~ 1, meuse)

meuse.g <- gstat(meuse.g, model = vgm(1, "Sph", 900, 1), fill.all = T)
x <- variogram(meuse.g, cutoff = 1000)
meuse.fit = fit.lmc(x, meuse.g)
plot(x, model = meuse.fit)
z <- predict(meuse.fit, newdata = meuse.grid)
spplot(z) #map
gstat.cv(meuse.g) #does not work...
gstat.cv(meuse.g, remove.all=T) #either
gstat.cv(meuse.g, all.residuals=T) #either
gstat.cv(object=meuse.g, formula = log(zinc) ~ 1, data = meuse, model = 
vgm(1, "Sph", 900, 1), nmax=40, verbose=F) #either :-(

#                             
# Intrinsic Correlation found. Good.
# [using ordinary cokriging]

# "chfactor.c", line 130: singular matrix in function LDLfactor()
# Error in predict.gstat(object, newdata = data[sel, ], ...) :
#         LDLfactor

Maybe an example on the help file would be nice (eheheh).. I
What am I missing?

Thank you very much in advance,
Marta

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Drs. Paul Hiemstra
Department of Physical Geography
Faculty of Geosciences
University of Utrecht
Heidelberglaan 2
P.O. Box 80.115
3508 TC Utrecht
Phone:     +31302535773
Fax:    +31302531145
http://intamap.geo.uu.nl/~paul
That was not the problem, the problem was that you used meuse.g instead 
of meuse.fit to pass on to gstat.cv. For meuse.g, you have perfect 
correlation between Cu and Zn, so that collocated observations (meaning 
a Zn and a Cu observation at each obs location) act as a duplicate in 
univarite kriging.

Try:

out = gstat.cv(object=meuse.fit, nmax=40, verbose=F, nfold=5)
--
Edzer
Hi,

You should check if you have duplicate observations, duplicate 
observations lead to a singular matrix. Use the function zerodist() to 
check where the observations are and remove.duplicates() to remove them.

cheers,
Paul

Marta Rufino schreef:
Hello,

yes, I know it is suppose to do it, but I could not find how, because 
it gives me an error... for example:

require(gstat); require(lattice)
data(meuse)
coordinates(meuse) = ~x + y
data(meuse.grid)
gridded(meuse.grid) = ~x + y

meuse.g <- gstat(id = "zn", formula = log(zinc) ~ 1, data = meuse)
meuse.g <- gstat(meuse.g, "cu", log(copper) ~ 1, meuse)

meuse.g <- gstat(meuse.g, model = vgm(1, "Sph", 900, 1), fill.all = T)
x <- variogram(meuse.g, cutoff = 1000)
meuse.fit = fit.lmc(x, meuse.g)
plot(x, model = meuse.fit)
z <- predict(meuse.fit, newdata = meuse.grid)
spplot(z) #map
gstat.cv(meuse.g) #does not work...
gstat.cv(meuse.g, remove.all=T) #either
gstat.cv(meuse.g, all.residuals=T) #either
gstat.cv(object=meuse.g, formula = log(zinc) ~ 1, data = meuse, model 
= vgm(1, "Sph", 900, 1), nmax=40, verbose=F) #either :-(

#                             # Intrinsic Correlation found. Good.
# [using ordinary cokriging]

# "chfactor.c", line 130: singular matrix in function LDLfactor()
# Error in predict.gstat(object, newdata = data[sel, ], ...) :
#         LDLfactor

Maybe an example on the help file would be nice (eheheh).. I
What am I missing?

Thank you very much in advance,
Marta

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-geo/attachments/20080207/7b14595a/attachment.pl>
Great!
Hi Marta,

I now have in the example section of gstat.cv help:

# multivariable; thanks to M. Rufino:
meuse.g <- gstat(id = "zn", formula = log(zinc) ~ 1, data = meuse)
meuse.g <- gstat(meuse.g, "cu", log(copper) ~ 1, meuse)
meuse.g <- gstat(meuse.g, model = vgm(1, "Sph", 900, 1), fill.all = T)
x <- variogram(meuse.g, cutoff = 1000)
meuse.fit = fit.lmc(x, meuse.g)
out = gstat.cv(meuse.fit, nmax = 40, nfold = 5)
summary(out)
# mean error, ideally 0:
mean(out$residual)
# MSPE, ideally small
mean(out$residual^2)
# Mean square normalized error, ideally close to 1
mean(out$zscore^2)
# correlation observed and predicted, ideally 1
cor(out$observed, out$observed - out$residual)
# correlation predicted and residual, ideally 0
cor(out$observed - out$residual, out$residual)

I'm a bit hesitant to write a dedicated function for such simple 
calculations. In your function below, I object to the use of "expected", 
where you mean "desired". It's definitely not expected in the 
statistical sense.

The difficult questions about the cross validation results in 
geostatistics are usually "can we attribute the difference of a MSNE of 
1.1 from the idealized value of 1 attribute to sampling error, or is it 
an indication of a misfitting model?" In the latter case: "should we 
worry?" (only if kriging errors are of importance, many people ignore them).

Also, if the data are nicely spread, say the shortest distance is 100m 
(stay with the example above), then CV is never going to tell anything 
about the variogram for distances up to 100m.

Leave-on-out vs. n-fold? I believe Hastie, Tibshirani and Friedman 
promote the 10-fold. The idea was that leave-one-out may be too 
optimistic and not reveal model error, as refitted models are almost 
identical for moderately sized or large data sets, in each fold. 
Leave-one-out will have smaller RMSE than say 5-fold, but this is not an 
indication of a better model nor of a substantially better procedure, IMO.

Best wishes,
--
Edzer
This works wonderfully...maybe would be nice if you add it to the 
example in the help page :-)

Further comments in /CV/... from the gstat.cv output, which 
cross-validation measures should be considered when establishing the 
performance of kriging, in relation to other methods, for example or 
to compare among kriging modelling options?

I.
http://www.ic.arizona.edu/ic/math574/class_notes/meuse%20zinc%20vs%20logzinc%20using%20gstat.pdf
"Cross validation produces multiple statistics,
making changes to improve one statistic may make another worse. The 
key statistics are (1)
the mean error (expected value is zero), (2) mean square normalized 
error (in gstat and
several other software packages, the normalized error is called the 
?zscore?). The expected
value of this statistic is one. (3) Ideally the correlation between 
the predicted value and the
observed value is one, however when using Ordinary kriging or 
Universal kriging, the
maximum correlation is somewhat less (because of the Lagrange 
multipliers). (4) Ideally the
correlation between the predicted value and the error (residual) is 
zero but again the
Lagrange multipliers have an effect. (5) Using Chebyshev?s Inequality 
we can bound the"

Thus, they state that the important checks to do with the CV, is ():
1) mean error (should ~0)
2) Mean squared normalized error (should ~1)
3) correlation between observed and predicted (should ~1)
4) correlation between predicted and residuals (should ~0)

out = gstat.cv(object=meuse.fit, nmax=40, nfold=5)
cv.results=function(x){ #function to organize CV results. x is the 
result of kriging Crossvalidation
print(bubble(x, c("residual"), main = "leave-one-out CV residuals"))
x=data.frame(x)
kk=data.frame(
ME=c(0,mean(x$res)),
MSNE=c(1,sqrt(sum(x$zscore^2)/length(x$res))),
cor1=c(1,cor(x$obs, x[,1])) ,
cor2=c(0,cor(x[,1], x$res)))
row.names(kk)=c("expected", "estimated")
return(kk)
}

cv.results(out)

II
On another web ref. 
(http://www.itc.nl/personal/rossiter/teach/R/R_ck.pdf):
"Diagnostic measures are the ME (bias), RMSE (precision), and Mean Squared
Deviation Ratio (MSDR) of the residuals to the prediction errors; this
should be 1 because the residuals from cross-validation should equal
to prediction errors at each point that was held out. For the univariate
case (OK) we can use the krige.cv cross-validation method of the gstat
package."
Which is calculated by:

mean(out$res) # mean error
sqrt(mean(out$res^2)) # RMSE
mean(out$res^2/as.data.frame(out)[,2]) # MSDR, should be ~1 (it is 
equivalent to predicted vs. sampled

So, I consulted several text books about the subject, and there was 
none providing a clear answer about the important measures to take int 
account and which sort of cross-validation performs better (although I 
heard that it is 10-fold CV). Does anyone has further information 
about this subject?

Which measures perform best?
Does anyone has good references about the subject?

Best wishes,
Marta

Edzer Pebesma wrote:
That was not the problem, the problem was that you used meuse.g 
instead of meuse.fit to pass on to gstat.cv. For meuse.g, you have 
perfect correlation between Cu and Zn, so that collocated 
observations (meaning a Zn and a Cu observation at each obs location) 
act as a duplicate in univarite kriging.

Try:

out = gstat.cv(object=meuse.fit, nmax=40, verbose=F, nfold=5)
-- 
Edzer

Paul Hiemstra wrote:
Hi,

You should check if you have duplicate observations, duplicate 
observations lead to a singular matrix. Use the function zerodist() 
to check where the observations are and remove.duplicates() to 
remove them.

cheers,
Paul

Marta Rufino schreef:
Hello,

yes, I know it is suppose to do it, but I could not find how, 
because it gives me an error... for example:

require(gstat); require(lattice)
data(meuse)
coordinates(meuse) = ~x + y
data(meuse.grid)
gridded(meuse.grid) = ~x + y

meuse.g <- gstat(id = "zn", formula = log(zinc) ~ 1, data = meuse)
meuse.g <- gstat(meuse.g, "cu", log(copper) ~ 1, meuse)

meuse.g <- gstat(meuse.g, model = vgm(1, "Sph", 900, 1), fill.all = T)
x <- variogram(meuse.g, cutoff = 1000)
meuse.fit = fit.lmc(x, meuse.g)
plot(x, model = meuse.fit)
z <- predict(meuse.fit, newdata = meuse.grid)
spplot(z) #map
gstat.cv(meuse.g) #does not work...
gstat.cv(meuse.g, remove.all=T) #either
gstat.cv(meuse.g, all.residuals=T) #either
gstat.cv(object=meuse.g, formula = log(zinc) ~ 1, data = meuse, 
model = vgm(1, "Sph", 900, 1), nmax=40, verbose=F) #either :-(

# # Intrinsic Correlation found. Good.
# [using ordinary cokriging]

# "chfactor.c", line 130: singular matrix in function LDLfactor()
# Error in predict.gstat(object, newdata = data[sel, ], ...) :
# LDLfactor

Maybe an example on the help file would be nice (eheheh).. I
What am I missing?

Thank you very much in advance,
Marta

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

-- 
.......................................................................
Marta M. Rufino (PhD)

.....
Instituto Nacional de Investiga??o Agr?ria e das Pescas (INIAP/IPIMAR),
Centro Regional de Investiga??o Pesqueira do Sul (CRIPSul)
Avenida 5 de Outubro s/n
P-8700-305 Olh?o, Portugal
+351 289 700 541

..... 
Institut de Ci?ncies del Mar - CMIMA (CSIC)
Passeig Mar?tim de la Barceloneta, 37-49      
08003 BARCELONA - Catalunya
Spain 

Excellent!

It was exactly this what I meant.
I also agree that it is not needed a function to calculate such simple 
stuff, but certainly would be desirable (and effective) to have it in 
the help file from krige.cv (to clarify the less statistically minded 
like myself :-))
I totally agree with 'desirable', instead of expected ;-)

About the 10-fold issue (thank you for the reference. It was exactly 
what I have heard about), maybe it would nice if the default was to do 
10-fold, because it is also substantially quicker to estimate. hihihi

In relation to the use of CV, I found an article that brought my 
attention to the subject again, where they compared the the mean 
estimated by kriging and sample mean through cross-validation, as a way 
to show how kriging is improving the estimates. (Sales MH, Souza JCM, 
Kyriakidis PC, Roberts DA, Vidal E (2007) Improving spatial distribution 
estimation of forest biomass with geostatistics: A case study for 
Rondonia, Brazil. Ecol Model 205:221-230). Thus this is one of the 
issues (for many users, I believe...) how to quantify/prove that kriging 
is actually doing better than other methods?
Does any one has any further alternatives?

Best wishes and thank you very much once again,
Marta
Marta Rufino wrote:
Great!
Hi Marta,

I now have in the example section of gstat.cv help:

# multivariable; thanks to M. Rufino:
meuse.g <- gstat(id = "zn", formula = log(zinc) ~ 1, data = meuse)
meuse.g <- gstat(meuse.g, "cu", log(copper) ~ 1, meuse)
meuse.g <- gstat(meuse.g, model = vgm(1, "Sph", 900, 1), fill.all = T)
x <- variogram(meuse.g, cutoff = 1000)
meuse.fit = fit.lmc(x, meuse.g)
out = gstat.cv(meuse.fit, nmax = 40, nfold = 5)
summary(out)
# mean error, ideally 0:
mean(out$residual)
# MSPE, ideally small
mean(out$residual^2)
# Mean square normalized error, ideally close to 1
mean(out$zscore^2)
# correlation observed and predicted, ideally 1
cor(out$observed, out$observed - out$residual)
# correlation predicted and residual, ideally 0
cor(out$observed - out$residual, out$residual)

I'm a bit hesitant to write a dedicated function for such simple 
calculations. In your function below, I object to the use of 
"expected", where you mean "desired". It's definitely not expected in 
the statistical sense.

The difficult questions about the cross validation results in 
geostatistics are usually "can we attribute the difference of a MSNE 
of 1.1 from the idealized value of 1 attribute to sampling error, or 
is it an indication of a misfitting model?" In the latter case: 
"should we worry?" (only if kriging errors are of importance, many 
people ignore them).

Also, if the data are nicely spread, say the shortest distance is 100m 
(stay with the example above), then CV is never going to tell anything 
about the variogram for distances up to 100m.

Leave-on-out vs. n-fold? I believe Hastie, Tibshirani and Friedman 
promote the 10-fold. The idea was that leave-one-out may be too 
optimistic and not reveal model error, as refitted models are almost 
identical for moderately sized or large data sets, in each fold. 
Leave-one-out will have smaller RMSE than say 5-fold, but this is not 
an indication of a better model nor of a substantially better 
procedure, IMO.

Best wishes,
-- 
Edzer

This works wonderfully...maybe would be nice if you add it to the 
example in the help page :-)

Further comments in /CV/... from the gstat.cv output, which 
cross-validation measures should be considered when establishing the 
performance of kriging, in relation to other methods, for example or 
to compare among kriging modelling options?

I.
http://www.ic.arizona.edu/ic/math574/class_notes/meuse%20zinc%20vs%20logzinc%20using%20gstat.pdf 

"Cross validation produces multiple statistics,
making changes to improve one statistic may make another worse. The 
key statistics are (1)
the mean error (expected value is zero), (2) mean square normalized 
error (in gstat and
several other software packages, the normalized error is called the 
?zscore?). The expected
value of this statistic is one. (3) Ideally the correlation between 
the predicted value and the
observed value is one, however when using Ordinary kriging or 
Universal kriging, the
maximum correlation is somewhat less (because of the Lagrange 
multipliers). (4) Ideally the
correlation between the predicted value and the error (residual) is 
zero but again the
Lagrange multipliers have an effect. (5) Using Chebyshev?s Inequality 
we can bound the"

Thus, they state that the important checks to do with the CV, is ():
1) mean error (should ~0)
2) Mean squared normalized error (should ~1)
3) correlation between observed and predicted (should ~1)
4) correlation between predicted and residuals (should ~0)

out = gstat.cv(object=meuse.fit, nmax=40, nfold=5)
cv.results=function(x){ #function to organize CV results. x is the 
result of kriging Crossvalidation
print(bubble(x, c("residual"), main = "leave-one-out CV residuals"))
x=data.frame(x)
kk=data.frame(
ME=c(0,mean(x$res)),
MSNE=c(1,sqrt(sum(x$zscore^2)/length(x$res))),
cor1=c(1,cor(x$obs, x[,1])) ,
cor2=c(0,cor(x[,1], x$res)))
row.names(kk)=c("expected", "estimated")
return(kk)
}

cv.results(out)

II
On another web ref. 
(http://www.itc.nl/personal/rossiter/teach/R/R_ck.pdf):
"Diagnostic measures are the ME (bias), RMSE (precision), and Mean 
Squared
Deviation Ratio (MSDR) of the residuals to the prediction errors; this
should be 1 because the residuals from cross-validation should equal
to prediction errors at each point that was held out. For the univariate
case (OK) we can use the krige.cv cross-validation method of the gstat
package."
Which is calculated by:

mean(out$res) # mean error
sqrt(mean(out$res^2)) # RMSE
mean(out$res^2/as.data.frame(out)[,2]) # MSDR, should be ~1 (it is 
equivalent to predicted vs. sampled

So, I consulted several text books about the subject, and there was 
none providing a clear answer about the important measures to take 
int account and which sort of cross-validation performs better 
(although I heard that it is 10-fold CV). Does anyone has further 
information about this subject?

Which measures perform best?
Does anyone has good references about the subject?

Best wishes,
Marta

Edzer Pebesma wrote:
That was not the problem, the problem was that you used meuse.g 
instead of meuse.fit to pass on to gstat.cv. For meuse.g, you have 
perfect correlation between Cu and Zn, so that collocated 
observations (meaning a Zn and a Cu observation at each obs 
location) act as a duplicate in univarite kriging.

Try:

out = gstat.cv(object=meuse.fit, nmax=40, verbose=F, nfold=5)
-- 
Edzer

Paul Hiemstra wrote:
Hi,

You should check if you have duplicate observations, duplicate 
observations lead to a singular matrix. Use the function zerodist() 
to check where the observations are and remove.duplicates() to 
remove them.

cheers,
Paul

Marta Rufino schreef:
Hello,

yes, I know it is suppose to do it, but I could not find how, 
because it gives me an error... for example:

require(gstat); require(lattice)
data(meuse)
coordinates(meuse) = ~x + y
data(meuse.grid)
gridded(meuse.grid) = ~x + y

meuse.g <- gstat(id = "zn", formula = log(zinc) ~ 1, data = meuse)
meuse.g <- gstat(meuse.g, "cu", log(copper) ~ 1, meuse)

meuse.g <- gstat(meuse.g, model = vgm(1, "Sph", 900, 1), fill.all 
= T)
x <- variogram(meuse.g, cutoff = 1000)
meuse.fit = fit.lmc(x, meuse.g)
plot(x, model = meuse.fit)
z <- predict(meuse.fit, newdata = meuse.grid)
spplot(z) #map
gstat.cv(meuse.g) #does not work...
gstat.cv(meuse.g, remove.all=T) #either
gstat.cv(meuse.g, all.residuals=T) #either
gstat.cv(object=meuse.g, formula = log(zinc) ~ 1, data = meuse, 
model = vgm(1, "Sph", 900, 1), nmax=40, verbose=F) #either :-(

# # Intrinsic Correlation found. Good.
# [using ordinary cokriging]

# "chfactor.c", line 130: singular matrix in function LDLfactor()
# Error in predict.gstat(object, newdata = data[sel, ], ...) :
# LDLfactor

Maybe an example on the help file would be nice (eheheh).. I
What am I missing?

Thank you very much in advance,
Marta

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

-- 
.......................................................................
Marta M. Rufino (PhD)

.....
Instituto Nacional de Investiga??o Agr?ria e das Pescas (INIAP/IPIMAR),
Centro Regional de Investiga??o Pesqueira do Sul (CRIPSul)
Avenida 5 de Outubro s/n
P-8700-305 Olh?o, Portugal
+351 289 700 541

..... Institut de Ci?ncies del Mar - CMIMA (CSIC)
Passeig Mar?tim de la Barceloneta, 37-49 08003 BARCELONA - Catalunya
Spain

.......................................................................
Marta M. Rufino (PhD)

.....
Instituto Nacional de Investiga??o Agr?ria e das Pescas (INIAP/IPIMAR),
Centro Regional de Investiga??o Pesqueira do Sul (CRIPSul)
Avenida 5 de Outubro s/n
P-8700-305 Olh?o, Portugal
+351 289 700 541

..... 
Institut de Ci?ncies del Mar - CMIMA (CSIC)
Passeig Mar?tim de la Barceloneta, 37-49      
08003 BARCELONA - Catalunya
Spain