spautolm - sanity check - R-SIG-Geo

Brian
Wed, Feb 9, 2011 12:03 PM #
Hello All,

I have gone through ASDAR and hope that I have properly applied it's 
content.
I have water color data sampled from polygons, which lie at the mouths 
of a rivers around an island. Due to ocean currents, bathymetry, and 
wind etc., I have the need to account for spatial autocorrelation for 
those polygons which are near to each other. I am trying to build a 
statistical model to explain the correlations between land cover and 
water quality.

Therefore:
### Coordinates of the polygons
coords <- coordinates(outlets)
### k=2 - assume that each outlet point could be/is possibly only 
affecting the two nearest outlets
outlets.nb <- knn2nb(knearneigh(coords,k=2),row.names=outlets$BASIN_ID)
### We assume that SA is related to distance between outlets
### Calculate distance (in meters, the units of our projection)
distance.nb <- nbdists(outlets.nb, coords)
### According to ASDAR (Page 254) this shouldn't be a problem because 
these distances are not on a grid but are asymmetrically placed in space,
### but we do ignore the dynamics physical processes that might 
transport water (currents, wind, bathymetry etc.) and assume equal 
diffusion/dispersal in all directions.
### Invert the distances: the bigger the distance the smaller the 
weight. Weights are proportional to the inverse distance
invert.dist <- lapply(distance.nb, function(x) 1/x)
(outlets.lw <- nb2listw(outlets.nb, glist=invert.dist, style='B'))

OK, is this a correct structure?
It may be better to just base it on distance (dnearneigh) i.e. assume 
that beyond 50km distance a nearest neighbor has no influence.
Any comments, suggestions, insults?

## Give it a whirl.
moran.test(response, listw=outlets.lw, randomisation=F)

Moran's I test under normality

data: response
weights: outlets.lw

Moran I statistic standard deviate = 2.8826, p-value = 0.001972
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.48914375 -0.02325581 0.03159666
### Tweaking suggests that heteroscedasticity is present as well the 
data not being normally distributed. Then a boxcox transformation 
suggests a 1/sqrt(response) transformation but it is difficult to 
interpret so I opt for a log transformation.

 > moran.test(log(response), listw=outlets.lw, randomisation=F)

Moran's I test under normality

data: log(response)
weights: outlets.lw

Moran I statistic standard deviate = 2.0702, p-value = 0.01922
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.34473683 -0.02325581 0.03159666

Moran's I changed quite a bit here. This is due to non-normality, right? 
It appears the log takes most of the heteroscedasticity out and 
transforms the distribution to almost normally ditributed (seen via 
plot(model)).

### Continuing
### after narrowing down the explanatory variables while removing the 
'worst' of those that are collinear (rho>0.5).
sp.model <- spautolm(log(response) ~., data=data, listw=outlets.lw)

I used the SAR default.

Also, without narrowing my explanatory variables down to a bare minimum, 
I usually run into the error:
Error in solve.default(crossprod(X, as.matrix(IlW %*% X)), tol = 
tol.solve) :
system is computationally singular: reciprocal condition number = 
2.85163e-019

Which still exists even when the same non-spatial (just lm) model 
contains no singularities. I get this often when adding an interaction 
term to the model.
I am worried that the spatial structure has been misspecified, when 
looking at the other post which addresses this same error:
http://r-sig-geo.2731867.n2.nabble.com/error-in-spautolm-spdep-and-model-selection-td2766159.html

 > sp.model <- spautolm(form, data=borneo.data,listw=outlets.lw)
 > summary(sp.model,Nagelkerke=T)

Call: spautolm(formula = form, data = borneo.data, listw = outlets.lw)

Residuals:
Min 1Q Median 3Q Max
-1.1164557 -0.3936249 0.0074938 0.3652388 1.4859714

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.2321e+00 1.6446e+00 -5.6135 1.983e-08
p_bare 1.2761e+00 2.5148e-01 5.0743 3.889e-07
mean_rfact 1.0769e-03 2.1112e-04 5.1010 3.378e-07
km_roads 8.0991e-05 3.3864e-05 2.3917 0.016772
p_bare:mean_rfact -1.5911e-04 3.1192e-05 -5.1009 3.381e-07
p_bare:km_roads -1.5490e-05 5.9758e-06 -2.5922 0.009537

Lambda: 3040.2 LR test value: 0.94862 p-value: 0.33007

Log likelihood: -44.52488
ML residual variance (sigma squared): 0.43704, (sigma: 0.66109)
Number of observations: 44
Number of parameters estimated: 8
AIC: 105.05
Nagelkerke pseudo-R-squared: 0.54572


So there seems to be little spatial autocorrelation in the residuals.

### This is how it looks in non-spatial manner.
summary(model);AIC(model)

Call:
lm(formula = response ~ p_bare + mean_rfact + km_roads + 
p_bare:mean_rfact +
p_bare:km_roads, data = borneo.data)

Residuals:
Min 1Q Median 3Q Max
-0.7111 -0.3342 -0.1216 0.3065 2.0646

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.067e+00 1.378e+00 -2.951 0.00539 **
p_bare 6.878e-01 2.142e-01 3.210 0.00270 **
mean_rfact 6.132e-04 1.766e-04 3.472 0.00130 **
km_roads 6.282e-05 2.863e-05 2.194 0.03439 *
p_bare:mean_rfact -8.611e-05 2.660e-05 -3.237 0.00251 **
p_bare:km_roads -1.091e-05 5.067e-06 -2.154 0.03767 *
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Residual standard error: 0.559 on 38 degrees of freedom
Multiple R-squared: 0.4248, Adjusted R-squared: 0.3491
F-statistic: 5.613 on 5 and 38 DF, p-value: 0.0005723

## and then...

lm.morantest(model.log,listw=outlets.lw)### Moran's I is big

Global Moran's I for regression residuals

data:
model: lm(formula = log(response) ~ p_bare + mean_rfact + km_roads +
p_bare:mean_rfact + p_bare:km_roads, data = borneo.data)
weights: outlets.lw

Moran I statistic standard deviate = 1.1192, p-value = 0.1315
alternative hypothesis: greater
sample estimates:
Observed Moran's I Expectation Variance
0.15521190 -0.04488877 0.03196742

So SA may or may not exist. It doesn't appear significant here, however, 
the R2 values increase in the spatial model and the AIC drops. The 
spatial model seems better, but is it overkill?

Thanks for the comments, suggestions, or insults!
Regards,
Brian