Hello:
On first sight, this is about read.gal[2,3] and read.gwt2nb, but in the
long run, it is about strategies for working with very large datasets.
Here is the background. With Robert Hijmans' support, we generated a
comprehensive database of world-wide green house gas emissions (GHGs)
and a wide range of explanatory variables. The point file now contains
(depending on source) between 1.4 and 2.1 million locations, all on a
0.1 degree grid. We would like to run a bunch of spatial regression
models on this very large dataset. In the end, we would like to
determine which (set of) variable(s) have what kind of effect on GHGs in
what part of the world. The variables are physical, economic,
demographic, and geographic (e.g. distance from ocean) in nature.
This procedure usually starts with creating a spatial weights matrix,
which we tried in R but lead to an endless process (we tried it
repeatedly on machines with 4 GM RAM and Xeon processors; it did not
bail, just kept running at about 50% CPU time using between 300 and 2100
MB of memory for more than a week until we killed the process).
GeoDA ran for about six hours and then produced a file with a good 5
million records, 99% of which contained zero neighbors. This is where
the immediate question comes into play. The read.gal function did not
like the file produced by GeoDA. There is some GeoDA documentation that
suggests that we should use read.gal2 or read.gal3 but these are not
part of the spdep distribution, nor could I find them anywhere. As it
happens, the file generated had a .gwt extension, so I tried
read.get2nb. It seemed to accept the input but then completely killed
the whole R process (I kept screen shots just for Roger). My guess is
that (a) the matrix was too big, or (b) it was too sparse, or (c) it was
a corrupt product of GeoDA in the first place.
Which brings me back to the bigger picture and the following questions:
1) Is there something inherently wrong with our approach?
2) Can anybody think of alternative ways to create a spatial regression
model for the above mentioned questions?
3) Would it be worthwhile to move onto a Linux machine and recompile all
the different packages?
Cheers,
Jochen
spdep neighbor generation and subsequent regression analysis
2 messages · Jochen Albrecht, Roger Bivand
On Sun, 15 Nov 2009, Jochen Albrecht wrote:
Hello: On first sight, this is about read.gal[2,3] and read.gwt2nb, but in the long run, it is about strategies for working with very large datasets. Here is the background. With Robert Hijmans' support, we generated a comprehensive database of world-wide green house gas emissions (GHGs) and a wide range of explanatory variables. The point file now contains (depending on source) between 1.4 and 2.1 million locations, all on a 0.1 degree grid. We would like to run a bunch of spatial regression models on this very large dataset. In the end, we would like to determine which (set of) variable(s) have what kind of effect on GHGs in what part of the world. The variables are physical, economic, demographic, and geographic (e.g. distance from ocean) in nature.
If this is like machine learning, why not use such techniques? Do you have a realistic spatial process model? I think that very many of the input variables are interpolated too, so probably spatial dependence at any scale will be induced by the changes in support prior to analysis. The results of such analysis would (or should) have large standard errors, so perhaps would not take you where you want to go. If you cannot handle the varying impacts of spatial scales in the data generating processes on both left and right hand sides, any observed residual dependence will certainly be spurious (red herring). Could you try a small subsample across a natural experiment (a clear difference in treatment)? Then the difficulty of generating a large weights object would go away. It would also let you examine the error propagation/change of support problem, which would be intractable with many "observations", which you need to do if your results are to be taken seriously. If you need to generate neighbours for very large n, please do describe the functions used, as there are many ways of doing this:
This procedure usually starts with creating a spatial weights matrix, which we tried in R but lead to an endless process (we tried it repeatedly on machines with 4 GM RAM and Xeon processors; it did not bail, just kept running at about 50% CPU time using between 300 and 2100 MB of memory for more than a week until we killed the process).
actually tells us nothing, as you haven't said how exactly you were doing this - presumably using point support and a distance criterion? Is the object a SpatialPixels object? Was the distance criterion sensible (see the GeoDa failure reported below - perhaps not)?
GeoDA ran for about six hours and then produced a file with a good 5 million records, 99% of which contained zero neighbors. This is where the immediate question comes into play. The read.gal function did not like the file produced by GeoDA. There is some GeoDA documentation that suggests that we should use read.gal2 or read.gal3 but these are not part of the spdep distribution, nor could I find them anywhere. As it happens, the file generated had a .gwt extension, so I tried read.get2nb. It seemed to accept the input but then completely killed the whole R process (I kept screen shots just for Roger). My guess is that (a) the matrix was too big, or (b) it was too sparse, or (c) it was a corrupt product of GeoDA in the first place.
Most likely that.
Which brings me back to the bigger picture and the following questions: 1) Is there something inherently wrong with our approach?
See above.
2) Can anybody think of alternative ways to create a spatial regression model for the above mentioned questions?
It can be done, but once you have control of the scale and process issues, there is nothing to stop you subsampling. If you go with a 1 degree grid, you shouldn't have trouble fitting a model (about 15000 on-land cells), but it may be largish for applying say Bayesian Model Averaging, which might give you a feel for which variables are in play.
3) Would it be worthwhile to move onto a Linux machine and recompile all the different packages?
When working with larger data sets, 64-bit Linux or OSX are still currently more viable than Windows, I believe. Hope this helps, Roger
Cheers, Jochen
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand Economic Geography Section, Department of Economics, Norwegian School of Economics and Business Administration, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43 e-mail: Roger.Bivand at nhh.no