Making GIS and R play nicely - some tips?
On Wed, 2 Mar 2005, abunn wrote:
Great. Thanks Roger and other others that replied.
But isn't the real challenge going to be shoehorning 250 by 1000 by 1000 by 8 into R at once - or are the data layers needed just one after the other? If I could grasp the workflow better, the advice above could become more focussed, I think.
I'm not positive on the workflow yet! That's one of the problems. I'm
finding it tough to wrap my head around the data since it is in both time
and space.
The data are not as huge as I thought at first. Each layer has 567 rows and
409 columns (231903 elements) but 137113 are no data. So, that makes the
actual amount of data to crunch much more manageable at 94790 elements per
layer.
The goal is to predict satellite reflectance as a function of climate. The
response data are monthly satellite reflectance over 19 years (So, that's
228 images). The predictor data are interpolated climate data and there
should be a minimum of three predictors and I'd like to use two more if they
are reliable. So, assuming three predictors the data would only run to 500mb
or so.
R > object.size(rep(runif(94790), times = 19*12*3)) / 2^20
[1] 494.6622
And if I can get all five then I'd be under a gig still.
R > object.size(rep(runif(94790), times = 19*12*5)) / 2^20
[1] 824.437
That's cumbersome but possible.
I've managed to read in a test example using erdas img files using rgdal
(thanks Tim Keitt):
R > library(rgdal)
R > junk <- list()
R > try1 <- GDAL.open("test.img")
R > getDriverLongName(getDriver(try1))
[1] "Erdas Imagine Images (.img)"
R > junk[[1]] <- c(getRasterData(try1))
R > GDAL.close(try1)
Closing GDAL dataset handle 00CACBF0... destroyed ... done.
R > try2 <- GDAL.open("test2.img")
R > getDriverLongName(getDriver(try2))
[1] "Erdas Imagine Images (.img)"
R > junk[[2]] <- c(getRasterData(try2))
R > GDAL.close(try2)
Closing GDAL dataset handle 00C7F180... destroyed ... done.
R > str(junk)
List of 2
$ : int [1:231903] 0 0 0 0 0 0 0 0 0 0 ...
$ : int [1:231903] 0 0 0 0 0 0 0 0 0 0 ...
So, to read all this in I can pipe a script to Arc to convert the grids to
images and loop through all the data. Does that sound like a reasonable
plan?
What are the volumes going back? Another itch to scratch is how much more you will get in terms of precision in the coefficient point estimates by actually forcing all this data down lm() or whatever's throat? Would it be possible to reduce the load by using subsetting in rgdal, given that you had the *.img and model based files lined up (or just use locations for which you have observed climate data)? Won't the interpolation errors in the predictors feed through? I think the Furlanello et al. paper may be very helpful (they use randomForest on regression trees of climate/topography drivers). Interesting problem. Roger
Thanks for all the help, Andy
Roger Bivand Economic Geography Section, Department of Economics, Norwegian School of Economics and Business Administration, Breiviksveien 40, N-5045 Bergen, Norway. voice: +47 55 95 93 55; fax +47 55 95 93 93 e-mail: Roger.Bivand at nhh.no