Hi R Spatial People: I am an experienced R user and an experienced GIS user - however, that GIS is Arc. In the past when I've worked with spatial data in R, I've done so by exporting the GIS data (always grids) from Arc as ASCII files. Then I read them into R and go through some gymnastics to dump the no data values (-9999) while I do my analysis (say a regression tree) and then write the ASCII back out and suck it into Arc for display. This lose coupling seems inelegant. I'm starting on a new project where I have a time series of about 250 grids that are roughly 1000 by 1000. There is a response variable and up to six or so predictors. So, it's enough data that I feel it's time to learn how to do it right. Question is, what's right? Is the state of the art approach to use GRASS a la Bivand and Neteler's paper (http://agec144.agecon.uiuc.edu/csiss/Rgeo/)? Is it better to dump the grids to an imagine file (or bil) and read them with rgdal? I work mostly in Windows (b/c of ESRI - damn them!) but switch gears onto Linux when the task requires it. I have never used GRASS. Show me the way! Thanks in advance, Andy
Making GIS and R play nicely - some tips?
7 messages · Roger Bivand, Tim Keitt, Andy Bunn +1 more
On Wed, 2 Mar 2005, abunn wrote:
Hi R Spatial People: I am an experienced R user and an experienced GIS user - however, that GIS is Arc. In the past when I've worked with spatial data in R, I've done so by exporting the GIS data (always grids) from Arc as ASCII files. Then I read them into R and go through some gymnastics to dump the no data values (-9999) while I do my analysis (say a regression tree) and then write the ASCII back out and suck it into Arc for display. This lose coupling seems inelegant. I'm starting on a new project where I have a time series of about 250 grids that are roughly 1000 by 1000. There is a response variable and up to six or so predictors. So, it's enough data that I feel it's time to learn how to do it right. Question is, what's right? Is the state of the art approach to use GRASS a la Bivand and Neteler's paper (http://agec144.agecon.uiuc.edu/csiss/Rgeo/)? Is it better to dump the grids to an imagine file (or bil) and read them with rgdal? I work mostly in Windows (b/c of ESRI - damn them!) but switch gears onto Linux when the task requires it. I have never used GRASS. Show me the way!
Well, there are plenty of possibilities. The first question (abstracting from the size of the data) is whether to go the way suggested in: http://www.gisvet.org/Documents/GisVet04/RegularPapers/Tait.pdf which leaves Arc in front, using R just as the compute engine, and passing data through StatConnector. I don't think that the code used has been published, but a very simple VBA example is at: http://perso.univ-lr.fr/csaintje/Recherche/RArcgis/index.html That permits the Arc user to avoid knowing about R if you need later to package the procedure for others to use. My guess is that this hasn't been checked on ArcGIS 9 (I'll try to check soon). Another is as you suggest to use the rgdal package to move the data into R, and once we get rgdal better checked for writing, back out too. That puts R in front of the loose-coupled data. A third is to use GRASS 5.4 and the R GRASS interface package, which reads and writes GRASS raster layers; GRASS can read from many formats and to many formats, and can run (as can the interface) under cygwin. This runs R on top of GRASS, which can be accessed through system(). The state of the art is certainly M. Neteler, H. Mitasova, 2004. Open Source GIS: A GRASS GIS Approach. Second Edition. 424 pages, Kluwer Academic Publishers, Boston, Dordrecht, ISBN 1-4020-8064-6 (Also published as eBook, ISBN 1-4020-8065-4), chapter 13. You might also enjoy: http://www.ci.tuwien.ac.at/Conferences/DSC-2003/Proceedings/FurlanelloEtAl.pdf showing GRASS, R and PostgreSQL working together. But isn't the real challenge going to be shoehorning 250 by 1000 by 1000 by 8 into R at once - or are the data layers needed just one after the other? If I could grasp the workflow better, the advice above could become more focussed, I think. Best wishes, Roger
Thanks in advance, Andy
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand Economic Geography Section, Department of Economics, Norwegian School of Economics and Business Administration, Breiviksveien 40, N-5045 Bergen, Norway. voice: +47 55 95 93 55; fax +47 55 95 93 93 e-mail: Roger.Bivand at nhh.no
Great. Thanks Roger and other others that replied.
But isn't the real challenge going to be shoehorning 250 by 1000 by 1000 by 8 into R at once - or are the data layers needed just one after the other? If I could grasp the workflow better, the advice above could become more focussed, I think.
I'm not positive on the workflow yet! That's one of the problems. I'm
finding it tough to wrap my head around the data since it is in both time
and space.
The data are not as huge as I thought at first. Each layer has 567 rows and
409 columns (231903 elements) but 137113 are no data. So, that makes the
actual amount of data to crunch much more manageable at 94790 elements per
layer.
The goal is to predict satellite reflectance as a function of climate. The
response data are monthly satellite reflectance over 19 years (So, that's
228 images). The predictor data are interpolated climate data and there
should be a minimum of three predictors and I'd like to use two more if they
are reliable. So, assuming three predictors the data would only run to 500mb
or so.
R > object.size(rep(runif(94790), times = 19*12*3)) / 2^20
[1] 494.6622
And if I can get all five then I'd be under a gig still.
R > object.size(rep(runif(94790), times = 19*12*5)) / 2^20
[1] 824.437
That's cumbersome but possible.
I've managed to read in a test example using erdas img files using rgdal
(thanks Tim Keitt):
R > library(rgdal)
R > junk <- list()
R > try1 <- GDAL.open("test.img")
R > getDriverLongName(getDriver(try1))
[1] "Erdas Imagine Images (.img)"
R > junk[[1]] <- c(getRasterData(try1))
R > GDAL.close(try1)
Closing GDAL dataset handle 00CACBF0... destroyed ... done.
R > try2 <- GDAL.open("test2.img")
R > getDriverLongName(getDriver(try2))
[1] "Erdas Imagine Images (.img)"
R > junk[[2]] <- c(getRasterData(try2))
R > GDAL.close(try2)
Closing GDAL dataset handle 00C7F180... destroyed ... done.
R > str(junk)
List of 2
$ : int [1:231903] 0 0 0 0 0 0 0 0 0 0 ...
$ : int [1:231903] 0 0 0 0 0 0 0 0 0 0 ...
So, to read all this in I can pipe a script to Arc to convert the grids to
images and loop through all the data. Does that sound like a reasonable
plan?
Thanks for all the help, Andy
On Wed, 2 Mar 2005, abunn wrote:
Great. Thanks Roger and other others that replied.
But isn't the real challenge going to be shoehorning 250 by 1000 by 1000 by 8 into R at once - or are the data layers needed just one after the other? If I could grasp the workflow better, the advice above could become more focussed, I think.
I'm not positive on the workflow yet! That's one of the problems. I'm
finding it tough to wrap my head around the data since it is in both time
and space.
The data are not as huge as I thought at first. Each layer has 567 rows and
409 columns (231903 elements) but 137113 are no data. So, that makes the
actual amount of data to crunch much more manageable at 94790 elements per
layer.
The goal is to predict satellite reflectance as a function of climate. The
response data are monthly satellite reflectance over 19 years (So, that's
228 images). The predictor data are interpolated climate data and there
should be a minimum of three predictors and I'd like to use two more if they
are reliable. So, assuming three predictors the data would only run to 500mb
or so.
R > object.size(rep(runif(94790), times = 19*12*3)) / 2^20
[1] 494.6622
And if I can get all five then I'd be under a gig still.
R > object.size(rep(runif(94790), times = 19*12*5)) / 2^20
[1] 824.437
That's cumbersome but possible.
I've managed to read in a test example using erdas img files using rgdal
(thanks Tim Keitt):
R > library(rgdal)
R > junk <- list()
R > try1 <- GDAL.open("test.img")
R > getDriverLongName(getDriver(try1))
[1] "Erdas Imagine Images (.img)"
R > junk[[1]] <- c(getRasterData(try1))
R > GDAL.close(try1)
Closing GDAL dataset handle 00CACBF0... destroyed ... done.
R > try2 <- GDAL.open("test2.img")
R > getDriverLongName(getDriver(try2))
[1] "Erdas Imagine Images (.img)"
R > junk[[2]] <- c(getRasterData(try2))
R > GDAL.close(try2)
Closing GDAL dataset handle 00C7F180... destroyed ... done.
R > str(junk)
List of 2
$ : int [1:231903] 0 0 0 0 0 0 0 0 0 0 ...
$ : int [1:231903] 0 0 0 0 0 0 0 0 0 0 ...
So, to read all this in I can pipe a script to Arc to convert the grids to
images and loop through all the data. Does that sound like a reasonable
plan?
What are the volumes going back? Another itch to scratch is how much more you will get in terms of precision in the coefficient point estimates by actually forcing all this data down lm() or whatever's throat? Would it be possible to reduce the load by using subsetting in rgdal, given that you had the *.img and model based files lined up (or just use locations for which you have observed climate data)? Won't the interpolation errors in the predictors feed through? I think the Furlanello et al. paper may be very helpful (they use randomForest on regression trees of climate/topography drivers). Interesting problem. Roger
Thanks for all the help, Andy
Roger Bivand Economic Geography Section, Department of Economics, Norwegian School of Economics and Business Administration, Breiviksveien 40, N-5045 Bergen, Norway. voice: +47 55 95 93 55; fax +47 55 95 93 93 e-mail: Roger.Bivand at nhh.no
Hi Andy, I've had pretty good luck reading large images line-by-line in rgdal. Its kind of slow, but not a bad as one might imagine. ArcGIS went down in flames trying to do any sort of processing on these files (10K x 10K). A better solution I think though will be to write space-time points into a postgis database (given lots of disk space) and use queries to subset the data as needed. I've found the database approach increadibly useful for working with BBS + climate data (~1e7 records). Ciao, THK
On Wed, 2005-03-02 at 11:26 -0500, abunn wrote:
Great. Thanks Roger and other others that replied.
But isn't the real challenge going to be shoehorning 250 by 1000 by 1000 by 8 into R at once - or are the data layers needed just one after the other? If I could grasp the workflow better, the advice above could become more focussed, I think.
I'm not positive on the workflow yet! That's one of the problems. I'm
finding it tough to wrap my head around the data since it is in both time
and space.
The data are not as huge as I thought at first. Each layer has 567 rows and
409 columns (231903 elements) but 137113 are no data. So, that makes the
actual amount of data to crunch much more manageable at 94790 elements per
layer.
The goal is to predict satellite reflectance as a function of climate. The
response data are monthly satellite reflectance over 19 years (So, that's
228 images). The predictor data are interpolated climate data and there
should be a minimum of three predictors and I'd like to use two more if they
are reliable. So, assuming three predictors the data would only run to 500mb
or so.
R > object.size(rep(runif(94790), times = 19*12*3)) / 2^20
[1] 494.6622
And if I can get all five then I'd be under a gig still.
R > object.size(rep(runif(94790), times = 19*12*5)) / 2^20
[1] 824.437
That's cumbersome but possible.
I've managed to read in a test example using erdas img files using rgdal
(thanks Tim Keitt):
R > library(rgdal)
R > junk <- list()
R > try1 <- GDAL.open("test.img")
R > getDriverLongName(getDriver(try1))
[1] "Erdas Imagine Images (.img)"
R > junk[[1]] <- c(getRasterData(try1))
R > GDAL.close(try1)
Closing GDAL dataset handle 00CACBF0... destroyed ... done.
R > try2 <- GDAL.open("test2.img")
R > getDriverLongName(getDriver(try2))
[1] "Erdas Imagine Images (.img)"
R > junk[[2]] <- c(getRasterData(try2))
R > GDAL.close(try2)
Closing GDAL dataset handle 00C7F180... destroyed ... done.
R > str(junk)
List of 2
$ : int [1:231903] 0 0 0 0 0 0 0 0 0 0 ...
$ : int [1:231903] 0 0 0 0 0 0 0 0 0 0 ...
So, to read all this in I can pipe a script to Arc to convert the grids to
images and loop through all the data. Does that sound like a reasonable
plan?
Thanks for all the help, Andy
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Hey Tim:
I'm writing the database as we speak which will take awhile but will then be
done. It is going faster than I thought it would. I figured out how to call
arc from the prompt and pass it an aml and variables from R:
aml2pass <- paste('arc "&r make.an.image.aml', which.var, year,
formatC(month, width = 2, flag = 0), '"', sep = "
")
system(aml2pass)
This means that I can write temporary files from arc, process them and clean
up without having a lot of loose ends. So, with that and rgdal, everything
is smooth. I'll leave GRASS for another day (I've been saying that for about
seven years now).
Thanks, A
Is the state of the art approach to use GRASS a la Bivand and Neteler's paper (http://agec144.agecon.uiuc.edu/csiss/Rgeo/)? Is it better to dump the grids to an imagine file (or bil) and read them with rgdal? I work mostly in Windows (b/c of ESRI - damn them!) but switch gears onto Linux when the task requires it. I have never used GRASS. Show me the way!
Hello Disclaimer - I've never used GRASS, and I haven't used ESRI much. You might consider using Manifold GIS: http://www.manifold.net/ Much more Windows (and user) friendly IMO, and has many options for input and output for most formats, as well as generic text and binary. (Give it a few years and Manifold will have taken over much of the GIS market.) You could output your grids to binary files (or ASCII), and with some modification of my simple script write the projection metadata to XML - then script-import these to Manifold. This scripting could be done all within R if that was desirable using RDCOMClient I have some R routines that export with Manifold-friendly metadata in XML, very rudimentary but might give you some ideas. Passing grid (matrix) data from R to Manifold: http://www.georeference.org/Forums/forums/thread-view.asp?tid=789&posts=9 Running R from Manifold: http://www.georeference.org/Forums/forums/thread-view.asp?tid=650&posts=8 Running Manifold from R: http://www.georeference.org/Forums/forums/thread-view.asp?tid=876&posts=9 Cheers, Mike. ############################################### Michael Sumner - PhD. candidate Maths and Physics (ACE CRC & IASOS) and Zoology (AWRU) University of Tasmania Private Bag 77, Hobart, Tas 7001, Australia Phone: (03) 6226 1752 http://www.acecrc.org.au/