Dear R - Geo community: This is perhaps a bit off topic (really a geostats question), but we've been struggling with these questions for about 6 months now, and it just reared its head again when I posed a question on one of ESRI's discussion groups. I realized you would be much more experienced in this issue and might be interested in offering advice. And this might be something others are struggling with as well. Background: We are creating predictive habitat distribution models (a good synthesis paper is Guisan and Zimmerman 2000. Ecological Modelling 135:147-186); using known locations of species to create a model (and subsequent raster) that predicts the locations for additional similar habitats. Many different algorithms have been used for actually doing the predictive modeling (e.g. logistic regression, Classification and regression trees [CART], ordination, maximum entropy), but in all cases the user needs a full coverage of all the environmental variables to be used and known locations for the element (species) being modeled. Known locations must be attributed with values from all the environmental layers. We've been using two different different algorithms, one parametric (MaxEnt - separate package) and one non-parametric (the RandomForests implementation of CART). I'm using the R implementation of randomForest, but (embarrasingly) I'm currently doing it by dumping the data to ascii and then reading the tables back into R (by looping through them to keep the size manageable). I saw the recent post on this listserv (https://stat.ethz.ch/pipermail/r-sig-geo/2005-March/000338.html) that mentions Furlanello et al. (wow!) and someday we'll strive to integrate in a similar way. This Furlanello paper is a good example of how we diverge and our resulting dilemma. Furlanello et al. (and most others, as far as I can tell, but please enlighten me if I'm wrong), begin with known POINT locations. Point locations are easy to attribute with data from your environmental layers. We have good-quality POLYGON data and we want to be sure the environmental variability captured by the polygon is represented in the 'presence' environmental data we pass on for analysis. How to best capture that full variability? Note one of our algorithms is parametric (MaxEnt) and one non-parametric (randomForest).. and I have no problem with deviating from traditional parametric sampling methodologies and leaving many formerly required parametric assumptions behind (or is that a big mistake?). We've thought long and hard about random sampling, regular sampling, and grabbing all the points. We currently have about 36 environmental variables, we've resampled them all to one size (30m). That meant bringing all the rasters derived from 10m DEM up to 30 meter and all the coarser rasters down to 30m. I'm being transparent about breaking the 'rules' here just so you know we are aware of the issues surrounding scale. Our rasters are pretty big (22578 rows x 17160 cols.... a single ascii is about 2GB). Again, our goal is to, as fully as possible, represent the environment within each polygon. For random (or regular) sampling; if you do a consistent number of samples for every polygon, small polygons are overweighted. If you sample on a per/area basis, larger polygons get more 'weight'. We settled on grabbing ALL the raster cells in each polygon and sticking a point in the center of each cell and attributing these points. This obviously blew up with larger polygons (much to many cells), and we realized we'd need to go back to subsampling. I thought the best step would be to sample randomly without replacement so that I'd get as full representation as possible (e.g. if a sample of 90 cells/points are requested, return 90 different cells/points). One problem that's been pointed out is that I'm treating rasters as discrete data when most are really a representation of continuous data and interpolating between cell centers is more appropriate. Unfortunately I don't have an easy way of doing this, but I'm trying to look into it. My questions: Has anyone ever used *all* the cells in a polygon representation when attributing data for later statistcal analyses? Would sampling without replacement be an appropriate approach for returning as much within polygon variability as possible? If so, how would you do it? If randomly sampling, has anyone sampled relatively more cells in smaller polygons, and relatively fewer cells in larger polygons, so that you reach an asymptote in number sampled per polygon? and related: If randomly sampling, do you tend to sample a static number of cells per polygon or a per-area number of cells. Thank you for your time. Tim Howard
Sampling polygons for analysis in CART/randomForests
1 message · Tim Howard