Species Distribution by Random Forest using a multi core pc

Tue, Mar 22, 2011 1:19 AM

Hi
First of all sorry for this late answer, i was struggling with compile
all the sw in my new pc (16 processor 16 g of memory) and setting up
all the environment for a large computation.

Herein, I will try to reply to the inputs that i receive more than one
month ego.
I will insert my questions and replies in order to track back the history.
Moreover, ?I will also insert codes/observations and how i solved the problems.

1) Issues:

I'm trying to do forest species distribution at European level (1 km
resolution) by means of ?randomForest running it in a cluster
computer. I'm using several predictors of different data sources. All
of them are rasters in grass format. ?Therefore i was using spgrass6
to import the data in to R and apply randomForest prediction to the
layers. In the same time, reading carefully the help page of the raster
package seems to me that his "row by row" feature allows a better
performance of the memory limitation, compare to spgrass6. It is this
the case?
If raster package is more efficient, how i can use it to import grass
data? ?I suppose by reading the raster under the cellhd folder

On 10 January 2011 20:25, Roger Bivand <Roger.Bivand at nhh.no> wrote:

After several test i decide to use the raster package to read and
write back the GRASS data. My decision was taken due to the faster
performance in reading the data and make the prediction. The drawbacks
were
1) the no-capability of raster package in reading the region of GRASS.
Anyway i easily solved by "cutting" all the grass data using the
grass region
2) the geo-output is a tif which i re-import in grass for other analysis.

I implemented the function for the prediction in this way:

rastRFpred <- function(x, rf, filename='') {
? ? ? ? out <- raster(x)
? ? ? ? dataType(out)='FLT4S'
? ? ? ? small <- canProcessInMemory(out, 1)
? ? ? ? filename <- trim(filename)
? ? ? ? if (! small & filename == '') {
? ? ? ? ? ? ? ? filename <- rasterTmpFile()
? ? ? ? }
? ? ? ? ?if (filename != '') {
? ? ? ? ? ? ? ? out <- writeStart(out, filename, overwrite=TRUE)
? ? ? ? ? ? ? ? todisk <- TRUE
? ? ? ? } else {
? ? ? ? ? ? ? ? vv <- matrix(ncol=nrow(out), nrow=ncol(out))
? ? ? ? ? ? ? ? todisk <- FALSE
? ? ? ? }
? ? ? ? bs <- blockSize(out)
? ? ? ? for (i in 1:bs\$n) {
? ? ? ? ? ? ? ? block_rasters = getValues(x,
row=bs\$row[i],nrows=bs\$nrows[i] ) ? # deals with no data
? ? ? ? ? ? ? ? block_rasters = unknownToNA( block_rasters,NaN)
? ? ? ? ? ? ? ? ? # i will fix it in a more
? ? ? ? ? ? ? ? block_rasters = NAToUnknown ( block_rasters , 0,
force=TRUE) ? ?# clever way
? ? ? ? ? ? ? ? print (paste("predict random forest for block",i))
? ? ? ? ? ? ? ? pred ?= predict (rf, newdata=block_rasters)

? ? ? ? ? ? ? ? if (todisk) {
? ? ? ? ? ? ? ? ? ? ? ? writeValues(out, pred, bs\$row[i])
? ? ? ? ? ? ? ? } else {
? ? ? ? ? ? ? ? ? ? ? ? cols <- bs\$row[i]:(bs\$row[i]+bs\$nrows[i]-1)
? ? ? ? ? ? ? ? ? ? ? ? vv[,cols] <- matrix(pred, nrow=out at ncols)
? ? ? ? ? ? ? ? }
? ? ? ? }
? ? ? ? if (todisk) {
? ? ? ? ? ? ? ? out <- writeStop(out)
? ? ? ? } else {
? ? ? ? ? ? ? ? out <- setValues(out, as.vector(vv))
? ? ? ? }
? ? ? ? return(out)
}

This function use a raster stack of 30 layers (Europe 1x1km) and apply
the random forest prediction. It is very fast just 10 minutes.
In order to use the multi-core power and run several predictions of
different species, I was insert the R script in a bash script under
the ? EOF syntax and i run it in this way.

for sp ?in /forest/sp_* ; do echo ?$sp ; ?done ?| xargs -n 1 -P 4 ?sh
./randomForest_predciton.sh

xargs allows the use of 4 processor (-P) using 1 argument (-n 1)

of course ?the bash variables needs to be export from bash and import in r by
Sys.getenv().

I always try to do all the gis and data preprocessing in bash using
GRASS ?and awk , sed and enter in R with the data completely clean and
ready for the real spatial statistic. Saving memory, computational
time and getting the best from each sw.

Now that all the processing chain is working i can start to play
around with the random forest tuning parameters and use the
suggestions of Roberts:

On 11 January 2011 06:51, Robert Hijmans <r.hijmans at gmail.com> wrote:

Indeed i also prefer use randomForest in a regression mode. The only
problem is the not so high ?value of the explained variance, due to
binary value of the response vs the continues value of the prediction.
So my questions to the community are:
How i can calculate the performance of the model in this particular
case?
Can i consider also in this case the maximization of the explained
variance as a driven factor for the tunning?

Yes, i will test and i will let you know the performance.

Thanks again for the suggestions
best regards
Giuseppe Amatulli

Species Distribution by Random Forest using a multi core pc

Thread (2 messages)