Skip to content

slow computation progress for calc function

3 messages · Sara Shaeri, Roger Bivand, Stephen Stewart

#
Hi Barry,Yes all of them are running at near 100% usage.

Sara
On Tuesday, June 25, 2019, 9:17:05 PM GMT+10, Barry Rowlingson <b.rowlingson at lancaster.ac.uk> wrote:

        
On Tue, Jun 25, 2019 at 2:32 AM Sara Shaeri via R-sig-Geo <r-sig-geo at r-project.org> wrote:
interflood <- clusterR(all_predictions, calc, args=list(function(x){y <- rle(as.numeric(x));return(max( y$lengths[y$values == 0]))}))



If I understand this correctly you are trying to find the length of the longest run of zeroes in each pixel stack?
You need to find out where the bottleneck is - are all your beginCluster(30) CPU cores running at near 100% usage? If not then there's a memory or disk bottleneck which would need a different optimisation strategy than trying to find something to optimise the CPU usage.
Barry

?


? ? ? ? [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo
#
On Tue, 25 Jun 2019, Sara Shaeri via R-sig-Geo wrote:

            
This is how I read this too, finding the longest run of zeroes (no flood?) 
among 8000 layers. This means that each of the raster cells is 
independent. I assume that all_predictions is not trying to fit into 
memory (how many copies across the cluster?). I believe GRASS reads and 
writes by default by raster row, so would just iterate across this row by 
row.

I suspect that the clusterR() framework is not what you need, this should 
be feasible on a laptop (data 2M x 8K x INT4 ~ 64G) by stepping through in 
blocks, shouldn't it? One row is max 64M? Read a row for the whole stack 
updating the rle's on read, store one row until all layers processed, 
write the row as INT4? Try GRASS?

Just thinking aloud, the underlying problem is needing to slice cell-wise 
through the array.

Roger

  
    
#
I have used the slice approach with success using complex functions and
multiple outputs on 12,000+ layers (3,000 X 3,000 cells approx) loaded into
chunked NetCDF files on a desktop machine, so this should work.

This was all done using the ncdf4 and raster packages. There is some work
involved in setting up the input / output NetCDF files though. The trick
was to select a chunking strategy that minimises row-wise read times
through the time series, then extract slices for each row into a matrix
using the ncdf4 package and use apply() with your custom functions.

The majority of overhead will be in read / write if you're using rle with
one output. I suspect clusteR / calc will be a lot faster on a chunked
NetCDF as well... I've seen some huge speed improvements before but it was
a special case with fewer layers and more computationally expensively
functions.

In any case, stacking 8,000 separate rasters is going to be super slow for
processing in R unless you use something like NetCDF.
On Tue., 25 Jun. 2019, 9:45 pm Roger Bivand, <Roger.Bivand at nhh.no> wrote: