slow computation progress for calc function

Hi Barry,Yes all of them are running at near 100% usage.

Sara

interflood <- clusterR(all_predictions, calc, args=list(function(x){y <- rle(as.numeric(x));return(max( y$lengths[y$values == 0]))}))

If I understand this correctly you are trying to find the length of the longest run of zeroes in each pixel stack?
You need to find out where the bottleneck is - are all your beginCluster(30) CPU cores running at near 100% usage? If not then there's a memory or disk bottleneck which would need a different optimisation strategy than trying to find something to optimise the CPU usage.
Barry

?

? ? ? ? [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Hi Barry,Yes all of them are running at near 100% usage.

Sara

   On Tuesday, June 25, 2019, 9:17:05 PM GMT+10, Barry Rowlingson
   <b.rowlingson at lancaster.ac.uk> wrote:

On Tue, Jun 25, 2019 at 2:32 AM Sara Shaeri via R-sig-Geo 
<r-sig-geo at r-project.org> wrote:

interflood <- clusterR(all_predictions, calc, args=list(function(x){y <- 
rle(as.numeric(x));return(max( y$lengths[y$values == 0]))}))

If I understand this correctly you are trying to find the length of the 
longest run of zeroes in each pixel stack?
This is how I read this too, finding the longest run of zeroes (no flood?) 
among 8000 layers. This means that each of the raster cells is 
independent. I assume that all_predictions is not trying to fit into 
memory (how many copies across the cluster?). I believe GRASS reads and 
writes by default by raster row, so would just iterate across this row by 
row.

I suspect that the clusterR() framework is not what you need, this should 
be feasible on a laptop (data 2M x 8K x INT4 ~ 64G) by stepping through in 
blocks, shouldn't it? One row is max 64M? Read a row for the whole stack 
updating the rle's on read, store one row until all layers processed, 
write the row as INT4? Try GRASS?

Just thinking aloud, the underlying problem is needing to slice cell-wise 
through the array.

Roger
You need to find out where 
the bottleneck is - are all your beginCluster(30) CPU cores running at 
near 100% usage? If not then there's a memory or disk bottleneck which 
would need a different optimisation strategy than trying to find 
something to optimise the CPU usage. Barry

? ? ? ? [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

	[[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: Roger.Bivand at nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
I have used the slice approach with success using complex functions and
multiple outputs on 12,000+ layers (3,000 X 3,000 cells approx) loaded into
chunked NetCDF files on a desktop machine, so this should work.

This was all done using the ncdf4 and raster packages. There is some work
involved in setting up the input / output NetCDF files though. The trick
was to select a chunking strategy that minimises row-wise read times
through the time series, then extract slices for each row into a matrix
using the ncdf4 package and use apply() with your custom functions.

The majority of overhead will be in read / write if you're using rle with
one output. I suspect clusteR / calc will be a lot faster on a chunked
NetCDF as well... I've seen some huge speed improvements before but it was
a special case with fewer layers and more computationally expensively
functions.

In any case, stacking 8,000 separate rasters is going to be super slow for
processing in R unless you use something like NetCDF.

On Tue, 25 Jun 2019, Sara Shaeri via R-sig-Geo wrote:

Hi Barry,Yes all of them are running at near 100% usage.

Sara

   On Tuesday, June 25, 2019, 9:17:05 PM GMT+10, Barry Rowlingson
   <b.rowlingson at lancaster.ac.uk> wrote:

On Tue, Jun 25, 2019 at 2:32 AM Sara Shaeri via R-sig-Geo
<r-sig-geo at r-project.org> wrote:

interflood <- clusterR(all_predictions, calc, args=list(function(x){y <-
rle(as.numeric(x));return(max( y$lengths[y$values == 0]))}))

If I understand this correctly you are trying to find the length of the
longest run of zeroes in each pixel stack?
This is how I read this too, finding the longest run of zeroes (no flood?)
among 8000 layers. This means that each of the raster cells is
independent. I assume that all_predictions is not trying to fit into
memory (how many copies across the cluster?). I believe GRASS reads and
writes by default by raster row, so would just iterate across this row by
row.

I suspect that the clusterR() framework is not what you need, this should
be feasible on a laptop (data 2M x 8K x INT4 ~ 64G) by stepping through in
blocks, shouldn't it? One row is max 64M? Read a row for the whole stack
updating the rle's on read, store one row until all layers processed,
write the row as INT4? Try GRASS?

Just thinking aloud, the underlying problem is needing to slice cell-wise
through the array.

Roger

You need to find out where
the bottleneck is - are all your beginCluster(30) CPU cores running at
near 100% usage? If not then there's a memory or disk bottleneck which
would need a different optimisation strategy than trying to find
something to optimise the CPU usage. Barry

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

      [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: Roger.Bivand at nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo