slow computation progress for calc function
I have used the slice approach with success using complex functions and multiple outputs on 12,000+ layers (3,000 X 3,000 cells approx) loaded into chunked NetCDF files on a desktop machine, so this should work. This was all done using the ncdf4 and raster packages. There is some work involved in setting up the input / output NetCDF files though. The trick was to select a chunking strategy that minimises row-wise read times through the time series, then extract slices for each row into a matrix using the ncdf4 package and use apply() with your custom functions. The majority of overhead will be in read / write if you're using rle with one output. I suspect clusteR / calc will be a lot faster on a chunked NetCDF as well... I've seen some huge speed improvements before but it was a special case with fewer layers and more computationally expensively functions. In any case, stacking 8,000 separate rasters is going to be super slow for processing in R unless you use something like NetCDF.
On Tue., 25 Jun. 2019, 9:45 pm Roger Bivand, <Roger.Bivand at nhh.no> wrote:
On Tue, 25 Jun 2019, Sara Shaeri via R-sig-Geo wrote:
Hi Barry,Yes all of them are running at near 100% usage.
Sara
On Tuesday, June 25, 2019, 9:17:05 PM GMT+10, Barry Rowlingson
<b.rowlingson at lancaster.ac.uk> wrote:
On Tue, Jun 25, 2019 at 2:32 AM Sara Shaeri via R-sig-Geo
<r-sig-geo at r-project.org> wrote:
interflood <- clusterR(all_predictions, calc, args=list(function(x){y <-
rle(as.numeric(x));return(max( y$lengths[y$values == 0]))}))
If I understand this correctly you are trying to find the length of the
longest run of zeroes in each pixel stack?
This is how I read this too, finding the longest run of zeroes (no flood?) among 8000 layers. This means that each of the raster cells is independent. I assume that all_predictions is not trying to fit into memory (how many copies across the cluster?). I believe GRASS reads and writes by default by raster row, so would just iterate across this row by row. I suspect that the clusterR() framework is not what you need, this should be feasible on a laptop (data 2M x 8K x INT4 ~ 64G) by stepping through in blocks, shouldn't it? One row is max 64M? Read a row for the whole stack updating the rle's on read, store one row until all layers processed, write the row as INT4? Try GRASS? Just thinking aloud, the underlying problem is needing to slice cell-wise through the array. Roger
You need to find out where
the bottleneck is - are all your beginCluster(30) CPU cores running at
near 100% usage? If not then there's a memory or disk bottleneck which
would need a different optimisation strategy than trying to find
something to optimise the CPU usage. Barry
[[alternative HTML version deleted]]
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo [[alternative HTML version deleted]] _______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
-- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55; e-mail: Roger.Bivand at nhh.no https://orcid.org/0000-0003-2392-6140 https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo