Skip to content

issue in using projectRaster in raster library

10 messages · Alex Zvoleff, Ping Yang, Jonathan Greenberg

#
Ping:

Re: maxmemory -- you really shouldn't adjust that, definitely not to
what you set it to -- you've set it to (by my quick calculations) to
be around 7500 GB of needed memory (this is assuming each cell takes
64 bits of space) -- this is probably why you are crashing.  Chunksize
also is usually set correctly, raster processes generally have a
quickly diminishing returns as you increase the chunk size.

You might want to try gdalwarp in my gdalUtils package -- it will be
faster to reproject, and you can even enable parallelized warping.

install.packages("gdalUtils",
repos="http://R-Forge.R-project.org",type="source") # To get the
latest, otherwise just use CRAN

--j
On Mon, Apr 7, 2014 at 3:30 PM, Alex Zvoleff <azvoleff at conservation.org> wrote:

  
    
#
For gdalUtils, you don't need to upgrade to 3.2 -- the latest CRAN
version only requires R 2.14 or later.  It should find your cygwin
version, but let me know if it doesn't.

The key with the rasterOptions you used is that, previously, you were
forcing the ENTIRE RASTER to load into memory with the max memory and
chunk size you used -- this is really not a good idea for 99.9% of
raster calculations.  I would try two things: 1) reduce the max memory
back to its default settings (1e+08 cells on my machine -- about 750mb
of RAM usage if each cell is 64 bits), and 2) adjust the chunk size to
maybe 250mb (3e+07 cells)  If you leave your settings like that,
particularly on Windows, you'll end up with out-of-memory crashes
fairly often (since I'm guessing you don't have 8TB of RAM on a
Windows box).

--j
On Mon, Apr 7, 2014 at 4:47 PM, ping yang <pingyang.whu at gmail.com> wrote:

  
    
#
Ping:

Re: raster:::projectRaster I'm not sure how much you can speed that up
(Robert?) -- I think it is native R code which is generally a lot
slower than compiled code like the GDAL utilities.  It has to use
"chunking", but for reasonable chunk sizes this is not likely to cause
any noticeable overhead.

Re: gdalUtils vs. cygwin -- you got basically the same execution time
-- 21.15 vs. 20.7 seconds is not worth worrying about (~ a 2% increase
using gdal from R)-- gdalUtils has the (small) overhead from R, but it
is actually using your installed gdalwarp in cygwin anyway, it is just
an R wrapper for it if you are more comfortable using R functions vs.
command line.  The first time you launch gdalwarp in an R session it
takes a few seconds to find your local installation of GDAL -- the
second time it runs, there should be almost no overhead.

Re: parallelization -- if you have GDAL 1.10.1 or later (perhaps a
slightly earlier version, not sure when it was implemented), you can
use -multi -wo NUM_THREADS=ALL_CPUS to parallelize the operation -- in
gdalUtils, this would be:

gdalwarp(...,multi=TRUE,wo="NUM_THREADS=ALL_CPUS")

# See: http://lists.osgeo.org/pipermail/gdal-dev/2012-June/033084.html

If you are trying to warp a LOT of images, you can also use e.g.
foreach() to warp each image using a different thread.  You'd have to
experiment with which one (loop through each image one at a time,
parallel processing each individual image vs. parallel processing
multiple images at the same time) is more efficient.

--j
On Tue, Apr 8, 2014 at 4:50 PM, ping yang <pingyang.whu at gmail.com> wrote:

  
    
1 day later
#
Responses below:
On Thu, Apr 10, 2014 at 4:49 PM, ping yang <pingyang.whu at gmail.com> wrote:
GDAL 1.9.2 is pretty old (October 2012) -- I'd recommend upgrading to
1.10.1 -- I think they added the parallel support around v. 1.10.0.
OSGeo4W is my install of choice for Windows, but you can see the
various versions at:
http://trac.osgeo.org/gdal/wiki/DownloadingGdalBinaries  Also, you
need to use the -multi parameter as well, otherwise it will ignore the
-wo NUM_THREADS=ALL_CPUS.
This looks like a potential firewall problem -- but you also need to
register it with foreach -- this is a good question for r-sig-hpc, by
the way, since that is where the foreach people tend to hang out.  try
something like:

cl <- makeCluster(spec=4,type="PSOCK")
registerDoParallel(cl)

Keep in mind, if you don't see your CPUs being pegged it is because
you are (likely) I/O limited.

Incidentally, the "cleaner" way to use packages with foreach is using
the .packages= parameter rather than using require():
foreach(...,.packages=c("raster","rgdal","gdalUtils")) # You don't
need to require foreach -- it is autoloaded
The idea is for a lot of files, you can approach it in two ways:
1) Sequentially loop through each file, but parallelize the single
file processing (e.g. using -multi -wo NUM_THREADS=ALL_CPUS in GDAL)
-- when you see all the processors light up, this is processing ONE
file.
2) Parallel loop through your files -- each CPU ("worker"), then, is
processing a SINGLE file, but you are processing multiple files at the
same time.  This would be more like what you are trying to accomplish
above (using foreach).

Slope calculations are a form of focal-window analysis which can be
CPU limited, so I think #2 is the right way to go.  If you are trying
to cut down the time, you should look for bottlenecks -- are you
reading and writing to the same drive?  Are you doing it over a
network?  Things like that can really slow the process down.

--j

  
    
3 days later