Skip to content

parallelize projectRaster()

7 messages · Yerguner, Jonathan Greenberg, Michael Sumner

#
Hi All,

Below sample script do the job with a small multi-layered netcdf file (200M)
but it takes ~8 hrs to complete. I need to do the same process with a large
netcdf file(5.5 GB -that I cut this sample file from). I am wondering if
there is way to parallelize this process using beginCluster()-counting on
the build-in capacity of projectRaster()- OR using doParallel().

Thanks,
Yasemin

library(raster)
library(ncdf4)
library(rgdal)
rasterOptions(tmpdir=?/my_tmp_path/?)
ifile <- "cold2002_s1.nc"
colds1 <- brick(ifile)
[1] -9999
colds1[colds1 == -9999] <- NA
[1] -1.7e+308

newproj <- "+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997
+b=6370997 +units=m +no_defs"
prjs1<- projectExtent(colds1, newproj) 
res(prjs1) <- c(231.6564, 231.6563)
prjs1 <- projectRaster(colds1, prjs1, method='ngb')
ifile2 <- "/mod2002.tif"
mod02 <- raster(ifile2)
submod02 <- crop(mod02,prj1)
bb <- extent(581811.7, 791924.1, 347931.5, 454493.4)
extent(submod02) <- bb
submod02f <- setExtent(submod02, bb, keepres=FALSE)
stackSelect(prj1,submod02f,filename="modcold02.tif",options="INTERLEAVE=BAND")



--
View this message in context: http://r-sig-geo.2731867.n2.nabble.com/parallelize-projectRaster-tp7586598.html
Sent from the R-sig-geo mailing list archive at Nabble.com.
#
Yerguner:

You might want to consider using gdalUtils
(http://cran.r-project.org/web/packages/gdalUtils/index.html) --
reprojecting via gdal warp is really fast and you can even get it
cranking on multiple processors.

--j
On Wed, Jun 18, 2014 at 12:24 AM, Yerguner <yasebaytok at gmail.com> wrote:

  
    
#
Hi Jonathan,

Thanks a lot for your reply. I've already tried your gdalUtils() package but
it creates a empty .tif file (all zero values) and continues to run several
(>14) hrs. May be I'm not using the function correctly:

library(raster)
rasterOptions(tmpdir="/pegasus/yb3/modvar2000/")
library(ncdf4)
library(rgdal)
library(gdalUtils)
gdal_setInstallation()
ifile2 <- "cold00_s2.nc"
newproj <- "+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997
+b=6370997 +units=m +no_defs"
gdalwarp(srcfile= ifile2, dstfile= "ohey_file.tif", t_srs= newproj,
output_Raster=TRUE, multi=TRUE, tr=c(231.6564, 231.6563),r="near")
class       : RasterBrick 
dimensions  : 2049, 4417, 9050433, 365  (nrow, ncol, ncell, nlayers)
resolution  : 231.6564, 231.6563  (x, y)
extent      : -1075057, -51831.17, 241865.9, 716529.7  (xmin, xmax, ymin,
ymax)
coord. ref. : +proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997
+b=6370997 +units=m +no_defs 
data source : /pegasus/yb3/daymet2000/ohey_file.tif 
names       : ohey_file.1, ohey_file.2, ohey_file.3, ohey_file.4,
ohey_file.5, ohey_file.6, ohey_file.7, ohey_file.8, ohey_file.9,
ohey_file.10, ohey_file.11, ohey_file.12, ohey_file.13, ohey_file.14,
ohey_file.15, ... 

Thanks,
Yasemin





--
View this message in context: http://r-sig-geo.2731867.n2.nabble.com/parallelize-projectRaster-tp7586598p7586604.html
Sent from the R-sig-geo mailing list archive at Nabble.com.
#
Yerguner:

Have you tried first converting the NetCDF to another "friendlier"
raster format (netcdf isn't really a raster format, keep in mind)--j
first via gdal_translate?  Then you can follow up with a gdalwarp.
Keep in mind gdalUtils is merely a wrapper for the GDAL Utilities --
from their documentation on NetCDF
(http://www.gdal.org/frmt_netcdf.html):

"If the file contains only one netCDF array which appears to be an
image, it may be accessed directly, but if the file contains multiple
images it may be necessary to import the file via a two step process.

The first step is to get a report of the components images (dataset)
in the file using gdalinfo, and then to import the desired images
using gdal_translate. The gdalinfo utility lists all multidimensional
subdatasets from the input netCDF file.

The name of individual images are assigned to the SUBDATASET_n_NAME
metadata item. The description for each image is found in the
SUBDATASET_n_DESC metadata item. For netCDF images will follow this
format: NETCDF:filename:variable_name"
On Wed, Jun 18, 2014 at 10:44 AM, Yerguner <yasebaytok at gmail.com> wrote:

  
    
#
Hi Jonathan,

Believe me I tried that and all other different combinations but starting
with the gdal_translate step only adds extra hrs to finish the whole
process. Actually, when I run the below gdalwarp commanline outside R, it
completes the same small netcdf file in ~ 2hrs but I just do the
reprojection and then do the resampling step in R in ~25 min. So, this was
the my first approach alternative to projectRaster() approach and 2,5 hrs is
better than 8 hrs. But again, unless there is a solution to speed up the
process (at least the reprojection part), both approaches do not work my
case. That's why I'm looking for way to do this all process in R to be able
to parallelize the functions as for the first approach. And I have also many
large files (each about the same size 5.5 GB) to repeat this process. So,
both the file sizes are large and many, in addition to doParallel(),
foreach() package might also be a solution?

gdalwarp and resample() approach:

gdalwarp -t_srs "+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +no_defs
+a=6370997 +b=6370997 +to_meter=1.0" NETCDF:"colds1.nc":tmin colds1.tif     
## ~2 hrs

library(raster)
library(rgdal)
rasterOptions(tmpdir="some_tmp_path")
source_file1 <- "mod2000.tif"
source_file2 <- "colds1.tif"
mod00 <- raster(source_file1)
cold00s1 <- brick(source_file2)
submod00 <- crop(mod00,cold00s1)
rsmp1 <- resample(cold00s1,submod00,method="ngb")     ## 25 min
stackSelect
(rsmp1,submod00,filename="modcold00.tif",options="INTERLEAVE=BAND")

Thank you so much and looking forward to hearing from you,
Yasemin




--
View this message in context: http://r-sig-geo.2731867.n2.nabble.com/parallelize-projectRaster-tp7586598p7586608.html
Sent from the R-sig-geo mailing list archive at Nabble.com.
#
Can you make your test file available via google drive or dropbox or
something?  I'm really surprised GDAL is taking more time than the R
version.  It is hard to replicate your example without the test file.

As an FYI, to enable parallel processing of a single gdalwarp, use
(from: http://lists.osgeo.org/pipermail/gdal-dev/2012-June/033084.html):

gdalwarp -multi src.tif dst.tif -wo NUM_THREADS=ALL_CPUS

You can use these options from gdalUtils if you'd like.

--j
On Wed, Jun 18, 2014 at 2:32 PM, Yerguner <yasebaytok at gmail.com> wrote: