model predict in subset: compare terra and stars

Fri, Jan 21, 2022 4:41 AM

Robert, 
Thank you for the reply and the tips. That's a big improvement! I'll have cases where the masked raster stack (mras) won't fit in RAM so that's why I was explicitly writing out the intermediate products. I'll explore this in that context. 

I've actually wondered if that was the issue on the stars side - too much RAM overhead somehow. 

Thanks again,
Best, 
Tim

From: Robert J. Hijmans <r.hijmans at gmail.com> 
Sent: Friday, January 21, 2022 1:21 AM
To: Howard, Tim G (DEC) <tim.howard at dec.ny.gov>
Cc: r-sig-geo at r-project.org
Subject: Re: [R-sig-Geo] model predict in subset: compare terra and stars


Hello everybody,? 
I cannot answer Tim's questions, but here is a way to do the terra bit much faster, by not writing intermediate steps to disk.
Cheers, Robert

library(terra)
library(stars)
library(sf)
library(microbenchmark)

## setup
tif = system.file("tif/L7_ETMs.tif", package = "stars")
L7_orig = read_stars(tif, proxy = TRUE) %>% split()
# create a shape within which we will want to make the prediction
circle = st_sfc(st_buffer(st_point(c(293749.5, 9115745)), 600), crs = st_crs(L7_orig))
# simple set of points for training the model
trainCoords <- data.frame(
? X = c(290050, 290100, 291700, 293000, 294500, 294600, 298000,
? ? ? ? 289500, 290100, 292500, 294000, 296000, 293600, 294800),
? Y = c(9117000, 9118500, 9117100, 9118000, 9112890, 9111700, 9120000,
? ? ? ? 9112700, 9114000, 9115700, 9118100, 9120000, 9111650, 9116000))

# model for stars
trainPts <- st_as_sf(x = trainCoords, coords = c("X","Y"), crs = st_crs(L7_orig))
trainAttr <- st_drop_geometry(st_as_sf(st_extract(L7_orig, trainPts)))
names(trainAttr) <- c("X1","X2","X3","X4","X5","X6")
trainAttr$pa <- c(rep(1,7), rep(0,7)) #add presence-absence attribute
st_model <- glm(formula=pa~.,data = trainAttr)

# model for terra 
ras = terra::rast(tif) 
trainAttr2 <- extract(ras, trainCoords)[,-1]
trainAttr2$pa <- c(rep(1,7), rep(0,7))
model <- glm(formula=pa~.,data = trainAttr2)
vcircle <- vect(circle)

# time both approaches
mbm <- microbenchmark("starsMethod" = {
? L7_orig_inside = read_stars(tif, proxy = TRUE) %>% split()
? L7_crop <- L7_orig_inside[circle]
? st_out = predict(L7_crop, st_model)
?# st_out is still a proxy with processing steps, writing it out forces those steps to be applied
? write_stars(st_out, file.path(tempdir(),"stars_prediction.tif"),overwrite = TRUE)
},
"terraMethod" = {
? ras = terra::rast(tif)
? cras <- terra::crop(ras, vcircle)
? mras <- terra::mask(cras, vcircle)
? terra_out <- predict(mras, model, filename = file.path(tempdir(),"terra_prediction.tif"), overwrite = TRUE)
}, times = 10)

mbm
#Unit: milliseconds
# ? ? ? ?expr ? ? ?min ? ? ? lq ? ? ?mean ? median ? ? ? uq ? ? ?max neval
# starsMethod 225.4070 227.4722 230.86188 231.4848 233.9374 237.5687 ? ?10
# terraMethod ?93.9507 ?96.0912 ?99.59111 100.0375 103.5670 104.1225 ? ?10

With terra you can use a window ("lazy crop")
? window(ras) <- ext(vcircle)
? mras <- terra::mask(ras, vcircle)
? # but this seems faster
? cras <- terra::crop(ras, vcircle)
? mras <- terra::mask(cras, vcircle)

On Thu, Jan 20, 2022 at 8:43 AM Howard, Tim G (DEC) via R-sig-Geo <mailto:r-sig-geo at r-project.org> wrote:

Dear all - 
Classically, if I want to create a spatial model and make a prediction for an area within the extent of my raster stack my approach would be to crop all predictor rasters to the area of interest and run the predict on that cropped set. The stars package opens up another option: to apply the crop/mask to the proxy object, which then will extract only the relevant cells from the full rasters when writing (or plotting) the result. At least that's how I understand it. 

The following example has both approaches and their timings. I borrowed heavily from stars article 7, other stars articles, and from the terra help pages! Questions below the example.

```
library(terra)
library(stars)
library(sf)
library(microbenchmark)

## setup
tif = system.file("tif/L7_ETMs.tif", package = "stars")
L7_orig = read_stars(tif, proxy = TRUE) %>% split()
# create a shape within which we will want to make the prediction
circle = st_sfc(st_buffer(st_point(c(293749.5, 9115745)), 600), crs = st_crs(L7_orig))
# simple set of points for training the model 
trainCoords <- data.frame(
? X = c(290050, 290100, 291700, 293000, 294500, 294600, 298000,
??????? 289500, 290100, 292500, 294000, 296000, 293600, 294800),
? Y = c(9117000, 9118500, 9117100, 9118000, 9112890, 9111700, 9120000,
??????? 9112700, 9114000, 9115700, 9118100, 9120000, 9111650, 9116000))

trainPts <- st_as_sf(x = trainCoords, coords = c("X","Y"), crs = st_crs(L7_orig))
trainAttr <- st_drop_geometry(st_as_sf(st_extract(L7_orig, trainPts)))
names(trainAttr) <- c("X1","X2","X3","X4","X5","X6")
trainAttr$pa <- c(rep(1,7), rep(0,7)) #add presence-absence attribute

# time both approaches
mbm <- microbenchmark("starsMethod" = {
? L7_orig_inside = read_stars(tif, proxy = TRUE) %>% split()
? st_model <- glm(formula=pa~.,data = trainAttr)
? L7_crop <- L7_orig_inside[circle]
? st_out = predict(L7_crop, st_model)
?# st_out is still a proxy with processing steps, writing it out forces those steps to be applied
? write_stars(st_out, file.path(tempdir(),"stars_prediction.tif"),overwrite = TRUE)
},
"terraMethod" = {
? ras = terra::rast(tif)
? # crop, mask, write out to temp dir
? c_ras <- terra::crop(ras, circle)
? msk <- rasterize(vect(circle), c_ras)
? maskRas <- file.path(tempdir(),"L7_ETMs_crop.tif")
? m_ras <- terra::mask(c_ras, msk, filename = maskRas, overwrite = TRUE)
? model <- glm(formula=pa~.,data = trainAttr)
? # read the cropped raster for predict (already in RAM, but this would be protocol)
? mras <- rast(maskRas)
? names(mras) <- c("X1","X2","X3","X4","X5","X6")
? terra_out <- predict(mras, model, filename = file.path(tempdir(),"terra_prediction.tif"), overwrite = TRUE)
}, times = 10)

# > mbm
# Unit: milliseconds
# expr????? min?????? lq???? mean?? median?????? uq????? max neval
# starsMethod 411.8035 442.4342 701.2864 505.4522 630.8252 1699.737??? 10
# terraMethod? 94.0480? 99.4800 560.6102 106.9317 132.1853 4207.067??? 1
```

As run above, using stars proxies takes a fair amount longer than cropping, saving, and predicting on the cropped rasters. I have found this to be the case for larger (more real-life) examples as well, to the level that for larger data sets and more predictors, I run out of RAM using stars proxies and that approach fails. 

Questions:
1. Am I missing something? Is there a more efficient way for running a model and predicting within a smaller area than the full extent of your predictors?
2. On the surface of things, I would have expected the proxy approach to be faster, especially if using COG rasters (which allow read by chunk, not line). I'm not seeing that here, nor with larger datasets. Any thoughts on the potential bottlenecks?

Thanks in advance, 
Tim

model predict in subset: compare terra and stars

Thread (3 messages)