I have a response variable which is the nighttime light luminosity (called
*ntl* in the data.frame I am using). As you can see in the image attached
(link below), it has some bright spots (highlighted in red) which are areas
with high brightness (outliers in the data.frame). Also, I shared the
histogram of the response. It's clear that the distribution is right-skewed
(and possibly there is bimodality?).
*Purpose of my analysis*
My goal is to predict the *ntl* from the coarse spatial resolution to a
higher. This means that, I need to maintain these bright spots (the
outliers) in the predicted image. Because at a later stage, I will
downscale the residuals of the regression using area-to-point Kriging, I
need the residuals to be random (no spatial structure).
*Analysis*
Following this <https://juliasilge.com/blog/ikea-prices/>approach, the
resulting RF residuals show clearly a spatial structure (i.e., the
residuals do not show a random pattern), especially it seems that the model
cannot capture the areas with high luminosity. This can also be seen if I
plot the fitted vs residuals (please see the GDrive link below). Again, for
a "good" model I would like to see a symmetric scatter of points around the
horizontal like at zero, indicating random deviations of predictions from
the observed values, but from the plot above I see that this isn't the case.
My thought is that in the NTL image one can see where the bright areas are
and were not, while in the covariates there is no such "variation" in these
areas. This means that, the RFR model I use cannot capture this variability
(I'm just thinking out loud).
What are your thoughts on that? How can I improve the model in order to
eliminate the spatial structure in residuals while mainting the bright
areas in the predicted high resolution image? Here is the complete code:
wd <- "path/"
# this a projected CRS
provoliko <- "EPSG:24313"
# the data.frame of the variables
df <- read.csv(paste0(wd, 'block.data.csv'))
# here I extract the coordinates
crds <- df[, 1:2]
# here I remove the coordinates from the data.frame
df <- df[, 3:12]
# random forest regression
library(tidymodels)
set.seed(456)
# splitting the data.frame into training and test set using stratified
random sampling
ikea_split <- initial_split(df, strata = ntl, prop = 0.80)
ikea_train <- training(ikea_split)
ikea_test <- testing(ikea_split)
set.seed(567)
ikea_folds <- bootstraps(ikea_train, strata = ntl)
ikea_folds
library(usemodels)
eq1 <- ntl ~ .
use_ranger(eq1, data = ikea_train)
library(textrecipes)
ranger_recipe <-
recipe(formula = eq1, data = ikea_train)
ranger_spec <-
rand_forest(mtry = tune(), min_n = tune(), trees = 1001) %>%
set_mode("regression") %>%
set_engine("ranger")
ranger_workflow <-
workflow() %>%
add_recipe(ranger_recipe) %>%
add_model(ranger_spec)
set.seed(678)
doParallel::registerDoParallel()
ranger_tune <-
tune_grid(ranger_workflow,
resamples = ikea_folds,
grid = 10
)
show_best(ranger_tune, metric = "rsq")
autoplot(ranger_tune)
final_rf <- ranger_workflow %>%
finalize_workflow(select_best(ranger_tune, metric = "rsq"))
final_rf
ikea_fit <- last_fit(final_rf, ikea_split)
ikea_fit
collect_metrics(ikea_fit)
collect_predictions(ikea_fit) %>%
ggplot(aes(ntl, .pred)) +
geom_abline(lty = 2, color = "gray50") +
geom_point(alpha = 0.5, color = "midnightblue") +
coord_fixed()
predict(ikea_fit$.workflow[[1]], ikea_test[15, ])
ikea_all <- predict(ikea_fit$.workflow[[1]], df[, 2:11])
ikea_all <- as.data.frame(cbind(df$ntl, ikea_all))
ikea_all$rsds <- ikea_all$`df$ntl` - ikea_all$.pred
ikea_all <- as.data.frame(cbind(crds, ikea_all))
ikea_all <- ikea_all[,-3:-4]
# export the residuals as a raster
library(terra)
rsds <- rast(ikea_all, type = "xyz")
crs(rsds) <- provoliko
writeRaster(rsds, paste0(wd, "rsds.tif"), overwrite = TRUE)
library(vip)
imp_spec <- ranger_spec %>%
finalize_model(select_best(ranger_tune, metric = "rsq")) %>%
set_engine("ranger", importance = "permutation")
workflow() %>%
add_recipe(ranger_recipe) %>%
add_model(imp_spec) %>%
fit(ikea_train) %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))
Because the csv I'm using has several thousands of rows, I can share it via
a link, from here
<https://drive.google.com/drive/folders/1V115zpdU2-5fXssI6iWv_F6aNu4E5qA7?usp=sharing>
(also in the link you can see the spatial structure of residuals and
histogram of the response variable). Just so you know, running and tuning a
model takes ~ 2-3 minutes on my laptop (8 gigs of RAM, 4 cores processor
(CPU: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz)). i am using R 4.3.1 and
RStudio 2023.09.0 Build 463, Windows 11.
Looking at the plots you sent, I have no idea if you have **spatial**
structure in your model residuals. You definitely have
heteroskedasticity and huge residuals on the high end of fitted
values, but these aren't necessarily due to spatial structure.
My guess would be the core cause is that you're using an
interpolation-ish method, random forests, and training it on data that
effectively stops at ~75, looking at your histogram. Regular random
forests are only going to predict within the range of values that they
were trained with, and their predictions are going to be weighted
towards the mean outcome value in the training set. The random forest
then uses your input variables to predict values further away from
that mean, but it's only able to identify relationships between the
levels of predictors and outcomes in the training data. Given the high
number of low NTL observations in your data, the random forest is able
to build a decent model for the lower end of NTL; given that there are
very few high NTL observations, the random forest mostly doesn't
bother trying to predict them.
You could attempt to add more predictors like land cover
classification or to assign your high NTL observations higher case
weights, in order to make the model more incentivized and more able to
predict high NTL values. But probably, given the sheer rarity of these
values, doing so will get you a model that's confidently wrong in
predicting high values, with abrupt jumps in predicted NTL values in
the map. I think the main way to improve this model would be to add
more high NTL observations to the training set.
As an aside, perhaps my favorite part of Edzer and Roger's new book is
their discussion of spatial autocorrelation
(https://r-spatial.org/book/15-Measures.html). The quote I use is "the
important point [is] that spatial autocorrelation is not inherent in
spatial phenomena, but often, is engendered by inappropriate
entitation, by omitted variables and/or inappropriate functional
form." My guess is that you're going to be best served here by
focusing on the model, and not the spatial side of things -- any
spatial structure you do see in your residuals can likely be addressed
by considering what variables you're using in your model.
Thanks,
Michael Mahoney
Michael Mahoney
781-812-8842 | mike.mahoney.218 at gmail.com
Website | LinkedIn | GitHub
On Sun, Oct 1, 2023 at 5:21?AM Nikolaos Tziokas <nikos.tziokas at gmail.com> wrote:
I have a response variable which is the nighttime light luminosity (called
*ntl* in the data.frame I am using). As you can see in the image attached
(link below), it has some bright spots (highlighted in red) which are areas
with high brightness (outliers in the data.frame). Also, I shared the
histogram of the response. It's clear that the distribution is right-skewed
(and possibly there is bimodality?).
*Purpose of my analysis*
My goal is to predict the *ntl* from the coarse spatial resolution to a
higher. This means that, I need to maintain these bright spots (the
outliers) in the predicted image. Because at a later stage, I will
downscale the residuals of the regression using area-to-point Kriging, I
need the residuals to be random (no spatial structure).
*Analysis*
Following this <https://juliasilge.com/blog/ikea-prices/>approach, the
resulting RF residuals show clearly a spatial structure (i.e., the
residuals do not show a random pattern), especially it seems that the model
cannot capture the areas with high luminosity. This can also be seen if I
plot the fitted vs residuals (please see the GDrive link below). Again, for
a "good" model I would like to see a symmetric scatter of points around the
horizontal like at zero, indicating random deviations of predictions from
the observed values, but from the plot above I see that this isn't the case.
My thought is that in the NTL image one can see where the bright areas are
and were not, while in the covariates there is no such "variation" in these
areas. This means that, the RFR model I use cannot capture this variability
(I'm just thinking out loud).
What are your thoughts on that? How can I improve the model in order to
eliminate the spatial structure in residuals while mainting the bright
areas in the predicted high resolution image? Here is the complete code:
wd <- "path/"
# this a projected CRS
provoliko <- "EPSG:24313"
# the data.frame of the variables
df <- read.csv(paste0(wd, 'block.data.csv'))
# here I extract the coordinates
crds <- df[, 1:2]
# here I remove the coordinates from the data.frame
df <- df[, 3:12]
# random forest regression
library(tidymodels)
set.seed(456)
# splitting the data.frame into training and test set using stratified
random sampling
ikea_split <- initial_split(df, strata = ntl, prop = 0.80)
ikea_train <- training(ikea_split)
ikea_test <- testing(ikea_split)
set.seed(567)
ikea_folds <- bootstraps(ikea_train, strata = ntl)
ikea_folds
library(usemodels)
eq1 <- ntl ~ .
use_ranger(eq1, data = ikea_train)
library(textrecipes)
ranger_recipe <-
recipe(formula = eq1, data = ikea_train)
ranger_spec <-
rand_forest(mtry = tune(), min_n = tune(), trees = 1001) %>%
set_mode("regression") %>%
set_engine("ranger")
ranger_workflow <-
workflow() %>%
add_recipe(ranger_recipe) %>%
add_model(ranger_spec)
set.seed(678)
doParallel::registerDoParallel()
ranger_tune <-
tune_grid(ranger_workflow,
resamples = ikea_folds,
grid = 10
)
show_best(ranger_tune, metric = "rsq")
autoplot(ranger_tune)
final_rf <- ranger_workflow %>%
finalize_workflow(select_best(ranger_tune, metric = "rsq"))
final_rf
ikea_fit <- last_fit(final_rf, ikea_split)
ikea_fit
collect_metrics(ikea_fit)
collect_predictions(ikea_fit) %>%
ggplot(aes(ntl, .pred)) +
geom_abline(lty = 2, color = "gray50") +
geom_point(alpha = 0.5, color = "midnightblue") +
coord_fixed()
predict(ikea_fit$.workflow[[1]], ikea_test[15, ])
ikea_all <- predict(ikea_fit$.workflow[[1]], df[, 2:11])
ikea_all <- as.data.frame(cbind(df$ntl, ikea_all))
ikea_all$rsds <- ikea_all$`df$ntl` - ikea_all$.pred
ikea_all <- as.data.frame(cbind(crds, ikea_all))
ikea_all <- ikea_all[,-3:-4]
# export the residuals as a raster
library(terra)
rsds <- rast(ikea_all, type = "xyz")
crs(rsds) <- provoliko
writeRaster(rsds, paste0(wd, "rsds.tif"), overwrite = TRUE)
library(vip)
imp_spec <- ranger_spec %>%
finalize_model(select_best(ranger_tune, metric = "rsq")) %>%
set_engine("ranger", importance = "permutation")
workflow() %>%
add_recipe(ranger_recipe) %>%
add_model(imp_spec) %>%
fit(ikea_train) %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(alpha = 0.8, fill = "midnightblue"))
Because the csv I'm using has several thousands of rows, I can share it via
a link, from here
<https://drive.google.com/drive/folders/1V115zpdU2-5fXssI6iWv_F6aNu4E5qA7?usp=sharing>
(also in the link you can see the spatial structure of residuals and
histogram of the response variable). Just so you know, running and tuning a
model takes ~ 2-3 minutes on my laptop (8 gigs of RAM, 4 cores processor
(CPU: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz)). i am using R 4.3.1 and
RStudio 2023.09.0 Build 463, Windows 11.
--
Tziokas Nikolaos
Cartographer
Tel:(+44)07561120302
LinkedIn <http://linkedin.com/in/nikolaos-tziokas-896081130>
[[alternative HTML version deleted]]