Ranger could not work with caret
@Rui Barradas <ruipbarradas at sapo.pt> Thank you again for the useful explanation. Best regards
On Fri, Jul 1, 2022 at 8:26 PM Rui Barradas <ruipbarradas at sapo.pt> wrote:
Hello, The error doesn't arise in randomForest because rf has a function tuneRF that looks for the best mtry (best relative to OOB error estimate). And it's this value that it uses. The question's code gives Ranger errors but it also gives R warnings: Warning messages: 1: model fit failed for Fold01: mtry=48, min.node.size=5, splitrule=variance Error in ranger::ranger(dependent.variable.name = ".outcome", data = x, : User interrupt or internal error. As you can see, mtry=48 is the double of ncol(tr) when should *never* be greater than the number of variables in the data set. Why it is using this value, I don't know. Function bug? Ask the package maintainer? And, by the way, package caret does or can do a grid search for optimal parameter values. If that is giving errors and you are calling rf directly why bother whith caret's error? Use the original function. Here is an example with tuneRF. Setting argument doBest to TRUE you'll have both the optimal value for mtry and the fitted random forest. 2 in 1. library(randomForest) # randomForest 4.7-1.1 # Type rfNews() to see new features/changes/bug fixes. c2 <- tuneRF( x = tr[-ncol(tr)], y = tr$act_effort, mtryStart = ncol(tr)/2, doBest = TRUE ) # mtry = 12 OOB error = 139920.7 # Searching left ... # mtry = 6 OOB error = 170909.3 # -0.2214729 0.05 # Searching right ... # mtry = 23 OOB error = 128566.7 # 0.08114586 0.05 c2 # # Call: # randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1]) # Type of random forest: regression # Number of trees: 500 # No. of variables tried at each split: 23 # # Mean of squared residuals: 129734.8 # % Var explained: 39.98 Hope this helps, Rui Barradas ?s 17:18 de 01/07/2022, Neha gupta escreveu:
Thank you so much for your help. I hope it will work.
However, why the same error doesn't arise when I am using rf. They both
have the same parameters and it's default values.
Best regards
On Friday, July 1, 2022, Rui Barradas <ruipbarradas at sapo.pt
<mailto:ruipbarradas at sapo.pt>> wrote:
Hello,
The error is in Ranger parameter mtry becoming greater than the
number of variables (columns).
mtry can be set manually in caret::train argument tuneGrid. But for
random forests you must also set the split rule and the minimum node.
library(caret)
library(farff)
boot <- trainControl(method = "cv", number = 10)
# set the maximum mtry manually to ncol(tr)
# this creates a sequence of mtry values
mtry <- var_seq(ncol(tr), len = 3) # 3 is the default value
mtry
# [1] 2 13 24
#[1] 2 13 24
splitrule <- c("variance", "extratrees")
min.node.size <- 1:10
mtrygrid <- expand.grid(mtry, splitrule, min.node.size)
names(mtrygrid) <- c("mtry", "splitrule", "min.node.size")
c1 <- train(act_effort ~ ., data = tr,
method = "ranger",
tuneLength = 5,
metric = "MAE",
preProc = c("center", "scale", "nzv"),
tuneGrid = mtrygrid,
trControl = boot)
c1
# Random Forest
#
# 30 samples
# 23 predictors
#
# Pre-processing: centered (48), scaled (48), remove (58)
# Resampling: Cross-Validated (10 fold)
# Summary of sample sizes: 28, 27, 27, 28, 27, 27, ...
# Resampling results across tuning parameters:
#
# mtry splitrule min.node.size RMSE Rsquared MAE
# 2 variance 1 256.6391 0.8103759 186.3609
# 2 variance 2 249.7120 0.8628109 183.6696
# 2 variance 3 258.8240 0.8284449 189.0712
#
# [...omit...]
#
# 13 extratrees 10 254.9569 0.8918014 191.2524
# 24 variance 1 177.7188 0.9458652 112.2800
# 24 variance 2 172.6826 0.9204287 108.5943
# 24 variance 3 172.9954 0.9271006 109.2554
# 24 variance 4 172.2467 0.9523067 110.0776
# 24 variance 5 175.2485 0.9283317 112.8798
# 24 variance 6 177.9285 0.9369881 115.8970
# 24 variance 7 180.5959 0.9485035 117.5816
# 24 variance 8 178.8037 0.9358033 117.8725
# 24 variance 9 176.5849 0.9210959 117.0055
# 24 variance 10 178.6439 0.9257969 119.8035
# 24 extratrees 1 219.1368 0.8801770 141.0720
# 24 extratrees 2 216.1900 0.8550002 140.9263
# 24 extratrees 3 212.4138 0.8979379 141.4282
# 24 extratrees 4 218.2631 0.9121471 146.2908
# 24 extratrees 5 212.5679 0.9279598 144.2715
# 24 extratrees 6 218.9856 0.9141754 152.2099
# 24 extratrees 7 222.8540 0.9412682 152.4614
# 24 extratrees 8 228.1156 0.9423414 161.8456
# 24 extratrees 9 226.6182 0.9408306 160.5264
# 24 extratrees 10 226.9280 0.9429413 165.6878
#
# MAE was used to select the optimal model using the smallest value.
# The final values used for the model were mtry = 24, splitrule =
variance
# and min.node.size = 2.
plot(c1)
Hope this helps,
Rui Barradas
?s 23:03 de 30/06/2022, Neha gupta escreveu:
Ok, the data is pasted below
But on the same data (everything the same) and with other models
like RF, SVM etc, it works fine.
> dput(head(tr, 30))
structure(list(recordnumber = c(0, 0.02, 0.04, 0.06, 0.07, 0.08,
0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.16, 0.17, 0.18, 0.23, 0.24,
0.25, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.35, 0.36, 0.37, 0.38,
0.4, 0.41), projectname = structure(c(1L, 1L, 1L, 1L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 5L, 6L), levels = c("de", "erb", "gal",
"X", "hst", "slp", "spl", "Y"), class = "factor"), cat2 =
structure(c(3L,
3L, 3L, 3L, 3L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L,
9L, 11L, 5L, 4L, 6L, 8L, 3L, 9L, 9L, 9L, 9L, 6L, 7L), levels =
c("Avionics",
"application_ground", "avionicsmonitoring",
"batchdataprocessing",
"communications", "datacapture", "launchprocessing",
"missionplanning",
"monitor_control", "operatingsystem", "realdataprocessing",
"science",
"simulation", "utility"), class = "factor"), forg =
structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels =
c("f",
"g"), class = "factor"), center = structure(c(2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 6L), levels = c("1", "2",
"3", "4", "5", "6"), class = "factor"), year = c(0.5, 0.5, 0.5,
0.5, 0.6875, 0.5625, 0.5625, 0.8125, 0.5625, 0.875, 0.5625, 0.75,
0.5625, 0.8125, 0.75, 0.9375, 0.9375, 0.9375, 0.6875, 0.6875,
0.6875, 0.6875, 0.875, 1, 0.9375, 0.9375, 0.9375, 0.9375, 0.5625,
0.25), mode = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L), levels = c("embedded", "organic",
"semidetached"
), class = "factor"), rely = structure(c(4L, 4L, 4L, 4L, 4L,
4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 4L), levels = c("vl", "l", "n",
"h", "vh", "xh"), class = "factor"), data = structure(c(2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
5L, 5L, 5L, 5L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 2L), levels = c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), cplx =
structure(c(4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L,
3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), levels =
c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), time =
structure(c(3L,
3L, 3L, 3L, 3L, 6L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L,
3L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 3L), levels =
c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), stor =
structure(c(3L,
3L, 3L, 3L, 3L, 6L, 3L, 3L, 3L, 3L, 3L, 3L, 6L, 3L, 3L, 3L, 3L,
3L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L), levels =
c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), virt =
structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L), levels =
c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), turn =
structure(c(2L,
2L, 2L, 2L, 2L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 2L), levels =
c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), acap =
structure(c(3L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L,
3L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L), levels =
c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), aexp =
structure(c(3L,
3L, 3L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 4L, 5L, 5L, 4L, 5L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), levels =
c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), pcap =
structure(c(3L,
3L, 3L, 3L, 3L, 4L, 5L, 4L, 5L, 3L, 4L, 4L, 5L, 4L, 4L, 4L, 4L,
4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L, 4L, 4L), levels =
c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), vexp =
structure(c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L), levels =
c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), lexp =
structure(c(4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 1L, 4L, 4L, 4L, 4L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 4L, 3L, 4L, 3L), levels =
c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), modp =
structure(c(4L,
4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 5L, 5L, 5L, 5L, 4L, 4L, 3L, 3L, 4L, 3L, 4L, 4L), levels =
c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), tool =
structure(c(3L,
3L, 3L, 3L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 4L, 3L, 3L, 1L), levels =
c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), sced =
structure(c(2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 3L), levels =
c("vl",
"l", "n", "h", "vh", "xh"), class = "factor"), equivphyskloc =
c(0.025534,
0.006945, 0.008988, 0.002655, 0.067102, 0.006741, 0.019508,
0.005209,
0.101215, 0.010622, 0.101215, 0.019508, 0.152283, 0.031253,
0.014401,
0.014401, 0.037892, 0.009294, 0.015729, 0.012154, 0.032377,
0.035339,
0.004698, 0.009703, 0.00572, 0.012358, 0.091002, 0.007252,
0.180778,
0.307527), act_effort = c(117.6, 31.2, 25.2, 10.8, 352.8, 72,
72, 24, 360, 36, 215, 48, 324, 60, 48, 90, 210, 48, 82, 62, 170,
192, 18, 50, 42, 60, 444, 42, 1248, 2400)), row.names = c(1L,
3L, 5L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 17L, 18L, 19L,
24L, 25L, 26L, 29L, 30L, 31L, 32L, 33L, 34L, 36L, 37L, 38L, 39L,
41L, 42L), class = "data.frame")
On Thu, Jun 30, 2022 at 11:28 PM Rui Barradas
<ruipbarradas at sapo.pt <mailto:ruipbarradas at sapo.pt>
<mailto:ruipbarradas at sapo.pt <mailto:ruipbarradas at sapo.pt>>>
wrote:
Hello,
Please post data in dput format, without it it's difficult
to tell.
If I substitute
mpg for act_effort
mtcars for tr
keeping everything else, I don't get any errors.
And the error message says clearly that the error is in tr
(data).
Can you post the output of dput(head(tr, 30))?
Rui Barradas
?s 19:32 de 30/06/2022, Neha gupta escreveu:
> I posted it for the second time as I didn't get any
response from
group
> members. I am not sure if some problem is with the
question.
>
>
>
> I cannot run the "ranger" model with caret. I am only
using the
farff and
> caret libraries and the following code:
>
> boot <- trainControl(method = "cv", number=10)
>
> c1 <-train(act_effort ~ ., data = tr,
> method = "ranger",
> tuneLength = 5,
> metric = "MAE",
> preProc = c("center", "scale", "nzv"),
> trControl = boot)
>
> The error I get is the repeating of the following
message until I
interrupt
> it.
>
> Error: mtry can not be larger than number of variables
in data.
Ranger will
> EXIT now.
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org>
<mailto:R-help at r-project.org <mailto:R-help at r-project.org>>
mailing list
-- To UNSUBSCRIBE and more, see
<https://stat.ethz.ch/mailman/listinfo/r-help> <https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help>>
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html <http://www.R-project.org/posting-guide.html> <http://www.R-project.org/posting-guide.html <http://www.R-project.org/posting-guide.html>> > and provide commented, minimal, self-contained, reproducible code.