Hello,
As decision trees require sorting the variable used for splitting a given
node, I'm trying to avoid having this recurrent sorting by only sorting all
numeric variable first (and only once).
My attempt in doing this is shown in "Solution 2" below, but although I get
the desired result I think the %in% operation may be a costly one (and may
even offset the benefits of pre-sorting).
Any alternative solutions would be highly appreciated.
### Sample data
set.seed(1)
df <- data.frame(x1 = rnorm(20), x2 = rnorm(20))
w <- rep(1L, nrow(df))
# w == 1L denote observation present in the current node
w[c(1, 8, 10)] <- 0L
### The problem: sort x1 within observations present in the current node
### Solution 1: slow for repeated sorting
nodeObsInd <- which(w == 1L)
sol1 <- df[nodeObsInd, ]
sol1 <- sol1[order(sol1$x1), ]$x1
### Solution 2: sort all variables initially only.
sort_fun <- function(x)
{
index <- order(x)
x <- x[index]
data.frame(x, index) # the index gives original position of the obs
}
s_df <- lapply(df, function(x) sort_fun(x))
sol2 <- s_df[[1]][s_df$x1$index %in% nodeObsInd, ]
### check same result
all.equal(sol1, sol2$x)
Regards,
Axel.
Sorting in trees problem
2 messages · Axel Urbiz, Bert Gunter
"%in%" is a wrapper for match(), which uses hashing I believe (correction welcome!),and so is generally very fast. See ?Rprof for profiling R code to get timings. (Haven't used it myself, so not sure how useful it would be for this situation). -- Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Wed, Feb 24, 2016 at 8:45 AM, Axel Urbiz <axel.urbiz at gmail.com> wrote:
Hello,
As decision trees require sorting the variable used for splitting a given
node, I'm trying to avoid having this recurrent sorting by only sorting all
numeric variable first (and only once).
My attempt in doing this is shown in "Solution 2" below, but although I get
the desired result I think the %in% operation may be a costly one (and may
even offset the benefits of pre-sorting).
Any alternative solutions would be highly appreciated.
### Sample data
set.seed(1)
df <- data.frame(x1 = rnorm(20), x2 = rnorm(20))
w <- rep(1L, nrow(df))
# w == 1L denote observation present in the current node
w[c(1, 8, 10)] <- 0L
### The problem: sort x1 within observations present in the current node
### Solution 1: slow for repeated sorting
nodeObsInd <- which(w == 1L)
sol1 <- df[nodeObsInd, ]
sol1 <- sol1[order(sol1$x1), ]$x1
### Solution 2: sort all variables initially only.
sort_fun <- function(x)
{
index <- order(x)
x <- x[index]
data.frame(x, index) # the index gives original position of the obs
}
s_df <- lapply(df, function(x) sort_fun(x))
sol2 <- s_df[[1]][s_df$x1$index %in% nodeObsInd, ]
### check same result
all.equal(sol1, sol2$x)
Regards,
Axel.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.