anyone know why package "RandomForest" na.roughfix is so slow??

Thu, Jul 1, 2010 5:07 PM

Here's another version that's a bit easier to read:

na.roughfix2 <- function (object, ...) {
  res <- lapply(object, roughfix)
  structure(res, class = "data.frame", row.names = seq_len(nrow(object)))
}

roughfix <- function(x) {
  missing <- is.na(x)
  if (!any(missing)) return(x)

  if (is.numeric(x)) {
    x[missing] <- median.default(x[!missing])
  } else if (is.factor(x)) {
    freq <- table(x)
    x[missing] <- names(freq)[which.max(freq)]
  } else {
    stop("na.roughfix only works for numeric or factor")
  }
  x
}

I'm cheating a bit because as.data.frame is so slow.

Hadley

On Thu, Jul 1, 2010 at 6:44 PM, Mike Williamson <this.is.mvw at gmail.com> wrote:

Jim, Andy,

? ?Thanks for your suggestions!

? ?I found some time today to futz around with it, and I found a "home
made" script to fill in NA values to be much quicker. ?For those who are
interested, instead of using:

? ? ? ? ?dataSet <- na.roughfix(dataSet)



? ?I used:

? ? ? ? ? ? ? ? ? ?origCols <- names(dataSet)
? ? ? ? ? ? ? ? ? ?## Fix numeric values...
? ? ? ? ? ? ? ? ? ?dataSet <- as.data.frame(lapply(dataSet, FUN=function(x)
{
? ? ? ? ? ? ? ? ? ? ? ?if(!is.numeric(x)) { x } else {
? ? ? ? ? ? ? ? ? ? ? ? ? ?ifelse(is.na(x), median(x, na.rm=TRUE), x) } }
),
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? row.names=row.names(dataSet) )
? ? ? ? ? ? ? ? ? ?## Fix factors...
? ? ? ? ? ? ? ? ? ?dataSet <- as.data.frame(lapply(dataSet, FUN=function(x)
{
? ? ? ? ? ? ? ? ? ? ? ?if(!is.factor(x)) { x } else {
? ? ? ? ? ? ? ? ? ? ? ? ? ?levels(x)[ifelse(!is.na
(x),x,table(max(table(x)))
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?) ] } } ),
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? row.names=row.names(dataSet) )
? ? ? ? ? ? ? ? ? ?names(dataSet) <- origCols



? ?In one case study that I ran, the na.roughfix() algo took 296 seconds
whereas the homemade one above took 16 seconds.

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Regards,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Mike



"Telescopes and bathyscaphes and sonar probes of Scottish lakes,
Tacoma Narrows bridge collapse explained with abstract phase-space maps,
Some x-ray slides, a music score, Minard's Napoleanic war:
The most exciting frontier is charting what's already here."
?-- xkcd

--
Help protect Wikipedia. Donate now:
http://wikimediafoundation.org/wiki/Support_Wikipedia/en


On Thu, Jul 1, 2010 at 10:05 AM, Liaw, Andy <andy_liaw at merck.com> wrote:

?You need to isolate the problem further, or give more detail about your
data. ?This is what I get:

R> nr <- 2134
R> nc <- 14037
R> x <- matrix(runif(nr*nc), nr, nc)
R> n.na <- round(nr*nc/10)
R> x[sample(nr*nc, n.na)] <- NA
R> system.time(x.fixed <- na.roughfix(x))
? ?user ?system elapsed
? ?8.44 ? ?0.39 ? ?8.85
R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with 2GB
ram.

Andy

?------------------------------
*From:* Mike Williamson [mailto:this.is.mvw at gmail.com]
*Sent:* Thursday, July 01, 2010 12:48 PM
*To:* Liaw, Andy
*Cc:* r-help
*Subject:* Re: [R] anyone know why package "RandomForest" na.roughfix is
so slow??

Andy,

? ? You're right, I didn't supply any code, because my call was very simple
and it was the call itself at question. ?However, here is the associated
code I am using:


? ? ? ? naFixTime <- system.time( {
? ? ? ? ? ? if (fltrResponse) { ?## TRUE: there are no NA's in the
response... cleared via earlier steps
? ? ? ? ? ? ? ? message(paste(iAm,": Missing values will now be
imputed...\n", sep=""))
? ? ? ? try( dataSet <- rfImpute(dataSet[,!is.element(names(dataSet),
response)],
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?dataSet[,response]) )
? ? ? ? ? ? } else { ?## In this case, there is no "response" column in the
data set
? ? ? ? ? ? ? ? message(paste(iAm,": Missing values will now be filled in
with median",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? " values or most frequent levels", sep=""))
? ? ? ? ? ? ? ? try( dataSet <- na.roughfix(dataSet) )
? ? ? ? ? ? }
? ? ? ? } )



? ? As you can see, the "na.roughfix" call is made as simply as possible:
I supply the entire dataSet (only parameters, no responses). ?I am not doing
the prediction here (that is done later, and the prediction itself is not
taking very long).
? ? Here are some calculation times that I experienced:

# rows ? ? ? # cols ? ? ? time to run na.roughfix
======= ? ? ======= ? ? ====================
? 2046 ? ? ? ? ?2833 ? ? ? ? ? ? ~ 2 minutes
? 2066 ? ? ? ? ?5626 ? ? ? ? ? ? ~ 6 minutes
? 2134 ? ? ? ? 14037 ? ? ? ? ? ? ~ 30 minutes

? ? These numbers are on a Windows server using the 64-bit version of 'R'.

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Regards,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Mike


"Telescopes and bathyscaphes and sonar probes of Scottish lakes,
Tacoma Narrows bridge collapse explained with abstract phase-space maps,
Some x-ray slides, a music score, Minard's Napoleanic war:
The most exciting frontier is charting what's already here."
?-- xkcd

--
Help protect Wikipedia. Donate now:
http://wikimediafoundation.org/wiki/Support_Wikipedia/en


On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy <andy_liaw at merck.com> wrote:

You have not shown any code on exactly how you use na.roughfix(), so I
can only guess.

If you are doing something like:

?randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...)

I would not be surprised that it's taking very long on large datasets.
Most likely it's caused by the formula interface, not na.roughfix()
itself.

If that is your case, try doing the imputation beforehand and run
randomForest() afterward; e.g.,

myroughfixed <- na.roughfix(mybigdata)
randomForest(myroughfixed[list.of.predictor.columns],
myroughfixed[[myresponse]],...)

HTH,
Andy

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Mike Williamson
Sent: Wednesday, June 30, 2010 7:53 PM
To: r-help
Subject: [R] anyone know why package "RandomForest" na.roughfix is so
slow??

Hi all,

? ?I am using the package "random forest" for random forest
predictions. ?I
like the package. ?However, I have fairly large data sets, and it can
often
take *hours* just to go through the "na.roughfix" call, which simply
goes
through and cleans up any NA values to either the median (numerical
data) or
the most frequent occurrence (factors).
? ?I am going to start doing some comparisons between na.roughfix() and
some apply() functions which, it seems, are able to do the same job more
quickly. ?But I hesitate to duplicate a function that is already in the
package, since I presume the na.roughfix should be as quick as possible
and
it should also be well "tailored" to the requirements of random forest.

? ?Has anyone else seen that this is really slow? ?(I haven't noticed
rfImpute to be nearly as slow, but I cannot say for sure: ?my "predict"
data
sets are MUCH larger than my model data sets, so cleaning the prediction
data set simply takes much longer.)
? ?If so, any ideas how to speed this up?

? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Thanks!
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mike



"Telescopes and bathyscaphes and sonar probes of Scottish lakes,
Tacoma Narrows bridge collapse explained with abstract phase-space maps,
Some x-ray slides, a music score, Minard's Napoleanic war:
The most exciting frontier is charting what's already here."
?-- xkcd

--
Help protect Wikipedia. Donate now:
http://wikimediafoundation.org/wiki/Support_Wikipedia/en

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice: ?This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from
your system.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

anyone know why package "RandomForest" na.roughfix is so slow??

Thread (7 messages)