anyone know why package "RandomForest" na.roughfix is so slow??
Here's another version that's a bit easier to read:
na.roughfix2 <- function (object, ...) {
res <- lapply(object, roughfix)
structure(res, class = "data.frame", row.names = seq_len(nrow(object)))
}
roughfix <- function(x) {
missing <- is.na(x)
if (!any(missing)) return(x)
if (is.numeric(x)) {
x[missing] <- median.default(x[!missing])
} else if (is.factor(x)) {
freq <- table(x)
x[missing] <- names(freq)[which.max(freq)]
} else {
stop("na.roughfix only works for numeric or factor")
}
x
}
I'm cheating a bit because as.data.frame is so slow.
Hadley
On Thu, Jul 1, 2010 at 6:44 PM, Mike Williamson <this.is.mvw at gmail.com> wrote:
Jim, Andy,
? ?Thanks for your suggestions!
? ?I found some time today to futz around with it, and I found a "home
made" script to fill in NA values to be much quicker. ?For those who are
interested, instead of using:
? ? ? ? ?dataSet <- na.roughfix(dataSet)
? ?I used:
? ? ? ? ? ? ? ? ? ?origCols <- names(dataSet)
? ? ? ? ? ? ? ? ? ?## Fix numeric values...
? ? ? ? ? ? ? ? ? ?dataSet <- as.data.frame(lapply(dataSet, FUN=function(x)
{
? ? ? ? ? ? ? ? ? ? ? ?if(!is.numeric(x)) { x } else {
? ? ? ? ? ? ? ? ? ? ? ? ? ?ifelse(is.na(x), median(x, na.rm=TRUE), x) } }
),
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? row.names=row.names(dataSet) )
? ? ? ? ? ? ? ? ? ?## Fix factors...
? ? ? ? ? ? ? ? ? ?dataSet <- as.data.frame(lapply(dataSet, FUN=function(x)
{
? ? ? ? ? ? ? ? ? ? ? ?if(!is.factor(x)) { x } else {
? ? ? ? ? ? ? ? ? ? ? ? ? ?levels(x)[ifelse(!is.na
(x),x,table(max(table(x)))
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?) ] } } ),
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? row.names=row.names(dataSet) )
? ? ? ? ? ? ? ? ? ?names(dataSet) <- origCols
? ?In one case study that I ran, the na.roughfix() algo took 296 seconds
whereas the homemade one above took 16 seconds.
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Regards,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Mike
"Telescopes and bathyscaphes and sonar probes of Scottish lakes,
Tacoma Narrows bridge collapse explained with abstract phase-space maps,
Some x-ray slides, a music score, Minard's Napoleanic war:
The most exciting frontier is charting what's already here."
?-- xkcd
--
Help protect Wikipedia. Donate now:
http://wikimediafoundation.org/wiki/Support_Wikipedia/en
On Thu, Jul 1, 2010 at 10:05 AM, Liaw, Andy <andy_liaw at merck.com> wrote:
?You need to isolate the problem further, or give more detail about your
data. ?This is what I get:
R> nr <- 2134
R> nc <- 14037
R> x <- matrix(runif(nr*nc), nr, nc)
R> n.na <- round(nr*nc/10)
R> x[sample(nr*nc, n.na)] <- NA
R> system.time(x.fixed <- na.roughfix(x))
? ?user ?system elapsed
? ?8.44 ? ?0.39 ? ?8.85
R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with 2GB
ram.
Andy
?------------------------------
*From:* Mike Williamson [mailto:this.is.mvw at gmail.com]
*Sent:* Thursday, July 01, 2010 12:48 PM
*To:* Liaw, Andy
*Cc:* r-help
*Subject:* Re: [R] anyone know why package "RandomForest" na.roughfix is
so slow??
Andy,
? ? You're right, I didn't supply any code, because my call was very simple
and it was the call itself at question. ?However, here is the associated
code I am using:
? ? ? ? naFixTime <- system.time( {
? ? ? ? ? ? if (fltrResponse) { ?## TRUE: there are no NA's in the
response... cleared via earlier steps
? ? ? ? ? ? ? ? message(paste(iAm,": Missing values will now be
imputed...\n", sep=""))
? ? ? ? try( dataSet <- rfImpute(dataSet[,!is.element(names(dataSet),
response)],
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?dataSet[,response]) )
? ? ? ? ? ? } else { ?## In this case, there is no "response" column in the
data set
? ? ? ? ? ? ? ? message(paste(iAm,": Missing values will now be filled in
with median",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? " values or most frequent levels", sep=""))
? ? ? ? ? ? ? ? try( dataSet <- na.roughfix(dataSet) )
? ? ? ? ? ? }
? ? ? ? } )
? ? As you can see, the "na.roughfix" call is made as simply as possible:
I supply the entire dataSet (only parameters, no responses). ?I am not doing
the prediction here (that is done later, and the prediction itself is not
taking very long).
? ? Here are some calculation times that I experienced:
# rows ? ? ? # cols ? ? ? time to run na.roughfix
======= ? ? ======= ? ? ====================
? 2046 ? ? ? ? ?2833 ? ? ? ? ? ? ~ 2 minutes
? 2066 ? ? ? ? ?5626 ? ? ? ? ? ? ~ 6 minutes
? 2134 ? ? ? ? 14037 ? ? ? ? ? ? ~ 30 minutes
? ? These numbers are on a Windows server using the 64-bit version of 'R'.
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Regards,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Mike
"Telescopes and bathyscaphes and sonar probes of Scottish lakes,
Tacoma Narrows bridge collapse explained with abstract phase-space maps,
Some x-ray slides, a music score, Minard's Napoleanic war:
The most exciting frontier is charting what's already here."
?-- xkcd
--
Help protect Wikipedia. Donate now:
http://wikimediafoundation.org/wiki/Support_Wikipedia/en
On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy <andy_liaw at merck.com> wrote:
You have not shown any code on exactly how you use na.roughfix(), so I can only guess. If you are doing something like: ?randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...) I would not be surprised that it's taking very long on large datasets. Most likely it's caused by the formula interface, not na.roughfix() itself. If that is your case, try doing the imputation beforehand and run randomForest() afterward; e.g., myroughfixed <- na.roughfix(mybigdata) randomForest(myroughfixed[list.of.predictor.columns], myroughfixed[[myresponse]],...) HTH, Andy -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Mike Williamson Sent: Wednesday, June 30, 2010 7:53 PM To: r-help Subject: [R] anyone know why package "RandomForest" na.roughfix is so slow?? Hi all, ? ?I am using the package "random forest" for random forest predictions. ?I like the package. ?However, I have fairly large data sets, and it can often take *hours* just to go through the "na.roughfix" call, which simply goes through and cleans up any NA values to either the median (numerical data) or the most frequent occurrence (factors). ? ?I am going to start doing some comparisons between na.roughfix() and some apply() functions which, it seems, are able to do the same job more quickly. ?But I hesitate to duplicate a function that is already in the package, since I presume the na.roughfix should be as quick as possible and it should also be well "tailored" to the requirements of random forest. ? ?Has anyone else seen that this is really slow? ?(I haven't noticed rfImpute to be nearly as slow, but I cannot say for sure: ?my "predict" data sets are MUCH larger than my model data sets, so cleaning the prediction data set simply takes much longer.) ? ?If so, any ideas how to speed this up? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Thanks! ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mike "Telescopes and bathyscaphes and sonar probes of Scottish lakes, Tacoma Narrows bridge collapse explained with abstract phase-space maps, Some x-ray slides, a music score, Minard's Napoleanic war: The most exciting frontier is charting what's already here." ?-- xkcd -- Help protect Wikipedia. Donate now: http://wikimediafoundation.org/wiki/Support_Wikipedia/en ? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: ?This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates Direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system.
Notice: ?This e-mail message, together with any attach...{{dropped:15}}
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/