RandomForest, Party and Memory Management
Dear Dennis and dear All, It was probably not my best post. I am running R on a Debian box (amd64 architecture) and that is why I was surprised to see memory issues when dealing with a vector larger than 1Gb. The memory is there, but probably it is not contiguous. I will investigate into the matter and post again (generating an artificial dataframe if needed). Many thanks Lorenzo
On 4 February 2013 00:50, Dennis Murphy <djmuser at gmail.com> wrote:
Hi Lorenzo: On Sun, Feb 3, 2013 at 11:47 AM, Lorenzo Isella <lorenzo.isella at gmail.com> wrote:
Dear All, For a data mining project, I am relying heavily on the RandomForest and Party packages. Due to the large size of the data set, I have often memory problems (in particular with the Party package; RandomForest seems to use less memory). I really have two questions at this point 1) Please see how I am using the Party and RandomForest packages. Any comment is welcome and useful.
As noted elsewhere, the example is not reproducible so I can't help you there.
myparty <- cforest(SalePrice ~ ModelID+
ProductGroup+
ProductGroupDesc+MfgYear+saledate3+saleday+
salemonth,
data = trainRF,
control = cforest_unbiased(mtry = 3, ntree=300, trace=TRUE))
rf_model <- randomForest(SalePrice ~ ModelID+
ProductGroup+
ProductGroupDesc+MfgYear+saledate3+saleday+
salemonth,
data = trainRF,na.action = na.omit,
importance=TRUE, do.trace=100, mtry=3,ntree=300)
2) I have another question: sometimes R crashes after telling me that it is
unable to allocate e.g. an array of 1.5 Gb.
However, I have 4Gb of ram on my box, so...technically the memory is there,
but is there a way to enable R to use more of it?
4Gb is not a lot of RAM for data mining projects. I have twice that and run into memory limits on some fairly simple tasks (e.g., 2D tables) in large simulations with 1M or 10M runs. Part of the problem is that data is often copied, sometimes more than once. If you have a 1Gb input data frame, three copies and you're out of space. Moreover, copied objects need contiguous memory, and this becomes very difficult to achieve with large objects and limited RAM. With 4Gb RAM, you need to be more clever: * eliminate as many other processes that access RAM as possible (e.g., no active browser) * think of ways to process your data in chunks (which is harder to do when the objective is model fitting) * type ?"Memory-limits" (including the quotes) at the console for explanations about memory limits and a few places to look for potential solutions * look into 'big data' packages like ff or bigmemory, among others * if you're in an (American ?) academic institution, you can get a free license for Revolution R, which is supposed to be better for big data problems than vanilla R It's hard to be specific about potential solutions, but the above should broaden your perspective on the big data problem and possible avenues for solving it. Dennis
Many thanks Lorenzo
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.