Skip to content

2 questions

2 messages · Saket Joshi

#
Hi all,

I am using R1.5.0 under Unix,

I have a couple of questions here.

1. My program is running out of memory. I am writing a program to grow a
list of trees using rpart() on a subset of a large dataset(5807x693) with
a different response for every tree. I saw that after each tree was
constucted, 116 MB of data was being added to the Vcells. I have no idea
what this data is. My dataset is 30MB large and each tree is 1.6 MB large.
Could someone tell me how to monitor what data is getting stored in the
Vcells?

2. This is related to the same program as above. When growing a tree I
used the expression:

fit <- rpart(formula= x[[34]] ~ ., data = x)

This does not give an error but does give an obviously wrong answer. But
when I rearranged the data.frame, x, so that the response variable comes in the
first column and all the other variables in the remaining columns and
tried using

fit <- rpart(x)

it worked perfectly i.e gave the correct tree.
Could someone tell me what to do if I want the 34th column of the
data.frame to be the response variable but dont want to use the column
names in the formula for growing the tree.

Thanks in advance.
-Saket.


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
1 day later
#
Hi all,

My sincere apologies to all those who could not understand my previous
question and so could not answer it. I am not a statistitian and neither
have I worked on R for long. So please excuse my naive language. I hope I
can explain my question better this time.

I have a data.frame named 'temp'. The following are the series of commands
that followed after I obtained this data.frame
$names
 [1] "frame"     "where"     "call"      "terms"     "cptable"   "splits"
 [7] "method"    "parms"     "control"   "functions" "y"         "ordered"

$class
[1] "rpart"
$summary
function (yval, dev, wt, ylevel, digits)
{
    paste("  mean=", formatg(yval, digits), ", MSE=", formatg(dev/wt,
        digits), sep = "")
}
<environment: 4494214>

$text
function (yval, dev, wt, ylevel, digits, n, use.n)
{
    if (use.n) {
        paste(formatg(yval, digits), "\nn=", n, sep = "")
    }
    else {
        paste(formatg(yval, digits))
    }
}
<environment: 4494214>
used  (Mb) gc trigger  (Mb)
Ncells   330122   8.9    1162530  31.1
Vcells 46072722 351.6   64233246 490.1
used  (Mb) gc trigger  (Mb)
Ncells   326469   8.8    1162530  31.1
Vcells 34321042 261.9   64233246 490.1


When the "functions" attribute of x was set to NULL, the storage in the
Vcells reduced from 351.6 Mb to 261.9 Mb as can be seen from the 2 gc()
commands executed above.

I imagined that the rpart object 'x', is storing a pointer by the name of
'functions' to a large amount of data in the Vcells. This data was garbage
collected when the pointer 'functions' was NULLed. However I am not sure
that I am right on this count.

My question is: Is there a way in which the options to rpart or otherwise
can be set so as to never create the pointer 'functions' while fitting the
rpart model in the first place instead of having to delete it later in
order to save memory?

Thanks in advance,
Saket.





-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._