Skip to content

party for prediction [REPOST]

8 messages · David Winsemius, Ed, Achim Zeileis

Ed
#
Apologies for re-posting, my original message seems to have been
overlooked by the moderators.

---------- Forwarded message ----------
From: Ed <icelus2k5 at gmail.com>
Date: 11 October 2012 19:03
Subject: party for prediction
To: R-help at r-project.org


Hi there

I'm experiencing some problems using the party package (specifically
mob) for prediction. I have a real scalar y I want to predict from a
real valued vector x and an integral vector z. mob seemed the ideal
choice from the documentation.

The first problem I had was at some nodes in a partitioning tree, the
components of x may be extremely highly correlated or effectively
constant (that is x are not independent for all choices of components
of z). When the resulting fit is fed into predict() the result is NA -
this is not the same behaviour as models returned by say lm which
ignore missing coefficients. I have fixed this by defining my own
statsModel (myLinearModel - imaginative) which also ignores such
coefficients when predicting.

The second problem I have is that I get "Cholesky not positive
definite" errors at some nodes. I guess this is because of numerical
error and degeneracy in the covariance matrix? Any thoughts on how to
avoid having this happen would be welcome; it is ignorable though for
now.

The third and really big problem I have is that when I apply mob to
large datasets (say hundreds of thousands of elements) I get a
"logical subscript too long" error inside mob_fit_fluctests. It's
caught in a try(), and mob just gives up and treats the node as
terminal. This is really hurting me though; with 1% of my data I can
get a good fit and a worthwhile tree, but with the whole dataset I get
a very stunted tree with a pretty useless prediction ability.

I guess what I really want to know is:
(a) has anyone else had this problem, and if so how did they overcome it?
(b) is there any way to get a line or stack trace out of a try()
without source modification?
(c) failing all of that, does anyone know of an alternative to mob
that does the same thing; for better or worse I'm now committed to
recursive partitioning over linear models, as per mob?
(d) failing all of this, does anyone have a link to a way to rebuild,
or locally modify, an R package (preferably windows, but anything
would do)?

Sorry for the length of this post. If I should RTFM, please point me
at any relevant manual by all means. I've spent a few days on this as
you can maybe tell, but I'm far from being an R expert.

Thanks for any help you can give.

Best wishes,

Ed
#
On Oct 12, 2012, at 1:37 AM, Ed wrote:

            
No. Your original post _was_ forwarded to the list. On my machine it appeared at October 11, 2012 11:03:08 AM PDT.  No one responded. It seems possible that its lack of data or code is the reason for that state of affairs.
Ed
#
Sorry, my mistake, I didn't get a notification or see it send. Thanks
for clearing that up.

Best wishes

Ed
On 12 October 2012 16:58, David Winsemius <dwinsemius at comcast.net> wrote:
1 day later
#
Ed:
I'm not sure what you mean by "integral vector". If you want to apply the 
approach to hundreds of thousands of observations, I gues that these are 
categorical (maybe even binary?) but maybe not...
If I recall correctly, we kept linearModel as simple as we did to save as 
much time as possible. This can be particularly important when one of the 
partitioning variables has many possible splits and the linearModel has to 
be fitted thousands of times.

Also, mob() assesses the stability of all coefficients of the model in all 
nodes during partitioning. If any of the coefficients is not identified, 
this would have to be excluded from all subsequent parameter stability 
tests in that node (and its child nodes). This is currently not provided 
for in mob().
This comes from the parameter stability tests and might be a result of an 
unidentified (or close to unidentified) model fit.
With hundreds of thousands of observations, you would need some additional 
pruning strategy anyway. Significance test-based splitting will probably 
overfit because tiny differences in the coefficients will be picked up at 
such large sample sizes.

Furthermore, computationally the extensive search over all possible splits 
might be too burdensome with this many observations.

Hence, using some subsampling strategy might not be the worst thing.
We have had non-identified model fits in binary GLMs (with quasi-complete 
separation) where we then set estfun() to all zero so that partitioning 
stops. But I don't think that such a strategy helps here.
Not sure, I don't know any off the top off my head.
If your partitioning variables are particularly simple (e.g., all binary) 
you could exploit that and it may be easier to write a custom function for 
your particular data. Then likelihood-ratio tests (rather than LM-type 
tests) would also be easier to apply in case of unidentified parameters.

But if there are partitioning variables with different measurement scales, 
then this will not be that simple...
Have a look at the "Writing R Extensions" manual and the R for Windows 
FAQ.

Best,
Z
Ed
#
First up, thanks hugely for your response. I've been beating my head
against this!
On 14 October 2012 16:51, Achim Zeileis <Achim.Zeileis at uibk.ac.at> wrote:
I'm sorry I can't go into the details of the data, I would if I could.
z are categorical variables represented as integers, mostly ordered,
but not all. I've tried fitting them as integers, as well as ordered,
but O don't think it made a huge difference.
I can appreciate that, but maybe having an alternative linearModel
which will predict when the fit is degenerate would be worth
including? I'm happy to contribute what I have, although it's pretty
obvious stuff (and probably done suboptimally since I'm not much of an
R coder at this point). For me at least, even with huge datasets, the
speed of party is quite good; it's getting a better result that's the
problem.
Would pretending the coefficients were fit at 0 fool mob into doing
something moderately meaningful here?

If not, I would try to hack the code, but I'm honestly at something of
a loss as to how to modify it and feed the results back into my
interpreter. I have bytecode installed; I downloaded the source, but I
haven't squared the circle of modifying the source and installing the
result. I will check out the docs on writing extensions you suggest.
This is a great help to know. I improved my results quite considerably
with aggressive scaling of everything (scaling the response and all
the predictors to lie between 0 an 1). That deepened my tree by a
factor of two or so (say depth 3 to 7) and improved the quality of fit
substantially. Is there any way I can engage a more numerically robust
Cholesky in mob?
I'm okay with overfitting, honestly. At the moment it is underfitting
by quite a large amount I think (the quality of the predictions on the
training set is not very high). The problem really is there is so much
going on the data, but the "noise" level is probably very low. I
wouldn't be surprised if my data was accurate to 5 or 6 s.f.
I have plenty of compute time/power and RAM, though R seems to be
running single threaded. But even on a few million observations, it's
still pretty fast and doesn't use more than 30 or so gig of memory. If
it takes a day and requires 150gig of RAM, that is absolutely fine,
even over that would be viable though less optimal.
I've tried this at various degrees, but the data is really very
complicated with not a lot of error. I'm trying to encourage party to
fit more closely, which I thought more data might encourage. At the
moment I'm a long way from a clean fit. I have subsampled at various
levels down to 1%, and although that increases the depth of the tree
and quality of fit, it still doesn't give a very good quality fit and
can encourage it to overlook obvious aspects of the training set.
I've considered using rpart() to partition into cells of constant
gradient, then fitting linear models myself to the cells. This is my
next thought. I'm pretty sure partitioning over linear regression is
the way forward for the data we have. I tried mars and glm but there
are good reasons to think they're less reasonable, even though the fit
wasn't particularly poor. I'm not particularly wedded to party's
approach except that it looked like it immediately returned what we
needed, and with some degree of "optimality" into the bargain.
I guess I really will have to bite the bullet and try to figure out
how to install modified libraries. Thanks.
Unfortunately each partitioning variable is essentially a state
indicator, taking values say 0,...,R where R is different for each
component. I'm not a stats expert either; I've spent some time with
the party manuals and papers, but I wouldn't be confident of
implementing something like it in the time available to me (though if
I have to I will, but that wouldn't be a good situation to be in).
Will do.

Thank you very much for your responses, I really appreciate it.

Best wishes,

Ed
#
On Sun, 14 Oct 2012, Ed wrote:

            
The tests performed for categorical partitioning variables are rather 
different from the tests for numerical partitioning variables. If all of 
the variables are categorical, this may not be immediately obvious, but 
the factor coding should be more appropriate (especially if the number of 
levels is small or moderate).
As I explained in my last e-mail. In your situation this does not solve 
the problem completely because subsequent the tests are also not adapted 
to this. Setting the empirical estimating functions to zero for 
non-identified coefficients might alleviate the problem but is not really 
a clean solution.
The coefficients are not looked at during fitting, only the estfun(). This 
would have to be set to 0.
Writing (or modifying) R packages and installing them under Windows is 
pretty standard and well documented. The pointers I gave you should 
hopefully get you started.
No, I don't think that this is conceivable with the way this is 
implemented at the moment. Instead of the currently implemented tests, one 
could in principle use the likelihood ratio test which is invariant 
against parameter transformations and doesn't need the covariance matrix.
OK. Maybe with so little noise some other splitting strategy (not based on 
significance tests) would be better?
With this setup, you may consider writing your own partitioning algorithm 
using the same type of ideas as MOB. Instead of using the parameter 
stability tests, you could use plain likelihood ratio tests or ANOVAs to 
decide about splitting further.

If d is the data in the current node, y and x are response/regressors, and 
f1 to fn are your categorical partitioning variables, you could do

m0 <- lm(y ~ x, data = d)
m1 <- lm(y ~ f1 * x, data = d)
...
mn <- lm(y ~ fn * x, data = d)

and then

anova(m0, m1)
...
anova(m0, mn)

and then choose the lowest p-value for splitting. You could then also 
parallelize the computations in the daughter nodes.

None of this is readily coded in party though...
Possibly boosting linear models may be another route to go.
I think that's easier compared to solving the conceptual problems 
(thinking about how to best partition degenerate linear models etc.).
;-)

Good luck!
Best wishes,
Z
Ed
#
This was an exceptionally helpful answer, I can only thank you again.
I have plenty of avenues ahead where I was worried before I was
getting trapped in a dead end. If all else fails, the idea of using
anova is brilliant. Thank you!

Ed
On 14 October 2012 18:36, Achim Zeileis <Achim.Zeileis at uibk.ac.at> wrote:
#
On Sun, 14 Oct 2012, Ed wrote:

            
No prob. Just for the record, in case anybody else wonders: MOB is based 
on score-type tests (or Lagrange multiplier type tests) to decide which 
variable should be used for partitioning in the next step. In principle, 
likelihood-ratio-type tests could also be used but require fitting models 
under all conceivable splitted models. Especially for all two-way splits 
in numerical partitioning variables, this can be very costly and hence the 
computationally cheaper score tests are used (which only require model 
fitting once per node). In your case with only categorical partitioning 
variable, computing the likelihood ratio test is possible and only a 
little bit more coputationally expensive. (And you indicated that you do 
not worry about that.)

To build your own new tree function, it may be useful to employ the 
"partykit" infrastructure. This has very useful base infrastructure for 
representing the tree part but has not yet be customized for conveniently 
accessing model fits in the leaves. It's on our to-do list though...

Best,
Z