problem with predict() - R-help

Fri, Jun 28, 2002 10:42 AM #

I will try.

I tried to use lm() to have basic linear "reference".

If this is the reason (i.e. train data set is rank-deficient) and lm
cannot handle such situation, what are the other packages in R
you could recommend which would handle this type of data ?
Also in such case: should not lm() report a problem in a model building
phase ?

So far I have used SVM approach to do regression with this data set and I am
getting
rather poor r^2 (~0.25 on test set), but I do not have any numerical
problems with SVM.
I am also planning to try randomForest() to do classification. This was my
immediate
motivation to turn to R.

All the best,

R

Ryszard Czerminski   phone: (781)994-0479
ArQule, Inc.         email:ryszard at arqule.com
19 Presidential Way  http://www.arqule.com
Woburn, MA 01801     fax: (781)994-0679


-----Original Message-----
From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk]
Sent: Friday, June 28, 2002 12:39 PM
To: Czerminski, Ryszard
Cc: r-help at stat.math.ethz.ch
Subject: RE: [R] problem with predict()


Have you tried the R debugging tools?  If not, please make use of them.
My guess is that you have a rank-deficient problem.

?debugger
?recover
?dump.frames
...

On Fri, 28 Jun 2002, Czerminski, Ryszard wrote:

That's what the debugging tools are for.

"scaled:scale")

mask <- apply(xr, 2, function(x) any(is.na(x)))
xr <- xr[,!mask] # rm NA's
ys <- test.data[,1]; xs <- test.data[,-1]
xs <- scale(xs, center = x.center, scale = x.scale)
xs <- xs[,!mask]
train <- data.frame(y = yr, x = xr)
test <- data.frame(y = ys, x = xs)
model <- lm(y~., train)
cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")

dim(train) = 164 119 ; dim(test) = 164 119

names(test) <- names(train)
length(predict(model, test))

Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) :
        subscript out of bounds

Ryszard Czerminski   phone: (781)994-0479
ArQule, Inc.         email:ryszard at arqule.com
19 Presidential Way  http://www.arqule.com
Woburn, MA 01801     fax: (781)994-0679

-----Original Message-----
From: Liaw, Andy [mailto:andy_liaw at merck.com]
Sent: Friday, June 28, 2002 8:46 AM
To: 'Czerminski, Ryszard'
Cc: r-help at stat.math.ethz.ch
Subject: RE: [R] problem with predict()

You can try:

  names(test) <- names(train)

before calling predict() to make sure that the variable names match.
Without your data files, it's hard to tell why your first example worked.

Andy

-----Original Message-----
From: Czerminski, Ryszard [mailto:ryszard at arqule.com]
Sent: Thursday, June 27, 2002 3:29 PM
To: 'ripley at stats.ox.ac.uk'; Czerminski, Ryszard
Cc: r-help at stat.math.ethz.ch
Subject: RE: [R] problem with predict()

# Yes. You are *still* using a matrix in a data frame.
Please do read more
# carefully.

I have read some more R documentation trying to understand difference
between
matrices and data frames etc... and I repeat my example this time
executing EXACTLY the same code with only difference being
that in one case
I use smaller data sets ({train,test}-small.csv) and in the
second case I
use larger
data sets ({train,test}.csv) - and I got different behaviour.

Small case (10*4) runs fine, larger case (164*119) fails.

Any ideas why this happens ?

R

rm(list=ls())
train.data <- read.csv("train-small.csv", header = TRUE, row.names =

"mol", comment.char="")

test.data <- read.csv("test-small.csv", header = TRUE,

row.names = "mol",
comment.char="")

yr <- train.data[,1]; xr <- train.data[,-1]
xr <- scale(xr)
x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr,

"scaled:scale")

mask <- apply(xr, 2, function(x) any(is.na(x)))
xr <- xr[,!mask] # rm NA's
ys <- test.data[,1]; xs <- test.data[,-1]
xs <- scale(xs, center = x.center, scale = x.scale)
xs <- xs[,!mask]
train <- data.frame(y = yr, x = xr)
test <- data.frame(y = ys, x = xs)
model <- lm(y~., train)
cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")

dim(train) = 10 4 ; dim(test) = 10 4

length(predict(model, test))

[1] 10

train.data <- read.csv("train.csv", header = TRUE,

row.names = "mol",
comment.char="")

test.data <- read.csv("test.csv", header = TRUE, row.names = "mol",

comment.char="")
[snip...]

cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")

dim(train) = 164 119 ; dim(test) = 35 119

length(predict(model, test))

Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) :
        subscript out of bounds

Ryszard Czerminski   phone: (781)994-0479
ArQule, Inc.         email:ryszard at arqule.com
19 Presidential Way  http://www.arqule.com
Woburn, MA 01801     fax: (781)994-0679

-----Original Message-----
From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk]
Sent: Friday, June 21, 2002 3:41 PM
To: Czerminski, Ryszard
Cc: r-help at stat.math.ethz.ch
Subject: RE: [R] problem with predict()

On Fri, 21 Jun 2002, Czerminski, Ryszard wrote:

--- first problem

If I store 'simulated' data in data frames:
# train.data <- data.frame(matrix(rnorm(164*119), nrow = 164))
# test.data <- data.frame(matrix(rnorm(35*119), nrow = 35))
it still works the same way i.e. the code below works fine
for simulated data and fails for 'real' data the only
difference being in actual numeric values stored in data
structures of the same shape and type.

Any suggestions why this happens ?

Yes. You are *still* using a matrix in a data frame.  Please
do read more
carefully.

--- second problem

As Andy Liaw pointed out, xr is a matrix.  Take a look at

the names of

train.  Hint: they do not contain `x'.

Following your hint I am guessing that the fact that names

do not contain

'x'
explains why lm(y~., train) form works and lm(y~x, train) fails
and "lm(y~., train)" means roughly: correlate column "y" to

all other
colums

No, it means regress y on all the remaining colums in the
data argument.

Where I can find more detail specification of this syntax ?
In help(lm) I find this paragraph:

     Models for `lm' are specified symbolically.  A typical

model has

     the form `response ~ terms' where `response' is the

(numeric)...

which does not quite cover this case.

In any good book on the subject.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-.-.-.-.-.-.
-.-
r-help mailing list -- Read
http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._._._._._._.
_._
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-.-.-.-.-.-.-.-
r-help mailing list -- Read
http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._._._._._._._._

----------------------------------------------------------------------------

that

and

============================================================================

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-

http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html

_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._