I will try. I tried to use lm() to have basic linear "reference". If this is the reason (i.e. train data set is rank-deficient) and lm cannot handle such situation, what are the other packages in R you could recommend which would handle this type of data ? Also in such case: should not lm() report a problem in a model building phase ? So far I have used SVM approach to do regression with this data set and I am getting rather poor r^2 (~0.25 on test set), but I do not have any numerical problems with SVM. I am also planning to try randomForest() to do classification. This was my immediate motivation to turn to R. All the best, R Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -----Original Message----- From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk] Sent: Friday, June 28, 2002 12:39 PM To: Czerminski, Ryszard Cc: r-help at stat.math.ethz.ch Subject: RE: [R] problem with predict() Have you tried the R debugging tools? If not, please make use of them. My guess is that you have a rank-deficient problem. ?debugger ?recover ?dump.frames ...
On Fri, 28 Jun 2002, Czerminski, Ryszard wrote:
This time I use the same file for train.data and test.data throwing in "names(test) <- names(train)" before predict() for double protection (:-) and it still fails... Is it some weird problem with this particular data set ? or a bug ? (why "subscript out of bounds" ?)
That's what the debugging tools are for.
rm(list=ls())
train.data <- read.csv("train.csv", header = TRUE, row.names = "mol",
comment.char="")
test.data <- read.csv("train.csv", header = TRUE, row.names = "mol",
comment.char="")
yr <- train.data[,1]; xr <- train.data[,-1] xr <- scale(xr) # matrix <- scale(data.frame) x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr,
"scaled:scale")
mask <- apply(xr, 2, function(x) any(is.na(x)))
xr <- xr[,!mask] # rm NA's
ys <- test.data[,1]; xs <- test.data[,-1]
xs <- scale(xs, center = x.center, scale = x.scale)
xs <- xs[,!mask]
train <- data.frame(y = yr, x = xr)
test <- data.frame(y = ys, x = xs)
model <- lm(y~., train)
cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
dim(train) = 164 119 ; dim(test) = 164 119
names(test) <- names(train) length(predict(model, test))
Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) :
subscript out of bounds
Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -----Original Message----- From: Liaw, Andy [mailto:andy_liaw at merck.com] Sent: Friday, June 28, 2002 8:46 AM To: 'Czerminski, Ryszard' Cc: r-help at stat.math.ethz.ch Subject: RE: [R] problem with predict() You can try: names(test) <- names(train) before calling predict() to make sure that the variable names match. Without your data files, it's hard to tell why your first example worked. Andy
-----Original Message-----
From: Czerminski, Ryszard [mailto:ryszard at arqule.com]
Sent: Thursday, June 27, 2002 3:29 PM
To: 'ripley at stats.ox.ac.uk'; Czerminski, Ryszard
Cc: r-help at stat.math.ethz.ch
Subject: RE: [R] problem with predict()
# Yes. You are *still* using a matrix in a data frame.
Please do read more
# carefully.
I have read some more R documentation trying to understand difference
between
matrices and data frames etc... and I repeat my example this time
executing EXACTLY the same code with only difference being
that in one case
I use smaller data sets ({train,test}-small.csv) and in the
second case I
use larger
data sets ({train,test}.csv) - and I got different behaviour.
Small case (10*4) runs fine, larger case (164*119) fails.
Any ideas why this happens ?
R
rm(list=ls())
train.data <- read.csv("train-small.csv", header = TRUE, row.names =
"mol", comment.char="")
test.data <- read.csv("test-small.csv", header = TRUE,
row.names = "mol", comment.char="")
yr <- train.data[,1]; xr <- train.data[,-1] xr <- scale(xr) x.center <- attr(xr, "scaled:center"); x.scale <- attr(xr,
"scaled:scale")
mask <- apply(xr, 2, function(x) any(is.na(x)))
xr <- xr[,!mask] # rm NA's
ys <- test.data[,1]; xs <- test.data[,-1]
xs <- scale(xs, center = x.center, scale = x.scale)
xs <- xs[,!mask]
train <- data.frame(y = yr, x = xr)
test <- data.frame(y = ys, x = xs)
model <- lm(y~., train)
cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
dim(train) = 10 4 ; dim(test) = 10 4
length(predict(model, test))
[1] 10
train.data <- read.csv("train.csv", header = TRUE,
row.names = "mol", comment.char="")
test.data <- read.csv("test.csv", header = TRUE, row.names = "mol",
comment.char="") [snip...]
cat("dim(train) =", dim(train), "; dim(test) =", dim(test), "\n")
dim(train) = 164 119 ; dim(test) = 35 119
length(predict(model, test))
Error in drop(X[, piv, drop = FALSE] %*% beta[piv]) :
subscript out of bounds
Ryszard Czerminski phone: (781)994-0479 ArQule, Inc. email:ryszard at arqule.com 19 Presidential Way http://www.arqule.com Woburn, MA 01801 fax: (781)994-0679 -----Original Message----- From: ripley at stats.ox.ac.uk [mailto:ripley at stats.ox.ac.uk] Sent: Friday, June 21, 2002 3:41 PM To: Czerminski, Ryszard Cc: r-help at stat.math.ethz.ch Subject: RE: [R] problem with predict() On Fri, 21 Jun 2002, Czerminski, Ryszard wrote:
--- first problem If I store 'simulated' data in data frames: # train.data <- data.frame(matrix(rnorm(164*119), nrow = 164)) # test.data <- data.frame(matrix(rnorm(35*119), nrow = 35)) it still works the same way i.e. the code below works fine for simulated data and fails for 'real' data the only difference being in actual numeric values stored in data structures of the same shape and type. Any suggestions why this happens ?
Yes. You are *still* using a matrix in a data frame. Please do read more carefully.
--- second problem
As Andy Liaw pointed out, xr is a matrix. Take a look at
the names of
train. Hint: they do not contain `x'.
Following your hint I am guessing that the fact that names
do not contain
'x' explains why lm(y~., train) form works and lm(y~x, train) fails and "lm(y~., train)" means roughly: correlate column "y" to
all other colums No, it means regress y on all the remaining colums in the data argument.
Where I can find more detail specification of this syntax ?
In help(lm) I find this paragraph:
Models for `lm' are specified symbolically. A typical
model has
the form `response ~ terms' where `response' is the
(numeric)...
which does not quite cover this case.
In any good book on the subject. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.-.-.-.-.-.-. -.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._._._._._._. _._ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._._._._._._._._
----------------------------------------------------------------------------
-- Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA)
that
may be confidential, proprietary copyrighted and/or legally privileged,
and
is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it.
============================================================================
==
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.-
r-help mailing list -- Read
http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._