An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-help/attachments/20080205/01924c91/attachment.pl
Maximum number of variables allowed in a multiple linear regression model
4 messages · Michelle Chu, Bert Gunter, Tony Plate +1 more
I strongly suggest you collaborate with a local statistician. I can think of no circumstance where multiple regression on "hundreds of thousands of variables" is anything more than a fancy random number generator. -- Bert Gunter Genentech Nonclinical Statistics -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Michelle Chu Sent: Tuesday, February 05, 2008 9:00 AM To: R-help at r-project.org Subject: [R] Maximum number of variables allowed in a multiple linearregression model Hi, I appreciate it if someone can confirm the maximum number of variables allowed in a multiple linear regression model. Currently, I am looking for a software with the capacity of handling approximately 3,000 variables. I am using Excel to process the results. Any information for processing a matrix from Excel with hundreds to thousands of variables will helpful. Best Regards, Michelle ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Bert Gunter wrote:
I strongly suggest you collaborate with a local statistician. I can think of no circumstance where multiple regression on "hundreds of thousands of variables" is anything more than a fancy random number generator.
That sounds like a challenge! What is the largest regression problem (in
terms of numbers of variables) that people have encountered where it made
sense to do some sort of linear regression (and gave useful results)?
(Including multilevel and Bayesian techniques.)
However, the original poster did say "hundreds to thousands", which is
smaller than "hundreds of thousands". When I try a regression problem with
3,000 coefficients in R running under Windows XP 64 bit with 8Gb of memory
on the machine and the /3Gb option active (i.e., R can get up to 3Gb), R
2.6.1 runs out of memory (apparently trying to duplicate the model matrix):
R version 2.6.1 (2007-11-26)
Copyright (C) 2007 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
> m <- 3000
> n <- m * 10
> x <- matrix(rnorm(n*m), ncol=m, nrow=n,
dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
> dim(x)
[1] 30000 3000
> k <- sample(m, 10)
> y <- rowSums(x[,k]) + 10 * rnorm(n)
> fit <- lm.fit(y=y, x=x)
Error: cannot allocate vector of size 686.6 Mb
> object.size(x)/2^20
[1] 687.7787
> memory.size()
[1] -2022.552
>
and the Windows process monitor shows the peak memory usage for Rgui.exe at
2,137,923K. But in a 64 bit version of R, I would be surprised if it was
not possible to run this (given sufficient memory).
However, R easily handles a slightly smaller problem:
> m <- 1000 # of variables
> n <- m * 10 # of rows
> k <- sample(m, 10)
> x <- matrix(rnorm(n*m), ncol=m, nrow=n,
dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
> y <- rowSums(x[,k]) + 10 * rnorm(n)
> fit <- lm.fit(y=y, x=x)
> # distribution of coefs that should be one vs zero
> round(rbind(one=quantile(fit$coefficients[k]),
zero=quantile(fit$coefficients[-k])), digits=2)
0% 25% 50% 75% 100%
one 0.94 0.98 1.04 1.10 1.18
zero -0.30 -0.08 -0.01 0.06 0.29
>
To echo Bert Gunter's cautions, one must be careful doing ordinary linear
regression with large numbers of coefficients. It does seem a little
unlikely that there is sufficient data to get useful estimates of three
thousand coefficients using linear regression in data managed in Excel
(though I guess it could be possible using Excel 12.0, which can handle up
to 1 million rows - recent versions prior to 2008 could handle on 64K rows
- see http://en.wikipedia.org/wiki/Microsoft_Excel#Versions ). So, the
suggestion to consult a local statistician is good advice - there may be
other more suitable approaches, and if some form of linear regression is an
appropriate approach, there are things to do to gain confidence that the
results of the linear regression convey useful information.
-- Tony Plate
-- Bert Gunter Genentech Nonclinical Statistics -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Michelle Chu Sent: Tuesday, February 05, 2008 9:00 AM To: R-help at r-project.org Subject: [R] Maximum number of variables allowed in a multiple linearregression model Hi, I appreciate it if someone can confirm the maximum number of variables allowed in a multiple linear regression model. Currently, I am looking for a software with the capacity of handling approximately 3,000 variables. I am using Excel to process the results. Any information for processing a matrix from Excel with hundreds to thousands of variables will helpful. Best Regards, Michelle [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Feb 6, 2008 11:28 AM, Tony Plate <tplate at acm.org> wrote:
Bert Gunter wrote:
I strongly suggest you collaborate with a local statistician. I can think of no circumstance where multiple regression on "hundreds of thousands of variables" is anything more than a fancy random number generator.
That sounds like a challenge! What is the largest regression problem (in terms of numbers of variables) that people have encountered where it made sense to do some sort of linear regression (and gave useful results)? (Including multilevel and Bayesian techniques.)
I have fit linear and generalized linear models with hundreds of thousands of coefficients but, of course, with a highly structured model matrix and using sparse matrix techniques. What is called the Rasch model for analysis of item response data (e.g. correct/incorrect responses by students to the items on a multiple-choice test) is a generalized linear model with the students and the items as factors. However, like Bert I would be very dubious of any attempt to fit a linear regression model to 3000 variables that were not generated in a systematic way. Sounds like a massive, computer-fueled fishing expedition (a.k.a. "data mining").
However, the original poster did say "hundreds to thousands", which is smaller than "hundreds of thousands". When I try a regression problem with 3,000 coefficients in R running under Windows XP 64 bit with 8Gb of memory on the machine and the /3Gb option active (i.e., R can get up to 3Gb), R 2.6.1 runs out of memory (apparently trying to duplicate the model matrix): R version 2.6.1 (2007-11-26) Copyright (C) 2007 The R Foundation for Statistical Computing ISBN 3-900051-07-0
> m <- 3000 > n <- m * 10 > x <- matrix(rnorm(n*m), ncol=m, nrow=n,
dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
> dim(x)
[1] 30000 3000
> k <- sample(m, 10) > y <- rowSums(x[,k]) + 10 * rnorm(n) > fit <- lm.fit(y=y, x=x)
Error: cannot allocate vector of size 686.6 Mb
> object.size(x)/2^20
[1] 687.7787
> memory.size()
[1] -2022.552
>
and the Windows process monitor shows the peak memory usage for Rgui.exe at 2,137,923K. But in a 64 bit version of R, I would be surprised if it was not possible to run this (given sufficient memory). However, R easily handles a slightly smaller problem:
> m <- 1000 # of variables > n <- m * 10 # of rows > k <- sample(m, 10) > x <- matrix(rnorm(n*m), ncol=m, nrow=n,
dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
> y <- rowSums(x[,k]) + 10 * rnorm(n) > fit <- lm.fit(y=y, x=x) > # distribution of coefs that should be one vs zero > round(rbind(one=quantile(fit$coefficients[k]),
zero=quantile(fit$coefficients[-k])), digits=2)
0% 25% 50% 75% 100%
one 0.94 0.98 1.04 1.10 1.18
zero -0.30 -0.08 -0.01 0.06 0.29
>
To echo Bert Gunter's cautions, one must be careful doing ordinary linear regression with large numbers of coefficients. It does seem a little unlikely that there is sufficient data to get useful estimates of three thousand coefficients using linear regression in data managed in Excel (though I guess it could be possible using Excel 12.0, which can handle up to 1 million rows - recent versions prior to 2008 could handle on 64K rows - see http://en.wikipedia.org/wiki/Microsoft_Excel#Versions ). So, the suggestion to consult a local statistician is good advice - there may be other more suitable approaches, and if some form of linear regression is an appropriate approach, there are things to do to gain confidence that the results of the linear regression convey useful information. -- Tony Plate
-- Bert Gunter
Genentech Nonclinical Statistics
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Michelle Chu
Sent: Tuesday, February 05, 2008 9:00 AM
To: R-help at r-project.org
Subject: [R] Maximum number of variables allowed in a multiple
linearregression model
Hi,
I appreciate it if someone can confirm the maximum number of variables
allowed in a multiple linear regression model. Currently, I am looking for
a software with the capacity of handling approximately 3,000 variables. I
am using Excel to process the results. Any information for processing a
matrix from Excel with hundreds to thousands of variables will helpful.
Best Regards,
Michelle
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.