Skip to content

Question about multiple regression

15 messages · Brian Ripley, Bert Gunter, Gabor Grothendieck +4 more

#
Dear R-list,
maybe some of you could point me in the right direction:

Are you aware of any FREE Fortran or Java libraries/actual pieces of
code that are VERY efficient (time-wise) in running the regular linear
least-squares multiple regression?
More specifically, I have to run small regression models (between 1
and 15 predictors) on samples of up to N=700 but thousands and
thousands of them.

I am designing a simulation in R and running those regressions and R
itself is way too slow. So, I am thinking of compiling the regression
run itself in Fortran and Java and then calling it from R.

Thank you very much for any advice!

Dimitri Liakhovitski
MarketTools, Inc.
Dimitri.Liakhovitski at markettools.com
#
I would test the speed before making such as assumption.  Note that
lm.fit is faster than lm and if they have the same x matrix then
you can do many in one call by having y be a matrix.
On Mon, Sep 8, 2008 at 12:05 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
#
Are you sure R's ways are not fast enough (there are many layers 
underneath lm)?  For an example of how you might do this at C/Fortran 
level, see the function lqs() in MASS.
On Mon, 8 Sep 2008, Dimitri Liakhovitski wrote:

            
A lot of the effort is in getting the right answer fast, including for 
e.g. collinear inputs.
I think Java is unlikely to be fast compared to the Fortran R itself uses.

Have you profiled to find where the time is really being spent (both R and 
C/Fortran profiling if necessary).

  
    
#
Thank you for reminding me, Gabor. I forgot to mention: So far, I have
run one test set of regressions using lm. It took R 270 sec. I need to
run 1,800,000 of those, which would imply 15.4 years of computing time
:)

I have not done the same for lm.fit because I am not sure how to get
model R squared from lm.fit.

Dimitri

On Mon, Sep 8, 2008 at 12:17 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:

  
    
#
Yes, see my previous e-mail on how long R takes (270 seconds for one
of the 1,800,000 sets I need) - using system.time.
Not sure how to test the same for Fortran...

On Mon, Sep 8, 2008 at 12:51 PM, Prof Brian Ripley
<ripley at stats.ox.ac.uk> wrote:

  
    
#
Disclaimer: I have **NO IDEA** of the details of what you want to do or why
-- but I am willing to bet that there are better ways of doing it than  1.8
mm multiple refressions that take 270 secs each!! (which I find difficult to
believe in itself -- are you sure you are doing things right? Something
sounds very fishy here: R's regression code is typically very fast).

-- Bert Gunter

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Dimitri Liakhovitski
Sent: Monday, September 08, 2008 9:56 AM
To: Prof Brian Ripley
Cc: R-Help List
Subject: Re: [R] Question about multiple regression

Yes, see my previous e-mail on how long R takes (270 seconds for one
of the 1,800,000 sets I need) - using system.time.
Not sure how to test the same for Fortran...

On Mon, Sep 8, 2008 at 12:51 PM, Prof Brian Ripley
<ripley at stats.ox.ac.uk> wrote:
underneath
e.g.

  
    
#
Try:

sum(lm.fit(x, y)$residuals^2)
On Mon, Sep 8, 2008 at 12:52 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
#
Thank you everyone for your responses. I'll answer several questions.

1. >  Disclaimer: I have **NO IDEA** of the details of what you want
to do or why
I probably should not bore everyone, but just to explain where the
large number is coming from. I have an experimental design with 7
factors. Each factor has between 3 and 5 levels. Once you cross them
all, you end up with 18,000 cells. For each cell, I want to generate a
sample of N=100. For each sample I have to analyze the data using 3
different statistical methods of analysis (the goal of the
Monte-Carlo) is to compare those methods. One of the methods requires
running of up to ~32,000 simple multiple regressions - yes just for
one sample and it's not a mistake. I test-ran one such analysis for a
sample with N=800 and 15 predictors and it took 270 seconds. R was
actually very fast - it ran each of the individual regressions in
about 0.008 seconds. Still I need something faster.

2. Sorry - what was the formula sum(lm.fit(x,y))$residuals^2) for? For
example, using it on my data, I got a value of 36,644...

3. I know that for similarly challenging situations people did used
Fortran compilers. So, anyone heard of a free Fortran library or an
efficient piece of code?

Thank you!
Dimitri

  
    
#
On Mon, Sep 8, 2008 at 1:47 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
Its the sum of the squares of the residuals.
#
I could get an r squared from lm.fit by correlating fitted.values and
my response variable.
But could I do it somehow using Sums of Squares? I am clear on SS for
residuals. But where is SS for the model or the total SS in lm.fit
output?
Thank you!
Dimitri

On Mon, Sep 8, 2008 at 1:57 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:

  
    
#
R squared is: 1 - sum(residuals^2)/crossprod(y - mean(y))
On Mon, Sep 8, 2008 at 2:27 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
#
Although I along with the other believe there probably is an efficient R
solution, the answer to your direct question can perhaps be found at
http://www.fortran.com/.  The  free GNU G95 fortran compiler is at
http://www.g95.org/
Joe

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Dimitri Liakhovitski
Sent: Monday, September 08, 2008 11:05 AM
To: R-Help List
Subject: [R] Question about multiple regression

Dear R-list,
maybe some of you could point me in the right direction:

Are you aware of any FREE Fortran or Java libraries/actual pieces of
code that are VERY efficient (time-wise) in running the regular linear
least-squares multiple regression?
More specifically, I have to run small regression models (between 1 and
15 predictors) on samples of up to N=700 but thousands and thousands of
them.

I am designing a simulation in R and running those regressions and R
itself is way too slow. So, I am thinking of compiling the regression
run itself in Fortran and Java and then calling it from R.

Thank you very much for any advice!

Dimitri Liakhovitski
MarketTools, Inc.
Dimitri.Liakhovitski at markettools.com

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
Thanks a lot, everybody!

On Mon, Sep 8, 2008 at 3:11 PM, Lucke, Joseph F
<Joseph.F.Lucke at uth.tmc.edu> wrote:

  
    
#
Hi Dimitri,
On Mon, 8 Sep 2008, Dimitri Liakhovitski wrote:

            
You almost certainly want the LAPACK fortran libraries, avail at
http://www.netlib.org/lapack/

...the function of interest to you is probably called
"dgels":

http://www.netlib.org/lapack/explore-html/dgels.f.html

...of course, this runs faster if you have a fast BLAS library installed.
These exist in many forms, and may already be installed on your system.

--Adam
#
On Mon, Sep 8, 2008 at 7:47 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
Have you considered the fact that 32000 regressions simply takes a lot of time?
I don't really have anything to go by, but it sounds unlikely that you
will be able to cut computing time by more than, say, ten times to 27
second. That would still leave you with 4 months of running a
computer.

Perhaps an alternative approach would be to get access to stronger
(super)computers, either at a university, or buying access. A quick
googling turns up http://www.clusterondemand.com/ for example.

Anyhow, good luck with your project! I'm sure the R list would be very
interested to hear of how you solved your problem.

Regards,

Gustaf