Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression? More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R. Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
Question about multiple regression
15 messages · Brian Ripley, Bert Gunter, Gabor Grothendieck +4 more
I would test the speed before making such as assumption. Note that lm.fit is faster than lm and if they have the same x matrix then you can do many in one call by having y be a matrix.
On Mon, Sep 8, 2008 at 12:05 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression? More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R. Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Are you sure R's ways are not fast enough (there are many layers underneath lm)? For an example of how you might do this at C/Fortran level, see the function lqs() in MASS.
On Mon, 8 Sep 2008, Dimitri Liakhovitski wrote:
Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression?
A lot of the effort is in getting the right answer fast, including for e.g. collinear inputs.
More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R.
I think Java is unlikely to be fast compared to the Fortran R itself uses. Have you profiled to find where the time is really being spent (both R and C/Fortran profiling if necessary).
Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Thank you for reminding me, Gabor. I forgot to mention: So far, I have run one test set of regressions using lm. It took R 270 sec. I need to run 1,800,000 of those, which would imply 15.4 years of computing time :) I have not done the same for lm.fit because I am not sure how to get model R squared from lm.fit. Dimitri On Mon, Sep 8, 2008 at 12:17 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
I would test the speed before making such as assumption. Note that lm.fit is faster than lm and if they have the same x matrix then you can do many in one call by having y be a matrix. On Mon, Sep 8, 2008 at 12:05 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression? More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R. Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
Yes, see my previous e-mail on how long R takes (270 seconds for one of the 1,800,000 sets I need) - using system.time. Not sure how to test the same for Fortran... On Mon, Sep 8, 2008 at 12:51 PM, Prof Brian Ripley
<ripley at stats.ox.ac.uk> wrote:
Are you sure R's ways are not fast enough (there are many layers underneath lm)? For an example of how you might do this at C/Fortran level, see the function lqs() in MASS. On Mon, 8 Sep 2008, Dimitri Liakhovitski wrote:
Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression?
A lot of the effort is in getting the right answer fast, including for e.g. collinear inputs.
More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R.
I think Java is unlikely to be fast compared to the Fortran R itself uses. Have you profiled to find where the time is really being spent (both R and C/Fortran profiling if necessary).
Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
Disclaimer: I have **NO IDEA** of the details of what you want to do or why -- but I am willing to bet that there are better ways of doing it than 1.8 mm multiple refressions that take 270 secs each!! (which I find difficult to believe in itself -- are you sure you are doing things right? Something sounds very fishy here: R's regression code is typically very fast). -- Bert Gunter -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Dimitri Liakhovitski Sent: Monday, September 08, 2008 9:56 AM To: Prof Brian Ripley Cc: R-Help List Subject: Re: [R] Question about multiple regression Yes, see my previous e-mail on how long R takes (270 seconds for one of the 1,800,000 sets I need) - using system.time. Not sure how to test the same for Fortran... On Mon, Sep 8, 2008 at 12:51 PM, Prof Brian Ripley
<ripley at stats.ox.ac.uk> wrote:
Are you sure R's ways are not fast enough (there are many layers
underneath
lm)? For an example of how you might do this at C/Fortran level, see the function lqs() in MASS. On Mon, 8 Sep 2008, Dimitri Liakhovitski wrote:
Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression?
A lot of the effort is in getting the right answer fast, including for
e.g.
collinear inputs.
More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R.
I think Java is unlikely to be fast compared to the Fortran R itself uses. Have you profiled to find where the time is really being spent (both R and C/Fortran profiling if necessary).
Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Try: sum(lm.fit(x, y)$residuals^2)
On Mon, Sep 8, 2008 at 12:52 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
Thank you for reminding me, Gabor. I forgot to mention: So far, I have run one test set of regressions using lm. It took R 270 sec. I need to run 1,800,000 of those, which would imply 15.4 years of computing time :) I have not done the same for lm.fit because I am not sure how to get model R squared from lm.fit. Dimitri On Mon, Sep 8, 2008 at 12:17 PM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
I would test the speed before making such as assumption. Note that lm.fit is faster than lm and if they have the same x matrix then you can do many in one call by having y be a matrix. On Mon, Sep 8, 2008 at 12:05 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression? More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R. Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
Thank you everyone for your responses. I'll answer several questions. 1. > Disclaimer: I have **NO IDEA** of the details of what you want to do or why
-- but I am willing to bet that there are better ways of doing it than 1.8 mm multiple refressions that take 270 secs each!! (which I find difficult to believe in itself -- are you sure you are doing things right? Something sounds very fishy here: R's regression code is typically very fast).
I probably should not bore everyone, but just to explain where the large number is coming from. I have an experimental design with 7 factors. Each factor has between 3 and 5 levels. Once you cross them all, you end up with 18,000 cells. For each cell, I want to generate a sample of N=100. For each sample I have to analyze the data using 3 different statistical methods of analysis (the goal of the Monte-Carlo) is to compare those methods. One of the methods requires running of up to ~32,000 simple multiple regressions - yes just for one sample and it's not a mistake. I test-ran one such analysis for a sample with N=800 and 15 predictors and it took 270 seconds. R was actually very fast - it ran each of the individual regressions in about 0.008 seconds. Still I need something faster. 2. Sorry - what was the formula sum(lm.fit(x,y))$residuals^2) for? For example, using it on my data, I got a value of 36,644... 3. I know that for similarly challenging situations people did used Fortran compilers. So, anyone heard of a free Fortran library or an efficient piece of code? Thank you! Dimitri
-- Bert Gunter -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Dimitri Liakhovitski Sent: Monday, September 08, 2008 9:56 AM To: Prof Brian Ripley Cc: R-Help List Subject: Re: [R] Question about multiple regression Yes, see my previous e-mail on how long R takes (270 seconds for one of the 1,800,000 sets I need) - using system.time. Not sure how to test the same for Fortran... On Mon, Sep 8, 2008 at 12:51 PM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
Are you sure R's ways are not fast enough (there are many layers
underneath
lm)? For an example of how you might do this at C/Fortran level, see the function lqs() in MASS. On Mon, 8 Sep 2008, Dimitri Liakhovitski wrote:
Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression?
A lot of the effort is in getting the right answer fast, including for
e.g.
collinear inputs.
More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R.
I think Java is unlikely to be fast compared to the Fortran R itself uses. Have you profiled to find where the time is really being spent (both R and C/Fortran profiling if necessary).
Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
-- Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
On Mon, Sep 8, 2008 at 1:47 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
Thank you everyone for your responses. I'll answer several questions. 1. > Disclaimer: I have **NO IDEA** of the details of what you want to do or why
-- but I am willing to bet that there are better ways of doing it than 1.8 mm multiple refressions that take 270 secs each!! (which I find difficult to believe in itself -- are you sure you are doing things right? Something sounds very fishy here: R's regression code is typically very fast).
I probably should not bore everyone, but just to explain where the large number is coming from. I have an experimental design with 7 factors. Each factor has between 3 and 5 levels. Once you cross them all, you end up with 18,000 cells. For each cell, I want to generate a sample of N=100. For each sample I have to analyze the data using 3 different statistical methods of analysis (the goal of the Monte-Carlo) is to compare those methods. One of the methods requires running of up to ~32,000 simple multiple regressions - yes just for one sample and it's not a mistake. I test-ran one such analysis for a sample with N=800 and 15 predictors and it took 270 seconds. R was actually very fast - it ran each of the individual regressions in about 0.008 seconds. Still I need something faster. 2. Sorry - what was the formula sum(lm.fit(x,y))$residuals^2) for? For example, using it on my data, I got a value of 36,644...
Its the sum of the squares of the residuals.
3. I know that for similarly challenging situations people did used Fortran compilers. So, anyone heard of a free Fortran library or an efficient piece of code? Thank you! Dimitri
-- Bert Gunter -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Dimitri Liakhovitski Sent: Monday, September 08, 2008 9:56 AM To: Prof Brian Ripley Cc: R-Help List Subject: Re: [R] Question about multiple regression Yes, see my previous e-mail on how long R takes (270 seconds for one of the 1,800,000 sets I need) - using system.time. Not sure how to test the same for Fortran... On Mon, Sep 8, 2008 at 12:51 PM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
Are you sure R's ways are not fast enough (there are many layers
underneath
lm)? For an example of how you might do this at C/Fortran level, see the function lqs() in MASS. On Mon, 8 Sep 2008, Dimitri Liakhovitski wrote:
Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression?
A lot of the effort is in getting the right answer fast, including for
e.g.
collinear inputs.
More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R.
I think Java is unlikely to be fast compared to the Fortran R itself uses. Have you profiled to find where the time is really being spent (both R and C/Fortran profiling if necessary).
Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
-- Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I could get an r squared from lm.fit by correlating fitted.values and my response variable. But could I do it somehow using Sums of Squares? I am clear on SS for residuals. But where is SS for the model or the total SS in lm.fit output? Thank you! Dimitri On Mon, Sep 8, 2008 at 1:57 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
On Mon, Sep 8, 2008 at 1:47 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
Thank you everyone for your responses. I'll answer several questions. 1. > Disclaimer: I have **NO IDEA** of the details of what you want to do or why
-- but I am willing to bet that there are better ways of doing it than 1.8 mm multiple refressions that take 270 secs each!! (which I find difficult to believe in itself -- are you sure you are doing things right? Something sounds very fishy here: R's regression code is typically very fast).
I probably should not bore everyone, but just to explain where the large number is coming from. I have an experimental design with 7 factors. Each factor has between 3 and 5 levels. Once you cross them all, you end up with 18,000 cells. For each cell, I want to generate a sample of N=100. For each sample I have to analyze the data using 3 different statistical methods of analysis (the goal of the Monte-Carlo) is to compare those methods. One of the methods requires running of up to ~32,000 simple multiple regressions - yes just for one sample and it's not a mistake. I test-ran one such analysis for a sample with N=800 and 15 predictors and it took 270 seconds. R was actually very fast - it ran each of the individual regressions in about 0.008 seconds. Still I need something faster. 2. Sorry - what was the formula sum(lm.fit(x,y))$residuals^2) for? For example, using it on my data, I got a value of 36,644...
Its the sum of the squares of the residuals.
3. I know that for similarly challenging situations people did used Fortran compilers. So, anyone heard of a free Fortran library or an efficient piece of code? Thank you! Dimitri
-- Bert Gunter -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Dimitri Liakhovitski Sent: Monday, September 08, 2008 9:56 AM To: Prof Brian Ripley Cc: R-Help List Subject: Re: [R] Question about multiple regression Yes, see my previous e-mail on how long R takes (270 seconds for one of the 1,800,000 sets I need) - using system.time. Not sure how to test the same for Fortran... On Mon, Sep 8, 2008 at 12:51 PM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
Are you sure R's ways are not fast enough (there are many layers
underneath
lm)? For an example of how you might do this at C/Fortran level, see the function lqs() in MASS. On Mon, 8 Sep 2008, Dimitri Liakhovitski wrote:
Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression?
A lot of the effort is in getting the right answer fast, including for
e.g.
collinear inputs.
More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R.
I think Java is unlikely to be fast compared to the Fortran R itself uses. Have you profiled to find where the time is really being spent (both R and C/Fortran profiling if necessary).
Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
-- Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
R squared is: 1 - sum(residuals^2)/crossprod(y - mean(y))
On Mon, Sep 8, 2008 at 2:27 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
I could get an r squared from lm.fit by correlating fitted.values and my response variable. But could I do it somehow using Sums of Squares? I am clear on SS for residuals. But where is SS for the model or the total SS in lm.fit output? Thank you! Dimitri On Mon, Sep 8, 2008 at 1:57 PM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
On Mon, Sep 8, 2008 at 1:47 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
Thank you everyone for your responses. I'll answer several questions. 1. > Disclaimer: I have **NO IDEA** of the details of what you want to do or why
-- but I am willing to bet that there are better ways of doing it than 1.8 mm multiple refressions that take 270 secs each!! (which I find difficult to believe in itself -- are you sure you are doing things right? Something sounds very fishy here: R's regression code is typically very fast).
I probably should not bore everyone, but just to explain where the large number is coming from. I have an experimental design with 7 factors. Each factor has between 3 and 5 levels. Once you cross them all, you end up with 18,000 cells. For each cell, I want to generate a sample of N=100. For each sample I have to analyze the data using 3 different statistical methods of analysis (the goal of the Monte-Carlo) is to compare those methods. One of the methods requires running of up to ~32,000 simple multiple regressions - yes just for one sample and it's not a mistake. I test-ran one such analysis for a sample with N=800 and 15 predictors and it took 270 seconds. R was actually very fast - it ran each of the individual regressions in about 0.008 seconds. Still I need something faster. 2. Sorry - what was the formula sum(lm.fit(x,y))$residuals^2) for? For example, using it on my data, I got a value of 36,644...
Its the sum of the squares of the residuals.
3. I know that for similarly challenging situations people did used Fortran compilers. So, anyone heard of a free Fortran library or an efficient piece of code? Thank you! Dimitri
-- Bert Gunter -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Dimitri Liakhovitski Sent: Monday, September 08, 2008 9:56 AM To: Prof Brian Ripley Cc: R-Help List Subject: Re: [R] Question about multiple regression Yes, see my previous e-mail on how long R takes (270 seconds for one of the 1,800,000 sets I need) - using system.time. Not sure how to test the same for Fortran... On Mon, Sep 8, 2008 at 12:51 PM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
Are you sure R's ways are not fast enough (there are many layers
underneath
lm)? For an example of how you might do this at C/Fortran level, see the function lqs() in MASS. On Mon, 8 Sep 2008, Dimitri Liakhovitski wrote:
Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression?
A lot of the effort is in getting the right answer fast, including for
e.g.
collinear inputs.
More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R.
I think Java is unlikely to be fast compared to the Fortran R itself uses. Have you profiled to find where the time is really being spent (both R and C/Fortran profiling if necessary).
Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
-- Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
Although I along with the other believe there probably is an efficient R solution, the answer to your direct question can perhaps be found at http://www.fortran.com/. The free GNU G95 fortran compiler is at http://www.g95.org/ Joe -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Dimitri Liakhovitski Sent: Monday, September 08, 2008 11:05 AM To: R-Help List Subject: [R] Question about multiple regression Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression? More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R. Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Thanks a lot, everybody! On Mon, Sep 8, 2008 at 3:11 PM, Lucke, Joseph F
<Joseph.F.Lucke at uth.tmc.edu> wrote:
Although I along with the other believe there probably is an efficient R solution, the answer to your direct question can perhaps be found at http://www.fortran.com/. The free GNU G95 fortran compiler is at http://www.g95.org/ Joe -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Dimitri Liakhovitski Sent: Monday, September 08, 2008 11:05 AM To: R-Help List Subject: [R] Question about multiple regression Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression? More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R. Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
Hi Dimitri,
On Mon, 8 Sep 2008, Dimitri Liakhovitski wrote:
Dear R-list, maybe some of you could point me in the right direction: Are you aware of any FREE Fortran or Java libraries/actual pieces of code that are VERY efficient (time-wise) in running the regular linear least-squares multiple regression?
You almost certainly want the LAPACK fortran libraries, avail at http://www.netlib.org/lapack/ ...the function of interest to you is probably called "dgels": http://www.netlib.org/lapack/explore-html/dgels.f.html ...of course, this runs faster if you have a fast BLAS library installed. These exist in many forms, and may already be installed on your system. --Adam
More specifically, I have to run small regression models (between 1 and 15 predictors) on samples of up to N=700 but thousands and thousands of them. I am designing a simulation in R and running those regressions and R itself is way too slow. So, I am thinking of compiling the regression run itself in Fortran and Java and then calling it from R. Thank you very much for any advice! Dimitri Liakhovitski MarketTools, Inc. Dimitri.Liakhovitski at markettools.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Mon, Sep 8, 2008 at 7:47 PM, Dimitri Liakhovitski <ld7631 at gmail.com> wrote:
Thank you everyone for your responses. I'll answer several questions. 1. > Disclaimer: I have **NO IDEA** of the details of what you want to do or why
-- but I am willing to bet that there are better ways of doing it than 1.8 mm multiple refressions that take 270 secs each!! (which I find difficult to believe in itself -- are you sure you are doing things right? Something sounds very fishy here: R's regression code is typically very fast).
I probably should not bore everyone, but just to explain where the large number is coming from. I have an experimental design with 7 factors. Each factor has between 3 and 5 levels. Once you cross them all, you end up with 18,000 cells. For each cell, I want to generate a sample of N=100. For each sample I have to analyze the data using 3 different statistical methods of analysis (the goal of the Monte-Carlo) is to compare those methods. One of the methods requires running of up to ~32,000 simple multiple regressions - yes just for one sample and it's not a mistake. I test-ran one such analysis for a sample with N=800 and 15 predictors and it took 270 seconds. R was actually very fast - it ran each of the individual regressions in about 0.008 seconds. Still I need something faster. 2. Sorry - what was the formula sum(lm.fit(x,y))$residuals^2) for? For example, using it on my data, I got a value of 36,644... 3. I know that for similarly challenging situations people did used Fortran compilers. So, anyone heard of a free Fortran library or an efficient piece of code? Thank you! Dimitri
Have you considered the fact that 32000 regressions simply takes a lot of time? I don't really have anything to go by, but it sounds unlikely that you will be able to cut computing time by more than, say, ten times to 27 second. That would still leave you with 4 months of running a computer. Perhaps an alternative approach would be to get access to stronger (super)computers, either at a university, or buying access. A quick googling turns up http://www.clusterondemand.com/ for example. Anyhow, good luck with your project! I'm sure the R list would be very interested to hear of how you solved your problem. Regards, Gustaf
Gustaf Rydevik, M.Sci. tel: +46(0)703 051 451 address:Essingetorget 40,112 66 Stockholm, SE skype:gustaf_rydevik