Skip to content

foreach loop, stata equivalent

5 messages · Jamora, Nelissa, Milan Bouchet-Valat, Joshua Wiley

#
Le lundi 18 f?vrier 2013 ? 13:48 +0100, Jamora, Nelissa a ?crit :
You do not need package foreach, which is intended at a completely
different problem.

R does not really have the syntactic equivalent of "varlist", but you
can easily do something like:
for(var in paste0("p", 1:14)) {
    for(y in paste0("p", 15:269))
        lm(yourData[[var]] ~ yourData[[y]])
}

provided that yourData is the data frame in which the p* variables are
stored.

There are probably more direct ways of doing the same thing and storing
the resulting lm objects in a list, but you did not state what you
intend to do with this enormous set of regressions...


Regards
#
Hi Milan, 

Thanks for responding to my question. I'm actually not interested in LM, it
was more  for example. 

You are right, I'm trying an enormous set of model runs. My Var1 n=14; Var2
n=255 ==> 3570!  
But first, I need be able to set up 2 variables in each model run. Those 2
variables are different in each case. I can set this up 1-by-1 but it will be
tedious and not efficient. 

To describe in more details
I have a data frame with 269 variables.
1. individual columns 1-14 can be my first variable
2. individual columns 15-269 can be my second variable.

Variable1 and variable2 are different in each case. For e.g.
Model 1: var1 and var15
Model 2: var1 and var16
Model 3: var1 and var17...
....
Model 3570: var14 and var269
So I need to write a loop command that calls for different sets of variable1
and variable2 in each run.

What do I intend to do with this? I'm running threshold vecm (package tsDyn),
and I need to summarize threshold estimates in each model run (or market
pairs, var1 and var2). The goal is to extract N=3,570 threshold estimates. 
I did similar linear VECM estimates in Stata using my foreach loop, but now I
need to make parallel run in R but using threshold model.

Hope this clears things.
Nelissa








-----Original Message-----
From: Milan Bouchet-Valat [mailto:nalimilan at club.fr] 
Sent: Monday, February 18, 2013 3:44 PM
To: Jamora, Nelissa
Cc: r-help at r-project.org
Subject: Re: [R] foreach loop, stata equivalent

Le lundi 18 f?vrier 2013 ? 13:48 +0100, Jamora, Nelissa a ?crit :
You do not need package foreach, which is intended at a completely different
problem.

R does not really have the syntactic equivalent of "varlist", but you can
easily do something like:
for(var in paste0("p", 1:14)) {
    for(y in paste0("p", 15:269))
        lm(yourData[[var]] ~ yourData[[y]]) }

provided that yourData is the data frame in which the p* variables are
stored.

There are probably more direct ways of doing the same thing and storing the
resulting lm objects in a list, but you did not state what you intend to do
with this enormous set of regressions...


Regards
#
Le lundi 18 f?vrier 2013 ? 17:09 +0100, Jamora, Nelissa a ?crit :
Ah, if you need parallelism, you can likely try something like this
(untested) :

# Create cluster cl before and export yourData to them
parLapply(cl, paste0("p", 1:14)), function(var) {
    lapply(paste0("p", 15:269), function(y) {
         lm(yourData[[var]] ~ yourData[[y]])
    })
}

This will only be optimal if you have less than 14 cores.


Regards
#
If you have the same set of predictors for multiple outcomes, you can
speed up the process considerably by taking advantage of the fact that
the design matrix is the same for multiple outcomes.  As an example:

set.seed(10)
y <- matrix(rnorm(10000 * 14), ncol = 14)
x <- matrix(rnorm(10000 * 2), ncol = 2)
system.time(res <- lapply(1:14, function(i) lm(y[, i] ~ x)))
##   user  system elapsed
##   0.34    0.00    0.34
system.time(res2 <- lm(y ~ x))
##   user  system elapsed
##   0.05    0.02    0.06

lm can accept a matrix as the dependent variable.  So if various
combinations of variables predict all 14 outcomes, do not fit 14 *
number of combinations of predictors separately, do them in chunks for
substantial speedups.

Finally, as long as you are not using factors or any other fancy
things, and can work with just raw data matrices, instead of using
lm(), which is a high level function, use lm.fit().  It is not
especially clever, just expects the design matrix and response matrix.
 It will not an intercept by default, so to your data column bind on a
vector of 1s.

system.time(res3 <- lm.fit(cbind(1, x), y))
##   user  system elapsed
##   0.02    0.00    0.01

These three methods produce identical results:

res[[1]]
## Call:
## lm(formula = y[, i] ~ x)
## Coefficients:
## (Intercept)           x1           x2
##  0.0014401   -0.0198232   -0.0005721
res2[[1]][, 1]
##   (Intercept)            x1            x2
##  0.0014401149 -0.0198232209 -0.0005720764
res3[[1]][, 1]
##            x1            x2            x3
##  0.0014401149 -0.0198232209 -0.0005720764

however, fit each response one at a time instead of taking advantage
of fitting multiple responses at once (so the design matrix can be
reused) and taking advantage of lower level functions when you have a
simple, specific, repetitive task takes the estimation time from .34
down to .01.  You would need 34 cores to achieve that by simply
throwing more hardware at the problem as opposed to using more
optimized code.  Of course you can do both, and probably get the
results pretty quickly.

Cheers,

Josh
On Mon, Feb 18, 2013 at 8:25 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:
--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/