Everyone, I am running R 2.4.1 on the new 8-core Mac Pro with the parSapply function from the Snow package. Tests using 2,4, and 8 threads with makeCluster() yield somewhat disappointing results. The 4 thread process is fastest. With 8 threads, all the cores max out at about 70% power, and even then it is slower than the 4 thread process which maxes out the 4 threads at about 90-95%. This suggests the additional 4 cores on the Mac Pro do not improve performance in an embarrassingly parallel R/Snow environment... The function I tested in parSapply runs a regression (a call to "lm"). I am using Snow/rpvm package combination. There is no issue of limited memory. There are over 3 gigs of free RAM in the tests. Does anyone have suggestions as how to get more power out of a multi-core Mac Pro machine and/or have 4-core/8-core R multi-thread computing performance benchmarks/experiences to share? Thanks in advance, David David Marra Principal A.T. Kearney (Tokyo)
Multi-thread R processes performance on 8-core Mac-Pro
7 messages · Marra, David, Sean Davis, elijah wright +2 more
On Wednesday 18 April 2007 02:48, Marra, David wrote:
Everyone, I am running R 2.4.1 on the new 8-core Mac Pro with the parSapply function from the Snow package. Tests using 2,4, and 8 threads with makeCluster() yield somewhat disappointing results. The 4 thread process is fastest. With 8 threads, all the cores max out at about 70% power, and even then it is slower than the 4 thread process which maxes out the 4 threads at about 90-95%. This suggests the additional 4 cores on the Mac Pro do not improve performance in an embarrassingly parallel R/Snow environment... The function I tested in parSapply runs a regression (a call to "lm"). I am using Snow/rpvm package combination. There is no issue of limited memory. There are over 3 gigs of free RAM in the tests. Does anyone have suggestions as how to get more power out of a multi-core Mac Pro machine and/or have 4-core/8-core R multi-thread computing performance benchmarks/experiences to share?
David, Under snow/rpvm, pure CPU usage is not the only variable that goes into efficiency. There are potentially interprocess communication, disk I/O, memory utilization, and perhaps other variables that ultimately determine the efficiency of parallel code. Assuming that your test results are correct, it might be the case that the CPU is being limited by one or more of the other variables. Sean
I am running R 2.4.1 on the new 8-core Mac Pro with the parSapply function from the Snow package. Tests using 2,4, and 8 threads with makeCluster() yield somewhat disappointing results. The 4 thread process is fastest. With 8 threads, all the cores max out at about 70% power, and even then it is slower than the 4 thread process which maxes out the 4 threads at about 90-95%. This suggests the additional 4 cores on the Mac Pro do not improve performance in an embarrassingly parallel R/Snow environment...
The function I tested in parSapply runs a regression (a call to "lm"). I am using Snow/rpvm package combination. There is no issue of limited memory. There are over 3 gigs of free RAM in the tests.
There has been some discussion (elsewhere) of the mac pro motherboard's inability to keep all eight cores supplied with data from memory at an adequate rate. Your processor cores may simply be starved for data. The amount of memory available doesn't matter, much, if there's no IO bandwidth left between the CPUs and the DIMMs.................... --e
On Apr 18, 2007, at 8:28 AM, elw at stderr.org wrote:
I am running R 2.4.1 on the new 8-core Mac Pro with the parSapply function from the Snow package. Tests using 2,4, and 8 threads with makeCluster() yield somewhat disappointing results. The 4 thread process is fastest. With 8 threads, all the cores max out at about 70% power, and even then it is slower than the 4 thread process which maxes out the 4 threads at about 90-95%. This suggests the additional 4 cores on the Mac Pro do not improve performance in an embarrassingly parallel R/ Snow environment...
The function I tested in parSapply runs a regression (a call to "lm"). I am using Snow/rpvm package combination. There is no issue of limited memory. There are over 3 gigs of free RAM in the tests.
There has been some discussion (elsewhere) of the mac pro motherboard's inability to keep all eight cores supplied with data from memory at an adequate rate. Your processor cores may simply be starved for data. The amount of memory available doesn't matter, much, if there's no IO bandwidth left between the CPUs and the DIMMs....................
The Intel's SMP design is poor compared to its competitors (not just the MoBo, the CPUs as well), that is a well known fact, but I wouldn't be so sure that it is what hits you. Can you try to run some some big BLAS operations like for example: set.seed(1) a<-matrix(rnorm(4000000),2000) b<-matrix(rnorm(4000000),2000) system.time(for (i in 1:20) a%*%b) This takes ca 14.2s on a 2.66 quad Mac Pro (46.5s user time). I wonder what the 8-core does with this. If you can't feed all 8 cores then I'd say it's likely a bandwidth issue ... Cheers, S
1 day later
Thank you to everyone who contributed to understanding the multi-core
problem better. I took Elijah's advice and purchased a Leopard
pre-release DVD and will post performance results here when it arrives
and if the results are interesting (should be!). I'll also post the
result from Simon's BLAS test in a few hours.
In the meantime, there is a speed problem to solve. Appreciate advice
anyone may have on potential approaches for speeding up the following
function. Based on previous comments, fewer calls to memory may be
important...
results <- function(x){
fit <- lm(Y ~
get(cmb[x,1])+get(cmb[x,2])+get(cmb[x,3])+get(cmb[x,4])+get(cmb[x,5]),
data=data1)
list(R2=summary(fit)$adj.r.squared) }
This call to lm function is nested in a parSapply function that iterates
down the rows of the "cmb" matrix. Each row of cmb has, in the example
above, 5 character values (such as "var1", "var2",..."var5")
corresponding to variable names in the "data1" dataframe. The function
iterates down the rows, generating regressions, each with a different
combination of variables. (x just goes from 1 to whatever number of rows
are in cmb.) Finally the function delivers the R2 for each combination.
Any speed-up ideas?
David
On 4/19/07, Marra, David <David.Marra at atkearney.com> wrote:
Thank you to everyone who contributed to understanding the multi-core
problem better. I took Elijah's advice and purchased a Leopard
pre-release DVD and will post performance results here when it arrives
and if the results are interesting (should be!). I'll also post the
result from Simon's BLAS test in a few hours.
In the meantime, there is a speed problem to solve. Appreciate advice
anyone may have on potential approaches for speeding up the following
function. Based on previous comments, fewer calls to memory may be
important...
results <- function(x){
fit <- lm(Y ~
get(cmb[x,1])+get(cmb[x,2])+get(cmb[x,3])+get(cmb[x,4])+get(cmb[x,5]),
data=data1)
list(R2=summary(fit)$adj.r.squared) }
This call to lm function is nested in a parSapply function that iterates
down the rows of the "cmb" matrix. Each row of cmb has, in the example
above, 5 character values (such as "var1", "var2",..."var5")
corresponding to variable names in the "data1" dataframe. The function
iterates down the rows, generating regressions, each with a different
combination of variables. (x just goes from 1 to whatever number of rows
are in cmb.) Finally the function delivers the R2 for each combination.
Can you be more specific about what x is here? What you write makes it sound as if x is a single row but you wouldn't be able to do a linear model fit on a single row. It must be more than one row. The immediate way to speed things up is to use lm.fit directly instead of going through lm. The lm function is a convenience function to take a formula/data representation of a linear model along with several optional arguments and create the model matrices. In this case you can create the model matrix for all the rows in a single call, provided that it fits into memory, then farm out the individual fits. Also, the call to summary does a lot more that calculate an adjusted R-squared. You can calculate this single statistic directly from the dimensions of the problem and the "effects" component of the lm fit.
Any speed-up ideas? David
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-mac
On 4/19/07, Marra, David <David.Marra at atkearney.com> wrote:
Thank you to everyone who contributed to understanding the multi-core
problem better. I took Elijah's advice and purchased a Leopard
pre-release DVD and will post performance results here when it arrives
and if the results are interesting (should be!). I'll also post the
result from Simon's BLAS test in a few hours.
In the meantime, there is a speed problem to solve. Appreciate advice
anyone may have on potential approaches for speeding up the following
function. Based on previous comments, fewer calls to memory may be
important...
results <- function(x){
fit <- lm(Y ~
get(cmb[x,1])+get(cmb[x,2])+get(cmb[x,3])+get(cmb[x,4])+get(cmb[x,5]),
data=data1)
list(R2=summary(fit)$adj.r.squared) }
This call to lm function is nested in a parSapply function that iterates
down the rows of the "cmb" matrix. Each row of cmb has, in the example
above, 5 character values (such as "var1", "var2",..."var5")
corresponding to variable names in the "data1" dataframe. The function
iterates down the rows, generating regressions, each with a different
combination of variables. (x just goes from 1 to whatever number of rows
are in cmb.) Finally the function delivers the R2 for each combination.
Can you be more specific about what x is here? What you write makes it sound as if x is a single row but you wouldn't be able to do a linear model fit on a single row. It must be more than one row.
The immediate way to speed things up is to use lm.fit directly instead of going through lm. The lm function is a convenience function to take a formula/data representation of a linear model along with several optional arguments and create the model matrices. In this case you can create the model matrix for all the rows in a single call, provided that it fits into memory, then farm out the individual fits. Also, the call to summary does a lot more that calculate an adjusted R-squared. You can calculate this single statistic directly from the dimensions of the problem and the "effects" component of the lm fit.
I will try to clarify. The purpose of the function is to create x different lm models and extract their R2s. If x is 1:500 that means 500 unique models, each with a different combination of arguments. The 500 unique combinations of argument names are stored in cmb. One combination in each row. If there are 500 combinations of 4 arguments each, the cmb matrix has 500 rows and 4 columns. For example row 29 might contain the following 4 character values: "Var2", "Var7", "Var18", "Var30". Literally, just characters. A text file, if you will. The characters "Var2" would be in the first column, "Var7, in the second.."Var18" in the fourth. The function I would like to speed up, if it is possible, then gets variable names from cmb and the data from data1. data1 is a large dataframe with all the variables, Var1 to Var30, and their data.
Any speed-up ideas? David
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-mac