Skip to content

[Rcpp-devel] Stack imbalance warning when using Rcpp and OpenMP

7 messages · Michael Braun, Davor Cubranic, Dirk Eddelbuettel

#
Hi.  Just one last clarifying question on this issue before I dive back in.

Suppose I declared a new Rcpp::List object in my C++ code, and copied the list elements from either the SEXP or the original Rcpp::List.  Since the new memory is allocated in C++, would I still have the same problem because of the way Rcpp allocated the memory?  Or would the copy be thread-safe?

Similarly, what if I were to create an STL container of Rcpp::Lists, and operate on each element of the container in parallel?  Same problem?
6 days later
#
Michael,
On 19 August 2011 at 14:09, Michael Braun wrote:
| Hi.  Just one last clarifying question on this issue before I dive back in.
| 
| Suppose I declared a new Rcpp::List object in my C++ code, and copied the list elements from either the SEXP or the original Rcpp::List.  Since the new memory is allocated in C++, would I still have the same problem because of the way Rcpp allocated the memory?  Or would the copy be thread-safe?
| 
| Similarly, what if I were to create an STL container of Rcpp::Lists, and operate on each element of the container in parallel?  Same problem?
| 
| From your helpful responses, it seems like the best alternative is to explicitly copy the contents of each SEXP in the list to a totally non-Rcpp object. I'm just wondering if keeping some of the data in the original classes might still work.
| 
| And finally, since I am still relatively new to C++, are there any standard classes that might make more sense than others?  I'm considering an STL vector of either Eigen or Armadillo matrices, for example.  Good idea, or bad?

OpenMP is tricky. I would definitely recommend reading up on tutorials _just
on the C++ side_ and working with some self-contained C++ examples.

Only once you feel you have a reasonable handle, move on to Rcpp and OpenMP.
There are new issues, as the `Stack imbalance' issue you have seen which can
arise when you return prematurely while other threads still chew on R data
structures.

Long story short I just committed a new self-contained example to the Rcpp
source which you can look at (and copy) via the URL

  https://r-forge.r-project.org/scm/viewvc.php/pkg/Rcpp/inst/examples/OpenMP/OpenMPandInline.r?view=markup&root=rcpp

I simply takes a vector of size two million, set up as the sequence from 1 ..
N and then computes the log of each element.  In other words it is pretty
light on the actual computation.

The results bear this out. On my (standard i7, four cores hyperthreaded) box:

edd at max:~$ r ~/svn/rcpp/pkg/Rcpp/inst/examples/OpenMP/OpenMPandInline.r
Loading required package: methods
              test replications elapsed relative user.self sys.self
2     funOpenMP(z)          100   3.219 1.000000     25.26     0.07
3 funSerialRcpp(z)          100   9.030 2.805219      9.43     0.32
4  funSugarRcpp(z)          100   9.423 2.927307      9.06     0.35
1     funSerial(z)          100   9.601 2.982603      9.59     0.00
edd at max:~$ 

So OpenMP 'wins' but the gain is sublinear at a factor of three -- we need to
compare to method 'funSerial' which also uses a C++ vector. This indicates some
communications overhead between the threads.  

Rcpp sugar has no real leg up on manual loops, but is the shortest
implementation in two lines. Looping over an Rcpp vector is a little faster
than looping over a C++ STL vector (which incurs a copy).

Hope this helps,  Dirk
#
On August 25, 2011 05:32:45 PM Dirk Eddelbuettel wrote:
[...]
[...]
I know you wanted to keep all code looking the same, but with std::transform 
all serial code is very short:

serialStdAlgCode <- '
   std::vector<double> x = Rcpp::as<std::vector< double > >(xs);
   std::transform(x.begin(), x.end(), x.begin(), ::log);
   return Rcpp::wrap(x);
'
funSerialStdAlg <- cxxfunction(signature(xs="numeric"), body=serialStdAlgCode, 
plugin="Rcpp")

serialStdAlgRcppCode <- '
   Rcpp::NumericVector x = Rcpp::NumericVector(xs);
   std::transform(x.begin(), x.end(), x.begin(), ::log);
   return x;
'
funSerialStdAlgRcpp <- cxxfunction(signature(xs="numeric"), 
body=serialStdAlgRcppCode, plugin="Rcpp")

The results (without OpenMP because I'm currently on a single-core CPU):

                    test replications elapsed relative user.self sys.self
3 funSerialStdAlgRcpp(z)           20   4.236 1.000000     3.792    0.252
4       funSerialRcpp(z)           20   4.312 1.017941     3.792    0.272
5        funSugarRcpp(z)           20   4.537 1.071058     3.744    0.588
2     funSerialStdAlg(z)           20   5.514 1.301700     4.329    0.884
1           funSerial(z)           20   5.536 1.306893     4.480    0.808

Davor
#
On 26 August 2011 at 08:37, Davor Cubranic wrote:
| On August 25, 2011 05:32:45 PM Dirk Eddelbuettel wrote:
| > Long story short I just committed a new self-contained example to the Rcpp
| > source which you can look at (and copy) via the URL
| [...]
| > The results bear this out. On my (standard i7, four cores hyperthreaded)
| > box:
| > 
| > edd at max:~$ r ~/svn/rcpp/pkg/Rcpp/inst/examples/OpenMP/OpenMPandInline.r
| > Loading required package: methods
| >               test replications elapsed relative user.self sys.self
| > 2     funOpenMP(z)          100   3.219 1.000000     25.26     0.07
| > 3 funSerialRcpp(z)          100   9.030 2.805219      9.43     0.32
| > 4  funSugarRcpp(z)          100   9.423 2.927307      9.06     0.35
| > 1     funSerial(z)          100   9.601 2.982603      9.59     0.00
| > edd at max:~$
| > 
| [...]
| > Rcpp sugar has no real leg up on manual loops, but is the shortest
| > implementation in two lines. Looping over an Rcpp vector is a little faster
| > than looping over a C++ STL vector (which incurs a copy).
| 
| I know you wanted to keep all code looking the same, but with std::transform 
| all serial code is very short:

Good thinking, I'll make that change!

| serialStdAlgCode <- '
|    std::vector<double> x = Rcpp::as<std::vector< double > >(xs);
|    std::transform(x.begin(), x.end(), x.begin(), ::log);
|    return Rcpp::wrap(x);
| '
| funSerialStdAlg <- cxxfunction(signature(xs="numeric"), body=serialStdAlgCode, 
| plugin="Rcpp")
| 
| serialStdAlgRcppCode <- '
|    Rcpp::NumericVector x = Rcpp::NumericVector(xs);
|    std::transform(x.begin(), x.end(), x.begin(), ::log);
|    return x;
| '
| funSerialStdAlgRcpp <- cxxfunction(signature(xs="numeric"), 
| body=serialStdAlgRcppCode, plugin="Rcpp")

That's much nicer. And quicker. Nice :)
 
| The results (without OpenMP because I'm currently on a single-core CPU):
| 
|                     test replications elapsed relative user.self sys.self
| 3 funSerialStdAlgRcpp(z)           20   4.236 1.000000     3.792    0.252
| 4       funSerialRcpp(z)           20   4.312 1.017941     3.792    0.272
| 5        funSugarRcpp(z)           20   4.537 1.071058     3.744    0.588
| 2     funSerialStdAlg(z)           20   5.514 1.301700     4.329    0.884
| 1           funSerial(z)           20   5.536 1.306893     4.480    0.808


Thanks for the suggestion.

Dirk
#
That's what I get with Davor's additions:

edd at max:~/svn/rcpp/pkg/Rcpp/inst/examples/OpenMP$ r OpenMPandInline.r
Loading required package: methods
                    test replications elapsed relative user.self sys.self
2           funOpenMP(z)          100   3.996 1.000000     30.53     0.91
5       funSerialRcpp(z)          100   8.960 2.242242      8.57     0.39
3 funSerialStdAlgRcpp(z)          100   8.975 2.245996      9.39     0.34
6        funSugarRcpp(z)          100   9.225 2.308559      9.00     0.22
4     funSerialStdAlg(z)          100  10.003 2.503253      9.22     0.77
1           funSerial(z)          100  10.092 2.525526      9.42     0.67
edd at max:~/svn/rcpp/pkg/Rcpp/inst/examples/OpenMP$ 

OpenMP still wins in 'elapsed' and using the STL algorithm is good for
marginal gains over manual loops.

Dirk
#
On August 26, 2011 09:05:44 AM Dirk Eddelbuettel wrote:
Interesting -- using std::transform was consistently faster for me in both 
Rcpp and std::vector variants.

I was using gcc 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4), on KUbuntu 11.04.

Davor
#
On 26 August 2011 at 16:11, Davor Cubranic wrote:
| On August 26, 2011 09:05:44 AM Dirk Eddelbuettel wrote:
| > That's what I get with Davor's additions:
| > 
| > 5       funSerialRcpp(z)          100   8.960 2.242242      8.57     0.39
| > 3 funSerialStdAlgRcpp(z)          100   8.975 2.245996      9.39     0.34
| 
| Interesting -- using std::transform was consistently faster for me in both 
| Rcpp and std::vector variants.

Both those two times are basically instinguishable.  And this is timed on my
moderately busy server so there is some noise.

| I was using gcc 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4), on KUbuntu 11.04.

Also Ubuntu 11.04 here.

Dirk