[Rcpp-devel] Conversion of Rcpp::Vector Rcpp::List and Rcpp::Matrix to std objects - OpenMP - Rcpp-devel

Sat, Jun 1, 2013 4:45 AM #

Dear Dirk and Simon,

thank you very much for your help.

I have made a few experiments and am very puzzled. Maybe I could
bother you with advise on this?

The experiments on a
#pragma omp parallel for

1) Use private directive on all Rcpp-Objects that are only accessed to
be read from, and a shared statement on the Rcpp List that collects
the results. -> segfault
2) Make all private -> runs fine, but returns an empty result list
3) Use a std::map instead of the Rcpp List to collect the results ->
endless loop telling me "****snapping into wrong generation"
4) No private, nor shared directives and the above Rcpp List
collecting the results works fine on small test data (3 loops), but on
real live test data ( 1000 loops ) gives the above "****snapping into
wrong generation"

Do you have any idea, what I could do?
I am still learning C++ - Maybe I should use references instead of
Objects? But aren't SEXP pointers anyway?

Kind regards!

Simon Zehnder

Sat, Jun 1, 2013 5:05 AM #

Hi Asis,

I'll write between the lines here:

On Jun 1, 2013, at 1:45 PM, Asis Hallab <asis.hallab at gmail.com> wrote:

First, you do usually not need a private clause on objects you read from, but only objects you want to manipulate inside a thread. 
Second, I do not know how an Rcpp-List is embedded into memory, but it could be, that dependent on this memory structure,
one thread wants to write into a block, where another one is already writing. I would do the following: Try first using a matrix instead
of a list, here memory structure is more clear. Second, also try Armadillo objects instead of NumericMatrix. It could give a difference. 
For more input on the memory structure let us hope on a reply from the Rcpp-Devels.

That is clear, you let each thread now create its own result list and the result list outside the loop remains at the state you left it before starting
the loop. The list MUST be shared().

Again, could be the same thing as with the RcppList - memory structure. You should be able to find a lot of about 
OpenMP and generic Lists (or more generally STL-containers) in the web, as C++ Gurus are much more into this than statistical programmers.

Same as above. in the case of only 3 loops, the possibility for several threads writing into the same memory should be very low.

I wouldn't do this, as references make things here just more complicated and it should work with objects (The pragma directive usually calls for private objects the appropriate 
copy constructor). Maybe you post the code here, and I have a look on it. 

One last question from my side: Did you use nested loops? Using this can sometimes result in undefined behavior. I had such cases before.

Hope it helps

Best

Simon


P.S. If I do not answer in the next hours, it is because I have to give some exams....I'll be back later.

Dirk Eddelbuettel

Sat, Jun 1, 2013 6:47 AM #

On 1 June 2013 at 14:05, Simon Zehnder wrote:

| On Jun 1, 2013, at 1:45 PM, Asis Hallab <asis.hallab at gmail.com> wrote:

| > 1) Use private directive on all Rcpp-Objects that are only accessed to
| > be read from, and a shared statement on the Rcpp List that collects
| > the results. -> segfault
| 
| First, you do usually not need a private clause on objects you read from, but only objects you want to manipulate inside a thread. 
| Second, I do not know how an Rcpp-List is embedded into memory, but it could be, that dependent on this memory structure,

Collection ("list") objects which, in the case of vectors or matrices, are
contiguous memory. Ie a collection of R objects.

In short: better not to write there.

Good discussion otherwise.

My general recommendation: split the issues. Try to solve your OpenMP design
questions in a standalone C++ program without any R.  If and when you're
happy (and this is no small task even for experienced C++ programmers) mix
with R(cpp).

Dirk

Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com

Asis Hallab

Sat, Jun 1, 2013 11:53 AM #

Dear Simon,

thank you very much for your insights and offer to help.

I prepared a very small project on github:
https://github.com/asishallab/PhyloFun_Rccp

You'll find installation and test instructions as well as references
to the file containing the OpenMP using C++ code.

Your help is much appreciated.

Kind regards!

Simon Zehnder

Sun, Jun 2, 2013 8:09 AM #

Hi Asis,

I have cloned the git project and successfully installed it to R. The 'testCptsRealLive.R' does not work though. The 'load' command has problems reading the binary file. 

Nevertheless, I took a look at your OpenMP code: 

SEXP conditionalProbabilityTables( SEXP uniqueEdgeLengths, SEXP annos, SEXP
    stringifiedAnnotations, SEXP annotsMutationProbTableList, SEXP
    mutTblLengthColIndx, SEXP nThreads ) {
  BEGIN_RCPP

    //std::cout << "1" << "\n";
    NumericVector numberThreads = NumericVector( nThreads );
    //std::cout << "2" << "\n";
    omp_set_num_threads( numberThreads(0) );
    //std::cout << "3" << "\n";

    NumericVector edgeLengths = NumericVector( uniqueEdgeLengths );
    //std::cout << "4" << "\n";
    CharacterVector edgeLengthsAsStrs = as<CharacterVector>( edgeLengths );
    //std::cout << "5" << "\n";
    List cpts = List( 0 );
    /* std::map< std::string, std::vector<double> > cpts; */
    //std::cout << "6" << "\n";

    #pragma omp parallel for private( uniqueEdgeLengths, annos, stringifiedAnnotations, annotsMutationProbTableList, mutTblLengthColIndx, nThreads, edgeLengths ) shared( cpts )
    for ( int i = 0; i < edgeLengths.size(); i++ )   
    {
      NumericVector currBranchLength( 1, edgeLengths( i ) );
      NumericMatrix cpt = conditionalProbabilityTable( currBranchLength, annos,
          stringifiedAnnotations, annotsMutationProbTableList,
          mutTblLengthColIndx );
      cpts.push_back( cpt, std::string( edgeLengthsAsStrs( i ) ) );
      /* cpts[ std::string( edgeLengthsAsStrs( i ) ) ] = as<std::vector<double> >( cpt ); */
    }
    return( wrap( cpts ) );

  END_RCPP
}

My comments: 
1) Set always '++i' not 'i++'. The second directive involves a copy, whereas the first one doesn't. This makes the first one faster. 

2) You are using a function 'conditionalProbabilityTable' in your OpenMP-Loop. Have you tested this function?

3) The 'cpts' List-Object is shared in the loop, i.e. all threads can access it. Then you write-access the list not by index or name, but by push_back(). If this is thread-safe I cannot tell you, but I wouldn't use it here. 
This can cause undefined behavior, i.e. your program runs sometimes without errors and another time you get a segfault. Try to access the list at least with an index or name. Another possibility: Use the push_back
command in a '#pragma omp critical' directive, which allows only one thread at a time access to the List (this can cause a longer runtime). 

4) You are using in the called function 'conditionalProbabilityTable' another for-loop: 

SEXP conditionalProbabilityTable( SEXP branchLength, SEXP annos,
    SEXP stringifiedAnnotations, SEXP annotsMutationProbTables, SEXP
    mutTblLengthColIndx ) {
  BEGIN_RCPP

    List annosLst( annos );
    CharacterVector annosAsStrs( stringifiedAnnotations );
    NumericMatrix cpt = NumericMatrix( annosLst.size(), annosLst.size() );
    cpt.attr( "dimnames" ) = List::create( annosAsStrs, annosAsStrs );

    for ( int i = 0; i < annosLst.size(); i++ ) {
      CharacterVector compositeAnnotation = annosLst( i );
      double compAnnoMutProb = 1.0;
      std::string ua = "unknown";
      std::string caFirst = as<std::string>( compositeAnnotation( 0 ) );
      if ( ua != caFirst ) {
        compAnnoMutProb = mutationProbability( compositeAnnotation,
            branchLength, annotsMutationProbTables, mutTblLengthColIndx )( 0 );
      }
      double mutToOtherAnnosProb = compAnnoMutProb / ( annosLst.size() - 1 );
      NumericVector colmn( annosLst.size(), mutToOtherAnnosProb );
      colmn( i ) = 1.0 - compAnnoMutProb;
      cpt( _, i ) = colmn;
    }

    return( wrap( cpt ) );

  END_RCPP
}
In such cases, you should always think about which loop to parallelize. Usually you parallelize the loop with the bigger workload. Only in case this loop has only
a few iterations it would make sense to parallelize the other loop. Another possibility: Create nested parallelized loops (advanced-level). For this you just use another '#pragma omp for' directive before the second for-loop. Set the variables OMP_NESTED to true. 
Set OMP_MAX_ACTIVE_LEVELS=2. Further make sure, that you have enough stack size for the threads: OMP_STACKSIZE=1000000. 
Do not even dream about an improvement, when working with two cores! You need at least 8 to see the work power of nesting!

5) Set the environment variable OMP_DYNAMIC=false. The dynamic chunking does in most cases not work and static chunking has a better performance. 

6) The OpenMP library was intended to be used on very simple objects, i.e. arrays and vectors. One reason for this is also the socalled page placing, where you place the memory chunks, which are used by a certain core
on the local memory of this core (NUMA-technology). If we have very complicated objects, it is not always clear, how the memory can be placed for an optimal access. 
So, to cut a long story short, I would always use very simple objects in an OpenMP directive. As NumericVectors and Armadillo-Vectors rely on an inner array for memory allocations 
(e.g. you can initialize both objects with an already existing stl::vector for memory reutilization), this objects can still be considered as very simple -> fast. Second, page placement
is more easily achieved, if wanted/needed. Third, write-access can be done easily by index and is thread-safe if indices do not intersect or we have read-access only. 
Run over vectors and matrices and stack them later on into a list. 

7) It seems sometimes to me, that you create a lot of objects. Check if these objects are all needed. For example, consider in ''conditionalProbabilityTable':
colmn( i ) = 1.0 - compAnnoMutProb;
cpt( _, i ) = colmn;
I think, you could write this immediately into 'cpt'?

8) Use instruments for testing: A very good introduction and an overview can be found on the website of the HPC-Cluster in Aachen (to which by the way you could get access from the University Bonn), http://www.rz.rwth-aachen.de/global/show_document.asp?id=aaaaaaaaaaclexs. Identify with an instrument the 
big workload in your sequential program and try to parallelize this section. Then check another time where another workload is, and so on. I would further suggest to simplify your program, when you 
encounter undefined behavior. This helps in most cases a lot. 

I hope I could help here. I already saw your new thread on the list and it seems, everything is running. So congrats and keep on!


Best

Simon

P.S. Of course we could meet in Bonn or Cologne (where I am living and working now). I am also very interested, what is needed in Bioinformatics, maybe we could interchange ideas. Let us keep in contact!

On Jun 1, 2013, at 8:53 PM, Asis Hallab <asis.hallab at gmail.com> wrote:

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130602/4563fe8c/attachment-0001.html>

Davor Cubranic

Tue, Jun 18, 2013 8:48 AM #

On 13-06-02 08:09 AM, Simon Zehnder wrote:

Really? When just incrementing the for-loop index counter? Any current 
compiler will do the efficient thing here. Have a look at the assembly 
output ("-S" option for gcc) and compare the version using pre- and 
post-decrement of "i".

Davor