Skip to content
Prev 243170 / 398500 Next

help: program efficiency

Hello,

Someone pointed out to me off list about this construct:

nodup_sort <- function(x, fun = nodup3){
     i <- sort.list(x)
     x[i] <- fun(x[i])
     x
}

which deals more efficiently with the reordering.

 > x <- sample( 1:100000, size = 300000, replace = TRUE )
 > system.time( nodup_cpp( x ) )
utilisateur     syst?me      ?coul?
       0.127       0.005       0.132
 > system.time( nodup_sort( x, nodup3 ) )
utilisateur     syst?me      ?coul?
       0.287       0.009       0.295
 > system.time( nodup_sort( x, nodup3a ) )
utilisateur     syst?me      ?coul?
       0.168       0.010       0.179
 > system.time( nodup_sort( x, nodup4 ) )
utilisateur     syst?me      ?coul?
       0.157       0.005       0.163
 > system.time( nodup_sort( x, nodup_cpp_assumingsorted ) )
utilisateur     syst?me      ?coul?
       0.096       0.001       0.097

So in this example, it seems more efficient to sort first and use the 
algorithm assuming that the data is sorted.

There is probably a way to be smarter in nodup_cpp where the bottleneck 
is likely to be related to map::find.

Another version taking some more internally :

nodup_cpp_hybrid <- cxxfunction( signature( x_ = "numeric", sort_ = 
"integer" ), '

     NumericVector input(x_) ;
     NumericVector x  = clone<NumericVector>( x_)  ;
     IntegerVector sort( sort_ ) ;

     int n = x.size() ;
     double current, previous = input[ sort[0] - 1 ] ;
     double index = 0.0 ;
     int si ;
     for( int i=1; i<n; i++){
         si = sort[i] - 1;
         current = input[ si ] ;
         if( current == previous ){
             index += .01 ;
             x[ si ] = current + index ;
         } else {
             index = 0.0 ;
         }
         previous = current ;
     }
     return x ;
', plugin = "Rcpp" )

no big difference:

 > system.time( res6 <- nodup_cpp_hybrid( x, sort.list(x) ) )
utilisateur     syst?me      ?coul?
       0.092       0.000       0.092


Profiling reveals this:

 > Rprof()
 > for(i in 1:100) {  res6 <- ( nodup_cpp_hybrid( x, sort.list(x) ) ) }
 > Rprof(NULL)
 > summaryRprof()
$by.self
               self.time self.pct total.time total.pct
"sort.list"        6.50    90.03       6.50     90.03
".Call"            0.42     5.82       0.42      5.82
"file.exists"      0.30     4.16       0.30      4.16

$by.total
                    total.time total.pct self.time self.pct
"nodup_cpp_hybrid"       7.22    100.00      0.00     0.00
"sort.list"              6.50     90.03      6.50    90.03
".Call"                  0.42      5.82      0.42     5.82
"file.exists"            0.30      4.16      0.30     4.16

$sample.interval
[1] 0.02

$sampling.time
[1] 7.22


The 4.16 % taken by file.exists indicates that someone in the inline 
project has to do some work (on my TODO list).

But otherwise sort.list dominates the time.

Romain

Le 26/11/10 21:22, Romain Francois a ?crit :