Skip to content

hard to believe speed difference

12 messages · Michael Camann, A.J. Rossini, Jason Liao +6 more

#
First, I love R and am grateful to be using this free and extremely
high quality software.

Recently I have been comparing two algorithms and naturally I
programmed in R first. It is so slow that I can almost feel its pain.
So I decided to do a comparison with Java. To draw 500,0000 truncated
normal, Java program takes 2 second and R takes 72 seconds. Not a
computer science major, I find it hard to understand how R can be so
slow.

R code:

normal.truncated <- function(mean, sd, lower, upper)
	 {  
	    repeat
            { 
		deviate = round( rnorm(1, mean, sd) );
                if(deviate<=upper && deviate>=lower) break;
            }
	    temp <- c(deviate+.5, deviate-.5, upper+.5, lower-.5);
	    result <- pnorm(temp, mean, sd);
	    prob <- (result[1] - result[2])/(result[3] - result[4]);

            return(c(deviate, prob));
     }

     print(date());

     n = 500000;
     result = numeric(n);

     for(i in 1:n) result[i] = normal.truncated(0,1, -10, 10)[1]
     print(date());

Java Code:

    package Mcdonald;
    import VisualNumerics.math.*;
    import java.util.*;

    public final class normal_univariate_truncated 
    {
       public double harvest, prob;
       static Random ran_obj = new Random();
       
       public normal_univariate_truncated(double mean, double var,
double lower, double upper)
       { 
          double sd = Math.sqrt(var);
          
          lower -= .5;
          upper += .5;
          double left = (lower - mean)/sd;
          double right = (upper - mean)/sd;
          double prob2 = Statistics.normalCdf(right) -
Statistics.normalCdf(left);
          
          while(true)
          {
              harvest = mean + ran_obj.nextGaussian()*sd;              
              if(harvest>lower && harvest<upper) break;
          }
          
          harvest = Sfun.nearestInteger(harvest);                 
          left = (harvest - .5 - mean)/sd;
          right = (harvest + .5 - mean)/sd;
          double prob1 = Statistics.normalCdf(right) -
Statistics.normalCdf(left);                  
                  
          prob = prob1/prob2;           
       }
          
       
       public static void main(String[] useless)
       {
           System.out.println(new Date());
           int n = 500000;
           double[] result = new double[n];
           for(int i=0; i<n; i++) 
           {
               normal_univariate_truncated obj = new
normal_univariate_truncated(0, 1, -10, 10);
               result[i] = obj.harvest;
           }
           System.out.println(new Date());
       }
           
    }


=====
Jason G. Liao, Ph.D.
Division of Biometrics
University of Medicine and Dentistry of New Jersey
335 George Street, Suite 2200
New Brunswick, NJ 08903-2688
phone (732) 235-8611, fax (732) 235-9777
http://www.geocities.com/jg_liao

__________________________________________________



-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
If you write R like C or Java, it will often be slow.  R is a vector
language, so learn to work with it rather than against it.

For a truncated (standard) normal, try qnorm(runif(500000, a, b)) for
suitable a and b (pnorm(lower) and pnorm(upper)?)

And what are you going to do with 0.5m truncated normals that makes 70s
significant?

BTW, R has system.time() to time things.
On Tue, 6 Aug 2002, Jason Liao wrote:

            
It's not the language ....

  
    
#
On Tue, 6 Aug 2002, Jason Liao wrote:

            
Largely because you aren't using vectorised code. Loops in R are slow (to
some extent true with all interpreters).  The way around this is to use
R's ability to work on vectors or matrices instead of looping over their
elements.

Eg
rtruncnorm<-function(n,mean,sd,lower,upper){

    rval<-numeric(n)
    done<-rep(FALSE,n)
    while(m<-sum(!done)){
        rval[!done] <- round(rnorm(m,mean,sd))
        done <- rval>=lower & rval<=upper
    }

    prob<-(pnorm(rval+0.5,mean,sd)-pnorm(rval-0.5,mean,sd))/(pnorm(upper+0.5,mean,sd)-pnorm(lower-0.5))
    cbind(rval,prob)

}

rtruncnorm2<-function(n,mean,sd,lower,upper){

    uval<-runif(n,pnorm(lower-0.5,mean,sd),pnorm(upper+0.5,mean,sd))
    rval<-round(qnorm(uval,mean,sd))
    prob<-(pnorm(rval+0.5,mean,sd)-pnorm(rval-0.5,mean,sd))/(pnorm(upper+0.5,mean,sd)-pnorm(lower-0.5))
    cbind(rval,prob)

}

are two improved versions of your code that run more than 20 times
faster on my computer.
The first uses the same rejection sampling algorithm that you use, but
does it for the whole vector at once.

The second generates truncated uniforms (which is trivial) and transforms
them to Normal.

In addition to being a lot faster than your code I think they will handle
the case where mean, sd, lower or upper are also vectors.

Even faster, if you only want whole numbers and always the same
parameters, is to compute the probabilities once and then sample from the
resulting discrete distribution.

rtruncnorm3<-function(n,mean,sd,lower,upper){
	p<-diff(pnorm((lower:(1+upper))-0.5,mean,sd))
	sample(lower:upper,n,replace=TRUE,prob=p)
}
This is about 400 times faster than your version (and the speed advantage
would increase if the truncation points were less extreme), apparently
faster than the Java program.

By the way, there's no point truncating standard Normals to [-10,10]. The
probability of getting anything outside [-10,10] in 500000 samples is
about 10^-17. Even for [-6, 6] it's less than 0.001.


	-thomas

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
1 day later
#
Forgive me if I'm adding to this thread a bit late, but I've just returned
to town.  Also, apologies in advance to those who might take offense that
this post is not specifically a help related response.

Like many of us on this list (I presume), I've been writing computer code
for twenty-five years or so in a wide variety of languages.  I started
using S-Plus in 1991 and switched to R when I took my present job in 1997.
The point of this is that despite a decade or so of S/R experience, I've
only recently, within the last couple of years, begun to write R code that
makes efficient use of the language, particularly with regard to
eliminating loops.  Old habits die hard, especially coding paradigms that
are efficient solutions in many environments but not in R.

Perhaps when the Intro to R undergoes it's next modification cycle it
might be useful to say something about this *prominantly*.  Of course,
I'll be the first to admit that it might be there already-- part of the
problem is that those of us with extensive coding experience probably just
apply what we know works in other environments to our R code, where it
also works, albeit less efficiently, and cruise the manual(s) only when we
need to solve a syntax problem, locate a function argument, etc.  Perhaps
an rlint() function or something similar is in order....

--Mike C.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Michael A. Camann                                  Voice: 707-826-3676
Associate Professor of Zoology                       Fax: 707-826-3201
Institute for Forest Canopy Research     Email: mac24 at axe.humboldt.edu
Department of Biology                            ifcr at axe.humboldt.edu
Humboldt State University
Arcata, CA 95521

                 URL:http://www.humboldt.edu/~mac24/
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
michael> Like many of us on this list (I presume), I've been writing computer code
    michael> for twenty-five years or so in a wide variety of languages.  I started
    michael> using S-Plus in 1991 and switched to R when I took my present job in 1997.
    michael> The point of this is that despite a decade or so of S/R experience, I've
    michael> only recently, within the last couple of years, begun to write R code that
    michael> makes efficient use of the language, particularly with regard to
    michael> eliminating loops.  Old habits die hard, especially coding paradigms that
    michael> are efficient solutions in many environments but not in R.

There's a wonderful book called S Programming, by Venables and Ripley
(familiar names, eh?), which should be a must-read in this case.  See:

        http://www.stats.ox.ac.uk/pub/MASS3/Sprog/

for more details.  It's not clear that one could write a "lint" to
check for inefficiency; lint doesn't do that for C!

best,
-tony
#
Thanks to Profs. Ripley, Lumley and Camann for replying to my
"complaint" about R speed. I have a few follow-up points:

1. They all pointed to the need to vectorizing the code for efficiency.
As Michael just said, however, this can require a substantial paradigms
shift and rethinking and may probe difficult to those of us who have
extensive experience in other languages and who may even consider
themselves good at programming. More importantly, some code can not be
vectorized. In my "real" program, I need to generate a large number of
truncated normal devaites.This has to done in a sequential manner since
the the mean and variance of the later ones depend on the values of the
previous ones.

2. I did another comparison between R and Java. This time is to compute
the covriance matrix of multinomial distributions. The R function is
simple and fully vectorized:
 
var.multinomial = function(n, pi)
{
   n * ( -pi%*%t(pi) + diag(pi*(1-pi)+pi*pi) )
}

For n=20, the time for one operation is 57/100000 seconds based on
  system.time(for (i in 1:100000) var = var.multinomial(20,
rep(1/20,20)) )

Hand coding this into Java and one opertaion takes 4/100000 seconds,
still 15 times faster.

3. I use Java for comparison because Java is also interpreted. Java
started also very slow but has improved tremendously with introduction
of just-in-time compiler and Hot spot. I am hoping that R will catch up
so that we statisticians will no longer need to struggle with C,
Fortran or Java.

Appendix:
My java code for variance of multinomial distribution:

 static double[][] var_multinomial(int n, double[] pi)
       {
           int p = pi.length;
           double[][] var = new double[p][p];
           for(int i=0; i<p; i++)
               for(int j=0; j<p; j++)
               {
                   if(i==j) var[i][j] = n*pi[i]*(1-pi[i]);
                   else var[i][j] = - n*pi[i]*pi[j];
               }
           return var;
       }












 static double[][] var_multinomial(int n, double[] pi)
       {
           int p = pi.length;
           double[][] var = new double[p][p];
           for(int i=0; i<p; i++)
               for(int j=0; j<p; j++)
               {
                   if(i==j) var[i][j] = n*pi[i]*(1-pi[i]);
                   else var[i][j] = - n*pi[i]*pi[j];
               }
           return var;
       }



=====
Jason G. Liao, Ph.D.
Division of Biometrics
University of Medicine and Dentistry of New Jersey
335 George Street, Suite 2200
New Brunswick, NJ 08903-2688
phone (732) 235-8611, fax (732) 235-9777
http://www.geocities.com/jg_liao

__________________________________________________

HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
On Thu, 8 Aug 2002, Jason Liao wrote:
I don't think it will. Java is byte-compiled (which makes quite a big
difference), and was designed from the ground up to be byte-compiled.  R
is much harder to optimise (as Luke Tierney will tell you), partly because
there are so many things that cannot be known until run-time.  Some Java
enthusiasts say that there is no reason in principle that Java should be
slower than C++. I don't think this is true for R.

In addition, there has probably been more programmer time spent on
optimising Java virtual machines than has been spent on writing all
statistical software ever.  This is partly a matter of resources but
partly that there are very few tasks in R where a 20% speedup would pay
off the effort required to achieve it.

I think it's reasonable to hope that there will be less and less need for
writing Fortran or even Java, but this will come partly because computers
will be faster and partly because the Fortran will already be written.


	-thomas

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
At 09:31 AM 8/8/2002 -0700, A.J. Rossini wrote:
Can someone tell me if Edition 4 ("MASS4") is actually shipping yet?

-jday


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
No, but it is published and will be at the `JSM' next week.  I do have a
copy ....
On Thu, 8 Aug 2002, John Day wrote:

            

  
    
#

        
At 8/8/2002 at 03:34 PM, Brian D. Ripley wrote:
Congratulations to V and R.  I feel it only fair to warn them that, given 
the rate at which MASS is evolving, I am going to stop buying new editions 
and start leasing them.

...MHP

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Dear helplist, I am trying to visualise a datamatrix together with its
marginal densities.
Suppose dat <- matrix( rnorm(1000), nrow=25 )

I can the main visual plot using image, levelplot etc. But how do I attach
the plots for rows sums and column sums to the top/bottom and right hand
side of the original plot ?

The only way I can think of is by using par(mfrow=c(2,2)) command but the
result is not desirable. Is there are useful command to do this ? If
succesful I am thinking of affixing the boxplots by row and columns as well.
I have gone through the lattice help but am still confused. Thank you.

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
2 days later
#
attach
hand
the
well.
If I correctly understand what you are trying to do, the way to do this
would be to use par("fig"), which adjusts the size of the plot region
within the current device.

par(mfrow) and par(mfcol) result in the device being split up into
equally sized plot regions, which does not achieve what I think you
want.

Below is some code as an example.  I am using barplot() for the row and
column sum plots in the example.  This may or may not be how you want
these plotted.

I will leave the fine tuning of the plot types, sizes, axis ranges,
labels, row/column alignments, etc. for you. This should give you a
conceptual feel for how to go about it.

HTH.

Marc

------------------

#define matrix
dat <- matrix(rnorm(1000), nrow = 25)

# save the current parameters 
op <- par(no.readonly = TRUE) 


# open a new plot window 
plot.new() 


# adjust the plot region size for each area
# the LL and UR corners of the default region are (0,0) and (1,1) 
# par("fig") is (x1, x2, y1, y2) 


# first set the region for the larger plot at the bottom 
# note that the dimensions will overlap a bit in each region
par(fig = c(0.0, 0.75, 0.0, 0.75)) 


# now draw the larger plot 
image(dat)


# do not overwrite the current plot 
par(new = TRUE) 


# now set the second region for the small plot at the top 
par(fig = c(0.0, 0.75, .65, 1.0)) 

# plot the colSums
barplot(colSums(dat))


# do not overwrite the current plot 
par(new = TRUE) 


# now set the third region for the small plot to the right 
par(fig = c(0.65, 1.0, 0.0, 0.75)) 

# plot the rowSums
barplot(rowSums(dat), horiz = TRUE)


# restore the original parameters 
par(op) 



-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._