Skip to content
Prev 37859 / 63421 Next

Fourteen patches to speed up R

Thanks very much for the patches.  I have spent a couple of days
working through them, and others have looked at some of them as well
and may continue to do so. Here are some notes on the individual
patches describing things I have done or decided not to do and things
others have done that I know about.

patch-transpose

 	Applied by Martin Maechler.


patch-for

 	Applied by Duncan Murdoch; revised by me. Some cosmetic
 	revisions, including going back to PROTECT_WITH_INDEX. Also
 	placed two variables 'n' and 'bgn' back under volatile
 	declarations.  Theoretically this shouldn't be needed, but gcc
 	-O2 -Wclobbered warns about them, so to be safe and eliminate
 	the warnings they are declared volatile as well.

 	The current byte code compiler actually stores the binding
 	cell rather than using setVar or defineVar -- this eliminates
 	the search and does not have the destructive effect of
 	modifying an outer variable if the loop variable is removed,
 	but removing the loop variable will then reference an outer
 	one if available or do other strange things. It doesn't
 	actually make much performance difference (at least in simple
 	examples) -- for that we would probably need to eliminate a
 	number of the function calls involved at the moment.  A better
 	solution for preserving the semantics in the case of user code
 	removing the loop variable might be to disallow removing the
 	loop variable, or to allow removal to be detected easily
 	(e.g. by having rm() put R_UnboundValue in the value cell).


patch-parens

 	Should not be applied.  `(` is conceptually a BUILTIN, i.e. a
 	strict function that evaluates all arguments in order. Making
 	it a SPECIAL is conceptually wrong and confuses issues of code
 	analysis and compilation. It is true that calling of BUILTINs
 	is currently less efficient than calling of SPECIALS because
 	the evaluated arguments for BUILTINs are stored in freshly
 	allocated lists, but it would be better to work towards making
 	that calling mechanism more efficient for all BUILTINs than to
 	confuse internal semantics by converting BUILTINs to SPECIALs.

 	We have currently a few things that are SPECIAL even though
 	they really have BUILTIN semantics, but they are SPECIAL
 	because of issues like needing to support missing arguments,
 	which BUILTINs do not. We should be moving these to BUILTIN
 	status, though perhaps not until BUILTIN calling performance
 	is improved.

 	Whether working on BUILTIN calling performance in the
 	interpreter makes sense depends on where the byte code
 	compiler gets to.  The current compiler is much more efficient
 	about the handling of inlined BUILTINs; the revised one in
 	progress is likely to me much more efficient for all BUILTINs.

 	I would rather not make the proposed change for braces
 	(do_begin) as it makes it harder to find the relevant bits to
 	remove if we want to change this. Source references are very
 	useful, but we should be able to find a way of having them
 	without incurring runtime overhead unless they are actually
 	used. I have added an R_INLINE designation to getSrcref to
 	encourage the compiler to do the inlining. Timing differences
 	for test-parens.r are in the right direction but in the noise
 	level on an Ubuntu laptop:

 			   inline   byte
 		    devel    decl   comp

 	    curly:  10.25   10.13   1.94
 	    parens: 11.21   10.91   1.91

 	The byte comp column is for the current byte code engine and
 	compiler and illustrates that this approach has much more
 	promise.


patch-sum-prod

 	I had looked at this a while back and had an uncommitted
 	change along very similar lines.  I think the reason I didn't
 	commit this change is that I didn't like the code expansion
 	that resulted, and I still don't.  Looking at this again it
 	turns out there is a very simple code change that preserves
 	the code structure and achieves the same improvement in the
 	narm == FALSE case: reverse the test order from

             if (!ISNAN(x[i]) || !narm)  ...

 	to

             if (!narm || !ISNAN(x[i])) ...

         That way the expensive ISNAN test is only done when the result
         might matter. This has been done for real and complex sum and
         prod. It provides the same level of improvement for the narm
         == FALSE case as the patch, and for the narm == TRUE cases the
         differences are in the noise level on my system. This has been
         committed as r52925.

         The specific six timings from test-sum-prod.r on my Ubuntu
         laptop are

 	    R 2.11   devel   patch  switch

 	      3.25   10.27    2.50    2.47
 	     10.69   10.39   10.25    9.99
 	     10.73   10.53   10.37   10.02
 	      3.80   10.77    3.60    3.57
 	     10.57   10.93   10.32   10.89
 	     10.54   11.07   10.45   11.09


patch-evalList

 	This looks fine (standard Lisp idiom) and has been applied as
 	r52930. Needed to add initial values for 'tail' variables to
 	turn off uninitialized variable warnings (r52935). Some
 	timings:

 			    R        R    patch  byte
 			 2.11    devel evalList  comp

 	    test-em     18.49    15.13    14.08  4.34
 	    p1          39.52    30.80    28.72  8.80

 	Again a compilation approach should produce much more
 	substantial gains for code dominated by interpreter overhead.
 	Here p1 is the example

             p1 <- function(x) {
                 for (i in seq_along(x))
                     x[i] <- x[i] + 1
                 x
             }
             x <- rep(1, 1e7)
             system.time(p1(x))


patch-square

 	The analysis provided with this patch needs fleshing out.  It
 	is useful to try to understand where the speed gains come from
 	and to make changes that can help other code as well.

 	The y == 2.0 test is fairly cheap.  The speed gain of the
 	patch comes mostly from avoiding the overhead that comes
 	before the y == 2.0 test, mainly the call into R_pow and two
 	calls to FINITE. r52937 (r52967 for 2.12) moves the test for y
 	== 2.0 to the top of the R_pow function, thus avoiding the
 	FINITE calls; this cuts the per value cost roughly in half on
 	one test platform at least. r52938 (r52968 for 2.12) defines
 	R_POW as an inline function that handles the y == 2.0 case in
 	line and only calls R_pow in the general case. This cuts the
 	time again by roughly a third.

 	On some platforms further improvement comes from avoiding the
 	overhead in mod_iterate for cases where one argument is a
 	scalar or the arguments are of equal length. r52965 (r52970
 	for 2.12) addresses this using the same approach as for
 	addition, etc. from patch-vec-arith; this should be abstracted
 	into a macro and used consistently in a few more cases.
 	Special handling the scalar exponent case only speeds things
 	up by a few percent on my laptop and one other machine and
 	actually slows down on a third platform (presumably a
 	code/optimizer interaction), though it does help some on a
 	fourth platform.  To keep the code simpler I prefer not to
 	make this change now, at least until we have had a chance to
 	look at abstracting the iteration process into a macro.


patch-matprod

 	I don't have particularly strong views on this one and will
 	leave it to others in R-core for now. One note: on my x86_64
 	Ubuntu laptop replacing ISNAN with

 	#undef ISNAN
 	static R_INLINE double ISNAN(double x) { return x != x; }

 	produces a fairly substantial improvement.  Dropping the ISNAN
 	test entirely speeds things up some more, and going to an
 	inline version of matrix multiply helps more for the smaller
 	cases but not much for the larger ones in the test-matprod
 	examples if the inline uses LDOUBLE for accumulation.  It
 	helps in all these cases if the inline uses double.  Here are
 	some timings that might be useful:

                                          R  inline    drop  inline  inline
                 		     devel   ISNAN   ISNAN LDOUBLE  double
 	V-V, length 1000:             8.64    4.71    3.05    1.67    1.36
 	M-V, 5x1000 times 1000x1:     4.95    2.60    1.61    1.80    1.44
 	M-V, 50x1000 times 1000x1:    3.72    1.75    0.91    2.06    1.64
 	M-M, 2x1000 times 1000x3:     5.72    3.60    2.87    1.57    1.16
 	M-M, 5x1000 times 1000x3:     8.99    5.87    4.55    5.13    4.03
 	M-M, 10x1000 times 1000x10:  10.05    7.71    6.75    7.61    5.96
 	M-M, 10x1000 times 1000x11:  10.87    8.40    7.38    8.35    6.51

 	On one of our lab machines, where we are still running 2.10.1
 	but also have MKL BLAS versions available I got

                                          R      MKL      MKL
                                     2.10.1      seq   thread
 	V-V, length 1000:            6.265    5.250    5.255
 	M-V, 5x1000 times 1000x1:    3.197    2.832    2.837
 	M-V, 50x1000 times 1000x1:   2.525    2.260    2.263
 	M-M, 2x1000 times 1000x3:    3.478    2.654    2.648
 	M-M, 5x1000 times 1000x3:    5.598    4.258    4.272
 	M-M, 10x1000 times 1000x10:  6.693    3.749    3.796
 	M-M, 10x1000 times 1000x11:  7.221    3.981    4.042


patch-fast-base
patch-fast-spec

 	I tried the patch-fast-spec patch and did not see consistent
 	performance improvement -- slightly faster on one example,
 	slightly slower on another. So it is not at all clear to me
 	that this provides any real benefit.  The code is certainly
 	made more complex, so I do not think these should be applied.
 	Optimizing access to base functions in general and operators
 	in particular is one of the things a byte code compiler will
 	do, at least at reasonable optimization levels. The current
 	byte code compiler speeds up the EM example by about a factor
 	of three; the revised one I am working on will hopefully do
 	even better.

 		      R    fast    byte
 		  devel    spec    code
 	    em    13.83   12.75    4.28
 	    p1    28.30   29.91    9.04


patch-vec-arith

 	Looks OK.  Applied to trunk as r52946 (r52969 in 2-12-branch).
 	Eventually it may make sense to revisit this and maybe use a
 	macro to abstract out common code and apply it to some other
 	cases as well.


patch-save-alloc

 	This is similar to something I have been experimenting with in
 	the context of byte code compilation. In that setting there is
 	more opportunity for optimization by looking at where the
 	result of a computation is being used and possibly overwriting
 	the target. I'm not sure this is worth doing in the
 	interpreter, and my timings give somewhat mixed results (that
 	also vary for non-obvious reasons with seemingly small code
 	changes). I would prefer to defer committing to this idea in
 	the interpreter until more is learned from the byte code
 	experiments and about exactly where gains, if any, might be
 	coming from.


patch-vec-subset
patch-dollar

 	I would prefer if someone in R-core who is more familiar with
 	the subsetting/dollar code than I am could have a look at
 	these.


patch-protect

 	As I mentioned in my initial reply, I've tried this before on
 	the theory that it should make a difference, but it didn't
 	then and I still doesn't now, at least not relative to the
 	noise level on my machines on the tests I ran.  So I don't
 	think this is worth doing now, but it is worth keeping in mind
 	and trying again as other factors improve.

luke
On Fri, 3 Sep 2010, Radford Neal wrote: