I have been lurking in this list a while and searching in the archives to find out how one learns to write fast R code. One solution seems to be to write part of the code not in R but in C. However after finding a benchmark article (http://www.sciviews.org/other/benchmark.htm) I have been more interested in making the R code itself more efficient. I would like to find more info about this. I have tried to mail the contact person for the benchmark, but I have so recieved no reply. I am not an R programmer (or statistican) so I do not know R well. I am looking for some advice about writing fast R code. What about the different data types for example? Is there some good place to start to look for more info about this? Thanks for any pointers Lennart
How to write efficient R code
9 messages · Lennart.Borgman@astrazeneca.com, Brian Ripley, Tom Blackwell +6 more
`S Programming' (see the FAQ) has a whole chapter with case studies. Beware that what is efficient under one version of S is not necessarily so under another, and that applies to R today vs R in 1999 (when those examples were done). However, the general principles are good for all time.
On Tue, 17 Feb 2004 Lennart.Borgman at astrazeneca.com wrote:
I have been lurking in this list a while and searching in the archives to find out how one learns to write fast R code. One solution seems to be to write part of the code not in R but in C. However after finding a benchmark article (http://www.sciviews.org/other/benchmark.htm) I have been more interested in making the R code itself more efficient. I would like to find more info about this. I have tried to mail the contact person for the benchmark, but I have so recieved no reply. I am not an R programmer (or statistican) so I do not know R well. I am looking for some advice about writing fast R code. What about the different data types for example? Is there some good place to start to look for more info about this?
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Lennart - My two rules are: 1. Be straightforward. Don't try to be too fancy. Don't worry about execution time until you have the WHOLE thing programmed and DOING everything you want it to. Then profile it, if it's really going to be run more than 1000 times. Execution time is NOT the issue. Code maintainability IS. 2. Use vector operations wherever possible. Avoid explicit loops. However, the admonition to avoid loops is probably much less important now than it was with the Splus of 10 or 15 years ago. (Not that I succeed in obeying these rules myself, all the time.) Remember: execution time is not the issue. memory size may be. clear, maintainable code definitely is. In my opinion, the occasional questions you will see on this list about incorporating C code, or trying to specify one data type over another, come up only in very unusual, special cases. Almost everything can be done without loops in straight R, if you think about it first. - tom blackwell - u michigan medical school - ann arbor -
On Tue, 17 Feb 2004 Lennart.Borgman at astrazeneca.com wrote:
I have been lurking in this list a while and searching in the archives to find out how one learns to write fast R code. One solution seems to be to write part of the code not in R but in C. However after finding a benchmark article (http://www.sciviews.org/other/benchmark.htm) I have been more interested in making the R code itself more efficient. I would like to find more info about this. I have tried to mail the contact person for the benchmark, but I have so recieved no reply. I am not an R programmer (or statistican) so I do not know R well. I am looking for some advice about writing fast R code. What about the different data types for example? Is there some good place to start to look for more info about this? Thanks for any pointers Lennart
______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
-----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Lennart.Borgman at astrazeneca.com Sent: Wednesday, February 18, 2004 3:36 AM To: r-help at stat.math.ethz.ch Subject: [R] How to write efficient R code I have been lurking in this list a while and searching in the archives to find out how one learns to write fast R code. One solution seems to be to write part of the code not in R but in C. However after finding a benchmark article (http://www.sciviews.org/other/benchmark.htm) I have been
more
interested in making the R code itself more efficient. I would like to find more info about this. I have tried to mail the contact person for
the
benchmark, but I have so recieved no reply.
One way to make your codes more efficient is to use "vectorisation" -- vectorise your codes. I'm not sure where you can find more information about it, but an example would be to use the apply() function on a data frame instead using a loop. Avoid loops if you can. Kevin -------------------------------------------- Ko-Kang Kevin Wang, MSc(Hon) SLC Stats Workshops Co-ordinator The University of Auckland New Zealand
On Tue, 2004-02-17 at 12:21, Tom Blackwell wrote:
Lennart -
My two rules are:
1. Be straightforward. Don't try to be too fancy. Don't worry
about execution time until you have the WHOLE thing programmed
and DOING everything you want it to. Then profile it, if it's
really going to be run more than 1000 times. Execution time
is NOT the issue. Code maintainability IS.
2. Use vector operations wherever possible. Avoid explicit loops.
However, the admonition to avoid loops is probably much less
important now than it was with the Splus of 10 or 15 years ago.
(Not that I succeed in obeying these rules myself, all the time.)
Remember: execution time is not the issue. memory size may be.
clear, maintainable code definitely is.
I've been using for maybe 6 months or less and am by no means an R expert. But the above two points are extremely valid - my policy is to always write code that I can read 2 months later without comments (though in the end I do add them) - even if it requires loops. However, after I'm sure the results are right I spend time on trying to vectorise the code. And that has improved performace by orders of magnitude (IMO, its also more elegant to have a one line vector operation rather than a loop). Of course as I progress towards the status of R expert I hope to be able to write vectorised code on the fly :) ------------------------------------------------------------------- Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- So the Zen master asked the hot-dog vendor, "Can you make me one with everything?" - TauZero on Slashdot
You may also be interested in reading the latest article on artima.com (http://www.artima.com/intv/abstreffi.html) where Bjarne Stroustrup (the creator of C++) discusses some of the benefits and costs of abstraction, as well as premature vs. prudent optimisation. It is important to remember that the key to improving execution speeds is profiling your running code - we're not good at anticipating what parts of a program will be slow. It's much better to run the program and see. Hadley
Lennart.Borgman at astrazeneca.com wrote:
I have been lurking in this list a while and searching in the archives to find out how one learns to write fast R code. One solution seems to be to write part of the code not in R but in C. However after finding a benchmark article (http://www.sciviews.org/other/benchmark.htm) I have been more interested in making the R code itself more efficient. I would like to find more info about this. I have tried to mail the contact person for the benchmark, but I have so recieved no reply. I am not an R programmer (or statistican) so I do not know R well. I am looking for some advice about writing fast R code. What about the different data types for example? Is there some good place to start to look for more info about this? Thanks for any pointers Lennart
Lennart.Borgman at astrazeneca.com wrote:
I have been lurking in this list a while and searching in the archives to find out how one learns to write fast R code. One solution seems to be to write part of the code not in R but in C. However after finding a benchmark article (http://www.sciviews.org/other/benchmark.htm) I have been more interested in making the R code itself more efficient. I would like to find more info about this. I have tried to mail the contact person for the benchmark, but I have so recieved no reply. I am not an R programmer (or statistican) so I do not know R well. I am looking for some advice about writing fast R code. What about the different data types for example? Is there some good place to start to look for more info about this? Thanks for any pointers Lennart
______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Lennart To learn about "data types" take a look at the early chapters of An Introduction To R available at http://cran.r-project.org/manuals.html Richard
Richard E. Remington III Statistician KERN Statistical Services, Inc. PO Box 1046 Boise, ID 83701 Tel: 208.426.0113 KernStat.com
On Wed, 18 Feb 2004, Ko-Kang Kevin Wang wrote:
One way to make your codes more efficient is to use "vectorisation" -- vectorise your codes. I'm not sure where you can find more information about it, but an example would be to use the apply() function on a data frame instead using a loop. Avoid loops if you can.
Umm. No. Vectorization is definitely a good thing -- just about the only coding change that improves both clarity and speed -- but replacing a loop with apply() is not vectorisation in that sense. Except for some cases of lapply, the apply functions are mostly clarity optimisations rather than speed optimisations. -thomas
Rajarshi Guha <rxg218 at psu.edu> writes:
On Tue, 2004-02-17 at 12:21, Tom Blackwell wrote:
Lennart -
My two rules are:
1. Be straightforward. Don't try to be too fancy. Don't worry
about execution time until you have the WHOLE thing programmed
and DOING everything you want it to. Then profile it, if it's
really going to be run more than 1000 times. Execution time
is NOT the issue. Code maintainability IS.
2. Use vector operations wherever possible. Avoid explicit loops.
However, the admonition to avoid loops is probably much less
important now than it was with the Splus of 10 or 15 years ago.
(Not that I succeed in obeying these rules myself, all the time.)
Remember: execution time is not the issue. memory size may be.
clear, maintainable code definitely is.
I've been using for maybe 6 months or less and am by no means an R expert. But the above two points are extremely valid - my policy is to always write code that I can read 2 months later without comments (though in the end I do add them) - even if it requires loops. However, after I'm sure the results are right I spend time on trying to vectorise the code. And that has improved performace by orders of magnitude (IMO, its also more elegant to have a one line vector operation rather than a loop).
All true. A couple of additional remarks:
1) Some constructs are spectacularly inefficient, as you'll realize
when you think about what they have to do. One standard example is
for (i in 1:10000)
x[i] <- f(i)
which becomes much faster if you preallocate x <- numeric(10000)
(never mind that sapply will do it more neatly). Without
preallocation, R will need to extend the array on every iteration,
which require the whole array to be copied to a new location. It is
a very good idea to keep your eyes open for these situations and
try to avoid them.
2) On the other hand, don't be trapped by efficiency differences that
might be "accidental" and go away in later releases. We've seen a
couple of cases were the Wrong Way was actually faster than the
Right Way (details elude me -- something with deparse/reparse vs.
symbolic computations, I suspect), but you this easily leads to
code that is hard to read, and may have subtle bugs.
O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907