[ Resending to the list as I fell foul of the too many recipients rule ]
On Thu, 2006-11-30 at 11:34 -0600, Marc Schwartz wrote:
Thanks to Marc, Prof. Ripley, Sebastian and Sebastian (Luque - offline) for your comments and suggestions. I noticed that two of the vectors were named and so I removed the names (names(vec) <- NULL) and that pushed the execution time for the function from c. 40 seconds to c. 115 seconds and all the time was taken within the data.frame(...) call. So having names *on* some of the vectors seemed to help things along, which was the opposite of what i had expected. If I use the cbind method of Marc, then the execution time for the function drops to c. 1 second (most of which is in the calculation of one of the vectors). So I guess I can work round this now. What I find interesting is that: test.dat <- rnorm(4471)
system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
test.dat, + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat, + col8 = test.dat, col9 = test.dat, col10 = test.dat)) [1] 0.008 0.000 0.007 0.000 0.000 Whereas doing exactly the same thing with different data in the function gives the following timings: system.time(fab <- data.frame(lc.ratio, Q, + fNupt, + rho.n, rho.s, + net.Nimm, + net.Nden, + CLminN, + CLmaxN, + CLmaxS)) [1] 173.415 0.260 192.192 0.000 0.000 Most of that was without a change in memory, but towards the end for c. 5 seconds memory use by R increased by 200-300 MB. and...
system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
+ fNupt = fNupt, + rho.n = rho.n, rho.s = rho.s, + net.Nimm = net.Nimm, + net.Nden = net.Nden, + CLminN = CLminN, + CLmaxN = CLmaxN, + CLmaxS = CLmaxS)) [1] 99.966 0.140 114.091 0.000 0.000 Again with a slight increase in memory usage in last 5 seconds. So now, having stripped the names of two of the vectors (so now all are un-named), the un-named version of the data.frame call is almost twice as slow as the named data.frame call. If I leave the names on the two vectors that had them, I get the following timings for those same calls
system.time(fab <- data.frame(lc.ratio, Q,
+ fNupt, + rho.n, rho.s, + net.Nimm, + net.Nden, + CLminN, + CLmaxN, + CLmaxS)) [1] 96.234 0.244 101.706 0.000 0.000
system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
+ fNupt = fNupt, + rho.n = rho.n, rho.s = rho.s, + net.Nimm = net.Nimm, + net.Nden = net.Nden, + CLminN = CLminN, + CLmaxN = CLmaxN, + CLmaxS = CLmaxS)) [1] 13.597 0.088 15.868 0.000 0.000 So having the 2 named vectors and using the named version of the data.frame call is the fastest combination. This is all done within the debugger at the time when I would be generating fab, and if I do, system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 = test.dat, + col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat, + col8 = test.dat, col9 = test.dat, col10 = test.dat)) [1] 0.008 0.000 0.007 0.000 0.000 (as above) at this point in the debugger it is exceedingly quick. I just don't understand what is going on with data.frame. I have yet to try Prof. Ripley's suggestion of being a bit naughty with R - I'll see if that is any quicker. Once again, thanks to you all for your suggestions. All the best, G
Gavin, One more note, which is that even timing the direct data frame creation on my system with colnames, again using the same 10 numeric columns, I get:
system.time(DF1 <- data.frame(lc.ratio = Col1, Q = Col2, fNupt = Col3,
rho.n = Col4, rho.s = Col5,
net.Nimm = Col6, net.Nden = Col7,
CLminN = Col8, CLmaxN = Col9,
CLmaxS = Col10))
[1] 0.012 0.000 0.028 0.000 0.000
str(DF1)
'data.frame': 4471 obs. of 10 variables: $ lc.ratio: num 0.1423 0.1873 -1.8129 0.0255 -1.7650 ... $ Q : num 0.8340 -0.2387 -0.0864 -1.1184 -0.3368 ... $ fNupt : num -0.1718 -0.0549 1.5194 -1.6127 -1.2019 ... $ rho.n : num -0.740 0.240 0.522 -1.492 1.003 ... $ rho.s : num -0.2363 -1.6248 -0.3045 0.0294 0.1240 ... $ net.Nimm: num -0.774 0.947 -1.098 0.809 1.216 ... $ net.Nden: num -0.198 -0.135 -0.300 -0.618 -0.784 ... $ CLminN : num 0.924 -3.265 0.211 0.813 0.262 ... $ CLmaxN : num 0.3212 -0.0502 -0.9978 0.9005 -1.6535 ... $ CLmaxS : num -0.520 0.278 -0.546 -0.925 1.507 ... So there is something else going on, either with your code or some other conflict, unless my assumptions about your data are incorrect. HTH, Marc
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [t] +44 (0)20 7679 0522 ECRC & ENSIS, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%