Skip to content
Prev 105373 / 398513 Next

Quicker way of combining vectors into a data.frame

[ Resending to the list as I fell foul of the too many recipients rule ]
On Thu, 2006-11-30 at 11:34 -0600, Marc Schwartz wrote:
Thanks to Marc, Prof. Ripley, Sebastian and Sebastian (Luque - offline)
for your comments and suggestions.

I noticed that two of the vectors were named and so I removed the names
(names(vec) <- NULL) and that pushed the execution time for the function
from c. 40 seconds to c. 115 seconds and all the time was taken within
the data.frame(...) call. So having names *on* some of the vectors
seemed to help things along, which was the opposite of what i had
expected.

If I use the cbind method of Marc, then the execution time for the
function drops to c. 1 second (most of which is in the calculation of
one of the vectors). So I guess I can work round this now.

What I find interesting is that:

test.dat <- rnorm(4471)
test.dat,
+ col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
+ col8 = test.dat, col9 = test.dat, col10 = test.dat))
[1] 0.008 0.000 0.007 0.000 0.000

Whereas doing exactly the same thing with different data in the function
gives the following timings:

system.time(fab <- data.frame(lc.ratio, Q,
+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1] 173.415   0.260 192.192   0.000   0.000

Most of that was without a change in memory, but towards the end for c.
5 seconds memory use by R increased by 200-300 MB.

and...
+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1]  99.966   0.140 114.091   0.000   0.000

Again with a slight increase in memory usage in last 5 seconds. So now,
having stripped the names of two of the vectors (so now all are
un-named), the un-named version of the data.frame call is almost twice
as slow as the named data.frame call.

If I leave the names on the two vectors that had them, I get the
following timings for those same calls
+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1]  96.234   0.244 101.706   0.000   0.000
+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1] 13.597  0.088 15.868  0.000  0.000

So having the 2 named vectors and using the named version of the
data.frame call is the fastest combination.

This is all done within the debugger at the time when I would be
generating fab, and if I do,

system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
test.dat,
+ col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
+ col8 = test.dat, col9 = test.dat, col10 = test.dat))
[1] 0.008 0.000 0.007 0.000 0.000

(as above) at this point in the debugger it is exceedingly quick.

I just don't understand what is going on with data.frame.

I have yet to try Prof. Ripley's suggestion of being a bit naughty with
R - I'll see if that is any quicker.

Once again, thanks to you all for your suggestions.

All the best,

G