Skip to content

Quicker way of combining vectors into a data.frame

3 messages · Gavin Simpson, Peter Dalgaard

#
[ Resending to the list as I fell foul of the too many recipients rule ]
On Thu, 2006-11-30 at 11:34 -0600, Marc Schwartz wrote:
Thanks to Marc, Prof. Ripley, Sebastian and Sebastian (Luque - offline)
for your comments and suggestions.

I noticed that two of the vectors were named and so I removed the names
(names(vec) <- NULL) and that pushed the execution time for the function
from c. 40 seconds to c. 115 seconds and all the time was taken within
the data.frame(...) call. So having names *on* some of the vectors
seemed to help things along, which was the opposite of what i had
expected.

If I use the cbind method of Marc, then the execution time for the
function drops to c. 1 second (most of which is in the calculation of
one of the vectors). So I guess I can work round this now.

What I find interesting is that:

test.dat <- rnorm(4471)
test.dat,
+ col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
+ col8 = test.dat, col9 = test.dat, col10 = test.dat))
[1] 0.008 0.000 0.007 0.000 0.000

Whereas doing exactly the same thing with different data in the function
gives the following timings:

system.time(fab <- data.frame(lc.ratio, Q,
+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1] 173.415   0.260 192.192   0.000   0.000

Most of that was without a change in memory, but towards the end for c.
5 seconds memory use by R increased by 200-300 MB.

and...
+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1]  99.966   0.140 114.091   0.000   0.000

Again with a slight increase in memory usage in last 5 seconds. So now,
having stripped the names of two of the vectors (so now all are
un-named), the un-named version of the data.frame call is almost twice
as slow as the named data.frame call.

If I leave the names on the two vectors that had them, I get the
following timings for those same calls
+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1]  96.234   0.244 101.706   0.000   0.000
+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1] 13.597  0.088 15.868  0.000  0.000

So having the 2 named vectors and using the named version of the
data.frame call is the fastest combination.

This is all done within the debugger at the time when I would be
generating fab, and if I do,

system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
test.dat,
+ col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
+ col8 = test.dat, col9 = test.dat, col10 = test.dat))
[1] 0.008 0.000 0.007 0.000 0.000

(as above) at this point in the debugger it is exceedingly quick.

I just don't understand what is going on with data.frame.

I have yet to try Prof. Ripley's suggestion of being a bit naughty with
R - I'll see if that is any quicker.

Once again, thanks to you all for your suggestions.

All the best,

G
#
Gavin Simpson wrote:
I think there is something about the data you're not telling us...

Could you e.g. do something like

str(data.frame(lc.ratio, Q,
                      fNupt,
                      rho.n, rho.s,
                      net.Nimm,
                      net.Nden,
                      CLminN,
                      CLmaxN,
                      CLmaxS))


and

str(list(lc.ratio, Q,
                      fNupt,
                      rho.n, rho.s,
                      net.Nimm,
                      net.Nden,
                      CLminN,
                      CLmaxN,
                      CLmaxS))
#
On Fri, 2006-12-01 at 12:13 +0100, Peter Dalgaard wrote:
<snip />
Yes, that I was doing something very, very silly that I thought would
work (produce a vector CLmaxN of the required length), but was in fact
blowing out to a huge named list. It was this that was causing the
massive increase in computation time in data.frame over cbind.

After correcting my mistake, timings for data.frame are:

system.time(fab <- data.frame(lc.ratio, Q,
+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1] 0.012 0.000 0.011 0.000 0.000
Browse[1]> system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1] 0.008 0.000 0.018 0.000 0.000

One vector has names for some reason, removing them brings the un-named
data.frame version down to the named version timing and makes no
difference to the named version

Browse[1]> names(CLmaxS) <- NULL
Browse[1]> system.time(fab <- data.frame(lc.ratio, Q,
+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1] 0.008 0.000 0.016 0.000 0.000
Browse[1]> system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1] 0.008 0.000 0.009 0.000 0.000

Apologies to the list for bothering you all with my stupidity and thank
you again to everyone who replied - I knew it was I who was doing
something wrong, but couldn't see it and thanks to your comments,
suggestions and queries I was able to work out what that was.

All the best,

G