Quicker way of combining vectors into a data.frame

[ Resending to the list as I fell foul of the too many recipients rule ]
Thanks to Marc, Prof. Ripley, Sebastian and Sebastian (Luque - offline)
for your comments and suggestions.

I noticed that two of the vectors were named and so I removed the names
(names(vec) <- NULL) and that pushed the execution time for the function
from c. 40 seconds to c. 115 seconds and all the time was taken within
the data.frame(...) call. So having names *on* some of the vectors
seemed to help things along, which was the opposite of what i had
expected.

If I use the cbind method of Marc, then the execution time for the
function drops to c. 1 second (most of which is in the calculation of
one of the vectors). So I guess I can work round this now.

What I find interesting is that:

test.dat <- rnorm(4471)
system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
test.dat,
+ col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
+ col8 = test.dat, col9 = test.dat, col10 = test.dat))
[1] 0.008 0.000 0.007 0.000 0.000

Whereas doing exactly the same thing with different data in the function
gives the following timings:

system.time(fab <- data.frame(lc.ratio, Q,
+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1] 173.415   0.260 192.192   0.000   0.000

Most of that was without a change in memory, but towards the end for c.
5 seconds memory use by R increased by 200-300 MB.

and...
system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1]  99.966   0.140 114.091   0.000   0.000

Again with a slight increase in memory usage in last 5 seconds. So now,
having stripped the names of two of the vectors (so now all are
un-named), the un-named version of the data.frame call is almost twice
as slow as the named data.frame call.

If I leave the names on the two vectors that had them, I get the
following timings for those same calls
system.time(fab <- data.frame(lc.ratio, Q,
+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1]  96.234   0.244 101.706   0.000   0.000
system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1] 13.597  0.088 15.868  0.000  0.000

So having the 2 named vectors and using the named version of the
data.frame call is the fastest combination.

This is all done within the debugger at the time when I would be
generating fab, and if I do,

system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
test.dat,
+ col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
+ col8 = test.dat, col9 = test.dat, col10 = test.dat))
[1] 0.008 0.000 0.007 0.000 0.000

(as above) at this point in the debugger it is exceedingly quick.

I just don't understand what is going on with data.frame.

I have yet to try Prof. Ripley's suggestion of being a bit naughty with
R - I'll see if that is any quicker.

Once again, thanks to you all for your suggestions.

All the best,

G
Gavin,

One more note, which is that even timing the direct data frame creation
on my system with colnames, again using the same 10 numeric columns, I
get:

system.time(DF1 <- data.frame(lc.ratio = Col1, Q = Col2, fNupt = Col3,
                                rho.n = Col4, rho.s = Col5, 
                                net.Nimm = Col6, net.Nden = Col7, 
                                CLminN = Col8, CLmaxN = Col9, 
                                CLmaxS = Col10))
[1] 0.012 0.000 0.028 0.000 0.000

str(DF1)
'data.frame':   4471 obs. of  10 variables:
 $ lc.ratio: num   0.1423  0.1873 -1.8129  0.0255 -1.7650 ...
 $ Q       : num   0.8340 -0.2387 -0.0864 -1.1184 -0.3368 ...
 $ fNupt   : num  -0.1718 -0.0549  1.5194 -1.6127 -1.2019 ...
 $ rho.n   : num  -0.740  0.240  0.522 -1.492  1.003 ...
 $ rho.s   : num  -0.2363 -1.6248 -0.3045  0.0294  0.1240 ...
 $ net.Nimm: num  -0.774  0.947 -1.098  0.809  1.216 ...
 $ net.Nden: num  -0.198 -0.135 -0.300 -0.618 -0.784 ...
 $ CLminN  : num   0.924 -3.265  0.211  0.813  0.262 ...
 $ CLmaxN  : num   0.3212 -0.0502 -0.9978  0.9005 -1.6535 ...
 $ CLmaxS  : num  -0.520  0.278 -0.546 -0.925  1.507 ...

So there is something else going on, either with your code or some other
conflict, unless my assumptions about your data are incorrect.

HTH,

Marc

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Gavin Simpson                 [t] +44 (0)20 7679 0522
 ECRC & ENSIS, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
[ Resending to the list as I fell foul of the too many recipients rule ]

On Thu, 2006-11-30 at 11:34 -0600, Marc Schwartz wrote:

Thanks to Marc, Prof. Ripley, Sebastian and Sebastian (Luque - offline)
for your comments and suggestions.

I noticed that two of the vectors were named and so I removed the names
(names(vec) <- NULL) and that pushed the execution time for the function
from c. 40 seconds to c. 115 seconds and all the time was taken within
the data.frame(...) call. So having names *on* some of the vectors
seemed to help things along, which was the opposite of what i had
expected.

If I use the cbind method of Marc, then the execution time for the
function drops to c. 1 second (most of which is in the calculation of
one of the vectors). So I guess I can work round this now.

What I find interesting is that:

test.dat <- rnorm(4471)

system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =

test.dat,
+ col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
+ col8 = test.dat, col9 = test.dat, col10 = test.dat))
[1] 0.008 0.000 0.007 0.000 0.000

Whereas doing exactly the same thing with different data in the function
gives the following timings:

system.time(fab <- data.frame(lc.ratio, Q,
+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1] 173.415   0.260 192.192   0.000   0.000

Most of that was without a change in memory, but towards the end for c.
5 seconds memory use by R increased by 200-300 MB.

and...

system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,

+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1]  99.966   0.140 114.091   0.000   0.000

Again with a slight increase in memory usage in last 5 seconds. So now,
having stripped the names of two of the vectors (so now all are
un-named), the un-named version of the data.frame call is almost twice
as slow as the named data.frame call.

If I leave the names on the two vectors that had them, I get the
following timings for those same calls

system.time(fab <- data.frame(lc.ratio, Q,

+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1]  96.234   0.244 101.706   0.000   0.000

system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,

+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1] 13.597  0.088 15.868  0.000  0.000

So having the 2 named vectors and using the named version of the
data.frame call is the fastest combination.

This is all done within the debugger at the time when I would be
generating fab, and if I do,

system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
test.dat,
+ col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
+ col8 = test.dat, col9 = test.dat, col10 = test.dat))
[1] 0.008 0.000 0.007 0.000 0.000

(as above) at this point in the debugger it is exceedingly quick.

I just don't understand what is going on with data.frame.

I think there is something about the data you're not telling us...

Could you e.g. do something like

str(data.frame(lc.ratio, Q,
                      fNupt,
                      rho.n, rho.s,
                      net.Nimm,
                      net.Nden,
                      CLminN,
                      CLmaxN,
                      CLmaxS))

and

str(list(lc.ratio, Q,
                      fNupt,
                      rho.n, rho.s,
                      net.Nimm,
                      net.Nden,
                      CLminN,
                      CLmaxN,
                      CLmaxS))
O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907
Gavin Simpson wrote:
<snip />
I just don't understand what is going on with data.frame.

I think there is something about the data you're not telling us...
Yes, that I was doing something very, very silly that I thought would
work (produce a vector CLmaxN of the required length), but was in fact
blowing out to a huge named list. It was this that was causing the
massive increase in computation time in data.frame over cbind.

After correcting my mistake, timings for data.frame are:

system.time(fab <- data.frame(lc.ratio, Q,
+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1] 0.012 0.000 0.011 0.000 0.000
Browse[1]> system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1] 0.008 0.000 0.018 0.000 0.000

One vector has names for some reason, removing them brings the un-named
data.frame version down to the named version timing and makes no
difference to the named version

Browse[1]> names(CLmaxS) <- NULL
Browse[1]> system.time(fab <- data.frame(lc.ratio, Q,
+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1] 0.008 0.000 0.016 0.000 0.000
Browse[1]> system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1] 0.008 0.000 0.009 0.000 0.000

Apologies to the list for bothering you all with my stupidity and thank
you again to everyone who replied - I knew it was I who was doing
something wrong, but couldn't see it and thanks to your comments,
suggestions and queries I was able to work out what that was.

All the best,

G
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Gavin Simpson                 [t] +44 (0)20 7679 0522
 ECRC & ENSIS, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%