Skip to content
Prev 35789 / 63424 Next

Why is there no c.factor?

How do you know?  Maybe its used a lot but the users had no need to tell you 
what they were using. The exact opposite might in fact be the case i.e. 
because concat is so good in splus,  you just never hear of problems with it 
from the users. That might be a very good sign.
I'd be happy to test it. I'm a bit concerned about performance though given 
what you said about repeated recursive calls, and dispatch. Could you run 
the following test in s-plus please and post back the timing?  If this small 
100MB example was fine, then we could proceed to a 64bit 10GB test. This is 
quite nippy at the moment in R (1.1sec). I'd be happy with a better way as 
long as speed wasn't compromised.

set.seed(1)
L = as.vector(outer(LETTERS,LETTERS,paste,sep=""))       # union set of 676 
levels
F = lapply(1:100, function(i) 
{                                                # create 100 factors
   f = sample(1:100, 1*1024^2 / 4, replace=TRUE)               # each factor 
1MB large (262144 integers), plus small amount for the levels
   levels(f) = sample(L,100) 
# pick 100 levels from the union set
   class(f) = "factor"
   f
})
[1] RT DM CO JV BG KU
100 Levels: YC FO PN IL CB CY HQ ...
[1] RK PD FE SG SJ CQ
100 Levels: JV FV DX NL XB ND CY QQ ...
With c.factor from data.table, as posted, placed in .GlobalEnv
user  system elapsed
   0.81    0.32    1.12
[1] RT DM CO JV BG KU        # looks right, comparing to F[[1]] above
676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS AT AU 
AV AW AX AY AZ BA BB BC BD BE BF ... ZZ
[1] RK PD FE SG SJ CQ          # looks right, comparing to F[[2]] above
676 Levels: AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS AT AU 
AV AW AX AY AZ BA BB BC BD BE BF ... ZZ
[1] TRUE

So I guess this would be compared to following in splus ?

system.time(G <- do.call("concat", F))

or maybe its just the following :

system.time(G <- concat(F))

I don't have splus so I can't test that myself.


"William Dunlap" <wdunlap at tibco.com> wrote in message 
news:77EB52C6DD32BA4D87471DCD70C8D7000275B4CA at NA-PA-VBE03.na.tibco.com...
Yes, c() should have been put on the deprecated list a couple
of decades ago, since people expect it to do too many
incompatible things.  And factor should have been a virtual
class, with subclasses "FixedLevels" (e.g., Sex) or "AdHocLevels"
(e.g., FamilyName), so c() and [()<- could do the appropriate
thing in either case.

Back to reality, S+ has a concat(...) function, whose comments say
# This function works like c() except that names of arguments are
# ignored.  That is, it concatenates its arguments into a single
# S vector object, without considering the names of the arguments,
# in the order that the arguments are given.
#
# To make this function work for new classes, it is only necessary
# to make methods for the concat.two function, which concatenates
# two vectors; recursion will take care of the rest.
concat() is not generic but it repeatedly calls concat.two(x,y), an
SV4-generic that dispatches on the classes of x and y.  Thus you
can easily predict the class of concat(x,y,z), although it may not
be the same as the class of concat(z,y,x), given suitably bizarre
methods for concat.two().

concat() doesn't get a lot of use but I think the idea is sound.
Perhaps that model would work well for a concatenation function in R.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com