Skip to content

c.factor

11 messages · Marc Schwartz, Brian Ripley, Bill Dunlap +4 more

#
Hi,

Given factors x and y,  c(x,y) does not seem to return a useful result :
[1] a b c d e
Levels: a b c d e
[1] d e f g h
Levels: d e f g h
[1] 1 2 3 4 5 1 2 3 4 5
Is there a case for a new method c.factor as follows?  Does something
similar exist already?  Is there a better way to write the function?
{
    newlevels = union(levels(x),levels(y))
    m = match(levels(y), newlevels)
    ans = c(unclass(x),m[unclass(y)])
    levels(ans) = newlevels
    class(ans) = "factor"
    ans
}
[1] a b c d e d e f g h
Levels: a b c d e f g h
[1] 1 2 3 4 5 4 5 6 7 8
Regards,
Matthew
_                           
platform       x86_64-unknown-linux-gnu    
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          2                           
minor          4.0                         
year           2006                        
month          10                          
day            03                          
svn rev        39566                       
language       R                           
version.string R version 2.4.0 (2006-10-03)
#
On Tue, 2006-11-14 at 16:36 +0000, Matthew Dowle wrote:
I'll defer to others as to whether or not there is a basis for c.factor,
however:

c.factor <- function(...)
{
  args <- list(...)

  # this could be optional
  if (!all(sapply(args, is.factor)))
   stop("All arguments must be factors")

  factor(unlist(lapply(args, function(x) as.character(x))))
}


x <- factor(letters[1:5])
y <- factor(letters[4:8])
z <- factor(letters[9:14])
[1] a b c d e
Levels: a b c d e
[1] d e f g h
Levels: d e f g h
[1] i j k l m n
Levels: i j k l m n
[1] a b c d e d e f g h
Levels: a b c d e f g h
[1] a b c d e d e f g h i j k l m n
Levels: a b c d e f g h i j k l m n
Error in c.factor(x, 1:5) : All arguments must be factors


HTH,

Marc Schwartz
#
On Tue, 2006-11-14 at 11:51 -0600, Marc Schwartz wrote:
That last line can even be cleaned up, as I was doing something else
initially:

c.factor <- function(...)
{
  args <- list(...)

  if (!all(sapply(args, is.factor)))
   stop("All arguments must be factors")

  factor(unlist(lapply(args, as.character)))
}


Marc
#
Well, R has managed without a factor method for c() for most of its decade 
of existence (not that it originally had factors as we know them).

I would argue that factors are best viewed as an enumeration type, and 
anything which silently changes their level set is a bad idea.  I can see 
a case for a c() method for factors that combines factors with the same 
level sets, but I can also see this is best done by users who know the 
level sets are same (c.factor would have to expend a considerable effort 
to check).

You also need to consider the dispatch rules.  c.factor will be called 
whenever the first argument is a factor, whatever the others are. S4 (I 
think, definitely S4-based versions of S-PLUS) has an alternative concat() 
that works differently (recursively) and seems a more natural model.
On Tue, 14 Nov 2006, Marc Schwartz wrote:

            

  
    
#
On Tue, 14 Nov 2006, Prof Brian Ripley wrote:

            
In addition, c() has always had a double meaning of
  (a) turning an object into a simple "vector" (an object
      without "attributes"), as in
      > c(factor(c("Cat","Dog","Cat")))
      [1] 1 2 1
      > c(data.frame(x=1:2,y=c("Dog","Cat")))
      $x
      [1] 1 2

      $y
      [1] Dog Cat
      Levels: Cat Dog

  (b) concatenating several such vectors into one.

The proposed c.factor does only (b).  Should we just
throw c() into the ash heap and use as.vector() or
concat() instead?

The whole concept of concatenating objects of disparate
types is suspect.
----------------------------------------------------------------------------
Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."
#
On Tue, 14 Nov 2006, Bill Dunlap wrote:

            
To my surprise that was not documented at all on the R help page, and I've 
clarified it.  (BTW, at least in R it does not remove names, just all 
other attributes.)
(Strictly not, as a factor is not a vector.)

But the help page explicitly only describes the default method, and some 
of the other methods do preserve some attributes, AFAIR.
I think working on a concat() for R would be helpful.  I vaguely recalled 
something like it in the Green Book, but the index does not help (but then 
it is not very complete).

Brian
#
It does not remove names in Splus either, just all
other attributes.  I see c() used in several Splus
functions as a way to convert a matrix into a vector
(by removing the .Dims and .Dimnames attributes).
Splus does have a concat().  I believe it is modelled
after the Green Book example.  It uses a helper function
called concat.two(x,y) with is generic and has 2 arguments
to make it easer to write methods for.  concat(x,y,z)
calls concat.two(concat.two(x,y),z).  concat() is not used much,
but it is the Summary group functions: min, max, sum, etc.

----------------------------------------------------------------------------
Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."
#
I dont see the logic in certain attribute names (?attr lists 'class', 
'comment', 'dim', 'dimnames', 'names', 'row.names' and 'tsp') being 
'special'. Why are matrices  vectors with some 'magic' attribute 'dim', 
and possibly some non-magic attributes like 'umbongo' and 'fnord'? I'd 
have the dim (and other essential properties of data structures) more 
tightly bound to the data. attr(M,'dim')=NULL seems a very odd way of 
turning a matrix into a vector.

  If I was the R bdfl[1] I'd deprecate using 'attr' to access these 
aspects of data objects and insist on using "dim(M)" or 
"row.names(M)=foo". Then I can use 'attr' for any old annotations, for 
example in my light bulb brightness study I could do:

  sample1 = c(4.2,4.5,4.8,4.1)
  attr(sample1,'dim') = TRUE

  sample2 = c(5.6,5.8,6.7,6.5,9.3)
  attr(sample2,'dim') = FALSE

Barry

[1] http://en.wikipedia.org/wiki/Benevolent_Dictator_for_Life
#
On Wed, 15 Nov 2006, Barry Rowlingson wrote:

            
I have a less than vague feeling that you as bdfl are describing new-style 
classes:

library(Matrix)
x <- Matrix(1:6, nrow=2, ncol=3)
getSlots(class(x))
slot(x, "Dim")
slot(x, "Dim") <- NULL

but perhaps unfortunately:

slot(x, "Dim") <- as.integer(c(0,0,0))

works! Maybe not quite what was intended?

  
    
#

        
Roger> On Wed, 15 Nov 2006, "BaRow" == Barry Rowlingson wrote:
> BillD> It does not remove names in Splus either, just all
      > BillD> other attributes.  I see c() used in several Splus
      > BillD> functions as a way to convert a matrix into a vector
      > BillD> (by removing the .Dims and .Dimnames attributes).
      >> 
      BaRow> I dont see the logic in certain attribute names
      BaRow> (?attr lists 'class', 'comment', 'dim', 'dimnames',
      BaRow> 'names', 'row.names' and 'tsp') being 'special'. Why
      BaRow> are matrices vectors with some 'magic' attribute
      BaRow> 'dim', and possibly some non-magic attributes like
      BaRow> 'umbongo' and 'fnord'? I'd have the dim (and other
      BaRow> essential properties of data structures) more tightly
      BaRow> bound to the data. attr(M,'dim')=NULL seems a very
      BaRow> odd way of turning a matrix into a vector.
      BaRow> 
      BaRow> If I was the R bdfl[1] I'd deprecate using 'attr' to
      BaRow> access these aspects of data objects and insist on
      BaRow> using "dim(M)" or "row.names(M)=foo".

    Roger> I have a less than vague feeling that you as bdfl are
    Roger> describing new-style classes:

indeed, thank you, Roger!

    Roger> library(Matrix)
    Roger> x <- Matrix(1:6, nrow=2, ncol=3)
    Roger> getSlots(class(x))
    Roger> slot(x, "Dim")
    Roger> slot(x, "Dim") <- NULL

which gives

Error in checkSlotAssignment(object, name, value) : 
	assignment of an object of class "NULL" is not valid for slot "Dim" in an object of class "dgeMatrix"; is(value, "integer") is not TRUE

    Roger> but perhaps unfortunately:

    Roger> slot(x, "Dim") <- as.integer(c(0,0,0))

    Roger> works! Maybe not quite what was intended?

Indeed it does not signal an error,  surprisingly to you and others.
Note however that subsequently

 > validObject(x)
 Error in validObject(x) : invalid class "dgeMatrix" object: Dim slot must be of length 2

. . . . . 

This seems to have become a QFAQ (quite frequently asked question):

 Q: Why does R not signal an error when I assign invalid contents
    to a slot of an S4 object?

 A: Calling validObject() on every slot assignment would
    - inhibit building S4 objects incrementally ``until they are valid''
    - potentially lead to too slowly executing code.

Consequence:  

 Only ``low-level'' programming should use
   slot(A, "sn") <- ....  or the equivalent
        A @ sn   <- ....
 and then the programmeR is solely responsible to construct a
 valid object -- or needs to expliclictly call validObject() at
 the end of the construction process.

 Normal construction of S4 objects should use

 - either  new("..", .....)  
   which calls validObject() 
   [unless for new("...") with no extra args]

 - or higher level constructors 
   such as you did in the example by calling Matrix().
   

John Chambers has much more to say on this, but I think most of
it has already been written in the Green Book.

--
Martin Maechler, ETH Zurich
6 days later
#
I noticed that a new feature in R 2.4 is that unlist of a list of factors 
already does the operation that I proposed :
[1] a b c d e d e f g h
Levels: a b c d e f g h
Therefore, does it not make sense that c(x,y) should return the same as 
unlist(list(x,y))  ?

Also, the specific "if" for factors inside the definition of unlist, not 
surprisingly, uses a very similar method to those previously posted. 
However, it first coerces the factors with as.character, before matching to 
the new level set.  This is inefficient. Here is the c.factor method again 
that I proposed, which avoids the as.character and is therefore more 
efficient.  Leaving aside the discussion about c.factor, or concat, or 
whatever,  could 'unlist' be changed to use this method instead ?   After 
all one of the key advantages of factors is to save main memory,  anything 
which coerces back to character is going to defeat the benefit.
args <- list(...)
    if (!all(sapply(args, is.factor))) stop("all arguments must be factor")
    newlevels = unique(unlist(lapply(args,levels)))
    ans = unlist(lapply(args, function(x) {
        m = match(levels(x), newlevels)
        m[as.integer(x)]
    }))
    levels(ans) = newlevels
    class(ans) = "factor"
    ans
}
[1] TRUE
_
platform       i386-pc-mingw32
arch           i386
os             mingw32
system         i386, mingw32
status
major          2
minor          4.0
year           2006
month          10
day            03
svn rev        39566
language       R
version.string R version 2.4.0 (2006-10-03)
"Brian Ripley" <ripley at stats.ox.ac.uk> wrote in message 
news:Pine.LNX.4.64.0611150926070.19618 at auk.stats...