Skip to content

problems with coercing a factor to be numeric

11 messages · Francesco Sarracino, Dimitris Rizopoulos, David Winsemius +3 more

#
Check R FAQ 7.10: How do I convert factors to numeric?


I hope it helps.

Best,
Dimitris
On 1/23/2013 10:33 AM, Francesco Sarracino wrote:

  
    
#
check also

pp <- rep(0:1, 10)
pp <- factor(pp, levels=(0:1), labels=c("no","yes"))

unclass(pp)
unclass(pp) - 1


Best,
Dimitris
On 1/23/2013 10:48 AM, Francesco Sarracino wrote:

  
    
#
On Jan 23, 2013, at 1:58 AM, Francesco Sarracino wrote:

            
I think it is rather strange that you are criticising R because the  
mean or sum functions won't coerce factors to numeric class. R is  
already very loosely typed. It has a fairly limited number of object  
classes and there is widespread class coercion when it is appropriate.  
Can you explain why you believed factors or by logical extension  
character classed variables should get implicitly coerced by all  
mathematical functions?
#
To find the proportion of "yes"s in pp you can use
   mean(pp == "yes")
and avoid the conversion of a factor to integer (and
subtracting 1).  The above works for character and factor
pp.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
Given that your labels are "no" and "yes", what do you expect R to
do?  To quote a well-known fortune, "R is lacking a mind_read() function!"

     cheers,

         Rolf Turner
On 01/23/2013 10:58 PM, Francesco Sarracino wrote:
#
On 23 Jan 2013, at 21:36, "Francesco Sarracino" <f.sarracino at gmail.com> wrote:

            
Such options do exist, but at modelling time, not factor creation/conversion time.

When created, by calls to 'factor' or in functions like 'read.table', factors are stored internally as integers with a list of labels (what you see as factor levels) that go with each integer. Those internal integers start at 1 and go up. You can set the ordering of those labels (by specifying the "levels" argument in factor()) so that, for example, yes and no can be associated with (numeric) factor levels 1 and 2 respectively instead of the default ordering which would put 'no' alphabetically before 'yes'. (I find this choice particularly useful for orderings like "high", "medium", "low" for which the alphabetic ordering is not exactly intuitive; similarly alphabetic ordering puts '1', '2', '10' in the order '1', '10', '2' and so on, so that often needs specifying manually. It's also useful to specify levels if you want things like boxplots to come out in a particular order, as boxplots by default use the order of the factor levels).
The internal integer values are returned by 'as numeric'. If your factor level labels - which are always character - are also interpretable as numbers, you need 'as.character' to return the character strings and then 'as.numeric' to convert those. 

Now, up to this point you just have more or less arbitrary integers asociated with the original factor levels (the degree of arbitrariness depends on whether you specified the level order or let R use its default). These integers are not the contrasts used in model fitting. Contrasts are set at model matrix building time; they are not a fixed attribute of the factor. The internal numbering of levels  affects contrasts only to the extent that the numerical values used in setting contrasts are usually in the same order as the factor levels.  You can inspect the functions used to associate contrasts  with factor levels by using options("contrasts"). You can inspect the numerical values that would currently be used for a given factor with a call to contrasts(). You can change the contrast asignments globally using options() or explicitly in some model calls (lm, for example, has a contrasts argument) and if you like you can write your own contrast functions to set any values you like.  The most common are probably treatment contrasts, which set the first factor level as intercept and the rest as (unit) differences from that, and sum to zero contrasts which do what they say, setting contrasts that sum to zero by choosing a set like (-1, 0, 1). 

So you actually have a great deal of control over both the order in which labels are associated with factor levels and the (separate) values of contrasts associated with those factor levels at modelling time. 

The cost of that control is some complexity, and the time needed to learn what's going on to use it all properly. 

Hope that helps ...


S Ellison

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}