Skip to content

Need a more efficient way to implement this type of logic in R

6 messages · Walter Anderson, Duncan Murdoch, Joshua Wiley +3 more

#
I have cobbled together the following logic.  It works but is very 
slow.  I'm sure that there must be a better r-specific way to implement 
this kind of thing, but have been unable to find/understand one.  Any 
help would be appreciated.

hh.sub <- households[c("HOUSEID","HHFAMINC")]
for (indx in 1:length(hh.sub$HOUSEID)) {
   if ((hh.sub$HHFAMINC[indx] == '01') | (hh.sub$HHFAMINC[indx] == '02') 
| (hh.sub$HHFAMINC[indx] == '03') | (hh.sub$HHFAMINC[indx] == '04') | 
(hh.sub$HHFAMINC[indx] == '05'))
     hh.sub$CS_FAMINC[indx] <- 1 # Less than $25,000
   if ((hh.sub$HHFAMINC[indx] == '06') | (hh.sub$HHFAMINC[indx] == '07') 
| (hh.sub$HHFAMINC[indx] == '08') | (hh.sub$HHFAMINC[indx] == '09') | 
(hh.sub$HHFAMINC[indx] == '10'))
     hh.sub$CS_FAMINC[indx] <- 2 # $25,000 to $50,000
   if ((hh.sub$HHFAMINC[indx] == '11') | (hh.sub$HHFAMINC[indx] == '12') 
| (hh.sub$HHFAMINC[indx] == '13') | (hh.sub$HHFAMINC[indx] == '14') | 
(hh.sub$HHFAMINC[indx] == '15'))
     hh.sub$CS_FAMINC[indx] <- 3 # $50,000 to $75,000
   if ((hh.sub$HHFAMINC[indx] == '16') | (hh.sub$HHFAMINC[indx] == '17'))
     hh.sub$CS_FAMINC[indx] <- 4 # $75,000 to $100,000
   if ((hh.sub$HHFAMINC[indx] == '18'))
     hh.sub$CS_FAMINC[indx] <- 5 # More than $100,000
   if ((hh.sub$HHFAMINC[indx] == '-7') | (hh.sub$HHFAMINC[indx] == '-8') 
| (hh.sub$HHFAMINC[indx] == '-9'))
     hh.sub$CS_FAMINC[indx] = 0
}
#
On 06/04/2011 4:02 PM, Walter Anderson wrote:
The answer is to think in terms of vectors and logical indexing.  The 
code above is equivalent to

hh.sub$CS_FAMINC[ hh.sub$HHFAMINC %in% c('01', '02', '03', '04', '05') ] 
<- 1

I've left off the rest of the loop, but I think it's similar.

Duncan Murdoch
#
Hi Walter,

Take a look at the function ?cut.  It is designed to take a continuous
variable and categorize it, and will be much simpler and faster.  The
only qualification is that your data would need to be numeric, not
character.  However, if your only values are the ones you put in
quotes in your code ('02' etc), a simple call to
as.numeric(variablename) ought to do the trick.  Beyond being faster,
you can probably get down to one line of code, which should be much
easier on the eyes.  To see some examples with cut(), type (at the
console):

example(cut)

Hope this helps,

Josh

P.S. If you are planning on doing any modelling with this data, why
not leave it continuous?
On Wed, Apr 6, 2011 at 1:02 PM, Walter Anderson <wandrson01 at gmail.com> wrote:

  
    
#
Walter -
    Since your codes represent numbers, you could use something like
this:

chk = as.numeric((hh.sub$HHFAMINC)
hh.sub$CS_FAMINC = cut(chk,c(-10,0,5,10,15,17,18),labels=c(0,1:5))

 					- Phil Spector
 					 Statistical Computing Facility
 					 Department of Statistics
 					 UC Berkeley
 					 spector at stat.berkeley.edu
On Wed, 6 Apr 2011, Walter Anderson wrote:

            
#
Am 06.04.2011 22:02, schrieb Walter Anderson:
Hi,
the for-loop is entirely unnecessary. You can, as a first step, rewrite 
the code like this:

if ((hh.sub$HHFAMINC == '01') | (hh.sub$HHFAMINC == '02') |
(hh.sub$HHFAMINC == '03') | (hh.sub$HHFAMINC == '04') |
(hh.sub$HHFAMINC == '05'))
     hh.sub$CS_FAMINC <- 1 # Less than $25,000

This very basic concept is called "vectorization" in R. You should read 
about it, it rocks.

In this case, though, you don't even need to do that:
If you cast the variable HHFAMINC into a number like this:
hh.sub$HHFAMINC <- as.numeric(hh.sub$HHFAMINC)
, then you can apply the cut() function to create a factor variable:

hh.sub$myawesomefactor <- cut(hh.sub$HHFAMINC, breaks=c(5.5, 10.5, 15.5, 
17.5))
or something like that should do the trick. You will then have to rename 
the factor values. I think it is the function names(), but I'm only 95% 
sure (heh.)

Also, this might be my OCD speaking, but I would use NA instead of 0 for 
non-available values.

Have fun,
  Alex
#
Hi

r-help-bounces at r-project.org napsal dne 06.04.2011 22:02:29:
'17'))
Take advantage of factors. If hh.sub$HHFAMINC was factor you can recode it 
by

levels(hh.sub$HHFAMINC)<-appropriate vector of new levels with the same 
length as levels

Something like
[1] a b c d e
Levels: a b c d e
[1] 1 1 2 2 1
Levels: 1 2
Regards
Petr
http://www.R-project.org/posting-guide.html