Skip to content

dummy variable encoding

4 messages · news at aleblanc.cotse.net, Richard Cotton

#
Hi,
   can anyone tell me why an encoding of 1/2 for a dummy variable for
   two groups (e.g. gender) seems to be preferred over 0/1?
   It's been bugging me for a while, 0/1 seems more natural, but I have
   been told (without explanation) that 1/2 is better. Why?
#
The best encoding depends upon which language you would like to manipulate 
the variable in.  In R, genders are most naturally represented as factors. 
 That means that in an external data source (like a spreadsheet of data), 
you should ideally have the gender recorded as human-understandable text 
("male" and "female", or "M" and "F").  Once the data is read into R, by 
default R will convert the string to factors (keeping the human readable 
labels).  This way you avoid having to remember that 1 means male (or 
whatever).

If you were manipulating the data in a different language that didn't have 
factors, then it might be more appropriate to use an integer.  Which 
integers you use doesn't matter, you need to have a look-up table to know 
what each number refers to, whatever you choose.

Regards,
Richie.

Mathematical Sciences Unit
HSL


------------------------------------------------------------------------
ATTENTION:

This message contains privileged and confidential inform...{{dropped:20}}
#
Richard.Cotton at hsl.gov.uk writes:
Yes, that's what I thought. However somebody told me that it is better
to use 1/2 rather than 0/1 for a 2 level factor such as gender, and I've
no idea why. I told them it didn't matter, but have since seen quite a
few examples where they use 1/2 (admittedly in SPSS).
#
manipulate
factors.
data),
text
by
readable
have
know
The only benefit that I can see of using 1/2 instead of 0/1 is fairly 
minor.

If you have cases where there are missing values, and you are working in a 
language that doesn't support NA values for integers (or factors; I'm 
thinking of something like C), then you could encode your genders as

0: not recorded
1: female
2: male

Then you can include logic like

if(gender)
{ 
   do something
}

The alternative encoding of 0/1, would be something like

-1: not recorded
0: female
1: male

This makes the code slightly less pretty.

if(gender != -1)
{ 
   do something
}

Again, none of this really applies to R, since you should be using factors 
for this sort of variable.

Regards,
Richie.

Mathematical Sciences Unit
HSL


------------------------------------------------------------------------
ATTENTION:

This message contains privileged and confidential inform...{{dropped:20}}