dummy variable encoding
The best encoding depends upon which language you would like to
manipulate
the variable in. In R, genders are most naturally represented as
factors.
That means that in an external data source (like a spreadsheet of
data),
you should ideally have the gender recorded as human-understandable
text
("male" and "female", or "M" and "F"). Once the data is read into R,
by
default R will convert the string to factors (keeping the human
readable
labels). This way you avoid having to remember that 1 means male (or whatever). If you were manipulating the data in a different language that didn't
have
factors, then it might be more appropriate to use an integer. Which integers you use doesn't matter, you need to have a look-up table to
know
what each number refers to, whatever you choose.
Yes, that's what I thought. However somebody told me that it is better to use 1/2 rather than 0/1 for a 2 level factor such as gender, and I've no idea why. I told them it didn't matter, but have since seen quite a few examples where they use 1/2 (admittedly in SPSS).
The only benefit that I can see of using 1/2 instead of 0/1 is fairly
minor.
If you have cases where there are missing values, and you are working in a
language that doesn't support NA values for integers (or factors; I'm
thinking of something like C), then you could encode your genders as
0: not recorded
1: female
2: male
Then you can include logic like
if(gender)
{
do something
}
The alternative encoding of 0/1, would be something like
-1: not recorded
0: female
1: male
This makes the code slightly less pretty.
if(gender != -1)
{
do something
}
Again, none of this really applies to R, since you should be using factors
for this sort of variable.
Regards,
Richie.
Mathematical Sciences Unit
HSL
------------------------------------------------------------------------
ATTENTION:
This message contains privileged and confidential inform...{{dropped:20}}