Skip to content

turning comma separated string from multiple choices into flags

5 messages · June Kim, Peter Dalgaard, Henrique Dallazuanna

#
Hello,

I use google docs' Forms to conduct surveys online. Multiple choices
questions are coded as comma separated values.

For example,

if the question is like:

1. What magazines do you currently subscribe to? (you can choose
multiple choices)
1) Fast Company
2) Havard Business Review
3) Business Week
4) The Economist

And if the subject chose 1) and 3), the data is coded as a cell in a
spreadsheet as,

"Fast Company, Business Week"

I read the data with read.csv into R. To analyze the data, I have to
change that string into something like flags(indicator variables?).
That is, there should be 4 variables, of which values are either 1 or
0, indicating chosen or not-chosen respectively.

Suppose the data is something like,
age                                    favorite_magazine
1  29                                         Fast Company
2  31                          Fast Company, Business Week
3  32 Havard Business Review, Business Week, The Economist
Then I have to chop the string in favorite_magazine column to turn
that data into something like,
age Fast Company Havard Business Review Business Week The Economist
1  29            1                      0             0             0
2  31            1                      0             1             0
3  32            0                      1             1             1
Actually I have many more multiple choice questions in the survey.

What is the easy elegant and natural way in R to do the job?
#
June Kim wrote:
I'd look into something like as.data.frame(lapply(strings, grep,
x=favorite_magazine, fixed=TRUE)), where strings <- c("Fast Company",
"Havard Business Review", ...).

(I take it that the mechanism is such that you can rely on at least
having everything misspelled in the same way? If it is alternatingly
"Havard" and "Harvard", then things get a bit trickier.)
#
Thank you. The misspelling of Harvard wasn't intended. The data are
spelled consistently.

2008/9/30 Peter Dalgaard <P.Dalgaard at biostat.ku.dk>:
#
June Kim wrote:
OK. One other potential problem: If the strings are substrings of
eachother (as in "Science" and "Statistical Science") then you may need
more care.

And I misremembered: It is probably better to use regexpr(....) != -1
than grep(....) for this purpose because the latter returns indices
rather than a value for each element.
#
Try this:

table(rep(x$age, unlist(lapply(strsplit(x$favorite_magazine, ","), length))),
        unlist(strsplit(x$favorite_magazine, ",")))
On Mon, Sep 29, 2008 at 11:45 AM, June Kim <juneaftn at gmail.com> wrote: