Basic question: Reading in multiple choice question responses to a single column in data frame

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090819/94a7c73b/attachment-0001.pl>
You might look at the mChoice function in the Hmisc package for some 
indirect help.

Frank
I'm using read.delim to successfully read in tab delimited data, but some
columns' values are comma seperated, reflecting the fact that user chose a
few answers on a multi-select question.  I understand that each answer is
its own category and so could be represented as a seperate column in the
data set, but I'd like the option of reading in the data column, and
converting it to a vector that has all row values (comma seperated or not)
each have their own vector entry, so that the "table(columnData)" function
does counts correctly.

So some code:

    myData = read.delim(myDataFile, row.names=1,header=TRUE,skip=10); #works
fine
    myColumn = myData[[question]]; #works fine, selects correct question
column data

myColumn data is now e.g.:

    1
    0
    2
    0,2
    0
    3
    2
    2,1

with the comma seperated values looking like atomic string values I guess.
But I would like:

    1
    0
    2
    0
    2
    0
    3
    2
    2
    1

I've tried various things, e.g. grep to recognize and expand the comma
seperated values, but since vector functions are at work, I can only replace
1 value back into the myColumn data, e.g. "0,2" entry becomes "0" or "2" if
I use 

    myColumn=gsub("^([0-9]+),([0-9]+),$",c('\\1'),myColumn,perl=TRUE) #or
replace with c('\\2')

but I can't replace into c('\\1','\\2') 

Any elegant or otherwise ways to do this?

Much appreciated,

Damion

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University
Are you looking for something like this?

 > d      = data.frame(a=1:5,b=c("1","2,3","2","3,4","1"))
 > d
   a   b
1 1   1
2 2 2,3
3 3   2
4 4 3,4
5 5   1
 > multis = strsplit(d$b,",")
 > counts = sapply(strsplit(d$b,","),length )
 > d2 = data.frame( a=rep(d$a,counts), b=unlist(multis) )
 > d2
   a b
1 1 1
2 2 2
3 2 3
4 3 2
5 4 3
6 4 4
7 5 1

Best,
Magnus
I'm using read.delim to successfully read in tab delimited data, but some
columns' values are comma seperated, reflecting the fact that user chose a
few answers on a multi-select question.  I understand that each answer is
its own category and so could be represented as a seperate column in the
data set, but I'd like the option of reading in the data column, and
converting it to a vector that has all row values (comma seperated or not)
each have their own vector entry, so that the "table(columnData)" function
does counts correctly.

So some code:

    myData = read.delim(myDataFile, row.names=1,header=TRUE,skip=10); #works
fine
    myColumn = myData[[question]]; #works fine, selects correct question
column data

myColumn data is now e.g.:

    1
    0
    2
    0,2
    0
    3
    2
    2,1

with the comma seperated values looking like atomic string values I guess.
But I would like:

    1
    0
    2
    0
    2
    0
    3
    2
    2
    1

I've tried various things, e.g. grep to recognize and expand the comma
seperated values, but since vector functions are at work, I can only replace
1 value back into the myColumn data, e.g. "0,2" entry becomes "0" or "2" if
I use 

    myColumn=gsub("^([0-9]+),([0-9]+),$",c('\\1'),myColumn,perl=TRUE) #or
replace with c('\\2')

but I can't replace into c('\\1','\\2') 

Any elegant or otherwise ways to do this?

Much appreciated,

Damion

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Magnus,

Looks like that solution should work, and I like the flexibility of your
data output, but I get a "error in strsplit(d$b,","): non-character
argument" at:

	multis = strsplit(d$b,",")

Seems like the c() function converts integer looking items like "1" into
integers and then strsplit fails on them?  I was running into this earlier
when attempting strsplit directly on column values.

Damion 

-----Original Message-----
From: Magnus Torfason [mailto:zulutime.net at gmail.com] 
Sent: August 19, 2009 12:33 PM
To: Damion Dooley
Cc: r-help at r-project.org
Subject: Re: [R] Basic question: Reading in multiple choice question
responses to a single column in data frame

Are you looking for something like this?

 > d      = data.frame(a=1:5,b=c("1","2,3","2","3,4","1"))
 > d
   a   b
1 1   1
2 2 2,3
3 3   2
4 4 3,4
5 5   1
 > multis = strsplit(d$b,",")
 > counts = sapply(strsplit(d$b,","),length )
d2 = data.frame( a=rep(d$a,counts), b=unlist(multis) )  
d2
a b
1 1 1
2 2 2
3 2 3
4 3 2
5 4 3
6 4 4
7 5 1

Best,
Magnus
Hi, Magnus,

I discovered that

	multis = strsplit(as.character(d$b),",")

Works in the example you gave.  Thanks very much, looks like that's the way
I'll go for now.  P.s. for those others who may want, my selected column was
plugged in as

	myData=read.delim(myDataFile etc. etc....);
	myColumn = myData[[myQuestion]]; #myQuestion is name of column
	d = data.frame(a=1:length(myColumn),b=myColumn);
	multis = strsplit(as.character(d$b),",");
	etc. as per Magnus's code.

And thank you Frank for pointing me to mChoice, which will require further
study on my part.

Regards,

Damion

Damion Dooley  .   LearningPoint.ca  Website Technology   .   604 877 0304

-----Original Message-----
From: Magnus Torfason [mailto:zulutime.net at gmail.com] 
Sent: August 19, 2009 12:33 PM
To: Damion Dooley
Cc: r-help at r-project.org
Subject: Re: [R] Basic question: Reading in multiple choice question
responses to a single column in data frame

Are you looking for something like this?

 > d      = data.frame(a=1:5,b=c("1","2,3","2","3,4","1"))
 > d
   a   b
1 1   1
2 2 2,3
3 3   2
4 4 3,4
5 5   1
 > multis = strsplit(d$b,",")
 > counts = sapply(strsplit(d$b,","),length )
d2 = data.frame( a=rep(d$a,counts), b=unlist(multis) )  
d2
a b
1 1 1
2 2 2
3 2 3
4 3 2
5 4 3
6 4 4
7 5 1

Best,
Magnus
I'm using read.delim to successfully read in tab delimited data, but 
some columns' values are comma seperated, reflecting the fact that 
user chose a few answers on a multi-select question.  I understand 
that each answer is its own category and so could be represented as a 
seperate column in the data set, but I'd like the option of reading in 
the data column, and converting it to a vector that has all row values 
(comma seperated or not) each have their own vector entry, so that the 
"table(columnData)" function does counts correctly.
 ...
Slight addendum.  Working from your code, I found 1 line of code does the
conversion:

	myColumn = unlist(strsplit(as.character(myData[[myQuestion]]),","));

But the dataframe you set up may prove more useful.

Regards,

Damion

-----Original Message-----
From: Magnus Torfason [mailto:zulutime.net at gmail.com] 
Sent: August 19, 2009 12:33 PM
To: Damion Dooley
Cc: r-help at r-project.org
Subject: Re: [R] Basic question: Reading in multiple choice question
responses to a single column in data frame

Are you looking for something like this?

 > d      = data.frame(a=1:5,b=c("1","2,3","2","3,4","1"))
 > d
   a   b
1 1   1
2 2 2,3
3 3   2
4 4 3,4
5 5   1
 > multis = strsplit(d$b,",")
 > counts = sapply(strsplit(d$b,","),length )
d2 = data.frame( a=rep(d$a,counts), b=unlist(multis) )  
d2
a b
1 1 1
2 2 2
3 2 3
4 3 2
5 4 3
6 4 4
7 5 1

Best,
Magnus
Slight addendum.  Working from your code, I found 1 line of code does the
conversion:

	myColumn = unlist(strsplit(as.character(myData[[myQuestion]]),","));

But the dataframe you set up may prove more useful.
I'm glad my suggestion was useful. My more comprehensive example assumed 
that you needed to be able to match individual multi-choice selections 
with other questions through the observation ID after the processing.
If that is not needed, the one-liner should be adequate.

Best,
Magnus