An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090819/94a7c73b/attachment-0001.pl>
Basic question: Reading in multiple choice question responses to a single column in data frame
7 messages · Damion Dooley, Frank E Harrell Jr, Magnus Torfason
You might look at the mChoice function in the Hmisc package for some indirect help. Frank
Damion Dooley wrote:
I'm using read.delim to successfully read in tab delimited data, but some
columns' values are comma seperated, reflecting the fact that user chose a
few answers on a multi-select question. I understand that each answer is
its own category and so could be represented as a seperate column in the
data set, but I'd like the option of reading in the data column, and
converting it to a vector that has all row values (comma seperated or not)
each have their own vector entry, so that the "table(columnData)" function
does counts correctly.
So some code:
myData = read.delim(myDataFile, row.names=1,header=TRUE,skip=10); #works
fine
myColumn = myData[[question]]; #works fine, selects correct question
column data
myColumn data is now e.g.:
1
0
2
0,2
0
3
2
2,1
with the comma seperated values looking like atomic string values I guess.
But I would like:
1
0
2
0
2
0
3
2
2
1
I've tried various things, e.g. grep to recognize and expand the comma
seperated values, but since vector functions are at work, I can only replace
1 value back into the myColumn data, e.g. "0,2" entry becomes "0" or "2" if
I use
myColumn=gsub("^([0-9]+),([0-9]+),$",c('\\1'),myColumn,perl=TRUE) #or
replace with c('\\2')
but I can't replace into c('\\1','\\2')
Any elegant or otherwise ways to do this?
Much appreciated,
Damion
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
Are you looking for something like this?
> d = data.frame(a=1:5,b=c("1","2,3","2","3,4","1"))
> d
a b
1 1 1
2 2 2,3
3 3 2
4 4 3,4
5 5 1
> multis = strsplit(d$b,",")
> counts = sapply(strsplit(d$b,","),length )
> d2 = data.frame( a=rep(d$a,counts), b=unlist(multis) )
> d2
a b
1 1 1
2 2 2
3 2 3
4 3 2
5 4 3
6 4 4
7 5 1
Best,
Magnus
On 8/19/2009 3:12 PM, Damion Dooley wrote:
I'm using read.delim to successfully read in tab delimited data, but some
columns' values are comma seperated, reflecting the fact that user chose a
few answers on a multi-select question. I understand that each answer is
its own category and so could be represented as a seperate column in the
data set, but I'd like the option of reading in the data column, and
converting it to a vector that has all row values (comma seperated or not)
each have their own vector entry, so that the "table(columnData)" function
does counts correctly.
So some code:
myData = read.delim(myDataFile, row.names=1,header=TRUE,skip=10); #works
fine
myColumn = myData[[question]]; #works fine, selects correct question
column data
myColumn data is now e.g.:
1
0
2
0,2
0
3
2
2,1
with the comma seperated values looking like atomic string values I guess.
But I would like:
1
0
2
0
2
0
3
2
2
1
I've tried various things, e.g. grep to recognize and expand the comma
seperated values, but since vector functions are at work, I can only replace
1 value back into the myColumn data, e.g. "0,2" entry becomes "0" or "2" if
I use
myColumn=gsub("^([0-9]+),([0-9]+),$",c('\\1'),myColumn,perl=TRUE) #or
replace with c('\\2')
but I can't replace into c('\\1','\\2')
Any elegant or otherwise ways to do this?
Much appreciated,
Damion
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Magnus,
Looks like that solution should work, and I like the flexibility of your
data output, but I get a "error in strsplit(d$b,","): non-character
argument" at:
multis = strsplit(d$b,",")
Seems like the c() function converts integer looking items like "1" into
integers and then strsplit fails on them? I was running into this earlier
when attempting strsplit directly on column values.
Damion
-----Original Message-----
From: Magnus Torfason [mailto:zulutime.net at gmail.com]
Sent: August 19, 2009 12:33 PM
To: Damion Dooley
Cc: r-help at r-project.org
Subject: Re: [R] Basic question: Reading in multiple choice question
responses to a single column in data frame
Are you looking for something like this?
> d = data.frame(a=1:5,b=c("1","2,3","2","3,4","1"))
> d
a b
1 1 1
2 2 2,3
3 3 2
4 4 3,4
5 5 1
> multis = strsplit(d$b,",")
> counts = sapply(strsplit(d$b,","),length )
d2 = data.frame( a=rep(d$a,counts), b=unlist(multis) ) d2
a b 1 1 1 2 2 2 3 2 3 4 3 2 5 4 3 6 4 4 7 5 1 Best, Magnus
Hi, Magnus,
I discovered that
multis = strsplit(as.character(d$b),",")
Works in the example you gave. Thanks very much, looks like that's the way
I'll go for now. P.s. for those others who may want, my selected column was
plugged in as
myData=read.delim(myDataFile etc. etc....);
myColumn = myData[[myQuestion]]; #myQuestion is name of column
d = data.frame(a=1:length(myColumn),b=myColumn);
multis = strsplit(as.character(d$b),",");
etc. as per Magnus's code.
And thank you Frank for pointing me to mChoice, which will require further
study on my part.
Regards,
Damion
Damion Dooley . LearningPoint.ca Website Technology . 604 877 0304
-----Original Message-----
From: Magnus Torfason [mailto:zulutime.net at gmail.com]
Sent: August 19, 2009 12:33 PM
To: Damion Dooley
Cc: r-help at r-project.org
Subject: Re: [R] Basic question: Reading in multiple choice question
responses to a single column in data frame
Are you looking for something like this?
> d = data.frame(a=1:5,b=c("1","2,3","2","3,4","1"))
> d
a b
1 1 1
2 2 2,3
3 3 2
4 4 3,4
5 5 1
> multis = strsplit(d$b,",")
> counts = sapply(strsplit(d$b,","),length )
d2 = data.frame( a=rep(d$a,counts), b=unlist(multis) ) d2
a b 1 1 1 2 2 2 3 2 3 4 3 2 5 4 3 6 4 4 7 5 1 Best, Magnus
On 8/19/2009 3:12 PM, Damion Dooley wrote:
I'm using read.delim to successfully read in tab delimited data, but some columns' values are comma seperated, reflecting the fact that user chose a few answers on a multi-select question. I understand that each answer is its own category and so could be represented as a seperate column in the data set, but I'd like the option of reading in the data column, and converting it to a vector that has all row values (comma seperated or not) each have their own vector entry, so that the "table(columnData)" function does counts correctly. ...
Slight addendum. Working from your code, I found 1 line of code does the
conversion:
myColumn = unlist(strsplit(as.character(myData[[myQuestion]]),","));
But the dataframe you set up may prove more useful.
Regards,
Damion
-----Original Message-----
From: Magnus Torfason [mailto:zulutime.net at gmail.com]
Sent: August 19, 2009 12:33 PM
To: Damion Dooley
Cc: r-help at r-project.org
Subject: Re: [R] Basic question: Reading in multiple choice question
responses to a single column in data frame
Are you looking for something like this?
> d = data.frame(a=1:5,b=c("1","2,3","2","3,4","1"))
> d
a b
1 1 1
2 2 2,3
3 3 2
4 4 3,4
5 5 1
> multis = strsplit(d$b,",")
> counts = sapply(strsplit(d$b,","),length )
d2 = data.frame( a=rep(d$a,counts), b=unlist(multis) ) d2
a b 1 1 1 2 2 2 3 2 3 4 3 2 5 4 3 6 4 4 7 5 1 Best, Magnus
On 8/19/2009 11:06 PM, Damion Dooley wrote:
Slight addendum. Working from your code, I found 1 line of code does the conversion: myColumn = unlist(strsplit(as.character(myData[[myQuestion]]),",")); But the dataframe you set up may prove more useful.
I'm glad my suggestion was useful. My more comprehensive example assumed that you needed to be able to match individual multi-choice selections with other questions through the observation ID after the processing. If that is not needed, the one-liner should be adequate. Best, Magnus