How to create a new column based on the values from multiple columns which are matching a particular string?
While Eric's solution is correct( mod "corner" cases like all NA's in
a row), it can be made considerably more efficient.
One minor improvement can be made by using the idiom
any(x == "A")
instead of matching via %in% for the simple case of matching just a
single value.
However, a considerable improvement can be made by getting fancy,
taking advantage of do.call() and the pmax() function to mostly
vectorize the calculation. Here are the details and timing on a large
data frame.
(Note: I removed the names in the %in% approach for simplicity. It has
almost no effect on timings.
I also moved the as.integer() call out of the function so that it is
called only once at the end, which improves efficiency a bit)
1. Eric's original:
fun1 <-function(df,what)
{
as.integer(unname(apply(df,MARGIN = 1,function(v) { what %in% v })))
}
2. Using any( x == "A") instead:
fun2 <- function(df,what)
{
as.integer(unname(apply(df,MARGIN =1, function(x)any(x == what,
na.rm=TRUE))))
}
3. Getting fancy to use pmax()
fun3 <- function(df,what)
{
z <- lapply(df,function(x)as.integer((x==what)))
do.call(pmax,c(z,na.rm=TRUE))
}
Here are the timings:
bigdf <- df[rep(1:10,1e4), rep(1:5, 50)] dim(bigdf)
[1] 100000 250
system.time(res1 <- fun1(bigdf, "A"))
user system elapsed 2.204 0.432 2.637
system.time(res2 <- fun2(bigdf, "A"))
user system elapsed 1.898 0.403 2.302
system.time(res3 <- fun3(bigdf, "A"))
user system elapsed 0.187 0.048 0.235 ## 10 times faster!
all.equal(res1,res2)
[1] TRUE
all.equal(res1,res3)
[1] TRUE NB: I freely admit that Eric's original solution may well be perfectly adequate, and the speed improvement is pointless. In that case, maybe this is at least somewhat instructive for someone. Nevertheless, I would welcome further suggestions for improvement, as I suspect my "fancy" approach is still a ways from what one can do (in R code, without resorting to C++). Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Mon, Jul 29, 2019 at 12:38 PM Eric Berger <ericjberger at gmail.com> wrote:
Read the help for apply and %in% ?apply ?%in% Sent from my iPhone
On 29 Jul 2019, at 22:23, Ana Marija <sokovic.anamarija at gmail.com> wrote: Thank you so much! Just to confirm here MARGIN=1 indicates that "A" should appear at least once per row?
On Mon, Jul 29, 2019 at 1:53 PM Eric Berger <ericjberger at gmail.com> wrote:
df$case <- apply(df,MARGIN = 1,function(v) { as.integer("A" %in% v) })
On Mon, Jul 29, 2019 at 9:02 PM Ana Marija <sokovic.anamarija at gmail.com> wrote:
sorry my bad, here is the edited version:
so the data frame is this:
df=data.frame(
eye_problemsdisorders_f6148_0_1=c("A","C","D",NA,"D","A","C",NA,"B","A"),
eye_problemsdisorders_f6148_0_2=c("B","C",NA,"A","C","B",NA,NA,"A","D"),
eye_problemsdisorders_f6148_0_3=c("C","A","D","D","B","A",NA,NA,"A","B"),
eye_problemsdisorders_f6148_0_4=c("D","D",NA,"B","A","C",NA,"C","A","B"),
eye_problemsdisorders_f6148_0_5=c("C","C",NA,"D","B","C",NA,"D","D","B")
and I would need to put inside the column which would be named "case" and values inside would be: 1,1,0,1,1,1,0,0,1,1
so "case" column is where value "A" can be found in any column.
On Mon, Jul 29, 2019 at 12:53 PM Eric Berger <ericjberger at gmail.com> wrote: You may have a typo/misstatement in your question. You define a data frame with 5 columns, each of which has 10 elements, so your data frame has dimensions 10 x 5. Then you request a new COLUMN which will have only 5 elements, which is not allowed. All columns of a data frame must have the same length.
On Mon, Jul 29, 2019 at 8:42 PM Ana Marija <sokovic.anamarija at gmail.com> wrote:
I have data frame which looks like this:
df=data.frame(
eye_problemsdisorders_f6148_0_1=c(A,C,D,NA,D,A,C,NA,B,A),
eye_problemsdisorders_f6148_0_2=c(B,C,NA,A,C,B,NA,NA,A,D),
eye_problemsdisorders_f6148_0_3=c(C,A,D,D,B,A,NA,NA,A,B),
eye_problemsdisorders_f6148_0_4=c(D,D,NA,B,A,C,NA,C,A,B),
eye_problemsdisorders_f6148_0_5=c(C,C,NA,D,B,C,NA,D,D,B))
In reality I have much more columns and they don't always match
"eye_problemsdisorders_f6148" this string, and there is much more rows.
What I would like to do is create a new column, say named "case" where I
would have value "1" for every row where string "A" appears at least once
in any column, if not the value would be "0". So in the above example
column "case" would have these values: 1,1,1,1,0
Thanks
Ana
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.