detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe
Yup, that does it. Let grep figure out what's a word rather than doing it manually. Forgot about "\b" Cheers, Bert Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:
Just add a word break marker before and after:
zz$v5 <- grepl( paste0( "\\b(", paste0( alarm.words, collapse="|" ), ")\\b" ), do.call( paste, zz[ , 2:3 ] ) ) )
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On July 9, 2015 10:12:23 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote:
Jeff: Well, it would be much better (no loops!) except, I think, for one issue: "red" would match "barred" and I don't think that this is what is wanted: the matches should be on whole "words" not just string patterns. So you would need to fix up the matching pattern to make this work, but it may be a little tricky, as arbitrary whitespace characters, e.g. " " or "\n" etc. could be in the strings to be matched separating the words or ending the "sentence." I'm sure it can be done, but I'll leave it to you or others to figure it out. Of course, if my diagnosis is wrong or silly, please point this out. Cheers, Bert Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Thu, Jul 9, 2015 at 9:34 AM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:
I think grep is better suited to this: zz$v5 <- grepl( paste0( alarm.words, collapse="|" ), do.call( paste,
zz[ , 2:3 ] ) ) )
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go
Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
Go...
Live: OO#.. Dead: OO#..
Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#.
rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity. On July 9, 2015 8:51:10 AM PDT, Bert Gunter <bgunter.4567 at gmail.com>
wrote:
Here's a way to do it that uses %in% (i.e. match() ) and uses only a single, not a double, loop. It should be more efficient.
sapply(strsplit(do.call(paste,zz[,2:3]),"[[:space:]]+"),
+ function(x)any(x %in% alarm.words)) [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE The idea is to paste the strings in each row (do.call allows an arbitrary number of columns) into a single string and then use strsplit to break the string into individual "words" on whitespace. Then the matching is vectorized with the any( %in% ... ) call. Cheers, Bert Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Thu, Jul 9, 2015 at 6:05 AM, John Fox <jfox at mcmaster.ca> wrote:
Dear Chris, If I understand correctly what you want, how about the following?
rows <- apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words,
grepl, x=x)))
zz[rows, ]
v1 v2 v3 v4
3 -1.022329 green turtle ronald weasley 2
6 0.336599 waffle the hamster red sparks 1
9 -1.631874 yellow giraffe with a long neck gandalf the white 1
10 1.130622 black bear gandalf the grey 2
I hope this helps,
John
------------------------------------------------
John Fox, Professor
McMaster University
Hamilton, Ontario, Canada
http://socserv.mcmaster.ca/jfox/
On Wed, 08 Jul 2015 22:23:37 -0400
"Christopher W. Ryan" <cryan at binghamton.edu> wrote:
Running R 3.1.1 on windows 7 I want to identify as a case any record in a dataframe that
contains
any
of several keywords in any of several variables.
Example:
# create a dataframe with 4 variables and 10 records
v2 <- c("white bird", "blue bird", "green turtle", "quick brown
fox",
"big black dog", "waffle the hamster", "benny likes food a lot",
"hello
world", "yellow giraffe with a long neck", "black bear")
v3 <- c("harry potter", "hermione grainger", "ronald weasley",
"ginny
weasley", "dudley dursley", "red sparks", "blue sparks", "white
dress
robes", "gandalf the white", "gandalf the grey") zz <- data.frame(v1=rnorm(10), v2=v2, v3=v3, v4=rpois(10,
lambda=2),
stringsAsFactors=FALSE)
str(zz)
zz
# here are the keywords
alarm.words <- c("red", "green", "turtle", "gandalf")
# For each row/record, I want to test whether the string in v2 or
the
string in v3 contains any of the strings in alarm.words. And then
if
so,
set zz$v5=TRUE for that record. # I'm thinking the str_detect function in the stringr package
ought
to
be able to help, perhaps with some use of apply over the rows, but
I
obviously misunderstand something about how str_detect works library(stringr) str_detect(zz[,2:3], alarm.words) # error: the target of the
search
# must be a vector, not
multiple
# columns str_detect(zz[1:4,2:3], alarm.words) # same error str_detect(zz[,2], alarm.words) # error, length of
alarm.words
# is less than the number of
# rows I am using for the
# comparison
str_detect(zz[1:4,2], alarm.words) # works as hoped when
length(alarm.words) # confining nrows
# to the length of
alarm.words
str_detect(zz, alarm.words) # obviously not right
# maybe I need apply() ?
my.f <- function(x){str_detect(x, alarm.words)}
apply(zz[,2], 1, my.f) # again, a mismatch in lengths
# between alarm.words and that
# in which I am searching for
# matching strings
apply(zz, 2, my.f) # now I'm getting somewhere
apply(zz[1:4,], 2, my.f) # but still only works with 4
# rows of the dataframe
# perhaps %in% could do the job?
Appreciate any advice.
--Chris Ryan
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.