Skip to content
Prev 377014 / 398502 Next

Matching multiple search criteria (Unlisting a nested dataset, take 2)

OK, as no one else has offered a solution, I'll take a whack at it.

Caveats: This is a brute force attempt using R's basic regular expression
engine. It is inelegant and barely tested, so likely to be at best
incomplete and buggy, and at worst, incorrect. But maybe Nathan or someone
else on the list can fix it up. So if (when) it breaks, complain on the
list to give someone (almost certainly not me) the opportunity.

The basic idea is that the tweets are just character strings and the search
phrases are just character vectors all of whose elements must match
"appropriately" -- i.e. they must match whole words -- in the character
strings. So my desired output from the code is a list indexed by the search
phrases, each of whose components if a logical vector of length the number
of tweets each of whose elements = TRUE iff all the words in the search
phrase match somewhere in the tweet.

Here's the code(using the data Nathan provided):
## convert the phrases to a list of character vectors of the words
## Result:
$`me abused depressed`
[1] "me"        "abused"    "depressed"

$`me hurt depressed`
[1] "me"        "hurt"      "depressed"

$`feel hopeless depressed`
[1] "feel"      "hopeless"  "depressed"

$`feel alone depressed`
[1] "feel"      "alone"     "depressed"

$`i feel helpless`
[1] "i"        "feel"     "helpless"

$`i feel worthless`
[1] "i"         "feel"      "worthless"
"),x, c(" "," "," *$")))
## function to create regexes for words when they are at the beginning,
middle, or end of tweets
##Result
## too lengthy to include
##
##extract the tweets
## x is a vector of regex patterns
   ## y is a character vector
   ## value = vector,vec, with length(vec) == length(y) and vec[i] == TRUE
iff any of x matches y[i]
{ apply(sapply(x,function(z)grepl(z,y)), 1,any)
}

## add a matching "tweet" to the tweet vector:
lapply(wordregex,function(z)apply(sapply(z,function(x)findin(x,tweets)), 1,
all))
## Result:
$`me abused depressed`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`me hurt depressed`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`feel hopeless depressed`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`feel alone depressed`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`i feel helpless`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$`i feel worthless`
[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

## None of the tweets match any of the phrases except for the last tweet
that I added.

## Note: you need to add capabilities to handle upper and lower case. See,
e.g. ?casefold

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Tue, Oct 16, 2018 at 3:03 PM Bert Gunter <bgunter.4567 at gmail.com> wrote: