Subsetting data where the condition is that the value of some column contains some substring
grep and regexpr return different values. regexpr returns a vector of the same length as the input and this can be used to construct a logical subscript. grep return a vector of only the matches, in which case you can have a length of zero if there are no matches. Makes it harder to create the subsets. You have to test for zero length and then do something special.
On Fri, Mar 20, 2009 at 9:20 PM, Max Bane <max.bane at gmail.com> wrote:
Thanks, Jim (and Mark, who replied off-list) -- that does the trick. I had tried using an index expression with grep, but that failed in the same way as the subset method. It is still rather mysterious why this works with regexpr but not with grep :) -Max On Fri, Mar 20, 2009 at 7:57 PM, jim holtman <jholtman at gmail.com> wrote:
Try using regexpr instead:
x <- read.table(textConnection("input output corpusFreq pvolOT pvolRatioOT
+ give(mysister,theoldbook) ? ? ?P ? ? ? 47.0 ?56016 ? 0.1543651 + donate(her,thebook) ? ? ?P ? ? ? 48.7 ?68928 ? 0.1899471 + give(mysister,thebook) ? ? ?P ? ? ? 73.4 ?80136 ? 0.2208333 + donate(mysister,theoldbook) ? ? ?P ? ? ? 79.0 ?57024 ? 0.1571429 + give(mysister,it) ? ? ?P ? ? ?100.0 132408 ? 0.3648810 + give(her,it) ? ? ?P ? ? ?100.0 157248 ? 0.4333333 + donate(mysister,it) ? ? ?P ? ? ?100.0 130720 ? 0.3602293 + give(her,thebook) ? ? ?P ? ? ? ?5.7 ?65232 ? 0.1797619 + donate(her,it) ? ? ?P ? ? ?100.0 152064 ? 0.4190476 + give(mylittlesister,thebook) ? ? ?P ? ? ? 91.8 112032 ? 0.3087302 + donate(mylittlesister,thebook) ? ? ?P ? ? ? 98.4 114048 ? 0.3142857 + donate(mysister,thebook) ? ? ?P ? ? ? 94.4 ?82800 ? 0.2281746"), header=TRUE)
# use regexpr
matched <- regexpr("her", x$input) != -1
notMatched <- !matched
x[matched,]
? ? ? ? ? ? ? ?input output corpusFreq pvolOT pvolRatioOT 2 donate(her,thebook) ? ? ?P ? ? ? 48.7 ?68928 ? 0.1899471 6 ? ? ? ?give(her,it) ? ? ?P ? ? ?100.0 157248 ? 0.4333333 8 ? give(her,thebook) ? ? ?P ? ? ? ?5.7 ?65232 ? 0.1797619 9 ? ? ?donate(her,it) ? ? ?P ? ? ?100.0 152064 ? 0.4190476
x[notMatched,]
? ? ? ? ? ? ? ? ? ? ? ? ? ?input output corpusFreq pvolOT pvolRatioOT 1 ? ? ? give(mysister,theoldbook) ? ? ?P ? ? ? 47.0 ?56016 ? 0.1543651 3 ? ? ? ? ?give(mysister,thebook) ? ? ?P ? ? ? 73.4 ?80136 ? 0.2208333 4 ? ? donate(mysister,theoldbook) ? ? ?P ? ? ? 79.0 ?57024 ? 0.1571429 5 ? ? ? ? ? ? ? give(mysister,it) ? ? ?P ? ? ?100.0 132408 ? 0.3648810 7 ? ? ? ? ? ? donate(mysister,it) ? ? ?P ? ? ?100.0 130720 ? 0.3602293 10 ? give(mylittlesister,thebook) ? ? ?P ? ? ? 91.8 112032 ? 0.3087302 11 donate(mylittlesister,thebook) ? ? ?P ? ? ? 98.4 114048 ? 0.3142857 12 ? ? ? donate(mysister,thebook) ? ? ?P ? ? ? 94.4 ?82800 ? 0.2281746
On Fri, Mar 20, 2009 at 8:25 PM, Max Bane <max.bane at gmail.com> wrote:
I have some data that looks like this:
dataP
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?input output corpusFreq pvolOT pvolRatioOT 1 ? ? ? give(my sister, the old book) ? ? ?P ? ? ? 47.0 ?56016 ? 0.1543651 5 ? ? ? ? ? ? ? donate(her, the book) ? ? ?P ? ? ? 48.7 ?68928 ? 0.1899471 9 ? ? ? ? ? give(my sister, the book) ? ? ?P ? ? ? 73.4 ?80136 ? 0.2208333 13 ? ?donate(my sister, the old book) ? ? ?P ? ? ? 79.0 ?57024 ? 0.1571429 20 ? ? ? ? ? ? ? ?give(my sister, it) ? ? ?P ? ? ?100.0 132408 ? 0.3648810 21 ? ? ? ? ? ? ? ? ? ? ?give(her, it) ? ? ?P ? ? ?100.0 157248 ? 0.4333333 24 ? ? ? ? ? ? ?donate(my sister, it) ? ? ?P ? ? ?100.0 130720 ? 0.3602293 28 ? ? ? ? ? ? ? ?give(her, the book) ? ? ?P ? ? ? ?5.7 ?65232 ? 0.1797619 31 ? ? ? ? ? ? ? ? ? ?donate(her, it) ? ? ?P ? ? ?100.0 152064 ? 0.4190476 35 ? give(my little sister, the book) ? ? ?P ? ? ? 91.8 112032 ? 0.3087302 39 donate(my little sister, the book) ? ? ?P ? ? ? 98.4 114048 ? 0.3142857 43 ? ? ? ?donate(my sister, the book) ? ? ?P ? ? ? 94.4 ?82800 ? 0.2281746 I would like to extract the subset of this data in which the value of the "input" column contains the substring "her". I was thinking I could use the grep function to test for the presence of this substring. For instance, if a string does not contain it, then grep returns a zero length integer vector:
grep("her", "give(my sister, it)")
integer(0) And if the string does contain the substring, grep returns a vector of the indices where the substring is located:
grep("her", "give(her, it)")
[1] 1 I can thus test for the presence of the substring by converting the length of the result of grep into a boolean:
as.logical(length(grep("her", "give(my sister, it)")))
[1] FALSE
as.logical(length(grep("her", "give(her, it)")))
[1] TRUE
as.logical(length(grep("her", "give(her, it)"))) == TRUE
[1] TRUE
as.logical(length(grep("her", "give(my sister, it)"))) == TRUE
[1] FALSE I would like to use this test as a criterion for constructing a subset of my data. Unfortunately, it does not work:
subset(dataP, as.logical(length(grep("her", input)))==TRUE)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?input output corpusFreq pvolOT pvolRatioOT 1 ? ? ? give(my sister, the old book) ? ? ?P ? ? ? 47.0 ?56016 ? 0.1543651 5 ? ? ? ? ? ? ? donate(her, the book) ? ? ?P ? ? ? 48.7 ?68928 ? 0.1899471 9 ? ? ? ? ? give(my sister, the book) ? ? ?P ? ? ? 73.4 ?80136 ? 0.2208333 13 ? ?donate(my sister, the old book) ? ? ?P ? ? ? 79.0 ?57024 ? 0.1571429 20 ? ? ? ? ? ? ? ?give(my sister, it) ? ? ?P ? ? ?100.0 132408 ? 0.3648810 21 ? ? ? ? ? ? ? ? ? ? ?give(her, it) ? ? ?P ? ? ?100.0 157248 ? 0.4333333 24 ? ? ? ? ? ? ?donate(my sister, it) ? ? ?P ? ? ?100.0 130720 ? 0.3602293 28 ? ? ? ? ? ? ? ?give(her, the book) ? ? ?P ? ? ? ?5.7 ?65232 ? 0.1797619 31 ? ? ? ? ? ? ? ? ? ?donate(her, it) ? ? ?P ? ? ?100.0 152064 ? 0.4190476 35 ? give(my little sister, the book) ? ? ?P ? ? ? 91.8 112032 ? 0.3087302 39 donate(my little sister, the book) ? ? ?P ? ? ? 98.4 114048 ? 0.3142857 43 ? ? ? ?donate(my sister, the book) ? ? ?P ? ? ? 94.4 ?82800 ? 0.2281746 As you can see, I get back the whole data set, rather than just the subset where the input column contains "her". And if I invert the test, which I would expect to give the subset *not* containing "her", I instead get the empty subset, rather mysteriously:
subset(dataP, as.logical(length(grep("her", input)))==FALSE)
[1] input ? ? ? output ? ? ?corpusFreq ?pvolOT ? ? ?pvolRatioOT <0 rows> (or 0-length row.names) The type of the input column is definitely character. To be double sure:
subset(dataP, as.logical(length(grep("her", as.character(input))))==TRUE)
does the same thing. Could somebody with more R experience than I have please explain what I am doing wrong here? I'll be much obliged. -- Max Bane PhD Student, Linguistics University of Chicago bane at uchicago.edu
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
-- Max Bane PhD Student, Linguistics University of Chicago bane at uchicago.edu
Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?