using package tm to find phrases - R-help

Thu, Aug 13, 2009 12:36 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090813/f528d69c/attachment-0001.pl>

Ingo Feinerer

Thu, Aug 13, 2009 3:11 PM #

On Thu, Aug 13, 2009 at 03:36:22PM -0400, Mark Kimpel wrote:

Yes.

* In case you only need to find instances, you could use full text
  search on your corpus, e.g.

  R> tmIndex(yourCorpus, "gene regulatory protein 1")

  would return the indices of all documents in your corpus containing
  this phrase.

* If you need tokens (in a term-document matrix) of length 4, you could
  use a n-gram tokenizer (n = 4). See e.g.,
  http://tm.r-forge.r-project.org/faq.html#Bigrams. Then you can use
  the dictionary argument to store only your selection of gene
  names. I.e., something like

  R> yourTokenizer <- function(x) RWeka::NGramTokenizer(x, Weka_control(min = 4, max = 4))
  R> TermDocumentMatrix(crude, control = list(tokenize = yourTokenizer, dictionary = yourDictionary))

  where yourDictionary contains the gene names (a character vector
  suffices) to be included in the term-document matrix.

* If you want to extract arbitrary patterns of different length that
  could match some gene names (and build a dictionary from that), you
  need some custom functionality. Regular expressions might be a good
  starting point ...

Best regards, Ingo

Ingo Feinerer
Vienna University of Technology
http://www.dbai.tuwien.ac.at/staff/feinerer

Mark Kimpel

Fri, Aug 14, 2009 7:18 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090814/81088ffe/attachment-0001.pl>