An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090813/f528d69c/attachment-0001.pl>
using package tm to find phrases
3 messages · Mark Kimpel, Ingo Feinerer
On Thu, Aug 13, 2009 at 03:36:22PM -0400, Mark Kimpel wrote:
I am using the package "tm" for text-mining of abstracts and would like to use it to find instances of gene names that may contain white space. For instance "gene regulatory protein 1". The default behavior of tm is to parse this into 4 separate words, but I would like to use the class constructor "dictionary" to define phrases such as just mentioned. Is this possible? If so, how?
Yes. * In case you only need to find instances, you could use full text search on your corpus, e.g. R> tmIndex(yourCorpus, "gene regulatory protein 1") would return the indices of all documents in your corpus containing this phrase. * If you need tokens (in a term-document matrix) of length 4, you could use a n-gram tokenizer (n = 4). See e.g., http://tm.r-forge.r-project.org/faq.html#Bigrams. Then you can use the dictionary argument to store only your selection of gene names. I.e., something like R> yourTokenizer <- function(x) RWeka::NGramTokenizer(x, Weka_control(min = 4, max = 4)) R> TermDocumentMatrix(crude, control = list(tokenize = yourTokenizer, dictionary = yourDictionary)) where yourDictionary contains the gene names (a character vector suffices) to be included in the term-document matrix. * If you want to extract arbitrary patterns of different length that could match some gene names (and build a dictionary from that), you need some custom functionality. Regular expressions might be a good starting point ... Best regards, Ingo
Ingo Feinerer Vienna University of Technology http://www.dbai.tuwien.ac.at/staff/feinerer
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090814/81088ffe/attachment-0001.pl>